1 Motivation - Semantic Scholar

10 downloads 0 Views 291KB Size Report
6] Don Chamberlin, James Clark, Daniela Florescu, Jonathan Robie, Jerome Simeon, and Mugur. Stefanescu. XQuery 1.0: An XML query language.
Formal and conceptual models for XML structures - the past, present, and future Arijit Sengupta

Sriram Mohan

Abstract XML is universally recognized as the standard document format for the purpose of interchange and device-independent presentation. Literature shows many eorts towards the development of formal and conceptual models for XML, although no commonly accepted model exists as yet. In this paper, we present a survey of some formal and conceptual modeling techniques, and discuss the merits and shortcomings of such techniques. We then present HNR (Heterogeneous Nested Relations) a formal modeling method for representing XML, and discuss how this formal model could be used to derive many useful results on XML query language, and how this could lead to a potential conceptual modeling technique as well.

1 Motivation Since its rst introduction in 1996, XML has been steadily progressing as the \format of choice" for data that has mostly textual content. Starting from nancial data, business transactions to data obtained from satellites - most of the data in today's web-driven world are being converted to XML. The two leading web application development platforms (.NET 9] and J2EE 16]) use XML web services - a standard mechanism for communication between applications. Given its huge success in the data and application domains, it is somewhat puzzling to see so little has been done in terms of conceptual and formal design areas involving XML. Literature shows dierent areas of application of design principles that apply to XML. Since XML has been around for a while, and only recently there has been an eort towards formalizing and conceptualizing the model behind XML, such modeling techniques are still playing \catch-up" with the XML standard. The World-Wide-Web Consortium (W3C) has started an eort towards a formal model for XML, which is termed DOM (Document Object Model) - a graph-based formal 1

model for XML documents 20]. For the purpose of querying, W3C has also proposed another data model called XPath data model, recently known as the XQuery 1.0/XPath 2.0 data model 11]. However, both of these models are at too low a level to serve as good conceptual modeling tools. Data modeling is nothing new in the eld of databases. The most common is the relational model, which is a formal model based on set-theoretic properties of tuples. The entity-relationship model is popular as a conceptual model in relational databases. Object-oriented databases, the same way, have the object-oriented model as the underlying formal model, and uni ed modeling language (UML) as the conceptual model. Although XML has been in existence for over ve years, it is not based on a formal or conceptual model. XML has a grammar-based model of describing documents, carried over from its predecessor SGML (Standard Generalized Markup Language 13]). Although fairly suitable for describing document structures and validating, grammars are not ideally suited for formal or conceptual description of data. The popularity of XML has, however, indicated that a convenient method for modeling would de nitely be useful for understanding and formalizing XML documents. The question naturally arises whether or not there is a necessity or use of a formal or conceptual model for XML documents. One would argue that the Entity-Relationship model has essentially all the aspects (except ordering) that one would need for modeling XML, but clearly the ER model is not entirely suited for XML. In this paper, we investigate the current research on abstracting XML in formal and conceptual models, and propose a model that provides the initial steps towards a complete abstraction.

2 Literature Survey In this section, we present a set of ten data models and classify them according to the level of abstraction provided by each model. To this regard, we propose three levels of classi cation: (i) physical, (ii) formal, and (iii) conceptual. A physical model is used for implementation - its at the lowest level including data structures and traversal mechanisms to navigate the data. A formal model is at the next level of abstraction, having language-level elements to reason about the data. Having a good formal model provides the ability for designing good languages and processing 2

techniques for the data, as well as allows the creation of ecient optimization strategies to pass to the physical data structures. A conceptual model is primarily for the purpose of understanding the data, and can be used to generate the formal and hence the physical representations. The classi cation presented here is not strict. Some of the models may lie in the border between the classi cations, some may span over multiple levels of classi cation. We have chosen the level that seems to best t the model in question.

2.1 Physical models 2.1.1 Document Object Model (DOM) The World Wide Consortium (W3C) has recommended the Document Object Model (DOM) 20] as a programming interface for XML documents. DOM helps capture the structure of these documents and also provides means to access and manipulate these documents. DOM can be used to change, delete, add, navigate and build XML structures. DOM is an inherently low level representation of the data model of XML documents and is more of a programming interface to manipulate XML rather than to nd information within the documents. Concepts are dispersed in the tree and it is not possible to view them as collections of homogeneous documents.

2.1.2 XML Data Model The XML Data Model de nes precisely the information that should be contained in the input to an XSLT 7] or XQuery 6] processor. Second, it de nes all permissible values of expressions in the XSLT, XQuery, and XPath 8] languages. The XML Data Model is based on the XML Information Set and also has support for XML Schema types and provides the ability to represent collections of documents and XML Values. The data model can represent various values including not only the input and the output of a stylesheet or query, but all values of expressions used during the intermediate calculations. The XML Data Model can be used to visualize the structure of documents. The lexical structure is represented by a tree and can be extended by additional links representing features within the tree, generating a graph. The node types correspond to the logical structure. The innate simplicity of the models helps support a conceptual modeling of document 3

instances. A precise de nition of the XML data model can be found in 11].

2.1.3 Structured data in Files Structured data in les can bene t from work done on standard database technology. The study in 2] shows that data in structured les can be queried and updated. Standard optimization techniques such as pushing selections can be applied. The study explains in great detail optimization techniques than can be applied and talks in detail about nave and sophisticated methods to update structured data. 2] introduces the notion of a structuring schema, a grammar annotated with database programs and a schema to parse the le and to interpret the data contained in the le as a database. A formal de nition of the notion can be found in 2]. The study presents in detail the issues to be considered while handling structured data stored in les, optimization techniques to use and methods to handle update.

2.1.4 Semantic Network based Design XML Schema as a replacement for DTD provides rich facilities for de ning and constraining the content of XML documents. It however does not concentrate on the semantics that underlie these documents, but instead depicts a logical data model. A conceptual model taking into account the semantics has been proposed in 10]. The methodology described can be broken into two levels a) Semantic and b) Schema level. The rst level is based on a semantic network, which provides a semantic model of the XML document through four major components: 1. Set of atomic and complex nodes representing real world objects. 2. Set of directed edges representing semantic relationships between objects. 3. Set of labels denoting dierent types of semantic relationships such as aggregation, generalization etc. 4. Set of constraints de ned over nodes and edges to constrain these relationships. 4

The second level is based on a detailed XML schema design, including element/attribute declarations and simple/complex type de nitions. The main idea is that the mapping between these two levels can be used to transform the XML semantic model into a XML schematic model, which can then be used to create, modify, manage and validate documents.

2.1.5 OEM and the Lorel approach OEM (Object Exchange Model) is a simple self-describing model that can intuitively be thought of as a labeled directed graph. All entities are objects (uniquely identi ed) that can either be atomic or complex. The graphical view of an OEM database represents the object and nodes and complex objects have outgoing edges labeled with the relationships to their sub-objects. A detailed study of the LOREL approach to XML data modeling can be found in 12]. The study describes in detail the methods to represent an XML element be it simple or composite in the new data model and also provides techniques to map an XML document into the data model. The data model makes it convenient to visualize the data as a labeled directed graph. The data can be viewed in two dierent modes, a semantic mode in which the user can view the database as an interconnected graph and a literal mode in which the user views the database as an XML document. Both the pictorial models described above provide an easily understandable model but do not provide the user will the ability to add, change and manipulate the model.

2.2 Formal Models 2.2.1 Complex Object Database Models A variety of models generalized the relational database model by relaxing the First Normal form constraint. These include Nested Relational Model, complex object model and semantic model. The study 1] presents a formal model for complex objects and languages for complex objects. The study attempts to identify a common approach to the design of query languages and the generalization of know paradigms to new models for complex objects. It proves that the algebraic language is related to the functional style of programming, de nes the concept of domain independence and proves 5

that calculus algebra and logic programming have equivalent expressive power. A comparison of various query languages for complex objections have been carried out and the following results obtained: 1. There is a classic equivalence between domain independent calculus and algebra. 2. Algebra and safe calculus (syntactic restrictions on the calculus that guarantees domain independence) are equivalent. 3. Algebra can be made equivalent by including the 'powerset' (Generates all subsets) operation. 4. Calculus, algebra and the programming language are equivalent. The study provides the means to generalize relational models by allowing arbitrary domains and generalizing the concept of a tuple. The study also identi es methods to design an algebra, language and calculus for these generalized relational models.

2.2.2 XGrammar XML Schema's expressed by schema languages such as Document Type De nitions (DTD) are being widely used to represent data. Semantic Data modeling capabilities of XML Schema's has been studied in 15]. The study tries to make a systematic approach to data description using XML schema and compares it to the more prevalently used conceptual model, Entity Relationship model 4]. The study formalizes a core set of features found in various XML schema languages into XGrammar, a commonly used notation in formal language theory. XGrammar is an extension of the regular tree grammar de nition in 17] and a six-tuple notation is used to describe the model of XML Schema languages. XGrammar has extended this notation to comprise the ability to represent attribute de nitions and data types. A formal de nition of XGrammar can be found in 15]. XGrammar introduces the ability to represent ordered binary relationships, recursive relationships and also to represent a set of semantically equivalent but structurally dierent types as one type. XGrammar also supports the ability to represent composite attributes, generalization hierarchy and n-ary relationships. The study also provides methods to convert XGrammar into an 6

Extended ER model and vice versa. XGrammar suers in its representation because its grammar is loosely based on several existing schema languages rather than a generalized representation.

2.2.3 SAL, An Algebraic Model SAL, Semi-Structured Algebra is an Algebraic query language that 3] uses as a target language for translation from a declarative user-oriented query language for XML. SAL provides a concise representation of query execution but still has room for cost based optimization and execution strategy. Since static type checking is not possible with Semi-Structured data (Schema provides an irregular structure), the system should be capable of handling run time errors. The study proposes a mechanism called 'data exceptions' for the same. The Data model of SAL is a variation of the OEM and views the data as an edge labeled directed graph with a distinguished single root and an order on the outgoing edges of every node. Every node may have an ID and can be referenced from more than one node Directed cycles are possible. A formal de nition of the Algebra and the model can be found in 3]. The algebra supports selection, mapping, joins, group by, regular expression matching and variable binding. It also has supports for function and predicates. The study provides mechanisms to convert an XML document into the corresponding Data model.

2.3 Conceptual Models 2.3.1 Uni ed Modeling Language (UML) An approach to model DTDs conceptually using UML (Uni ed Modeling Language) has been studied in 19]. The study projects UML as the link between software engineering and document design as it provides a mechanism to design object oriented software together with the necessary XML structures. The main goal behind conceptual modeling is to separate the designer's intention from the implementation details. UML helps achieve this by combining object oriented design with XML document structures , providing more clarity. The study explains in detail, the relevant modeling concepts of UML and its transformation into DTDs. Using UML for modeling helps improve the redesign process and also helps reveal possible structural weakness in the document 7

design. UML is extensible and adaptable and retains the expressiveness of the target language. Modeling is limited to the static portion as an XML document is inherently static. The study has also extended UML to make it compliant with the concept of order. Extensions are necessary for the implementation dependent concepts of modeling.

2.3.2 Entity Relationship for XML (ERX) Conceptual modeling of XML documents based on an evolutionary Entity Relationship model (ERX) is proposed in 18]. This model has the necessary modi cations to cope with the features that are peculiar to XML. The basic parts of an ER such as Entities, Relationships and attributes have been modi ed to support XML features such as Order, Complex structures and document classes. Order is supported in ERX by modeling it as a attribute of an entity which determines where the instance of the entity appears in the document. ERX provides an eective support to the development of Complex XML processors for advanced applications by providing a model that takes into consideration classes of concepts and their relationships across documents. ERX does not use Nested structures but instead makes use of a at representation. It is not constrained by the syntactic structure of XML and is speci cally focused on the Data Management issues. The study demonstrates the capabilities of the ERX system and also provides a detailed explanation of the various entities that constitute the ERX model.

2.4 Discussion Model

NR ERX UML XGrammar OEM DOM Semantic Network HNR

Physical Formal Conceptual Diagram matic No No No No Yes Yes Yes No

Yes No No Yes No No No Yes

No Yes Yes No No No No Extensible

No Yes Yes No Yes No Yes No

XML Native No Yes No Yes No Yes No ENF

Order Heterogeneous No Yes Yes Yes Yes Yes Yes Yes

No No Yes Yes No No No Yes

Table 1: Comparison of data models for XML Table 1 compares some of the dierent models mentioned above. The models were compared 8

on the basis of the type of representation followed,the support for XML features such as order, whether the models were built with XML in mind and nally on heterogeneity. It can be seen that most of the models were not originally meant for XML and are de cient in supporting XML's various intricacies. HNR on the other hand has been built keeping XML in mind and serves as a good formal modeling technique for XML. The next sections will describe HNR in great detail.

3 Heterogeneous Nested Relations (HNR) Toward the goal of discovering an appropriate formal model for XML, we rst consider the primary sources of dierence between XML structures and relational structures. Relational databases are sucient for representing objects with simple domains. Documents are inherently complex objects, and cannot be represented easily in the relational model because of the rst normal form requirement, that necessitates that all attributes of a tuple be of atomic type. In documents, a \tuple" has contents that are structured - containing complex values or sets of values. In addition, elements in a document may have dierent structures, resulting in heterogeneous types in the same set.

3.1 Denitions In order to appropriately describe the model in the context of documents, we are going to rst de ne a few key terms.

Element Normal Form (ENF) In this paper, we are going to use the concept of ENF (Element Normal Form) that is a representation of XML documents without using any attributes (an informal introduction to ENF is provided in 14]). Technically, an XML document is in ENF if it does not have any attributes. It is trivial to show that any XML document with attributes can be transformed without loss of generality to XML without attributes, in which the attributes are converted to elements with a special naming convention. This transformation can be performed using a simple XSLT, and the original document can be obtained back from an ENF representation using XSLT as well. In the rest of the paper, we assume that all the documents in our data set are in ENF. The following table shows a sample document and its ENF representation: 9

XML document

ENF

1001 yes John Doe Joe Schmoe

John Doe Joe Schmoe

Table 2: An XML document and its ENF representation

Document Relations A document relation is de ned as a set of documents of the same top-level type. A document relation is a strict set, which means there is no implicit order imposed among documents in the relation. A relation in general would consist of distinct documents as well.

Complex Tuple A complex tuple is one element of a document relation. We would often refer to a complex tuple simply as a tuple.

Complex attribute A complex attribute is a component of a complex tuple. The term complex attribute should not be confused as XML attributes (given the ENF assumption, documents do not contain any XML attributes).

3.2 The Formal Data Model The HNR model is based on the nested relational model 1], also known as the NF2 or non- rst normal form model, a generalization of the traditional relational model without enforcing the rst normal form. In this model, attributes in relations may be non-atomic, by allowing tuple-type and set-type attributes. The nested relational model allows the users to view the database in a representation that is closer to the conceptual view of the real world. Complex objects with an internal hierarchical structure can be represented as one complex relation instead of being normalized to a number of at relations as in the relational model. The nested relational model, as we have seen earlier, however, has a few limitations that prevent its immediate use with document databases. First and foremost, the nested relational model is rigid, 10

and does not allow any dierences and exceptions in structure within a relation. In XML, however, it is common to have multiple alternative structures for the same complex type. Moreover, nested relations do not allow recursion in structure, and the attribute names must all be unique in a relation at any level of nesting. These limitations make the nested relational model dicult to use with XML documents. HNR removes all of these limitations. To describe HNR, we rst present the essential formal components of the model. Lets A be the universal set of attribute names and relation names. A document scheme or a HNR scheme is of the form R(S ) where R 2 A is the name of the HNR scheme, and S is an ordered list of the form (A1 A2 ::: An ), where each Ai is either an atomic type, or another HNR scheme.

3.3 HNR Types The concept of HNR types form the core of the model. As expected, the type of an attribute is recursive, and de ned recursively as well. The domain of values for a given relation D(R) is constructed inductively from the domains of its components, as follows: 1. Atomic type: The domain of an atomic attribute Ai is its corresponding domain Di of values. 2. Relation type: Each member of a relation type R(A1 A2 ::: An ) is a tuple with the structure, so the domain of values is a cross product of the individual domains D(A1 )  D(A2 )  :::  D(An). 3. Multi-valued type: An attribute can be of a multi-valued type, where the domain of the attribute is the powerset of the corresponding sub-domain, represented as powerset(D(Ai )). 4. Heterogeneous type: An attribute can be of heterogeneous type, indicating that it may belong to any of a number of dierent subtypes. In this case, the domain of the attribute is the union of the domains of all the subtypes belonging to that relation D(Ai ) = D(Ai1 )  D(Ai2 )  :::  D(Ain ). To take an example, let us consider a very simple Document type de nition, from the W3C XML Query use cases 5], translated into its ENF equivalent: 11


bib (book* )> book (title, (author+ | editor+ ), publisher, price, ATTV_year )> author (last, first )> editor (last, first, affiliation )> title (#PCDATA )> last (#PCDATA )> first (#PCDATA )> affiliation (#PCDATA )> publisher (#PCDATA )> price (#PCDATA )> ATTV_year (#PCDATA )>

An XML document in the above DTD, when viewed as an HNR scheme, would look like Figure 1. The tabular representation is for viewability only, a conceptual representation would be more like a tree. Title

Author|Editor

Publisher

Data on the Web

Author last Abiteboul Buneman Suciu

Morgan Kaufmann Publishers

39.95

2000

Kluwer Academic Publisher

129.95

1999

The economics of Editor Technology and last Content for Digital Gerbarg TV

first Serge Peter Dan

first Darcy

affiliation CITI

Price

ATTV_year

Figure 1: A representation of a Heterogeneous Nested Relation

Order and Sequence Sequence is an important element in XML data, which is somewhat dierent from relational (and even nested relational) data. In HNR, relations are strictly sets, i.e., there is no order between the tuples in a relation. However, the concept of sequence and ordering is important in the attributes. Attributes are ordered left-to-right (so, in Figure 1, title is before price, and that order is signi cant). In addition, for multivalued attributes, the values of the attributes are in order as well. Hence, \Serge Abiteboul" would be considered as the rst author of the book titled \Data on the Web".

12

4 Applications The main use of a formal model is for developing succinct theoretical foundations in query languages and processing mechanisms. Just like the relational model is used in the development of relational algebra and calculus, and in the formulation of SQL, our model can be just as easily used to develop such formal and practical query language. In this section, we briey present three equivalent languages - an algebra-based language, a calculus-based language, and an SQL language. Here we present brief descriptions of each language. Technical Details on these languages and proofs of the properties are beyond the scope of this paper.

4.1 The HNR Calculus (HNRC) The HNR Calculus (HNRC) or simply, Document Calculus (DC) is a calculus based on the HNR model. This calculus language is developed in the same notion as the relational calculus, and hence has a similar look-and-feel as the relational calculus.

Types Types in this language include atomic domain types, tuple types, and multivalued types.  Domain type: D is the simplest atomic type.  Tuple type A1 : 1 ::: An : n] is a composite type, where A1 :::An are attributes with types 1 :::: n respectively.

 Set type: ftg is the type of a multi-valued attribute.  Heterogeneous type: a heterogeneous type is represented as  = 1j2j:::jn . The domain of a heterogeneous type is just a union of the domains of the component types. A typed relation in HNRC is represented as R - referring to objects in R having type  . An HNRC Expression can be de ned inductively as follows:

 Base expression: R (x ) 13

 Path terms: x :P is a path term obtained by traversing the path P from x. We use the concept of a simple path expression (SPE) which includes only direct child traversal (:), descendant traversal (::), and index selection to choose the ith item in a sequence. For example, the path expression book..author1].last selects the last name of the rst author in a book, where author could be at any level of nesting under the level of book.

 Predicates:

{ Comparison with atoms: x :P = `a' is an equality comparison, where P is an SPE. Inequality operators such as ! = >=