XML-Based Support for Database Histories and

0 downloads 0 Views 4MB Size Report
4 Using XML to Build Transaction-Time Support in Relational. Databases . ..... XML is a standard endorsed by most major software vendors. They heavily ... diversity of applications, EAI becomes more complicated and expensive to implement.
University of California Los Angeles

XML-Based Support for Database Histories and Document Versions

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science

by

Fusheng Wang

2004

c Copyright by ° Fusheng Wang 2004

The dissertation of Fusheng Wang is approved.

Douglas S. Bell

Wesley W. Chu

D. Stott Parker

Carlo Zaniolo, Committee Chair

University of California, Los Angeles 2004

ii

To my family

iii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.1

2.2

2.3

2.4

Historical Information Management . . . . . . . . . . . . . . . . .

4

2.1.1

Temporal Databases and Grouped Representation.

. . . .

4

2.1.2

Time in XML. . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.3

Version Management . . . . . . . . . . . . . . . . . . . . .

6

XML: the Universal Language for the Web . . . . . . . . . . . . .

7

2.2.1

Advantages of XML . . . . . . . . . . . . . . . . . . . . .

8

2.2.2

XML Usage . . . . . . . . . . . . . . . . . . . . . . . . . .

9

XML Query Languages . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3.2

XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Publishing Relational Databases in XML. . . . . . . . . . . . . . .

18

3 An XML-based Approach to Publishing and Querying the History of Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.2

History of Database Tables as XML Documents . . . . . . . . . .

21

3.2.1

Publishing Each Table as an XML Document with Columns as Elements . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

24

3.3

Temporal Queries using XQuery . . . . . . . . . . . . . . . . . . .

24

3.3.1

More Complex Queries . . . . . . . . . . . . . . . . . . . .

27

3.3.2

Temporal Functions . . . . . . . . . . . . . . . . . . . . .

28

3.3.3

Support for ‘now’ . . . . . . . . . . . . . . . . . . . . . . .

29

3.4

Temporal Transformations and Visualization with XSLT . . . . .

31

3.5

Representing and Querying Schema Histories in XML . . . . . . .

34

3.5.1

Schema History Queries. . . . . . . . . . . . . . . . . . . .

35

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.6

4 Using XML to Build Transaction-Time Support in Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1

4.2

4.3

39

The ArchIS System . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.1.1

H-tables . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.2

Updating Table Histories . . . . . . . . . . . . . . . . . . .

43

4.1.3

Query Mapping . . . . . . . . . . . . . . . . . . . . . . . .

43

4.1.4

Function Mapping . . . . . . . . . . . . . . . . . . . . . .

47

4.1.5

Temporal Clustering and Indexing . . . . . . . . . . . . . .

47

Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.2.1

Experimental Setup . . . . . . . . . . . . . . . . . . . . . .

52

4.2.2

Query Performance . . . . . . . . . . . . . . . . . . . . . .

53

4.2.3

Storage Utilization . . . . . . . . . . . . . . . . . . . . . .

55

Database Compression . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3.1

58

Block-based Compression: BlockZIP . . . . . . . . . . . .

v

4.3.2

Storage Utilization with Compression . . . . . . . . . . . .

59

4.3.3

Query Performance with Compression

. . . . . . . . . . .

60

4.3.4

Update Performance . . . . . . . . . . . . . . . . . . . . .

62

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5 XBiT: An XML-based Bitemporal Data Model . . . . . . . . . .

64

4.4

5.1

5.2

5.3

Temporal ER Modeling . . . . . . . . . . . . . . . . . . . . . . . .

64

5.1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

64

5.1.2

An Example . . . . . . . . . . . . . . . . . . . . . . . . . .

65

Valid Time History in XML . . . . . . . . . . . . . . . . . . . . .

66

5.2.1

Valid Time Temporal Queries . . . . . . . . . . . . . . . .

69

5.2.2

Temporal Operators . . . . . . . . . . . . . . . . . . . . .

71

5.2.3

Database Modifications . . . . . . . . . . . . . . . . . . . .

71

. . . . . . . . . . . . . .

73

5.3.1

The XBiT Data Model . . . . . . . . . . . . . . . . . . . .

73

5.3.2

Bitemporal Queries with XQuery . . . . . . . . . . . . . .

76

5.3.3

Database Modifications . . . . . . . . . . . . . . . . . . . .

78

5.3.4

Temporal Database Implementations . . . . . . . . . . . .

80

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

6 General XML Documents . . . . . . . . . . . . . . . . . . . . . . .

82

5.4

An XML-based Bitemporal Data Model

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

6.2

A Logical Model for Versions . . . . . . . . . . . . . . . . . . . . .

86

6.2.1

87

Change Management . . . . . . . . . . . . . . . . . . . . .

vi

6.2.2

A General Approach: the ICAP Project . . . . . . . . . .

89

Complex Queries . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

6.3.1

Temporal Functions . . . . . . . . . . . . . . . . . . . . . .

92

Building V-Documents . . . . . . . . . . . . . . . . . . . . . . . .

93

6.4.1

Handling Attributes . . . . . . . . . . . . . . . . . . . . .

93

6.4.2

Mixed Content . . . . . . . . . . . . . . . . . . . . . . . .

94

6.4.3

Schema of the V-Document . . . . . . . . . . . . . . . . .

94

6.4.4

XChronicler: Generate V-Document from Versions . . . . .

95

6.4.5

Visualizing Versions . . . . . . . . . . . . . . . . . . . . . .

97

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

7 Storage of Temporal XML Documents . . . . . . . . . . . . . . .

99

6.3

6.4

6.5

7.1

Current Approaches to XML Databases . . . . . . . . . . . . . . .

99

7.1.1

Categories of XML Databases . . . . . . . . . . . . . . . .

99

7.1.2

Storage of XML Data in Object-Relational Database Systems103

7.1.3

Storage of XML data in Native XML Databases . . . . . . 106

7.2

Storage of Temporal XML Documents . . . . . . . . . . . . . . . 109

7.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

vii

List of Figures 3.1

The History of the employee Table Is Published as employees.xml 22

3.2

The History of the dept Table Is Published as depts.xml . . . . .

23

3.3

Snapshot Query with XSLT . . . . . . . . . . . . . . . . . . . . .

31

3.4

History Query with XSLT . . . . . . . . . . . . . . . . . . . . . .

32

3.5

The XSLT Document to Visualize Changes between Two Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.1

ArchIS: the Archival Information System . . . . . . . . . . . . . .

40

4.2

Relative Storage Size with Different Umin . . . . . . . . . . . . . .

50

4.3

Query Performance of Segment-based Archiving on RDBMS vs Native XML DB . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

Query Performance on Historical Data with Segment-based Clustering vs without Clustering . . . . . . . . . . . . . . . . . . . . .

4.5

56

Compression Ratios of Historical Data Storage on Different Systems (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6

55

57

Compression Ratios of Historical Data Storage on Different Systems (II)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.7

Query Performance with Compression . . . . . . . . . . . . . . . .

61

5.1

TEER Schema Modeling of Employees and Departments (with

5.2

Time Semantics Added) . . . . . . . . . . . . . . . . . . . . . . .

65

The Valid Time History of Employees . . . . . . . . . . . . . . . .

67

viii

5.3

XML Representation of the Valid-time History of Employees(VHdocument) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.4

Temporally Grouped Valid Time History of Employees . . . . . .

68

5.5

Bitemporal History of Employees . . . . . . . . . . . . . . . . . .

73

5.6

XML Representation of the Bitemporal History of Employees(BHdocument) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

5.7

Temporally Grouped Bitemporal History of Employees . . . . . .

76

6.1

Web Information Warehouses . . . . . . . . . . . . . . . . . . . .

84

6.2

Sample Versioned XML Documents . . . . . . . . . . . . . . . . .

88

6.3

XML Representation of a Versioned Document . . . . . . . . . . .

89

6.4

DTD of Snapshot XML Documents . . . . . . . . . . . . . . . . .

95

6.5

DTD of the V-Document . . . . . . . . . . . . . . . . . . . . . . .

95

6.6

XML Schema of the V-Document . . . . . . . . . . . . . . . . . .

96

7.1

Storage Organization . . . . . . . . . . . . . . . . . . . . . . . . . 109

ix

List of Tables 3.1

The Snapshot History of Employees . . . . . . . . . . . . . . . . .

21

3.2

The Snapshot History of Departments . . . . . . . . . . . . . . .

22

4.1

Temporal Queries on Archived History . . . . . . . . . . . . . . .

54

4.2

BlockZIP Compression . . . . . . . . . . . . . . . . . . . . . . . .

59

x

Acknowledgments First, I would like to express my deep appreciation to my advisor, Professor Carlo Zaniolo. He guides me not only on numerous research and technique problems, but also on how to do research. His profound knowledge, his dedication to research and science, his kindness and optimistic attitude toward the world, will always inspire me. I am also very thankful to Guogen Zhang at IBM Silicon Valley Lab, for his continuous help on my research. Especially, he gave me the chance to prepare a solid background to start my dissertation. I am also grateful to Professor D. Stott Parker, Professor Wesley W. Chu, and Professor Douglas S. Bell for their participation in my doctoral committee and taking the time to guide me through my dissertation. I would like to thank my colleagues Xin Zhou, Chang Luo and Emanuele Ottavi for their contribution to the experiments and implementation of the ArchIS system. I am also fortunate to work closely with other members from the Web Information System Lab: Fang Chu, Yijian Bai, Yan-Nei Law, Hyun Jin Moon, Hetal Thakkar, Shu-Yao Chien, Cindy Chen, and Haixun Wang. Finally, I am grateful to my family from the bottom of my heart for their unconditional support. Especially, I would like to thank my wife, Hua Gan, for the love and encouragement. Last but not the least, I would like thank my newborn son, Andrew, whose beaming smile always brings joy to my heart and enlightens my day.

xi

Vita

1994

B.S. in Physics Tsinghua University, Beijing, China

1997

M.S. in Engineering Physics Tsinghua University, Beijing, China

2000

M.S. in Computer Science University of California, Los Angeles, USA

2004

Ph.D. in Computer Science University of California, Los Angeles, USA

Publications

F. Wang, C. Zaniolo, Temporal Queries and Version Management for XML Document Archives. Submitted for Journal Publication.

F. Wang, C. Zaniolo, An XML-Based Approach to Publishing and Querying the History of Databases. To Appear in World Wide Web: Internet and Web Information Systems, Special Issue on Web Information Systems, Kluwer, 2005.

F. Wang, X. Zhou, C. Zaniolo, Using XML to Build Efficient Transaction-Time Temporal Database Systems on Relational Databases. Submitted for Conference Publication.

xii

F. Wang, X. Zhou, C. Zaniolo, Temporal Information Management using XML. In Proc. of the 23rd International Conference on Conceptual Modeling(ER’04) Shanghai, China, November 2004.

F. Wang, C. Zaniolo, XBiT: An XML-based Bitemporal Data Model. In Proc. of the 23rd International Conference on Conceptual Modeling(ER’04) Shanghai, China, November 2004.

F. Wang, C. Zaniolo, Publishing and Querying the Histories of Archived Relational Databases in XML. In Proc. of the 4th International Conference on Web Information Systems Engineering (WISE’03), Roma, Italy, December 2003.

F. Wang, C. Zaniolo, Representing and Querying the Evolution of Databases and their Schemas in XML, In Proc. of International Workshop on Web Engineering (held in conjunction with SEKE’03), San Francisco Bay, USA, July 2003.

F. Wang, C. Zaniolo, Temporal Queries in XML Document Archives and Web Warehouses. In Proc. of the 10th International Symposium on Temporal Representation and Reasoning and 4th International Conference on Temporal Logic (TIME-ICTL’03), Queensland, Australia, July 2003.

F. Wang, C. Zaniolo, Preserving and Querying Histories of XML-Published Relational Databases. In Proc. of the Second International Workshop on Evolution and Change in Data Management (ECDM02) (held in conjunction with ER’02), Tampere, Finland, October 2002.

xiii

Abstract of the Dissertation

XML-Based Support for Database Histories and Document Versions by

Fusheng Wang Doctor of Philosophy in Computer Science University of California, Los Angeles, 2004 Professor Carlo Zaniolo, Chair

Current database systems do not provide effective means for archiving the database history and querying past snapshots of the database and its temporal evolution. Better support for temporal applications by database systems represents an important objective that is difficult to achieve, since it requires an integrated solution for technical problems that are challenging on their own, including (i) expressive temporal representations and data models, (ii) powerful languages for temporal queries and snapshot queries, (iii) indexing, clustering and query optimization techniques for managing temporal information efficiently, and (iv) architectures that bring together the different pieces of enabling technology into a robust system. There is much current interest in publishing and viewing databases as XML documents. The general benefits of this approach follow from the popularity of XML and the tool set available for processing information encoded in this universal standard. In this dissertation, we explore the additional and unique benefits achieved by this approach on temporal database applications. We show that XML with XQuery can provide surprisingly effective solutions to the problem

xiv

of supporting historical queries on past contents of database relations and their evolution. Indeed, using XML, the histories of database relations can be represented naturally using a temporally grouped data model, and complex temporal queries can be expressed in XQuery without requiring extensions to the current standard. Therefore, we present the ArchIS system that achieves these benefits. ArchIS’ architecture uses (a) XML to support a temporally grouped (virtual) representation of the database history, (b) XQuery to express powerful temporal queries on such views, (c) temporal clustering and indexing techniques for managing the actual historical data in an RDBMS, and (d) SQL/XML for executing the queries on the XML views as equivalent queries on the relational database. Extensive performance studies show that ArchIS is quite effective at storing and retrieving under complex query conditions the transaction-time history of relational databases. By supporting database compression as an option, ArchIS also achieves excellent storage efficiency for archived histories. Therefore, ArchIS delivers full-functionality transaction-time databases without requiring temporal extensions in XML or database standards. Moreover, these techniques can be extended to valid-time and bitemporal databases, where complex temporal queries can also be expressed in standard XQuery. The temporal modeling and querying approach proposed in this thesis is very general and can be extended to arbitrary XML documents. In our extension, we manage the XML document revision history effectively by (i) representing concisely the successive versions of a document as another XML document that implements a temporally grouped data model, ii) using structured diff algorithms to build such documents, and iii) using XQuery to express complex temporal queries on the evolution of the document structure and its content.

xv

CHAPTER 1 Introduction Because of its simple syntax, and a self-describing semantic structure, XML is fast becoming the standard information exchange language for web-based applications [8]. As a result, XML is exerting a major influence on existing application areas, particularly information systems, as these are quickly converging toward a webbased integration. The enabling technology of these application domains is also profoundly affected by this unifying trend. In particular, XML and databases are converging and influencing each other in many significant ways, and in the process, fostering major R&D efforts by companies and universities. Recent efforts have focused on the following three problems: • Database Publishing. Large amount of information is stored in existing databases and such information needs to be efficiently published as XML documents and exchanged on the Web. • DBMS Extensions. As XML documents need to be efficiently stored and manipulated in XML repositories, commercial DBMS vendors are extending their Object-Relational Systems to support the storage and retrieval of XML documents. Various schemes are used to decompose (shred) documents into separate tables to enhance query performance, and a new XML data type is proposed. • Native XML repositories. These directly support data definition languages

1

based on XML standards, such as XML Schema [28]. Approaches that convert Object Oriented DBMSs into native XML repositories also fall in this group. The convergence of XML and databases is now taking a major step forward with the introduction into the standards of XQuery—a powerful query language which extends SQL and is uniformly applicable to the three scenarios described above. XQuery [30] can specify XML documents on the basis of their complex structure and can serve as the bridge between XML repositories and the Web. Thus, with XQuery, databases and the Web can be more easily integrated; also, its expressive power provides opportunities to support new applications. In particular, the uniqueness of XML and XQuery provides an opportunity to manage evolving information. XQuery uses path expressions to specify XML documents on the basis of their complex structure. Queries containing path expressions are difficult to support efficiently and much current research focuses on indexing and query processing schemes to implement such queries efficiently. A new area of research focuses on the dynamic aspects of XML documents. Change management is critical for many applications, including the ability of updating documents, and delivering changes to various subscribers. This includes the ability of supporting multiversion XML documents, and complex queries on documents’ history. In this dissertation, the state of the art is first reviewed with a focus on the technical problems and research challenges created by the rapid convergence of database systems and web information systems. Then, the dissertation concentrates on the following areas of research opportunity:

2

• expressive temporal representation and data models, • powerful languages for temporal queries, • efficient support for temporal queries, • temporal extension of RDBMS via XML, and • version management in XML document archives.

Outline of this Dissertation The dissertation is organized as follows. In Chapter 2 we present an overview of the state of the art, including historical information management, version management, XML and its query langauge. Then, in Chapter 3, we discuss publishing and querying relational database history using XML and its query languages; efficient database support for the XML-viewed history is discussed in Chapter 4. Chapter 5 shows that similar techniques can be applied to valid and bitemporal database history, and discusses supporting complex temporal queries and updates. In Chapter 6, we show that our approach is general and can be applied to arbitrary XML document history management, and we focus on the version management and temporal query support in XML document repositories. In Chapter 7 we discuss the techniques to store XML documents and temporal XML documents. Chapter 8 concludes the dissertation.

3

CHAPTER 2 State of the Art 2.1 2.1.1

Historical Information Management Temporal Databases and Grouped Representation.

There is a large number of temporal data models proposed and the design space for the relational data model has been exhaustively explored [82]. Clifford et al. [57] classified them as two main categories: temporally ungrouped and temporally grouped. The second representation is said to have more expressive power and to be more natural since it is history-oriented [57]. TSQL2 [105] tries to reconcile the two approaches [57] within the severe limitations of the relational tables. Our approach is based on a temporally grouped data model, since this dovetails perfectly with XML documents’ hierarchical structure. TimeDB [96] is a layered architecture that translates temporal queries into RDBMS, where temporal data is represented as tuples with intervals, thus temporally ungrouped. Object-oriented temporal models are compared in [92], and a formal temporal object-oriented data model is proposed in [39]. The problem of version management in object-oriented and CAD databases has received a significant amount of attention [38, 55]. However, support for temporal queries is not discussed, although query issues relating to time multigranularity were discussed in [63].

4

Recently Oracle implemented Flashback [13], an advanced recovery technology that allows users to rollback to old versions of tables in case the user decides that there have been errors. However, Flashback provides only limited support for temporal queries, since it only supports queries on the recent past through reading update logs.

2.1.2

Time in XML.

Some interesting research work has recently focused on the problem of representing historical information in XML. In [44], an annotation-based object model is proposed to manage historical semistructured data, and a special Chorel language is used to query changes. In [72] a new markup tag for XML/HTML documents is proposed to support valid time on the Web, thus temporal visualization can be implemented on web browsers with XSLT. In [70], a dimension-based method is proposed to manage changes in XML documents, however how to support queries is not discussed. In [34, 35], a data model is proposed for temporal XML documents. However, since a valid interval is represented as a mixed string, queries have to be supported by extending DOM APIs or XPath. Similarly, TTXPath [62] is another extension of XPath data model and query language to support transaction time semantics. (In our approach, we instead support XPath/XQuery without any extension to XML data models or query languages.) A τ XQuery language is proposed in [69] to extend XQuery for temporal support, by providing new constructs for the language. An archiving technique for scientific data was presented in [42], based on an extension of the SCCS [85] scheme. This approach timestamps elements only when they are different from the parent elements, so the structure of the representation is not fixed; this makes

5

it difficult to support queries in XPath/XQuery, which, in fact, is not discussed in [42]. The scheme we use here to publish the histories of relational tables present several similarities to that proposed in [42], but it also provides full support for XML query languages such as XPath and XQuery.

2.1.3

Version Management

Change Detection. LaDiff [45] is a change detection algorithm for semistructured information, which approaches the problem by dividing it into (i) the Good Matching problem, and (ii) the Minimum Conforming Edit Script problem. A diff algorithm XyDiff for XML documents is proposed in [80, 59]. To match the largest identical parts of both documents, the algorithm utilizes ID attribute information, and a signature and a weight is computed for each node from bottom-up. X-Diff [103] detects changes in XML documents based on an unordered model, which applies to published relational data. By utilizing node signatures and node XHash values, the algorithm tries to find the minimum-cost matching. Microsoft XMLDiff [1] provides a tool to generate XML diff and represent it as another XML document. Different approaches for supporting multiversion documents use different schemes for storing these deltas on secondary storage and processing them for version reconstruction and query execution. These are discussed next. Managing the Deltas. The RCS [98] scheme stores the most current version plus reverse editing scripts to retrieve previous versions. An improvement to RCS was proposed in [48], where a temporal clustering scheme is introduced to improve the efficiency of retrieving past versions from secondary store. These approaches lack an XML-compatible logical representation that can be used to support complex queries [48]. RBVM [49] unifies the logical representation and

6

the physical one, by representing objects that remain unchanged as references to the objects in old versions. However, this scheme can only handle simple queries; in fact, different storage representations are needed for more complex queries [50, 51]. Another scheme often used in the past for version control is SCCS [85]. In SCCS, each text segment is clustered with all its successive changes; also pairs of timestamps are associated with the segments and their changes to specify their lifespans. Version retrieval is performed by scanning the file and retrieving valid segments according to timestamps. At the physical level, this is a source of much inefficiency, since reconstruction of a single version now requires a complete scan of the file which becomes larger and larger as successive versions are added. The addition of indexes [85] only improves the situation up to a point, since the segments belonging to the same version are not clustered together; thus retrieving a version can require accessing a different page for each of its segments [48].

2.2

XML: the Universal Language for the Web

With the popularity of the World Wide Web, Extensible Markup Language (XML) [8] has emerged as a universal standard for Web documents. It is designed from a restricted subset from SGML, and its role has expanded from a semantic markup language for online documents to a broad range of applications, including data interchange, data integration, content publishing, and so on. Meanwhile, XML also brings a new technology shift that provides new opportunities and poses new challenges to information management.

7

2.2.1

Advantages of XML

• XML is simple. XML is a text-based language, where markup tags like and can clearly specify what kind of information is enclosed in a tag. Since tags and contents are concatenated together, information can be understood without the application that created them. XML documents are easy to read for humans, and can be easily processed by machines as well. There are many XML parsers and processors that make XML processing convenient. • XML is semantically structured. The structure of XML also represents the semantics of the data, which provides a great opportunity for information retrieval. Since XML documents can be categorized by their schemas, results can be made more precise by limiting the searching to specific schemas. Meanwhile, ambiguous words in the search can be distinguished by the content they appear in. For example, if we want to search a keyword “Washington”, without further information, the search engine cannot understand if it is a search for a person’s name, or a state name. If “Washington” is specified in “Washington”, then search results will be much more precise. With the structure information, a specific relevant part of an XML document can also be retrieved. • XML is an extensible meta-language. With XML, we can define whatever information representation we need. One of the goals of the XML standard is its extensibility. XML has been used as a standardization mechanism to define platform-independent grammar for specific purposes and schemas for all kinds of data models. DTD [8] is the standard schema language for SGML and XML, that can be used to define the syntax and structure

8

of XML documents. XML Schema [28] is a much more comprehensive schema language for XML, expressed with the syntax of XML. XML also bridges the gap between document-oriented and record-oriented processing. With some rules, relational data can be encoded as XML documents, and XML documents can be shredded into relational tables or composed from relational tables. • XML separates data from presentation. XML is very different from HTML in the sense that HTML mixes both data and presentation information in a single document, while XML separates presentation from data. Extensible Stylesheet Language (XSL) provides a mechanism to transform one XML document structure into another new XML document, or format XML document into desired HTML display or devices. • XML is multilingual. XML is based on Unicode, so it can represent most languages in the world, including Chinese and Japanese. XML tools can be written once and used for most languages. This greatly facilitates the data and document exchange in a multilingual world. • XML is open.

XML is a standard endorsed by most major software

vendors. They heavily invest on XML, and actively contribute to the standardization of XML. With XML as the base standard, there are many XML related application standards, such as GML[9] and ebXML[7].

2.2.2

XML Usage

With the characteristics and advantages of XML, it has been widely used in the following application areas.

9

• Data Exchange/Messaging. In the real world, information is heterogenous: computer systems and databases contain data in different formats. By encoding data in XML, with its flexibility in defining industry-specific vocabularies, it provides a standard framework for exchanging different types of data. A typical application of data exchange is web service: a Web-based framework for information exchange based on XML, which uses SOAP [16] to encode messages. • Application Integration.

Enterprise Application Integration (EAI) links

diverse applications, platforms, and operating systems so they work as one and deliver powerful business results seamlessly. Due to the complexity and diversity of applications, EAI becomes more complicated and expensive to implement. As a neutral language, XML provides a way to integrate applications more easily. • Data Integration.

In a data integration system, it uses an automated

method for querying across multiple heterogeneous databases in a uniform way. XML can be uses to provide a virtual view of heterogeneous data and hide it from users, and information can be retrieved by quering the view. • Content Publishing.

There are many challenges for content management

systems, for example, interconnected hypermedia, dynamic content combining databases and structured text, and cross-media publishing with content re-use. XML provides a perfect match for content management. XML and namespaces can provide highly structure content, XML schemas and query languages provide efficient content management, and XSL can flexibly customize and layout contents. With the integration of XML into content management, the content can be more reusable, interoperable, and extensible.

10

2.3

XML Query Languages

Since XML provides a powerful model for representing information, there are much current interest in using XML to represent and manage information stored in databases.

In fact, XML provides many of the database-like capabilities

needed for the task, including schema languages (DTDs, XML Schema), query languages(XPath, XQuery), programming interfaces(SAX, DOM, JDOM). Much of current R&D work focuses on providing management systems for the XML document repositories: these systems will have to efficiently support storage, indexes, security, transactions, and complex content-based queries. Therefore, all database powerhouses are now busy on extending commercial ORDBMS to store and process XML documents [18, 12, 5, 2]. Meanwhile, native XML database management systems are emerging to store XML [20], and some OODBMSs are being converted into XML database systems [17]. The extent to which XML and databases are converging and influencing each other is best illustrated by the recent developments in XML query languages discussed next.

2.3.1

Overview

XML query languages are used for querying, updating, transforming, and integrating XML data. XML query languages can extract fragments from XML data, but also transform XML data into new structures. They can transform data by grouping, aggregating, joining, nesting, element-constructing, and so on. XML query languages are often used as data integration languages. They can query on a collection of XML documents and return a single XML document, or

11

joining data from different sources. Some XML query languages can also define XML views on multiple sources. While updates are an essential feature of database query languages, with a few exceptions [32, 97], most current XML query languages do not support updates. There are many other features that differentiate XML query languages, such as support for regular path expressions, order, similarity queries, namespaces, datatypes, functions, and so on [30].

2.3.2

XQuery

XQuery [30] is the XML query language proposal by W3C, which was first released in February 2001 and the latest draft was released in November 2003. One of the design goals of XQuery is its expressive power. XQuery can be implemented in many environments, such as in traditional databases, XML repositories, or XML programming libraries, and can combine data from multiple sources. The language is based on Quilt, and borrows many features from other query languages.

Data Model XQuery 1.0 [30] uses the same data model as XPath 2.0 and XSLT 2.0. The data model is a node-labeled, tree-constructor representation, with the concept of node identity, which can be used for IDREF, XPointer and URI values. In the data model, an XML document is ordered on all the nodes. In XQuery data model [30], values fall into five categories. Node values have eight distinct kinds of nodes: document, element, attribute, text, namespace, processing instruction, comment, and reference. Simple values have two types, primitive values, and derived simple values. A primitive value is one of the 19 primitive values defined in XML Schema. A derived simple value will correspond

12

to its derived simple type and Sequence value is an ordered collection of nodes. The data model can model XML documents, well-formed fragments of a document, a sequence of documents, or a sequence of document fragments. The instance of the data model is always an ordered sequence of nodes (can be nested).

XQuery Expressions XQuery is a functional language, in which a query is an expression. XQuery is a case-sensitive language, and keywords in XQuery use lower-case characters and are not reserved. The structure of a query has optional namespace declarations, optional functional declarations, and query expressions. The following are the principal expressions in XQuery: • Variables and constants. Variables have the prefix “$” and constants can be numbers or strings. • Function Calls. Call a function with name and arguments. • Constructors. Generate elements, attributes or other nodes. • Path expressions. XPath expressions used to locate part of an XML document. • Expressions involving operators and functions. For example, x + y, −z ∗ f oo(x, y) . • Conditional expressions. An IF . . . THEN . . . ELSE clause is used to evaluate a test expression and then returns one of the two result expressions. • Quantified expressions. “SOME var IN expr SATISFIES expr” is used to test for the existence of some element that satisfies a condition, while “EVERY var IN expr SATISFIES expr” is used to determine whether all elements in some collection satisfy a condition.

13

• FLWOR Expressions. This will be discussed later. Interesting features of XQuery include: • Grouping. XQuery has several features to handle grouping. It has a distinct() function that can be used to bind variables to unique values, and the LET clause always binds a variable to a value of set. It also provides several grouping functions such as count(), avg() and sum(), that can be applied on a set. The following query will find the part number and average price for parts that have at least 3 suppliers. FOR $pn in distinct (doc("catalog.xml")//partno ) LET $i := document("catalog.xml")//item[partno = $pn] WHERE count ($i) >= 3 RETURN {$pn } {avg($i/price) } order by (partno)

• Sorting. A sequence can be ordered by means of an ORDER BY clause that contains one or more ordering expressions. • Collections in XQuery. In XQuery, the collections can be ordered or unordered. Generally, collections are ordered, for example, “/bib/book/author” is an ordered collection of authors, and “LET $a := /bib/book” binds variable a to a sorted collection. Some functions can make the collection unordered, for example, ”distinct(/bib/book/author)” will return an unordered collection.

14

• Functions. XQuery supports two types of functions: built-in functions, and user-defined functions. Built-in functions can be scalar functions such as max and count, or string functions like distinct, contains. User-defined functions (UDF) are defined in the XQuery syntax,and provides a native extensibility for XQuery. The following is a sample query that defines a function and calls it to return the depth of the XML document tree: NAMESPACE xsd = "http://www.w3.org/2001/XMLSchema" DEFINE FUNCTION depth($e) RETURNS xsd:integer {(:An empty element has depth 1,otherwise, add 1 to max depth of children :) IF (empty($e/*)) THEN 1 ELSE max(depth($e/*)) + 1 } depth(doc("partlist.xml"))

Path expressions Path expressions are based on XPath [27]. XPath is an expression language primarily designed for manipulating parts of an XML document represented as a tree of nodes, but capable of performing some math, string manipulation, and boolean logic. A path expression begins with an expression that identifies a node or a sequence of nodes in a document. It can use the function doc(string) to return the root node, collection() to return a collection or documents from a URI, or begins with “/” or “//” which represents an implicit root node determined by current context. A path expression consists of a series of “steps”, and each step represents movement through a document along a specified “axis”, and each step can apply some predicates to filter unqualified nodes. An axis can be the child, descendant, parent, ancestor, self, attribute and so on. The default axis is the child axis. The path expression is often abbreviated and the following is a list of some common

15

used abbreviations: a

Element named a

/a

Element named a at the root of the document

a/b

Element b occurring as a direct child of element a

a//b

Element b occurring any number of levels below a

//a

Element a occurring any number of levels below root

@a

Attribute named a



Any element in current level

@*

Any attribute of current element

text()

Text node

..

The parent node

b[n]

The nth b element

FLWOR expressions A FLWOR expression is constructed from FOR, LET, WHERE, ORDER BY, and RETURN clauses. A FLWOR expression binds values to one or more variables and then uses these variables to construct a result. FOR/LET clause. The first part of a FLWOR expression consists of FOR and/or LET clauses, which bind values to one or more variables. The values to be bound to the variables are expressed by expressions, normally path expressions. A FOR clause will start a loop and iterate on the values returned by each variable’s respective expression, in order. A LET clause is used to bind one variable to one expression without iteration, and the value is bound to a value of sequence. WHERE clause. WHERE clause is an optional clause used for further filtering. The WHERE clause may contain several predicates connected by AND or OR, and often contains references to the bound variables. RETURN clause. This clause constructs the result from FLWOR expression, and is executed once for each loop in the FOR clause that satisfies the condition

16

in WHERE clause, preserving the order of these tuples. Element constructors are often used in the RETURN clause to construct new elements. The results are often a sequence of nodes which can be nested.

Joins XQuery can be used not only to extract fragments of an XML document, but also transform XML data, and integrate data from multiple sources. For example, the following query joins data from three XML documents by partno and suppno: { FOR $i IN doc("catalog.xml")//item, $p IN doc("parts.xml")//part[partno = $i/partno], $s IN doc("suppliers.xml")//supplier[suppno =$i/suppno] RETURN { $p/description, $s/suppname, $i/price } SORTBY(description, suppname) }

2.3.3

Summary

XML query languages are bridges between the Web and XML repositories. With XML query languages, information can be retrieved from XML repositories, XML data can be restructured, and information can be integrated. Although XQuery provides the most expressive power, many features are missing from the language, including support for updates. Also efficient support for XQuery features, in particular path expressions, raises a challenging research challenge.

17

2.4

Publishing Relational Databases in XML.

There is much current interest in publishing relational databases in XML. One approach is to publish relational data at the application level, such as DB2’s XML Extender [6], which uses user-defined functions and stored procedures to map between XML data and relational data. Another approach is a middleware based approach, such as in SilkRoute [66] and XPERANTO [43, 68], which defines XML views on top of relational data for query support. For instance, XPERANTO can build a default view on the whole relational database, and new XML views and queries upon XML views can then be defined using XQuery. XQuery statements are then translated into SQL and executed inside the RDBMS. This approach utilizes RDBMS technology and provides users with a unified interface. Several DBMS vendors are jointly working toward new SQL/XML standards [18]; the objective is to extend SQL to support XML, by defining mappings of data, schema, actions, etc., between SQL and XML. A new XML data type and a set of SQL XML publishing functions are also defined, and are currently supported in commercial databases [12, 5].

18

CHAPTER 3 An XML-based Approach to Publishing and Querying the History of Databases 3.1

Introduction

Information systems have yet to address satisfactorily the problem of how to preserve the evolving content and structure of their underlying databases, and to support the query and retrieval of past information. This is a serious shortcoming for the web, where documents frequently change in structure and content, or disappear all together, causing serious problems, such as broken links and lack of accountability for sites of public interest. Database-centric information systems face similar problems, particularly since current database management systems (DBMSs) provide little help in that respect. Indeed as today, commercial DBMSs do not provide effective means for archiving their data and supporting historical queries on their past contents. Given the strong application demand and the significant research efforts spent on these problems [94, 71, 77], the lack of viable solutions must be attributed, at least in part, to the technical difficulty of introducing temporal extensions into relational databases and object-oriented databases. Schema changes represent a particularly difficult and important problem for modern information systems, which need to be designed for evolution [87, 86, 105, 84].

19

Meanwhile, there is much current interest in publishing and viewing databaseresident data as XML documents. In fact, such XML views of the database can be easily visualized on web browsers and processed by web languages, including powerful query languages such as XQuery [30] and XSLT [21]. The definition of the mapping from the database tables to the XML view is in fact used to translate queries on these views into equivalent SQL queries on the underlying database [43, 68]. As the database is updated, its external XML view also evolves—and many users who are interested in viewing and querying the current database are also interested in viewing and querying its past snapshots and evolving history. In this chapter, we show that the history of a relational database can be viewed naturally as an XML document. The various benefits of XML-published relational databases (browsers, web languages and tools, unification of database and web information, etc.) are now extended to XML-published relation histories. In fact, we show that we can define and query XML views that support a temporally grouped data model, which has long been recognized as very effective in representing temporal information [57], but could not be supported well using relational tables and SQL. Therefore, temporal queries that would be very difficult to express in SQL can now be easily expressed in standard XQuery. Indeed, unlike SQL, XQuery [30] is Turing-complete and extensible [78, 65]. Thus many additional constructs needed for temporal queries can be defined in XQuery itself, without having to depend on difficult-to-obtain extensions by standard committees. Therefore, while the challenge of expressing and supporting complex temporal queries should never be underestimated, in this chapter we will show that the additional complexity of going from standard queries into temporal ones is much less when starting from XML/XQuery than when starting from

20

Table 3.1: The Snapshot History of Employees empno name salary

title

deptno

start

end

1001

Bob

60000

Engineer

d01

1995-01-01 1995-05-31

1001

Bob

70000

Engineer

d01

1995-06-01 1995-09-30

1001

Bob

70000 Sr Engineer

d02

1995-10-01 1996-01-31

1001

Bob

70000 Tech Leader

d02

1996-02-01 1996-12-31

relational tables and SQL.

3.2

History of Database Tables as XML Documents

Table 3.1 and Table 3.2 describe the history of employees and departments as they would be viewed in traditional transaction-time databases [82]. Instead of using this temporally ungrouped representation, we use temporally grouped representations that exploit the expressive power of XML and its query languages. Thus, instead of the representation shown in Table 3.1 and Table 3.2, we will use the representation shown in Figure 3.1 and Figure 3.2. We will call these Hdocuments. Each element in an H-document is assigned the attributes tstart and tend, to represent the inclusive time-interval of the element. The value of tend can be set to now, to denote the ever-increasing current time. Note that there is a covering constraint that the interval of a parent node always covers that of its child nodes, which is preserved in the update process. The H-document also has a simple and well-defined schema. For updates on a node, when there is a delete, the value of tend is updated to the current timestamp; when there is an insert, a new node is appended with tstart set to current timestamp and tend set to now; and update can be

21

Table 3.2: The Snapshot History of Departments deptno deptname mgrno

start

end

d01

QA

2501

1994-01-01 1998-12-31

d02

RD

3402

1992-01-01 1996-12-31

d02

RD

1009

1997-01-01 1998-12-31

d03

Sales

4748

1993-01-01 1997-12-31

Figure 3.1: The History of the employee Table Is Published as employees.xml implemented as a delete followed by an insert. Our H-documents use a temporally grouped data model [57]. Clifford, et al. [57] show that temporally-grouped models are more natural and powerful than temporally-ungrouped ones. Temporally grouped data models are however difficult to support in the framework of flat relations and SQL. Thus, many approaches proposed in the past instead timestamp the tuples of relational tables. These approaches incur into several problems, including the coalescing problem [40]. TSQL2’s approach [105] attempts to achieve a compromise between

22

Figure 3.2: The History of the dept Table Is Published as depts.xml these two [57], and is based on an implicit temporal model, which is not without its own problems [46]. An advantage of our approach is that powerful temporal queries can be expressed in XQuery without requiring the introduction of new constructs in the language. XML and XQuery support an adequate set of built-in temporal types, including datetime, date, time, and duration [30]; they also provide a complete set of comparison and casting functions for duration, date and time values, making snapshot and period-based queries convenient to express in XQuery. Furthermore, whenever more complex temporal functions are needed they can be defined using XQuery functions that provide a native extensibility mechanism for the language. User-defined functions are further discussed in Section 3.3.2.

23

3.2.1

Publishing Each Table as an XML Document with Columns as Elements

A natural way of publishing relational data is to publish each table as an XML document by converting relational columns into XML elements [90]. Figure 3.1 shows the history of the employee table and Figure 3.2 shows the history of the dept table. Thus the history of each relation is published as a separate H-document. Alternative representations for table histories were studied and compared in [101]. For instance, multiple tables can be first joined together and then represented by a single XML document. This approach offers no advantage compared to the one described above. However IDs can be added to this representation to make some join queries easier [101]. Yet another approach consists of representing the joined tables by a hierarchically structured XML document. This approach simplifies some queries but complicate others. The last approach is to represent the tuples of a relational table by the attribute values of the XML document. Then, the XML document reproduces the flat structure of tables with timestamped tuples, and the well-known problems of this temporally ungrouped representation. In summary, the representation used in the dissertation is the overall approach of choice, according to [101].

3.3

Temporal Queries using XQuery

The key advantage of our approach is that powerful temporal queries can be expressed in XQuery without requiring the introduction of new constructs in the language. We next show how to express the main classes of temporal queries as discussed in [105, 93]: temporal projection, temporal snapshot, temporal slic-

24

ing, temporal join, temporal aggregate, and restructuring, on employees.xml document(Figure 3.1) and departments.xml document(Figure 3.2). QUERY 1: Temporal Projection. Retrieve the title history of employee “Bob”:

element title_history{ for $t in doc("employees.xml")/employees/employee[name=’Bob’]/title return $t } QUERY 2: Temporal Snapshot. Retrieve the managers on 1994-05-06:

for $m in doc("depts.xml")/depts/dept/mgrno [tstart(.)= "1994-05-06"]

return( $m )

Note that tstart($e) and tend($e) are user-defined functions (expressed in XQuery) that get the starting date and ending date of an element respectively, thus the implementation is transparent to users. This will be further discussed in Section 3.3.2. QUERY 3: Temporal Slicing. Find employees who worked at any time between

1994-05-06 and 1995-05-06: for $e in doc("employees.xml")/employees/employee [ toverlaps(.,

telement("1994-05-06","1995-05-06" ) ) ]

return $e/name

Here toverlaps($a, $b) is a user-defined function that returns true if one node overlaps with another one, and false otherwise. The next query is a containment query: QUERY 4: Temporal Join. Find the history of employees each manager manages:

25

element manages{ for $d in doc("depts.xml")/depts/dept for $m in $d/mgrno return element manage {$d/deptno,

$m,

element employees { for $e in doc("employees.xml")/employees/employee where

$e/deptno = $d/deptno and not(empty(overlapinterval($e, $m) ) )

return($e/name, overlapinterval($e,$m)) }}}

This query will join depts.xml and employees.xml documents and generate a hierarchical XML document grouped by dept and manager. overlapinterval($a, $b) is a user-defined function that returns an element interval with overlapped interval as attributes tstart and tend. If there is no overlap, the element is not returned which satisfies the XQuery built-in function empty($e). QUERY 5:Temporal Aggregate. Retrieve the history of the average salary:

let $s := document("emp.xml")/employees/employee/salary return tavg($s)

Here tavg($s) is a user-defined function computed as follows. First, a list of salary-timestamp pairs are generated by adding and decreasing the salary value of each interval; then these salaries are sorted by the timestamp, and for each timestamp, all the changes are added up to get a delta sum. If the delta is different from zero, then the old interval is ended and a new one is started, where the new sum is the old one plus the delta. In this way, the temporal aggregate is computed in a single scan.

26

Other temporal aggregates such as RISING or moving window aggregate can also be supported through user-defined functions. QUERY 6: Restructuring. Find the maximum length of time during which Bob

worked continuously without changing title or department: for $e in doc("emp.xml")/employees/employee[name=’Bob’] let $d := $e/dept let $t := $e/title let $overlaps := restructure($d, $t) return max($overlaps)

The user-defined function restructure takes two lists and returns all the overlapped intervals. 3.3.1

More Complex Queries

Here we discuss more advanced temporal queries, such as until, since, and contain, which are often used as a test for the expressive power of temporal languages [54, 53]. For instance, the following is a “since” query: QUERY 7: A Since B. Find the employee who has been a Senior Engineer in dept

“d001” since he/she joined the dept: for $e in doc("employees.xml")/employees/employee let $m:= $e/title[.="Sr Engineer" and tend(.)=current-date()] let $d:=$e/deptno[.="d001" and tcontains($m, .)] where not empty($d) and not empty($m) return {$e/empno, $e/name}

27



Here tcontains($e) is a user-defined function to check if one interval covers another. QUERY 8: Period Containment. Find employees with the same employment his-

tory as employee “Bob”, i.e., they worked in the same department(s) as employee “Bob” and exactly for the same periods: for $e1 in doc("employees.xml")/employees/employee[name = ’Bob’] for $e2 in doc("employees.xml")/employees/employee[name != ’Bob’] where

every $d1 in $e1/deptno satisfies

some $d2 in $e2/deptno satisfies (string($d1)=string($d2) and tequals($d2,$d1)) and every $d2 in $e2/deptno satisfies some $d1 in $e1/deptno satisfies (string($d2)=string( $d1) and tequals($d1,$d2)) return {$e2/name}

Here tequals($d1,$d2) is a user-defined function to check if two nodes have equal intervals.

3.3.2

Temporal Functions

The use of functions tstart($e) and tend($e) in temporal queries offers the advantage of divorcing the queries from the low-level details used in representing time, e.g., if the interval is closed at the end, or how now is represented. Other useful functions predefined in our system include: Restructuring functions: coalesce($l) will coalesce a list of nodes, and restructure($a,$b) will return all the overlapped intervals on two set of nodes.

28

Interval functions: toverlaps($a,$b), tprecedes($a,$b), tcontains($a,$b), tequals($a,$b), tmeets($a,$b) will return true or false according to two interval positions; The overlapinterval($a,$b) will return the overlapped interval if they overlap, and the result has the form: . Duration and date/time functions: timespan($e) returns the time span of a node; tstart($e) returns the start time of a node; tend($e) returns the end time of a node; tinterval($e) returns the interval of a node; telement($Ts, $Te) constructs an empty element telement with attribute tstart and tend as the argument values respectively; rtend($e) recursively replaces all the occurrence of “9999-12-31” with the value of current date; externalnow($e) recursively replaces all the occurrence of “9999-12-31” with the string “now”. 3.3.3

Support for ‘now’

This is an important topic that has received considerable attention in temporal databases [58]. For historical transaction time databases, ‘now’ can only appear as the end of a period, and means ‘no changes until now.’ That is to say that the values in the tuple are still current at the time the query is asked. Therefore, a simple strategy to support this concept consists in replacing every occurrence of the symbol ‘now’ in the database with the value current timestamp (or current date, depending on the granularity used for time-stamping the tuples). This is basically the strategy we use in our implementation; of course, we do

29

not perform the actual instantiation on all the ‘now’ occurrences in the database for each query Q. Rather, we perform instantiations conservatively, as needed to answer the query correctly. Internally, we use the “end-of-time” value to denote the ‘now’ symbol. For instance, for dates we use “9999-12-31.” The user does not access this value directly, he/she will access it through built-in functions. For instance, to refer to an employee on “2003-02-13”, the user might write tstart($e) Bob}

TITLE(e)

= {[1995-01-01,1997-12-31] -> Engineer, [1998-01-01,now] -> {Sr Engineer} }

SALARY(e) = {[1995-01-01,1997-12-31]-> 65000, [1998-01-01,1999-12-31]-> 70000, [2000-01-01,now] -> 85000}

Here each attribute value is associated with a valid time lifespan. surrogate is a system-defined identifier, which can be ignored if the key doesn’t change. The following is the list of temporal attribute values of entity dept (or d) : SURROGATE(d) = {[1995-01-01,now] -> surrogate_id} NAME(d)

= {[1995-01-01,now] -> RD}

Similarly, for the instance rb of the relationship belongs to between employee ‘Bob’ and dept ‘RD’, the lifespan is T(rb)=[1995-01-01,now], and for the instance rm of the relationship manages between employee ‘Mike’ and dept ‘RD’, the lifespan is T(r)=[1999-01-01,now]. In the next section, we show that such temporal ER model can be supported well with XML.

5.2

Valid Time History in XML

While transaction time identifies when data was recorded in the database, valid time concerns when a fact was true in reality. One major difference is that while transaction time is appended only and cannot be updated, valid time can be

66

updated by users. We show that, with XML, we can model the valid time history naturally.

Figure 5.2: The Valid Time History of Employees Figure 5.2 shows a valid time history of employees, where each tuple is timestamped with a valid time interval. This representation assumes valid time homogeneity, and is temporally ungrouped [57]. It has several drawbacks: first, redundant information is preserved between tuples, e.g., Bob’s department appeared the same but was stored in all the tuples; second, temporal queries need to frequently coalesce tuples, which is a source of complications in temporal query languages. These problems can be overcome using a representation where the timestamped history of each attribute is grouped under the attribute [57]. This produces a hierarchical organization that can be naturally represented by the hierarchical XML view shown in Figure 5.3 (VH-document). Observe that every element is timestamped using two XML attributes vstart and vend. In the VH-document, each element is timestamped with an inclusive valid time interval (vstart, vend). vend can be set to now to denote the ever-increasing current date, which is internally represented as “9999-12-31”(Section 5.2.2). Please note that an entity (e.g., employee ‘Bob’) always has a longer or equal lifespan than its children, thus there is a valid time covering constraint that the valid time interval of a parent node always covers that of its child nodes, which is preserved in the update process(Section 5.2.3).

67

Figure 5.3:

XML Representation of the Valid-time History of Employ-

ees(VH-document) Unlike the relational data model that is almost invariably depicted via tables, XML is not directly associated with a graphical representation. This creates the challenge and the opportunity of devising the graphical representation most conducive for the application at hand—and implementing it using standard XML tools such as XSL [21]. Figure 5.4 shows a representation of temporally grouped tables that we found effective as user interface (and even more so after contrasting colored backgrounds and other browser-supported embellishments).

Figure 5.4: Temporally Grouped Valid Time History of Employees

68

5.2.1

Valid Time Temporal Queries

The data shown Figure 5.4 is the actual data stored in the database—with the exception of the special “now” symbol discussed later. Thus a powerful query language such as XQuery can be directly applied to this data model. Next we show that we can specify temporal queries with XQuery on the VH-document, such as temporal projection, snapshot queries, temporal slicing, temporal joins, etc. QUERY V1: Temporal projection: retrieve the history of departments where Bob

was employed: for $s in doc("emps.xml")/employees/employee[name="Bob"]/dept return $s QUERY V2: Snapshot: retrieve the managers of each department on 1999-05-01:

for $m in doc("depts.xml")/depts/ dept/mgrno[vstart(.)="1999-05-01"] return $m

Here depts.xml is the VH-document that includes the history of dept names and managers. vstart() and vend() are user-defined functions (expressed in XQuery) that return the starting date and ending date of an element’s valid time respectively, thus the implementation is transparent to users. QUERY V3: Continuous Period: find employees who worked as a manager for

more than 5 consecutive years ( i.e., 1826 days):

69

for $e in doc("emps.xml")/employees/employee[title="Manager"] for $t in $e/title[.="Manager"] let $duration := subtract-dates( vend($t), vstart($t)

)

where dayTimeDuration-greater-than($duration,"P1826D") return

$e/name

Here “P1826D” is a duration constant of 1826 days in XQuery. QUERY V4: Temporal Join: find employees who were making the same salaries

on 2001-04-01: for $e1 in doc("emps.xml")/employees/employee for $e2 in doc("emps.xml")/employees/employee where $e1/salary[vstart(.)="2001-04-01"]= $e2/salary[vstart(.)="2001-04-01"] and $e1/name != $e2/name return ($e1/name, $e2/name)

This query will join emps.xml with itself. It is also easy to support since and until connectives of first-order temporal logic [54], for example: QUERY V5: A Until B: find the employee who was hired and worked in dept

“RD” until Bob was appointed as the manager of the dept: for $e in doc("emps.xml")/employees/employee for $b in doc("emps.xml")/employees/employee[name="Bob"] let $t := $b/title[.="manager"] let $bd := $b/dept[.="RD"] let $d := $e/dept [1][.="RD"] where vmeets($d, $t) and vcontains($bd,t) return {$e/name}

70

5.2.2

Temporal Operators

In the temporal queries, we used functions such as vstart, vend to shield users from the implementations of representing time. Similar to functions defined for transaction-time history (Section 3.3.2), functions predefined for valid time history include: timestamp referencing functions, such as vstart, vend; interval comparison functions, such as voverlaps, vprecedes, vcontains,vequals, vmeets, voverlapinterval; and during and date/time functions, such as vtimespan and vinterval. For example, vcontains() is defined as follows: define function vcontains($a, $b){ if ($a/@vstart= $b/@vend ) then true() else false() }

Internally, we use “end-of-time” (e.g., “9999-12-31”) values to denote the ‘now’ and ‘UC’ symbol, and uses user-defined functions to convert it to different formats according to user’s demand. These valid-time queries are similar to those transaction-time queries, as discussed in Chapter 3. However, unlike transaction-time databases, valid time databases must also support explicit updates. This is not discussed in Chapter 3 or Chapter 4 and will be discussed next.

5.2.3

Database Modifications

An update task force is currently working on defining standard update constructs for XQuery [88]; moreover, update constructs are already supported in several native XML databases [20]. Our approach to temporal updates consists in supporting the operations of insert, delete, and update via user-defined functions.

71

This approach will preserve the validity of end-user programs in the face of differences between vendors and evolving standards. It also shields the end-users from the complexity of the additional operations required by temporal updates, such as the coalescing of periods, and the propagation of updates to enforce the covering constraints.

INSERT. When a new entity is inserted, the new employee element with its children elements is appended in the VH-Document; the vstart attributes are set to the valid starting timestamp, and vend are set to now. Insertion can be done through the user-defined function VInsert($path,$newelement). The new element can be created using the function VNewElement($valueset, $vstart, $vend). For example, the following query inserts Mike as an engineer into RD dept with salary 50K, starting immediately: for $s in doc("emps.xml")/employees/employee[last()] return VInsert($s, VNewElement( ["Mike", "Engineer", "RD", "50000"], current-date(),"now" ))

DELETE. There are two types of deletion: deletion without valid time and deletion with valid time. The former assumes a default valid time interval: (current date, forever), and can be implemented with the user defined function VNodeDelete($path). For deletion with a valid time interval v on node e, there can be three mutually exclusive cases: (i) e is removed if its valid time interval is contained in v, (ii) the valid time interval of e is extended if the two intervals overlap, but do not contain each other, or (iii) e’s interval is split if it properly contains v. Deletions on a node are then propagated downward to its children

72

to satisfy the covering constraint. Node deletion (with downward propagation) is supported by the function VTimeDelete($path, $vstart, $vend).

UPDATE. Updates can be on values or valid time, and coalescing is needed. There are two functions defined: VNodeReplace($path,$newValue), and VTimeReplace($path, $vstart,$vend). For value update, propagation is not needed; for valid time update, it is needed to downward update the node’s children’s valid time. If a valid time update on a child node violates the valid time covering constraint, then the update will fail.

5.3

An XML-based Bitemporal Data Model

Figure 5.5: Bitemporal History of Employees

5.3.1

The XBiT Data Model

In practice, temporal applications often involve both transaction time and valid time. We show next that, with XML, we can naturally represent a temporally grouped data model, and provide support for complex bitemporal queries.

73

Bitemporal Grouping Figure 5.5 shows a bitemporal history of employees, using a temporally ungrouped representation. Although valid time and transaction time are generally independent, for the sake of illustration, we assume here that employees’ promotions are scheduled and entered in the database four months before they occur.

Figure 5.6:

XML Representation of the Bitemporal History of Employ-

ees(BH-document) XBiT supports a temporally grouped representation by coalescing attributes’ histories on both transaction time and valid time. Temporal coalescing on two temporal dimensions is different from coalescing on just one. On one dimension, coalescing is done when: i) two successive tuples are value equivalent, and ii) the intervals overlap or meet. The two intervals are then merged into maximal intervals. For bitemporal histories, coalescing is done when two tuples are value-equivalent and (i) their valid time intervals are the same and the transaction time intervals meet or overlap; or (ii) the transaction time intervals are the same and the valid time intervals meet or overlap. This operation is repeated until no tuples satisfy

74

these conditions. For example, in Figure 5.5, to group the history of titles with value ‘Sr Engineer’ in the last three tuples, i.e., (title, valid time, transaction time), the last two transaction time intervals are the same, so they are coalesced as (Sr Engineer, 1998-01-01:now, 1999-09-01:UC). This one again has the same valid time interval as the previous one: ((Sr Engineer, 1998-01-01:

now,

1997-09-01:1999-08-31), thus finally they are coalesced as (Sr Engineer, 1998-01-01:now, 1997-09-01:UC), as shown in Figure 5.7.

Data Modeling of Bitemporal History with XML With temporal grouping, the bitemporal history is represented in XBiT as an XML document (BHdocument). This is shown in the example of Figure 5.6, which is snapshotequivalent to the example of Figure 5.5. Each employee entity is represented as an employee element in the BH-document, and table attributes are represented as employee element’s child elements. Each element in the BH-document is assigned two pairs of attributes tstart and tend to represent the inclusive transaction time interval, and vstart and vend to represent the inclusive valid time interval. Elements corresponding to a table attribute value history are ordered by the starting transaction time tstart. The value of tend can be set to UC (until changed), and vend can be set to now. There is a covering constraint whereby the transaction time interval of a parent node must always cover that of its child nodes, and likewise for valid time intervals. Figure 5.7 displays the resulting temporally grouped representation, which is appealing to intuition, and also effective at supporting natural language interfaces, as shown by Clifford [56].

75

Figure 5.7: Temporally Grouped Bitemporal History of Employees 5.3.2

Bitemporal Queries with XQuery

The XBiT-based representation can also support powerful temporal queries, expressed in XQuery without requiring the introduction of new constructs in the language. We next show how to express bitemporal queries on employees. QUERY B1: Temporal projection: retrieve the bitemporal salary history of em-

ployee “Bob”: for $s in doc("emps.xml")/employees/employee[name="Bob"]/salary return $s

This query is exactly the same as query V1, except that it retrieves both transaction time and valid time history of salaries. QUERY B2: Snapshot: according to what was known on 1999-05-01, what was

76

the average salary at that time? let $s := doc("emps.xml")/employees/employee/salary where tstart($s)=

and vstart($s)= return

"1999-05-01" "1999-05-01"

avg($s)

Here tstart(), tend(), vstart() and vend() are user-defined functions that get the starting date and ending date of an element’s transaction-time and validtime respectively. QUERY B3: Diff queries: retrieve employees whose salaries (according to our

current information) didn’t changed between 1999-01-01 and 2000-01-01: let $s := doc("emps.xml")/employees/employee/salary where tstart($s)=current-date()

and vstart($s)= "2000-01-01" return

$s/..

This query will take a transaction time snapshot and a valid time slice of salaries. QUERY B4: Change Detection: find all the updates of employee salaries that

were applied retroactively. for

$s in doc("emps.xml")/employees/employee/salary

where tstart($s) > vstart($s) or

tend($s) > vend($s)

QUERY B5: Find the manager for each current employee, as best known now:

for $e in doc("emps.xml")/employees/employee for $d in doc("depts.xml")/depts/dept/name[.=$e/dept]

77

where tend($e)="UC" and tend($d)="UC" and vend($e)="now" and vend($d)="now" return ($e, $d)

This query will take a current snapshot on both transaction time and valid time. QUERY B6: Period Containment: find employees with the same employment his-

tory as employee “Bob”, i.e., at any time, they worked in the same department(s) as employee “Bob” and exactly for the same periods. for $e1 in doc("emps.xml")/employees/employee[name="Bob"] for $e2 in doc("emps.xml")/employees/employee[name!="Bob"] where

every $d1 in $e1/dept satisfies

some $d2 in $e2/dept satisfies (string($d1) = string( $d2 ) and

tequals($d2, $d1) and vequals($d2, $d1) )

and every $d2 in $e2/dept satisfies some $d1 in $e1/dept satisfies (string($d2) = string( $d1 ) and tequals($d1, $d2) and vequals($d1, $d2) ) return {$e2/name}

where tequals($d1,$d2) and vequals($d1,$d2) are user-defined functions in XQuery to check if two nodes have the same transaction time interval or valid time interval respectively.

5.3.3

Database Modifications

For valid time databases, both attribute values and attribute valid time can be updated by users, and XBiT must perform some implicit coalescing to support the update process. Note that only elements that are current (ending transaction time as UC ) can be modified. A modification combines two processes: explicit

78

modification of valid time and values, and implicit modification of transaction time.

Modifications of Transaction Time Databases Transaction time modifications can also be classified as three types: insert, delete, and update. INSERT. When a new tuple is inserted, the corresponding new element (e.g., employee ‘Bob’) and its child elements in BH-document are timestamped with starting transaction time as current date, and ending transaction time as UC. The user-defined function TInsert($node) will insert the node with the transaction time interval(current date, UC ). DELETE. When a tuple is removed, the ending transaction time of the corresponding element and its current children is changed to current time. This can be done by the function TDelete($node). UPDATE. Update can be seen as a delete followed by an insert.

Database Modifications in XBiT Modifications in XBiT can be seen as the combination of modifications on valid time and transaction time history. XBiT will automatically coalesce on both valid time and transaction time. INSERT. Insertion is similar to valid time database insertion except that the added element is timestamped with transaction time interval as (current date, UC ). This can be done by the function BInsert($path, $newelement), which combines VInsert and TInsert. DELETE. Deletion is similar to valid time database insertion, except that the function TDelete is called to change tend of the deleted element and its current children to current date. Node deletion is done through the function

79

BNodeDelete($path), and valid time deletion is done through the function BTimeDelete($path, $vstart, $vend). UPDATE. Update is also a combination of valid time and transaction time, i.e., deleting the old tuple with tend set to current date, and inserting the new tuple with new value and valid time interval, tstart set to current date and tend set to UC. This is done by the functions BNodeReplace($path, $newValue) and BTimeReplace($path, $vstart, $vend) respectively. 5.3.4

Temporal Database Implementations

Two basic approaches are possible to manage the three types of H-documents discussed here: one is to use a native XML database, and the other is to use traditional RDBMS. In Chapter 4 we showed that a transaction time TH-document can be stored in a RDBMS and has significant performance advantages on temporal queries over a native XML database. Similarly, RDBMS-based approach can be applied to the valid history and bitemporal history. First, the BH-document is shredded and stored into H-tables. For example, the employee BH-document in Figure 5.6 is mapped into the following attribute history tables: employee name(id,name,vstart,vend,tstart,tend) employee title(id,title,vstart,vend,tstart,tend) employee salary(id,salary,vstart,vend,tstart,tend) ...

Since the BH-document and H-tables has a simple mapping relationship, temporal XQuery can be translated into SQL queries based on such mapping relationship, using the techniques discussed in Chapter 4.

80

5.4

Summary

In this chapter, we showed that valid-time, transaction-time, and bitemporal databases can be naturally managed in XML using temporally-grouped data models. This approach is similar to the one we proposed for transaction-time databases in Chapter 3, but we have here shown that it also supports (i) the temporal EER model [64], and (ii) valid-time and bitemporal databases with the complex temporal update operations they require. Complex historical queries, and updates, which would be very difficult to express in SQL on relational tables, can now be easily expressed in XQuery on such XML-based representations.

81

CHAPTER 6 General XML Documents The temporal modelling and querying approach proposed in the previous chapters is general and can be applied to arbitrary XML documents, i.e., ranging from strictly structured data-centric XML documents, to loosely structured textcentric Web documents. An important issue of web archiving is how to preserve efficiently past versions of documents, and query their evolution history. Our timestamping scheme can be used to archive the version history of arbitrary Web XML documents. Therefore, e-permanence of XML documents is achieved by representing their version history in XML and using XQuery to express complex queries on the evolution of the documents and their contents.

6.1

Introduction

Temporal queries and version management for web archives find important applications in web information systems and are particularly important in applications, such as document archives and digital libraries, that must assure the permanence of e-documents. Indeed, the very computing technology that makes digital repositories possible also makes it very easy to revise the documents and publish their revised versions on the Web. To avoid loss of critical information and achieve e-permanence, old versions must be preserved. We can expect that in the future, ‘e-permanence’ standards will be established for critical Web sites

82

of public interest [11]. A first approach to preserve the content of successive document versions, is storing each version as a separate document. But this is very undesirable because of (i) storage inefficiency, (ii) explosion of hits on content-based searches, and (iii) difficulty of answering queries on the evolution of a document and its content. Therefore, in this chapter, we propose XML-based techniques whereby a multiversion document is managed as a unit, and successive versions are represented by their delta changes to optimize storage utilization, and support efficiently complex historical queries on the evolution of the document and its elements (e.g., abstract, figures, bibliography, etc.). Similar problems and opportunities occur in data warehousing applications, where the need for Temporal Data Warehouses is well-established [104, 83]. Furthermore, as we move from traditional inter-company warehouses to intracompany warehouses, reliance on XML and Web-based environments is bound to grow. A related new trend is Web warehouses (Figure 6.1) designed to collect contents from Web sites of interest, and monitor them at regular intervals to detect changes and provide subscribed services to community of users [80]. Typical services provided include (i) detecting changes from the previous version, (ii) forwarding significant deltas to subscribers, (iii) answering continuous queries over the stream of changes, (iv) retrieving old versions, and (v) supporting historical queries and trend analysis queries. Thus, we focus on information systems, such as digital libraries and Web warehouses, that support sophisticated change management and temporal queries for focused application domains. This differentiates our web warehouses and digital archives from systems, such as the Wayback machine, which are primarily interested in preserving old snapshots of the global Web [22], rather than on providing sophisticated information services.

83

Figure 6.1: Web Information Warehouses In addition to these new applications of version management, we find the more traditional ones, such as software configuration and cooperative work [25, 99]. As these applications migrate to a Web-based environment, they increasingly use XML for representing and exchanging information, often seeking standard vendor-supported tools for processing and exchanging their XML documents. The importance of version management has been fully recognized by XML standard groups [25], but because of the difficult research issues that remain open, standards have not been issued yet. This current situation provides a window of opportunity to solve the technical challenges and lay the foundations for this important piece of Web information systems technology. The first problem that must be addressed is how to represent a multiversion document as an XML document that (i) can be viewed by conventional XML browsers at remote sites, and (ii) can support complex queries, including tem-

84

poral ones, expressed in standard XML query languages, such as XQuery [30]. For instance, the reference-based representation proposed in [49] achieves the first objective but falls short of the second one. On the other hand, the model based on durable node numbers and document shredding presented in [50] was proven effective at delivering good performance at the physical level; however the problem of how to represent the version history as XML document for querying and/or forwarding it to a remote browser was never solved. Indeed, the question of what data representation should be used for timedependent information constitutes a difficult problem—also one that greatly influences the design of the temporal query language to be used. For relational databases, more than forty different approaches were counted [105, 82], each featuring a different combination of a temporal data model and query language. Although the design space of alternatives has been so extensively explored, as today, no temporal data model and query language exists which is generally accepted in the research community, or supported by major vendors. However, the troublesome experience of relational systems with temporal information is due to the flat structured relational tables, and does not necessarily carry over to XML [101]. Indeed XML provides a richer data model, whose structured hierarchy can be used to support temporally grouped data models that have long been recognized as most natural and expressive for temporal information [57, 56]. Also XML provides more powerful query languages, such as XQuery that, unlike SQL, achieve native extensibility and Turing completeness [78]. In this chapter, we propose our new scheme to represent XML document changes, and show how complex temporal queries can be supported on this scheme. Then we discuss techniques and tools to build V-Documents.

85

6.2

A Logical Model for Versions

In SCCS [85], each text segment is clustered with all its successive changes and pairs of timestamps are associated with the segments and their changes to specify their lifespans. Although the SCCS scheme has many shortcomings at the physical level, we will show next that it offers potential benefits at the logical level. These are not significant in conventional systems, where the basic document segments managed by SCCS are lines of text, and the multiversion document generated by SCCS lacks any obvious logical structure. However, when applied to the elements of a well-structured XML document, the basic SCCS scheme can be extended to produce a well-structured XML document that can be used to (i) display the version history of the document on a Web browser, and (ii) express complex queries on the document and its evolution. We now discuss how to summarize and represent the successive versions of a document as an XML document, called a V-Document (Figure 6.3), upon which complex queries and temporal queries can be specified using languages such as XPath or XQuery. In a V-Document, each element is assigned two attributes vstart and vend, which represent the valid version interval (inclusively) of the element. In general, vstart and vend can be version numbers or timestamps: vstart represents the initial version when the element is first added to the XML document, and vend represents the last version in which such an element is valid. After the vend version, the element is either removed or changed. The value of vend can be set to now, to denote the ever-increasing current version number. It is clear that the version interval of an ancestor node always contains those of its descendant nodes. There are significant advantages of our scheme, including the following: • There is no storage redundancy, since multiple identical nodes in succes-

86

sive versions of the same document are represented as a single node timestamped with its validity interval, • The document history is represented by a temporally grouped model encoded as an XML document whose DTD (XML Schema) is automatically generated from the DTD (the schema) of the original document, • Temporal queries and other complex queries can be easily expressed in VDocuments using standard XML query languages. 6.2.1

Change Management

We now consider a very simple document and its successive versions as shown in Figure 6.2. For simplicity, the only primitive changes used in in our V-documents are DELETE, INSERT and UPDATE. (Operations such as MOVE or COPY can be reduced to these.) These changes are detected by XML-Diff [1] algorithm and represented in the V-document by using vstart and vend attribute that denote the beginning and end time stamps of the version (or alternatively the version number). UPDATE. When an element is considered updated, a new element with the same name will be appended immediately after the original element; the attribute vstart of this new element is set to the current time stamp (or version number), and the attribute vend is set to the special symbol ‘now’ that represents the ever-increasing current time stamp (or version number). The vend attribute of the old element is set to the last version before it was changed. The change of an element is not viewed as a change of its ancestors, unless the ancestors themselves are changed. This is consistent with the results produced by the XMLDiff algorithms, where the only deltas listed are those of the changed elements.

87

Figure 6.2: Sample Versioned XML Documents INSERT. When a new element is inserted, that element is inserted into the corresponding position in the V-Document; the vstart attribute is set to the current time stamp (or version number), and vend is set to now. DELETE. When an element is removed, the only information that must be changed is the vend attribute; this is set to the last time stamp (or version number) where the element was valid. In the next section, we show that this representation is conducive to sophisticated temporal queries on the document history. The approach is also general and generalizable to different types of XML documents.

88

Figure 6.3: XML Representation of a Versioned Document 6.2.2

A General Approach: the ICAP Project

The UCLA’s ICAP Project [10] aims at supporting the archiving and querying of the histories of a broad spectrum of documents including (i) textual reports such as the various versions of the W3C XLink proposed standards [26], (ii) semistructured documents, such as the UCLA course catalog that is revised every two years [24], and (iii) XML-published relational databases (which are highly structured). Toward this goal, the project applies the approach just described to (i) automate the process of deriving and archiving the V-documents describing the history of multiversion documents, and (ii) support the efficient execution of temporal queries on these V-Documents. For instance, the V-Document representing the history of the W3C XLink specs, makes it easy to query when any part of the specs was last revised, or when a reference to a related technology was first introduced1 . For the UCLA course catalog, it becomes possible to find what 1

An additional benefit of V-documents is the ability of use the standard XML primitives to add annotations to the changed elements. These, for instance, can be used to document and retrieve the justification and provenance of each revision introduced in the specs.

89

courses stopped being offered by a certain department in year 2000, and how long keywords such as “nanotechnology” took to migrate from graduate course syllabi to undergraduate ones. For XML-published relational databases instead, the user will now be able to pose the vast assortment of historical queries that have been proposed in the temporal database literature [93, 92, 82]. The XML-based approach here proposed to model and query temporal information is general and applicable to the broad spectrum of documents previously described—i.e., ranging from loosely structured textual documents to highly structured relational databases.

6.3

Complex Queries

Using our change representation scheme, complex queries can be expressed easily. The following is a list of queries expressed on XQuery [30]. All queries were tested with Quip [15], an XQuery engine implemented by SoftwareAG. QUERY 1. Snapshot: what was the title of Chapter 1 on 2002-01-03?

for $e in doc("V-Document.xml")/document/chapter/title where vstart($e)

Figure 6.4: DTD of Snapshot XML Documents
vstart CDATA #REQUIRED vend CDATA #REQUIRED>



Figure 6.5: DTD of the V-Document The XML Schemas of V-Documents can also be constructed automatically, as shown in Figure 6.6.

6.4.4

XChronicler: Generate V-Document from Versions

We build a tool XChronicler [10] to generate the V-Document through the diffs between successive versions of a document. First, using the Microsoft XMLDiff [1] tool, we compute the diff between the first two versions, and the diff X1−2 is

95

Figure 6.6: XML Schema of the V-Document

96

represented as an XML document in Microsoft XML Diff Language, an XML representation of the changes. The initial document V1 is formatted to include vstart and vend attributes, with vend as “now”. Then we apply X1−2 to V1, to combine the changes and update the intervals for each node, and generate V-Document V2. Repeatedly, we apply X2−3 to V2, and generate V3, until we compose the whole history into a final V-Document. 6.4.5

Visualizing Versions

The visualization of changes is very helpful for end users, and has recently attracted attention from researchers [23]. The V-Document can be easily queried and processed with XML stylesheet language XSLT [21], to translate into other XML documents or HTML documents, and changes between any two versions can thus be visualized through colors. Three types of changes are highlighted: added( yellow), updated( green), and deleted(strikethrough) text. For example, to mark newly added nodes between V1 and V2, we can compare each node’s version interval, and if the interval is valid at V2, and not valid at V1, then this is a new node and will be marked with a yellow background with the HTML tag span. Since V-Document preserves the structures of the snapshot versions, the XSLT stylesheet for snapshot versions can be reused for version display on the Web.

6.5

Summary

Version management is important for web document archives and data warehouses and is being changed profoundly by XML, since a wide assortment of information sources (ranging from textual documents to transaction-time relational databases) can now be represented by a unified data format that supports

97

complex historical queries on the evolution of documents and their contents. In this chapter, we generalize the XML-based approach used for modelling the relational database history, and use it to represent the version history of arbitrary XML documents. The main benefits of this approach are that (i) temporally grouped data models can be represented using the structure of XML, and (ii) complex temporal queries on the document evolution history can be supported in XQuery, without requiring temporal extensions to the XML standards.

98

CHAPTER 7 Storage of Temporal XML Documents 7.1

Current Approaches to XML Databases

Due to the advantages of XML, new information is increasingly being encoded as XML documents, and much data in existing databases are transformed into XML documents. It is therefore important to provide a repository for XML documents, to efficiently support storage, indexes, security, transactions, queries across multiple documents, and so on. A natural approach is to extend existing database technology to support XML. Meanwhile, there are emerging native XML database management systems to store and manage XML documents. 7.1.1

Categories of XML Databases

One of the key factors that determines how XML information is stored is whether it is document-centric or data-centric [41]: • Document-centric (or text-centric) XML documents. Document-centric XML documents normally have a less regular structure, larger granularity, and lots of mixed content (elements mixed with text data). The order of nodes is normally significant. For example, emails, papers, and XHTML documents are all document-centric. Document-centric XML documents are normally used for human-consumption.

99

• Data-centric XML documents. Data-centric XML documents normally have a very regular structure, fine-grained data, and little or no mixed content. The order of nodes is not significant. For example, product catalogs, orders, invoices, flight schedules, and restaurant menus are all data-centric. Datacentric XML documents are usually used for machine-consumption. Basically there are two main approaches to store and manage XML documents: • XML-enabled object-relational databases. By extending existing objectrelational database technology, ORDBMS can support storage and queries of XML documents. XML-enabled databases are generally good at storing and retrieving data-centric documents. Since data-centric documents are well-structured, the order is not important, XML documents can be shredded into tables or composed from tables with some mapping rules. One of the disadvantages of relational approach is that the mapping between XML documents and relational data is not round-trip. The mapping only preserves the element and attribute nodes, and other nodes, such as process instructions, comments, and so on, are ignored. This approach cannot handle document-centric documents well; thus such documents are normally stored as a VARCHAR or Large Object(LOB) datatypes, in which the contents of XML documents are hard to index and retrieve. Meanwhile, reconstruction of XML documents by composing fragmented nodes among multiple tables can have performance problems. New SQL extensions in ORDBMS are built (or planned for) either to create new datatypes (XMLType) to support XML document, or to create new built-in SQL functions to publish relational data efficiently as XML docu-

100

ments. The SQL/XML [18] emerges as a new ISO SQL standard for such extensions. • Native XML databases. A native XML database is defined by XML:DB initiatives group [29, 41] as a database that: – Defines a (logical) model for an XML document – as opposed to the data in that document – and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Examples of such models are the XPath data model, the XML Infoset, and the models implied by the DOM and the events in SAX 1.0. – Has an XML document as its fundamental unit of (logical) storage, just as a relational database has a row in a table as its fundamental unit of (logical) storage. – Does not assume any particular underlying physical storage model. For example, it can be built on relational, hierarchical, or object-oriented databases, or use proprietary storage formats such as compressed, indexed files. Native XML databases can be categorized according to the storage model [41] as follows: – Text-based storage: the entire document is stored in text form, and the system provides database-like functionalities such as indexing, transactions, query optimization, and so on. Tamino is a native XML database in this category. The advantage of this approach is that it supports round-trip XML documents, and it can quickly return the

101

entire XML document, or a fragment of an XML document in text form. – Model-based storage: the document is preparsed as a binary model (for example, a persistent DOM tree). Then, the model-structured objects are stored in an existing data store. One of the examples is the Persistent DOM Implementations (PDOM) [75], in which XML documents are pre-parsed as a DOM tree, and the objects in the DOM tree are mapped into a data store. Also, eXcelon (now Sonic XML Server) [17], which falls into this category, uses the Object-Oriented database ObjectStore to store XML documents as PDOM trees. The advantage of this approach is that it is fast to combine fragments from different documents. There are several essential differences between the relational approach and the native approach: – Native XML databases can preserve any XML node, including elements, attributes DTDs, PIs, comments, and so on. In relational approach, only elements and attributes are preserved in practice. – In the relational approach, a well-defined XML schema is needed for mapping. For the native approach, a schema can be used to define storage and indexing, but it is not required. So, native XML databases can handle better document-centric XML information. – The interfaces are different. In native XML databases, the normal interfaces provided to access the database is through XPath, DOM, XQuery, or other XML-based APIs, while the interfaces in the relational approach are exactly the same as in traditional RDBMS, i.e., SQL, ODBC, JDBC, and so on.

102

7.1.2

Storage of XML Data in Object-Relational Database Systems

For all the merits of native XML databases, ORDBMSs are scalable, provide transactions, query optimization, and so on. Extending object-relational databases to support XML is very important for their vendors. There are several challenges to store XML documents in an ORDBMS. • Storage of XML documents. For data-centric XML documents, ORDBMS can utilize DTD/XML Schema to guide the relational representation of XML documents. For document-centric XML documents, since a document is often retrieved as a whole, XML documents can be stored as a VARCHAR or CLOB datatype in the table. In many cases, a hybrid approach is used, where well-structured data is shredded into relational data, and mixed fragments are stored as whole. There are much research on mapping between XML documents and ORDBMS [91, 66, 61, 67]. • Query of XML documents. Although data can be easily retrieved from relational tables, in reality tagged XML documents are needed on the fly. So ORDBMSs have to provide XML publishing functions that can tag relational data as XML documents or fragments, and aggregate the data as hierarchical XML documents. SQL/XML [18] proposed a set of XML publishing functions in SQL. Another direction is to use a middleware such as XPERANTO [43, 68] to publish XML documents from relational data. The relational data will be viewed as XML views, and users can specify XML queries upon views in the middleware. The queries are transformed and optimized as SQL queries, and the query results are tagged by the middleware as XML documents and returned to the users.

103

• Indexing of XML documents. When an XML document is shredded into relational data, existing indexing techniques can be applied to such data. If an XML document is stored as a LOB datatype, side tables can be built to store extracted node values and use them for indexing. This approach is not generic, and if a node is not in the side tables, it is hard to query upon it. Oracle 10g [14] provides a new datatype XMLType, and the underlined storage and indexing of such datatype can be potentially provided. SQL/XML New in SQL 2003 standard as Part 14, SQL/XML [18] defines how SQL can be used in conjunction with XML in a database. There are several goals of this standard: • Defining the mapping between SQL identifiers and XML identifiers, and between SQL datatypes and XML datatypes. XML and SQL have different identifier schemes, for example, no space is allowed in XML tag names, thus the mapping will replace a space in relational column name into an encoded string in the XML tag name. The mapping of datatypes are defined based on restricted (or derived) data types in the XML Schema. • Defining XML publishing functions in SQL. Basically there are two types of XML publishing functions: scalar publishing functions and aggregate publishing functions. XML scalar functions construct an XML element or attribute by concatenating and tagging column values. XML aggregate functions aggregate a group of XML elements generated by XML scalar functions and generate a hierarchical structured XML document or fragment. • Defining XML datatype in RDBMS. To efficiently support XML storage and query, a new XML datatype (or XMLType) in RDBMS is proposed,

104

which provides the opportunities on special storage and indexing handling of such type. The implementation of XMLType is vendor-specific. • Defining SQL functions to search, extract or update XML documents. SQL/XML also defines a set of functions to search, extract, and update XML documents stored in XMLType columns based on XPath expressions.

Oracle XML Support Oracle XML DB [12] is the technology used for XML storage and retrieval. It provides a native XML data-type for storing and managing XML documents, a set of methods and SQL operators on XML content, integration of XML Schema into the database, and support for SQL/XML and XPath functions. XMLType. A new native server data-type XMLType is built into Oracle, which can be used like any data-type. In Oracle XML DB, XML can be stored in an XMLType column in a relational table, or an XML object in an XMLType table. Non-schema based XML is always stored as CLOB, and schema based XML can be stored as a CLOB or as a set of objects mapped from the schema. Relational and external data can also be viewed through XML views. XML Schema Support. Oracle XML DB provides comprehensive XML Schema support. XML schema can be used to validate an XMLType content. An XML Scheme will be first registered with the database under a URL, then a default table is created for each global element defined by the XML Schema, and XML instances will be decomposed into such default tables. The default tables created are XMLType tables, where each row is represented as an instance of the XMLType data type. Oracle XML DB can also use XML Schema information to map XML Schema into SQL types, and allow users to add annotation in the XML Schema to customize mapping between XML Schema and SQL data types.

105

SQL/XML functions. Oracle XML DB provides two sets of SQL/XML functions: i) SQL/XML publishing functions, including XMLElement() to generate an XMLType element,XMLAttributes() to generate XML attributes, XMLForest() function to produces a forest of XML elements, XMLConcat() function to concatenate content into an XML fragment, XMLAgg() aggregate function that produces a forest of XML elements from a collection of XML elements; ii) operators to query and access XML content as part of normal SQL operations. The operators include: existsnode(), used in the where clause to validate an XPath condition on XML content, extract(), used to return the nodes matching an XPath expression, extractValue() to return leaf level node value for an XPath expression. updatexml() allows partial update on an XML document, and xmlsequence() will convert a document fragment into a set of well-formed XML documents. 7.1.3

Storage of XML data in Native XML Databases

Native XML databases store XML documents natively, i.e., an XML document is the smallest unit of storage. One leading native XML database system is SoftwareAG’s Tamino XML Server [20]. To store an XML document with XML Schema, a Tamino Schema needs to be defined for a collection (all documents to be stored with the same schema). Tamino Schema is based on XML Schema, and provides some extensions to define how the data is stored, mapped, indexed, compressed, and so on. Tamino [20] stands for Transaction Architecture for the Management of INternet Objects. In Tamino, documents can be stored with/without a schema, and non-XML documents like images and DTDs can be stored as well. Tamino can also integrate relational data with a component called X-node.

106

Tamino Architecture Tamino uses HTTP as the primary protocol for inserting and retrieving XML documents. A query language X-Query (an extension of XPath) is supported, and XQuery draft is also partly supported. Data Map is a repository of schemas and data mapping information, which is analogous to the catalog in RDBMS. A Tamino schema is extended upon XML Schema of XML documents. The schema can specify several mapping options. First, an element or attribute can be stored natively as text, and used for text oriented processing. Secondly, an element or attribute can be mapped as a typed storage, and used for value-based searches. Thirdly, an element or attribute can be mapped to a relational database(table, column). The last option is that the element or attribute can be mapped to a UDF (Server Extensions or X-Tension). XML objects can contain logical attributes that are calculated at execution time through Server Extensions. A single document can be mapped to multiple data sources (including Tamino itself). Tamino schemas also define the rules for indexing XML data. Index options can be enabled for text-based indexing or value-based indexing. If a document does not have a schema, it is stored without validation. If the structure of a node in an XML document is not well known, it can be defined as open-content, and the sub-tree under this node will not be validated. This provides the flexibility for schema evolution. X-Machine is the principal run-time engine to handle and store XML data. It includes query optimization, transactions, concurrency control, etc. A tag will be indexed automatically by X-Machine if the index option is enabled in the Data Map. Document contents can be stored in a compressed form, and still with full text searching capability. Requests to create new XML objects or update existing objects are first vali-

107

dated by the XML Parser with schema information in the Data Map. The Object Processor manages the storage of the document transparently. X-Node is a component for integrating heterogeneous data. It is like a skeletal XML-engine that communicates with a non-XML data source. It pushes incoming data to external systems (via ODBC, OLE-DB) or extract data from these systems as defined in the Data Map. The mapping is bi-directional, and supports both query and update access. X-Node provides a uniform XML view of heterogeneous data and integrates data together.

Storage and Indexing XML data can be stored in a XML store, or mapped into a SQL store. Tamino will compress documents which have size larger than 32KB. Index and data are stored separately. Tamino provides 3 types of indexes for XML data, standard index, text index, and structure index.

Query Processing A retrieval request will be parsed first, and the query optimizer checks if there is any index information available for the current request. If so, the request will be processed, and the corresponding objects will be retrieved with the index, and Object-Composer builds the final XML document. If there is no index for the current request, the postprocessor receives the request, locates the data, and builds a DOM tree for Object-Composer, and the Object-Composer then builds the final XML document. Tamino database is a native XML database, which stores XML data, indexes XML data, supports XML queries and XML schemas, and integrates heterogenous data as well. Tamino has the most functionalities as an XML repository.

108

7.2

Storage of Temporal XML Documents

XML and XQuery provide effective support for temporal queries at the logical level, but many research issues remain with respect to their efficient implementation, and they will be briefly discussed next. For instance, at the logical level, V-Documents are first clustered by the document structure, and then by the change history. However, a different storage organization is needed [51] to achieve efficient support for document retrieval and querying. In fact, the logical arrangement shown in Figure 6.3 must be supported in physical organization as shown in Figure 7.1. In the organization of Figure 7.1, all elements of the first version are stored at the beginning, and the deltas in the following versions are appended. So the storage is first clustered by vstart, and secondly by the order of the elements in the document (thus, the opposite with respect to the logical representation).

Figure 7.1: Storage Organization To efficiently support version reconstruction and complex queries, we improve this basic organization with Usefulness-based clustering and the SPaR node numbering scheme.

109

Usefulness-based Clustering. This scheme [47, 51] clusters the objects of a version into a new page if the percentage of valid objects in a page(Usefulness) falls below a threshold; thus the worst-case I/O cost for version retrieval is greatly decreased. This technique can be applied to our scheme as well. Since the records for a given version are clustered, reconstructing the document at a version only requires to retrieve the pages that were useful at that version [50]. The retrieved elements are then ordered according to their Durable Node Numbers (DNN) [50].

Numbering Scheme. An XML document can be viewed as an ordered tree consisting of tree nodes (elements). A pre-order traversal number can be used to identify the elements of the XML tree [106, 79, 95, 50]. The SPaR(Sparse Preorder and Range) numbering scheme [50] uses durable node numbers (DNN, range) that can sustain frequent updates, which is first proposed in [79]. Thus the interval [dnn(X), dnn(X)+range(X)] is associated with element X, and we represent it as [nstart, nend] in the V-Document.

Support for Complex Queries. The use of DNN also facilitates the maintenance of indexes on multiversion documents. In fact, by using DNNs, efficient indexing schemes [50] and query processing algorithms [51, 52] can be used to support complex queries on multiversion documents. For instance, multiversion BTree (MVBT) [37] indexing is used to support complex queries. The scheme [51] supports conventional and path expression queries.

7.3

Summary

To manage repositories of XML documents, ORDBMSs usually take one of the following two approaches: i) mapping data between relational tables and XML

110

documents with extenders or utilities, and ii) building SQL extensions inside the engine to support XML documents, e.g., efficient XML publishing functions, new XML datatypes, and so on. With a new XML datatype, an XML document can be stored “natively” inside an ORDBMS. How to efficiently support XML queries in ORDBMS will be a challenging issue for ORDBMS vendors in the near future. A middleware approach hides the relational data from users, and provides XML views on relational data, and users can directly query on the views with XML query languages. Native XML database approaches have more flexibility to store, query and index XML data. In current approaches, either an OODBMS is converted to store XML data, or a new native XML database engine is built from scratch. Although native XML databases provide many functionalities to support XML, how to efficiently support queries are still under study.

111

CHAPTER 8 Conclusion Archiving web information is becoming very important in the information age: the final objective is to allow users to retrieve past snapshots and query the evolution history of the documents and databases. In this dissertation, we have investigated i) extending RDBMS with temporal support via XML, and ii) temporal queries and version management for XML document repositories. We first proposed a temporally-grouped data model, which has been longrecognized as natural and expressive, to represent the history of a relational database using XML. Indeed, we showed that XML provides a more supportive environment for querying and manipulating temporal information than SQL. Then we showed that temporal queries that would be very difficult to express in SQL can now be easily expressed in standard XQuery, which provides greater expressive power and native extensibility. In addition, the XML-viewed history can be effectively transformed and visualized using XSLT. The approach is also complete, since its realization does not require the invention of new techniques, nor the extension of existing standards. Next, we turned to the problem of efficient support for these temporal queries in DBMSs. The ArchIS system we build demonstrates that the transaction time histories of relational databases can be stored and queried efficiently by using (i) XML to provide temporally-grouped representations of such histories, (ii)

112

SQL/XML to implement queries expressed against these representations, (iii) segment-based temporal clustering scheme for efficient temporal query support. Furthermore, we proposed a block-based data compression technique for archived history based on LZ77 and Huffman encoding, which provides efficient storage, and efficient queries due to the growing disparity between I/O and CPUs. The ArchIS system demonstrates the query mapping, indexing, clustering, and compression techniques we used to achieve performance levels well above those of a native XML DBMS. The approach realized by ArchIS can be applied to transaction-time, validtime, and bitemporal databases. Complex historical queries, and updates, which would be very difficult to express in SQL on relational tables, can now be easily expressed in XQuery on such XML-based representations. Our approach is general and can be applied to arbitrary XML documents. Version management is important for web document archives and data warehouses and is being changed profoundly by XML, since a wide assortment of information sources can now be represented by a unified data format that supports complex historical queries on the evolution of documents and their contents. Then, we proposed a logical version model for XML documents, to represent concisely the XML document revision history as yet another XML document. The main benefits of this approach are that (i) temporally grouped data models can be represented using the structure of XML, and (ii) complex temporal queries can be supported in XQuery, without requiring temporal extensions to the XML standards. Finally, we generalized the storage techniques for temporal XML documents, and showed the various clustering and indexing approaches that can achieve efficient execution of temporal queries.

113

In summary, the XML-based approach we proposed for managing evolving information is simple and efficient, and provides a unified solution for a wide spectrum of temporal application problems, requiring no extension to existing standards.

Future Work At the physical level, many clustering and indexing techniques have been proposed for temporal databases [89]; these deserve further investigations, also in view of the fact that, for valid time and bitemporal databases, optimal techniques to be used could be different from those that are most effective with transaction-time databases. Difficult problems result from database schema evolution, since schema changes can occur reasonably often, and this trend has been accelerated by the globalization of services brought by the internet [84]. Database schema evolution brings very challenging problems, including the historical management of metadata and the ability of answering queries involving historical data stored under previous versions of the database schema. These research problems are well beyond the scope of this thesis, even though the general representation we have proposed for temporal information can provide a unifying framework for evolving data and metadata.

114

References [1] Microsoft XML Diff. http://apps.gotdotnet.com/xmltools/xmldiff/. [2] Yukon SQL Database Server. http://www.microsoft.com/sql/yukon. [3] ATLaS. http://wis.cs.ucla.edu/atlas. [4] BerkeleyDB. http://www.sleepycat.com. [5] DB2 V8.1 Documentation. http://www.ibm.com/db2/. [6] DB2 XML Extender. http://www-3.ibm.com/software/data/db2/extenders/ xmlext/. [7] Electronic Business using http://www.ebxml.org.

eXtensible

Markup

Language(ebXML).

[8] Extensible Markup Language (XML). http://www.w3.org/XML/. [9] Geography Markup Language(GML). http://www.opengis.org. [10] ICAP:Incorporating Change Management http://wis.cs.ucla.edu/projects/icap/.

into

Archival

Processes.

[11] National Archives of Australias policy statement Archiving Web Resources: A Policy for Keeping Records of Web-based Activity in the Commonwealth Government. http://www.naa.gov.au/recordkeeping. [12] Oracle Documentation. http://otn.oracle.com. [13] Oracle Flashback Technology. http://otn.oracle.com/deploy/availability /htdocs/flashback overview.htm. [14] Oracle XML. http://otn.oracle.com/xml/. [15] Quip: Software AG’s http://www.softwareag.com/tamino.

XQuery

Prototype.

[16] SOAP. http://www.w3.org/tr/soap/. [17] Sonic XML Server. http://www.sonicsoftware.com. [18] SQL/XML. http://www.sqlx.org. [19] Table Compression in Oracle9i Release2. http://otn.oracle.com/ oramag/webcolumns/2003/techarticles/poess tablecomp.html.

115

[20] Tamino XML Server. http://www.tamino.com. [21] The Extensible Stylesheet http://www.w3.org/Style/XSL/.

Language

(XSL).

[22] The Internet Archive–The Wayback Machine. http://www.archive.org/. [23] The Versioning Machine. http://mith2.umd.edu/products/ver-mach/. [24] UCLA Catalog. http://www.registrar.ucla.edu/catalog/. [25] WebDAV, WWW Distributed Authoring www.ietf.org/html.charters/webdav-charter.html. [26] XML Linking Language http://www.w3.org/TR/xlink/.

(XLink)

and

Versioning.

Version

1.0.

[27] XML Path Language (XPath). http://www.w3.org/TR/xpath. [28] XML Schema. http://www.w3.org/XML/Schema. [29] XML:DB Initiative for XML Databases. http://www.xmldb.org. [30] XQuery 1.0: An XML Query Language. http://www.w3.org/XML/Query. [31] XQuery 1.0 and XPath http://www.w3.org/tr/xquery/.

2.0

Functions

and

Operators.

[32] XUpdate - XML Update Language. http://www.xmldb.org/xupdate/. [33] Zlib. http://www.gzip.org/zlib/. [34] T. Amagasa, M. Yoshikawa, and S. Uemura. A Data Model for Temporal XML Documents. In DEXA, 2000. [35] T. Amagasa, M. Yoshikawa, and S. Uemura. Realizing Temporal XML Repositories using Temporal Relational Databases. In CODAS, pages 63– 68, 2001. [36] D. Wilhite B. R. Iyer. Data Compression Support in Databases. In VLDB, 1994. [37] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. On Optimal Multiversion Access Structures. In Proc. of Symposium on Large Spatial Databases, 1993.

116

[38] D. Beech and B. Mahbod. Generalized Version Control in an ObjectOriented Database. In ICDE, pages 14–22, 1988. [39] E. Bertino, E. Ferrai, and G. Guerrini. A Formal Temporal Object-Oriented Data Model. In EDBT, 1996. [40] M. H. Bhlen, R. T. Snodgrass, and M. D. Soo. Coalescing in Temporal Databases. In VLDB, 1996. [41] R. Bourret. XML and Databases. http://www.rpbourret.com/xml/ xmlanddatabases.htm. [42] P. Buneman, S. Khanna, K. Ajima, and W. Tan. Archiving Scientific Data. In ACM SIGMOD, 2002. [43] M. Carey, J. Kiernan, J. Shanmugasundaram, and et al. XPERANTO: A Middleware for Publishing Object-Relational Data as XML Documents. In VLDB, 2000. [44] S.S. Chawathe, S. Abiteboul, and J. Widom. Managing Historical Semistructured Data. Theory and Practice of Object Systems, 24(4):1–20, 1999. [45] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In SIGMOD, 1996. [46] C. Chen and C. Zaniolo. Universal Temporal Extensions for Database Languages. In ICDE, 1999. [47] S.Y. Chien, V.J. Tsotras, and C. Zaniolo. Version Management of XML Documents. In WebDB, 2000. [48] S.Y. Chien, V.J. Tsotras, and C. Zaniolo. Copy-Based versus Edit-Base Version Management Schemes for Structured Documents. In RIDE, 2001. [49] S.Y. Chien, V.J. Tsotras, and C. Zaniolo. Efficient Management of Multiversion Documents by Object Referencing. In VLDB, 2001. [50] S.Y. Chien, V.J. Tsotras, C. Zaniolo, and D. Zhang. Storing and Querying Multiversion XML Documents Using Durable Node Numbers. In WISE, 2001. [51] S.Y. Chien, V.J. Tsotras, C. Zaniolo, and D. Zhang. Efficient Complex Query Support for Multiversion XML Documents. In EDBT, 2002.

117

[52] S.Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, and C. Zaniolo. Efficient Structural Joins on Indexed XML Documents. In VLDB, 2002. [53] J. Chomicki and D. Toman. Temporal Logic in Information Systems. In Logics for Databases and Information Systems, pages 31–70. Kluwer, 1998. [54] J. Chomicki, D. Toman, and M.H. B¨ohlen. Querying ATSQL Databases with Temporal Logic. TODS, 26(2):145–178, June 2001. [55] H. Chou and W. Kim. A Unifying Framework for Version Control in a CAD Environment. In VLDB, 1986. [56] J. Clifford. Formal Semantics and Pragmatics for Natural Language Querying. Cambridge University Press, 1990. [57] J. Clifford, A. Croker, F. Grandi, and A. Tuzhilin. On Temporal Grouping. In Recent Advances in Temporal Databases, pages 194–213. Springer Verlag, 1995. [58] J. Clifford, C.E. Dyreson, T. Isakowitz, C.S. Jensen, and R.T. Snodgrass. On the Semantics of “Now” in Databases. TODS, 22(2):171–214, 1997. [59] G. Cobena, S. Abiteboul, and A. Marian. Detecting Changes in XML Documents. In ICDE, 2002. [60] D. DeHaan, D. Toman, M. P. Consens, and M. T. Ozsu. A Comprehensive XQuery to SQL Translation Using Dynamic Interval Encoding. In SIGMOD, 2003. [61] A. Deutsch, M. Fernandez, and D. Suciu. Storing Semistructured Data with STORED. SIGMOD Record, 28(2), 1999. [62] C.E. Dyreson. Observing Transaction-Time Semantics with TTXPath. In WISE, 2001. [63] G. Guerrini M. Mesiti E. Camossi, E. Bertino. Automatic Evolution of Multigranular Temporal Objects. In TIME, 2002. [64] R. Elmasri and G.T.J.Wuu. A Temporal Model and Query Language for ER Databases. In ICDE, pages 76–83, 1990. [65] M. Fernandez and J. Simon. Growing XQuery. In ECOOP, 2003. [66] M. Fernandez, W. Tan, and D. Suciu. SilkRoute: Trading Between Relations and XML. In 8th Intl. WWW Conf., 1999.

118

[67] D. Florescu and D. Kossmann. Storing and Querying XML Data using an RDBMS. IEEE Transactions: Data Engineering, 1999. [68] J.E. Funderburk, G. Kiernan, J. Shanmugasundaram, E. Shekita, and C. Wei. XTABLES: Bridging Relational Technology and XML. IBM Systems Journal, 41(4), 2002. [69] D. Gao and R. T. Snodgrass. Temporal Slicing in the Evaluation of XML Queries. In VLDB, 2003. [70] M. Gergatsoulis and Y. Stavrakas. Representing Changes in XML Documents using Dimensions. In Xsym, 2003. [71] F. Grandi. An Annotated Bibliography on Temporal and Evolution Aspects in the World Wide Web. In TimeCenter Technical Report TR-75, 2003. [72] F. Grandi and F. Mandreoli. The Valid Web: An XML/XSL Infrastructure for Temporal Management of Web Documents. In ADVIS, 2000. [73] H. Gregersen and C. Jensen. Conceptual Modeling of Time-varying Information. In TIMECENTER Technical Report TR-35, September 1998., 1998. [74] H. Gregersen and C. S. Jensen. Temporal Entity-Relationship Models - A Survey. Knowledge and Data Engineering, 11(3):464–497, 1999. [75] G. Huck, I. Macherius, and P. Fankhauser. PDOM: Lightweight Persistency Support for the Document Object Model. In OOPSLA Workshop. [76] U. Shaft J. Goldstein, R. Ramakrishnan. Compressing Relations and Indexes. In ICDE, 1998. [77] C. S. Jensen and C. E. Dyreson (eds). A Consensus Glossary of Temporal Database Concepts - February 1998 Version. Temporal Databases: Research and Practice, pages 367–405, 1998. [78] S. Kepser. A Proof of the Turing-Completeness of XSLT and XQuery. In Technical report SFB 441, Eberhard Karls Universitat Tubingen, 2002. [79] Q. Li and B. Moon. Indexing and querying XML data for regular path expressions. In VLDB, 2001. [80] A. Marian, S. Abiteboul, G. Cobena, and L. Mignet. Change-Centric Management of Versions in an XML Warehouse. In The VLDB Journal, pages 581–590, 2001.

119

[81] N. Alur and P. Haas and D. Momiroska and et al. DB2 UDB’s High Function Business Intelligence in e-Business. http://www.redbooks.ibm.com/, 2002. [82] G. Ozsoyoglu and R.T. Snodgrass. Temporal and Real-Time Databases: A Survey. IEEE Transactions on Knowledge and Data Engineering, 7(4):513– 532, 1995. [83] D. Papadias, Y. Tao, P. Kalnis, and J. Zhang. Indexing Spatio-Temporal Data Warehouses. In ICDE, 2002. [84] S. Ram and G. Shankaranarayanan. Research Issues in Database Schema Evolution: the Road Not Taken. In Working Paper No. 2003-15, Department of Information Systems, Boston University, 2003. [85] M.J. Rochkind. The Source Code Control System. IEEE Transactions on Software Engineering, SE-1(4):364–370, 1975. [86] J.F. Roddick. A Survey of Schema Versioning Issues for Database Systems. Information and Software Technology, 37(7):383–393, 1995. [87] J.F. Roddick. A Model for Schema Versioning in Temporal Database Systems. In Proc. 19th. ACSC Conf., pages 446–452, 1996. [88] M. Rys. Proposal for an XML Data Modification Language. In Microsoft Report, 2002. [89] B. Salzberg and V. J. Tsotras. Comparison of Access Methods for Timeevolving Data. ACM Comput. Surv., 31(2):158–221, 1999. [90] J. Shanmugasundaram and et al. Efficiently Publishing Relational Data as XML Documents. In VLDB, 2000. [91] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, pages 302–314. [92] R. T. Snodgrass. Temporal Object-Oriented Databases: a Critical Comparision. Addions-Wesley/ACM Press, 1995. [93] R. T. Snodgrass. The TSQL2 Temporal Query Language. Kluwer, 1995. [94] R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, 1999.

120

[95] D. Srivastava, S. Al-Khalifa, H.V. Jagadish, N. Koudas, J.M. Patel, and Y.Wu. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In ICDE, 2002. [96] A. Steiner. A Generalisation Approach to Temporal Data Models and their Implementations. PhD thesis, ETH Zurich, 1997. [97] I. Tatarinov, Z.G. Ives, and et al. Updating XML. In SIGMOD, 2001. [98] W.F. Tichy. RCS - A System for Version Control. Software - Practice and Experience, 15(7):637–654, 1985. [99] F. Vitali. Versioning Hypermedia. In ACM Computing Surveys 31(4es), 1999. [100] C. V. Ravishankar W. K. Ng. Relational Database Compression Using Augmented Vector Quantization. In ICDE, 1995. [101] F. Wang and C. Zaniolo. Preserving and Querying Histories of XMLPublished Relational Databases. In ECDM, 2002. [102] H. Wang and C. Zaniolo. Using SQL to Build New Aggregates and Extenders for Object-Relational Systems. In VLDB, 2000. [103] Y. Wang, D. J. DeWitt, and J. Cai. X-Diff: A Fast Change Detection Algorithm for XML Documents. In ICDE, 2003. [104] J. Yang. Temporal Data Warehousing. PhD thesis, Stanford University, 2001. [105] C. Zaniolo, S. Ceri, C.Faloutsos, R.T. Snodgrass, V.S. Subrahmanian, and R. Zicari. Advanced Database Systems. Morgan Kaufmann Publishers, 1997. [106] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman. On Supporting Containment Queries in Relational Database Management Systems. In SIGMOD, 2001.

121