Multi-dimensional Approach to Indexing XML Data

10 downloads 11419 Views 2MB Size Report
2.3.1 Indexing XML Data as a Multi-dimensional Problem . . . . . . . . 27 ...... Over against such languages, the XML, which is a simple and flex- ible text format ...
Multi-dimensional Approach to Indexing XML Data Ph.D. Thesis

Michal Kr´ atk´ y

Department of Computer Science ˇ VSB – Technical University of Ostrava Czech Republic

Ph.D. Thesis

c 2004

Michal Kr´atk´ y [email protected] http://www.cs.vsb.cz/kratky

Thesis supervisor: V´aclav Sn´aˇsel [email protected] http://www.cs.vsb.cz/snasel

Department of Computer Science, Faculty of Electrical Engineering and Computer Science ˇ – Technical University of Ostrava VSB 17. listopadu 15, 708 33 Ostrava–Poruba Czech Republic http://www.cs.vsb.cz http://fei.vsb.cz http://www.vsb.cz

Typeset by PDFLATEX

Contents List of Definitions

ix

List of Figures

xi

List of Listings

xiii

List of Tables

xv

List of Examples

xvii

Acknowledgement

xix

Preface

1

I

5

Introduction

1 Extensible Mark-up Language (XML) 1.1 Introduction . . . . . . . . . . . . . . . 1.2 DTDs and XML Schemas . . . . . . . 1.3 Model of XML Documents . . . . . . . 1.4 XML Query Languages . . . . . . . . . 1.4.1 XPath . . . . . . . . . . . . . . 1.4.2 XQuery . . . . . . . . . . . . . 1.5 Another XML Technologies . . . . . .

. . . . . . .

7 7 9 12 13 14 16 17

2 Indexing XML Data – State of the Art 2.1 Numbering Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Approaches Based on Relation Decomposition . . . . . . . . . . . . . . . . 2.2.1 STORED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19 20 21 21

iii

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . .

22 26 27 27 28 31 32 33

3 Multi-dimensional Indexing 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35 35 36 37

II

41

2.3

2.4

2.2.2 XML Indexing and Storage System (XISS) . . . . . . . . . 2.2.3 XML Indexing and Retrieval with a Hybrid Storage Model Multi-dimensional Approaches . . . . . . . . . . . . . . . . . . . . 2.3.1 Indexing XML Data as a Multi-dimensional Problem . . . 2.3.2 XPath Accelerator . . . . . . . . . . . . . . . . . . . . . . Approaches Based on Trie Representation of XML Documents . . 2.4.1 The Index Fabric . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 RegXP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Multi-dimensional Approach to Indexing XML Data

4 Multi-dimensional Approach to Indexing XML Data 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Model of XML Documents . . . . . . . . . . . . . . . . . . . . 4.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Queries for Values of Elements and Attributes . . . . . 4.3.2 Implementation of XPath Axes . . . . . . . . . . . . . 4.3.3 Query Defined by a Regular Path Expression . . . . . . 4.4 Indexing More XML Documents . . . . . . . . . . . . . . . . . 4.5 Inserting, Updating, and Deleting in Indexed XML Documents 4.6 Problem Issues of the Multi-dimensional Approach . . . . . . 4.6.1 Different Dimensions of Points . . . . . . . . . . . . . . 4.6.2 Efficiency of Processing the Narrow Range Query . . . 4.7 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

5 Multi-dimensional Term Indexing for Efficient Processing of Complex Queries 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Multi-dimensional Term Indexing for Efficient Processing of Complex Queries

43 43 44 47 47 49 51 53 53 53 53 54 54 56

57 57 58 59

5.4

III

5.3.1

Term as n-dimensional Point . . . . . . . . . . . . . . . . . . . . . .

59

5.3.2

Regular Expression Query Construction . . . . . . . . . . . . . . .

59

5.3.3

Term Clustering using Z-ordering . . . . . . . . . . . . . . . . . . .

60

Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

Multi-dimensional Index Data Structures

6 UB-tree and its Variants

63 65

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.2

Z-ordering and Z-region . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.3

Operations of the (B)UB-tree . . . . . . . . . . . . . . . . . . . . . . . . .

68

7 R-tree and its Variants

71

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.2

Split Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

7.3

R-tree Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

8 Multi-dimensional Forests

75

8.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

8.2

BUB-forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

8.2.1

Operations on BUB-forest . . . . . . . . . . . . . . . . . . . . . . .

77

8.3

Storage Volume Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

8.4

Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

9 Signature Multi-dimensional Trees

81

9.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

9.2

Processing a Narrow Range Query in Multidimensional Data Structures . .

82

9.3

Signature Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

9.4

Signature Multi-dimensional Trees . . . . . . . . . . . . . . . . . . . . . . .

84

9.5

Signature R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

9.5.1

Operations of the Signature R-tree . . . . . . . . . . . . . . . . . .

86

9.5.2

Range Query Operation for a Narrow Hyper Box . . . . . . . . . .

86

9.5.3

Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

Signature Generating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

9.6

IV

Experimental Results

91

10 Multi-dimensional Approach to Indexing XML Data

93

11 Signature Multi-dimensional Data Structures

97

12 Multi-dimensional Term Indexing for Efficient Processing of Complex Queries 103 Conclusion

109

Bibliography

111

Author’s publications

119

Index

123

List of Definitions 2.1 XML document as 3-dimensional points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Range query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Complex range query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Narrow range query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Point of n-dimensional space representing a labelled path . . . . . . . . . . . . . . . . . . . . . 45 4.2 Point of n-dimensional space representing a path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 Term as n-dimensional point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1 Z-address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Z-region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.1 The BUB-forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.2 BUB-forest range query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.1 Intersect and relevant regions, relevant ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 9.2 Quality ratio of a range query algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 9.3 n-dimensional signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 9.4 Range query processing with the n-dimensional signature . . . . . . . . . . . . . . . . . . . . . 84

ix

List of Figures 1.1

An XML tree of XML document in Listing 1.1 . . . . . . . . . . . . . . . .

12

1.2

(a) Preorder and (b) postorder traversal of an XML tree . . . . . . . . . .

13

1.3

XPath semantics: signposted nodes are nodes of the result forest if an (a) parent::*, (b) ancestor::*, and (c) ancestor-or-self::* step is taken from context node 7. . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

XPath semantics: signposted nodes are nodes of the result forest if an (a) child::*, (b) descendant::*, and (c) descendant-or-self::* step is taken from context node 1. . . . . . . . . . . . . . . . . . . . . . . . . .

16

XPath semantics: signposted nodes are nodes of the result forest if an (a) preceding::* and (b) following::* step is taken from context node 6 and 1, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

XPath semantics: signposted nodes are nodes of the result forest if an (a) preceding-sibling::* and (b) following-sibling::* step is taken from context node 6 and 2, respectively. . . . . . . . . . . . . . . . . . . .

17

2.1

An XML tree whose node are annotated by Dietz’s numbering scheme . . .

20

2.2

Numbering schemes of an XML tree: (a) Dietz’s numbering scheme (b) XISS numbering scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3

XISS element index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.4

XISS structure index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.5

XML documents in a 3-dimensional cube . . . . . . . . . . . . . . . . . . .

28

2.6

(a) An XML tree whose node are annotated by Dietz’s numbering scheme (b) Node distribution in the preorder/postorder plane and XML document regions as seen from context node h . . . . . . . . . . . . . . . . . . . . . .

30

2.7

String-encoding of XML paths . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.8

PATRICIA trie with additional layers . . . . . . . . . . . . . . . . . . . . .

33

3.1

Examples of the narrow range queries in spaces with the dimensions n = 2 and n = 3, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Sphere in a 2-dimensional space . . . . . . . . . . . . . . . . . . . . . . . .

38

1.4

1.5

1.6

3.2

xi

3.3

Search region and probability of query box with respect to increasing dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.1

Example of XML tree with unique numbers idN (ui ) of elements and attributes ui and unique numbers idT (si ) of names of elements and attributes and their values si (values in parenthesis) . . . . . . . . . . . . . . . . . . . 44 4.2 Example of an XML tree suitable for document-centric XML document in Listing 1.3 with unique numbers idN (ui ) of elements ui and unique numbers idT (si ) of names of elements and their text values si (values in parenthesis) 46 4.3 Trees modelling XPath queries: (a) /books/books[author=’Joseph Heller’] (b) /books//author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Procedure of searching preceding and following children using two range queries after finding one child . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1

Spaces of BUB-forest BF2 (2, 3) and query boxes for processing of query (t*) 61

6.1

6.2

(a) The Z-curve filling the entire 2-dimensional space 8×8. (b) 2-dimensional space 8 × 8 with tuples T1 – T8 . These tuples define the BUB-tree Z-regions partitioning [0:2],[7:11],[25:30],[57:62] with the node capacity 2. . . . . . . . This BUB-tree indexes the tuples presented in Figure 6.1 . . . . . . . . . .

66 67

7.1

Structure of the R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

8.1

(a) Path length frequency histogram of path dataset extracted from the Protein Sequence Database XML document (b) Dimension frequency of points representing paths in the document . . . . . . . . . . . . . . . . . . . . . . (a) Term length frequency histogram of term dataset extracted from the TREC’s document collections LATIMES and FBIS (b) Number of terms with growing maximal term length . . . . . . . . . . . . . . . . . . . . . . Example of BUB-forest BF2 . . . . . . . . . . . . . . . . . . . . . . . . . .

8.2

8.3 9.1 9.2 9.3

Points T1 , T2 , and T3 in MBB (4, 1) : (6, 5) and the narrow range query (1, 2) : (5, 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the Signature R-Tree . . . . . . . . . . . . . . . . . . . . . . . Tuples and regions from Examples 9.1 and 9.2: (a) MBB (b) signature region (c) the result region – an intersection of the MBB and signature region . .

76

77 77

82 86 88

12.1 Statistics of right extension test . . . . . . . . . . . . . . . . . . . . . . . . 105 12.2 Statistics of left-right extension test . . . . . . . . . . . . . . . . . . . . . . 106 12.3 Statistics of left extension test . . . . . . . . . . . . . . . . . . . . . . . . . 107

List of Listings 1.1 1.2 1.3 1.4 1.5

Well formed XML document valid w.r.t. DTD in Listing 1.2 . . . . . . . . DTD of documents which contain information about books and authors . . A part of document-centric XML document from the INEX collection . . . XML document with a DTD . . . . . . . . . . . . . . . . . . . . . . . . . . XML Schema of documents which contain information about books and authors (e.g. a document in Listing 1.1) . . . . . . . . . . . . . . . . . . . 1.6 Example of XQuery queries . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Insertion algorithm for tuple T . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Point query algorithm to find tuple T . . . . . . . . . . . . . . . . . . . . . 6.3 Deletion algorithm for a tuple T . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Original (B)UB-tree range query algorithm . . . . . . . . . . . . . . . . . . 6.5 Novel (B)UB-tree range query algorithm based on the intersection operation 9.1 Range query operation for a narrow hyper box . . . . . . . . . . . . . . . .

xiii

8 8 9 10 11 16 68 68 69 69 70 86

List of Tables 1.1

Semantics of axes supported by XPath for context node u

. . . . . . . . .

15

2.1

XPath axes α and their corresponding query windows window(α, v) (context node v) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.1 4.2

Characteristics of some document-centric XML document collections . . . . 55 Characteristics of the largest documents in project XML Data Repository [78] 56

10.1 Path index characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Results of queries for values of elements and attributes and queries defined by simple path based on a parent-child relationship in the path index . . . 10.3 Results of XPath axes queries in the path index . . . . . . . . . . . . . . . 11.1 11.2 11.3 11.4 11.5

A characterization of the test data collection and the size of index file . . . A characterization of index multi-dimensional data structures . . . . . . . A characterisation of narrow range query sets . . . . . . . . . . . . . . . . Experimental results of processing the narrow range queries – NRQ . . . . . Experimental results of processing the narrow range queries – the ratio of searched leaf nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Experimental results of processing the narrow range queries – cQ ratio . . . 11.7 Experimental results of processing the narrow range queries – the disk access cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 Experimental results of processing the narrow range queries – time of query processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

93 94 95 97 98 99 99 100 101 101 102

List of Examples 1.1 Well formed XML document valid w.r.t. DTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Document-centric and data-centric XML document . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Sphere in a multi-dimensional space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Cube in a multi-dimensional space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Decomposition of XML tree to paths and labelled paths . . . . . . . . . . . . . . . . . . . . . . 45 4.2 The XML tree suitable for an XML document with the mixed content . . . . . . . . 45 4.3 Evaluation plan of the XPath query /books/book[author=’Joseph Heller’] 48 4.4 Evaluation plan of the XPath query /books//author . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 Transformation of terms into points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Term indexing and querying in BUB-forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Z-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 Z-region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.1 Path dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.2 Term dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.3 Storage volume reduction for term index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9.1 Reason of inefficiency processing of the narrow range query in R-tree . . . . . . . . . . 82 9.2 Usage of the n-dimensional signature for filtration of irrelevant tree pages . . . . . 85 9.3 Signature region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

xvii

Acknowledgement. I would like to thank my girlfriend Eve for her love and support. The acknowledgement is also aimed to my parents. I would like to thank V´aclav Sn´aˇsel ˇ (VSB–Technical University of Ostrava) and Jaroslav Pokorn´ y (Charles University, Prague) for skilled help during my PhD study.

July 2004, Michal Kr´atk´ y

xix

Preface The Extensible Mark-up Language (XML) has been proposed by the World Wide Web Consortium (W3C ) as a standard for data representation and exchange. The XML represents an information by elements that can be nested and attributes that are parts of elements. Moreover, the XML allows to define a own set of tags for a description of domain specific data. Consequently, XML is a tool for heterogeneous data exchange, data representation in the Internet, or a description of semi-structured data, e.g. text documents, spatial data, mathematical formulas, chemical compounds, biological data and so on. The XML has been recently understood as a new approach to data modelling. A well formed XML document or a set of documents is an XML database and the associated DTD or schema specified in the language XML Schema is its database schema. Implementation of a system enabling us to store and query XML documents efficiently (so called native XML databases) requires a development of new techniques and is of great importance in the world of information technologies today. An XML document is usually modelled as a graph the nodes of which correspond to XML elements and attributes. The graph is mostly a tree. To obtain specified data from an XML database a number of special query languages have been developed, e.g. XML-QL, XPath, and XQuery. A common feature of these languages is a possibility to formulate paths in the XML graph. Such a path is a sequence of element or attribute names from the root element to a leaf. Regular expressions provide a valuable method for paths specifications. In fact, most of XML query languages are based on the XPath language that uses a form of path expressions for composing more general queries. The XPath defines a family of 13 axes, i.e. relationship types in that an actual element can be associated to other elements represented in the XML tree. The family of axes defined in the XPath is designed to allow a set of graph traversal operations that are seen to be atomic in XML document trees. In the past, there were many considerations about use of existing relational or objectrelational DBMSs for storing and querying XML data. Since a tree is accessed during evaluation of a query, conventional approaches through the conventional database languages SQL or OQL fail or they are not too efficient. Consequently, a form of another indexing is necessary. Recently there are several approaches to indexing XML or, more general, semi-structured data. Some of them are based on a traditional relational technology, other use special data structures for representation of XML data like trie, multi-dimensional data 1

2 structures, or signature data structures. Existing approaches often allow only the efficient processing a minor subset of an XML query language. Often, the processing XML queries is not efficient, or algorithms and data structures are not possible to applie for indexing huge volume XML data. In the course of the development of XML databases the need for a benchmark framework has become more and more evident: many different ways to store and query XML data have been suggested in the past, e.g. XMark, XML Data Repository, and INEX. This book summarizes the work of author achieved during his Ph.D. study. The aim of this work was to develop an approach to indexing XML data, which will be more general and more efficient processing XML queries will be possible compared with existing approaches to indexing XML data. An additional important aim was to develop an approach to indexing a huge amount of large XML documents. A novel multi-dimensional approach was developed and published. The multi-dimensional approach splits an XML document to the root to leaf paths, which are mapped to multi-dimensional points. Such points are indexed by balanced and paged multi-dimensional data structures. Consequently, XML queries are processed using queries of these data structures – point and range queries are applied. Due to the fact that the obtained points are specific, a straightforward application of current data structures like R-tree and UB-tree does not lead to success. The first particularity is the points have very different dimensions. The overhead of recent data structure is enormous for indexing such points. Therefore, the query processing is not too efficient. The second one, the range query used in the multidimensional approach to indexing XML data is called the narrow range query. Processing of the query is not too efficient in existing multi-dimensional data structures. In this work the data structures solving both problems are described. As these data structures are paged and balanced, they are appropriate for indexing a huge amount of large XML documents. Consequently, the multi-dimensional approach can serve as an alternative to implementing native XML databases. Moreover, our proposal is more general and enables an efficient accomplishment of querying text content of an element or attribute value as well as queries based on regular path expressions and axes of the XPath specification. Inserting, updating, and deleting of previously indexed XML documents is possible as well. The work is organized in the following way. This book is divided into four parts, each containing a coherent and closely related set of chapters. These parts are not self-contained and would be read in order. The four parts are as follows: Part I: Introduction Part II: Multi-dimensional Approach to Indexing XML Data Part III: Multi-dimensional Index Data Structures Part IV: Experimental Results

3 Part I contains preliminary chapters. The XML, XML query languages and so on are described in Chapter 1. Some approaches to indexing XML data are shown in Chapter 2 – the state of the art is depicted. These approaches are evaluated as well: the extent of implemented query language subset and efficiency of the query processing are compared. The multi-dimensional approach applies the multi-dimensional data structures for indexing XML data. Hence, in Chapter 3 multi-dimensional indexing is depicted. Principal terms as the point query, multi-dimensional range query and narrow range query are defined. An emphasis is put on the curse of dimensionality, which decreases an efficiency of the query processing in multi-dimensional data structures. Part II describes the multi-dimensional approach to indexing XML data. Chapter 4 contains a description of primary principles of the approach: a mapping of an XML document into points of a multi-dimensional space, an implementation of query language subset in the approach and so on. Significant part is a comparison of existing approaches to indexing XML data with our approach. A relevant problem of indexing XML data is a term indexing, consequently novel multi-dimensional approach for term indexing is explained in Chapter 5. The approach makes it possible to process complex queries over a large term collection. Both, existing and novel multi-dimensional data structures are depicted in Part III. In Chapter 6 and 7, we review existing multidimensional indexes – UB-tree and R-tree, respectively, and their variants. Following chapter depicts a novel data structure, the multidimensional forest, which is employed to indexing multi-dimensional points of different dimensions. Such points are obtained from an XML document in the multi-dimensional approach. XML queries are often processed using the narrow range queries in multidimensional data structure in the multi-dimensional approach to indexing XML data. Processing of such query is inefficient in existing data structures. Consequently, Chapter 9 puts forward a novel data structure for efficient processing the narrow range query. The signature is applied in the data structure, therefore we called it the signature multidimensional data structure. It is important the data structure eliminates the curse of dimensionality significantly. Part IV puts forward experimental results of described approaches. In Chapter 10, we show experimental results of the multi-dimensional approach to indexing XML data. Various XPath queries are tested. Chapter 11 describes experimental results of the signature multi-dimensional data structures. In the following Chapter, experimental results of the multi-dimensional approach for term indexing are depicted. These results prove an efficiency of the multi-dimensional forest for indexing points of different dimensions. Finally, in Conclusion, we conclude with summary of contributions and discussions on possibilities of a future work.

Part I Introduction

5

Chapter 1 Extensible Mark-up Language (XML) 1.1

Introduction

The Extensible Mark-up Language (XML) [79] has been proposed by World Wide Web Consortium (W3C ) as a standard for data representation and exchange. The XML represents an information by elements that can be nested and attributes that are parts of elements. Languages like HyperText Markup Language (HTML) [90] are defined by a fixed set of tags and the data contained in a document is intertwined with information about its presentation. Over against such languages, the XML, which is a simple and flexible text format derived from the Standard Generalized Markup Language (SGML) [40], allows to define a own set of tags for a description of domain specific data. The definition of tags and structure (using e.g. Data Type Definition (DTD) language) is called XML application or XML based language. Consequently, XML is a tool for heterogeneous data exchange, data representation in the Internet, or for description of semi-structured data, e.g. text documents (for example XML application DocBook [61]), spatial data (Geography Markup Language (GML) [60]), mathematical formulas (MathML [86]), chemical compounds (CML [95]), biological data and so on. Each element has a type, identified by name, sometimes called its generic identifier (GI ), and may have a set of attribute specifications. Each attribute specification has a name and a value. The beginning of every non-empty XML element is marked by a starttag (e.g. for the element identified by name). The end of every element that begins with a start-tag must be marked by an end-tag containing a name as given in the start-tag (e.g. ). The text between the start-tag and end-tag is called the element’s content. Elements may contain character data or/and child elements. An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements. An element with no content is said to be empty. Of course, empty element may contain attribute specifications. The representation of an empty element 7

8

Chapter 1. Extensible Mark-up Language (XML)

is either a start-tag immediately followed by an end-tag, or an empty-element tag (e.g. ). All XML documents must conform to these basic grammar rules (for complete description see [79]). Such documents can be interpreted by an XML interpret and are said to be well formed . It is important that it is not necessary to write an interpreter for each XML document. Well formed XML document can be validated against a DTD or an XML Schema (see Section 1.2). XML document valid to a given DTD or schema is said to be valid . Example 1.1 (Well formed XML document valid w.r.t. DTD). Let us take well-formed XML document in Listing 1.1 valid to DTD in Listing 1.2. The document contains information about books and authors. We can see, that the XML document includes the root element identified by name books. Child of the root element, element type book, contains single attribute with name id. For example, the attribute id of the first book has value 003-04312. Elements identified by title and author are children of element type book. Listing 1.1. Well formed XML document valid w.r.t. DTD in Listing 1.2 < title >The Two Towers J.R.R. Tolkien < title >The Return of the King J.R.R. Tolkien < title >Catch 22 Joseph Heller

Listing 1.2. DTD of documents which contain information about books and authors ]>

The XML data are instance of semi-structured data. An unstructured data may occur into the structured elements. Further, an XML document can be classified on the basis of contained data, as a data-centric or document-centric one. The data-centric XML documents have defined regular structure and capture a structured data rather. On the

1.2. DTDs and XML Schemas

9

other hand, the document-centric documents are often much unstructured, contain fewer elements with amount of unstructured data (e.g. an XML database of articles). Documentcentric documents often contain elements with the mixed content. The most of XML documents are combined from the both types (so-called hybrid documents). Example 1.2 (Document-centric and data-centric XML document). In Listing 1.1 data-centric XML document is depicted. In Listing 1.3 a part of documentcentric XML document from the INEX collection [35] is shown. We can see the element crt has the mixed content. Listing 1.3. A part of document-centric XML document from the INEX collection < article > A1003 10.1041/A1003s−1995 IEEE Annals of the History of Computing 1058−6180/95/$4.00 © 1995 IEEE ... ...

The first issue of our 17th volume ...



A common rule for modelling information using XML [15] is that data are contained in character values of elements and tags describe what data elements are. Whereas, attributes tell us something about how to interpret data elements – so called metadata. Additional information on XML is possible to find in [79, 37, 15].

1.2

DTDs and XML Schemas

Both DTDs [79] and XML Schemas [84, 85] are languages used for definition of XML documents structure. They determine what elements can be contained within the XML document, what elements may be nested in other one, what default values their attributes can have, and so on. Given a DTD or schema and its corresponding XML document, a parser can validate whether the document conforms to the desired structure and constraints.

10

Chapter 1. Extensible Mark-up Language (XML)

XML DTDs are subset of SGML DTDs. It defines the document structure with a list of legal elements. There are key words !ELEMENT and !ATTLIST for element type and attribute list declarations, respectively. The DTD also allows to declare that the child element must occur: one or more times (using +), zero or more times (using *), zero or one times (using ?), inside the parent element. Attributes of ID and IDREF (or IDREFS) type allow to create a relationalship between attributes. Example 1.3 (A DTD). In Listing 1.4 XML document which contains information about books and authors with a DTD is shown. The DTD is interpreted like this: !DOCTYPE books defines that this is a document of the type books. !ELEMENT books defines the books element as having one element identified by book. !ELEMENT book defines the book element as having child elements title and author. !ATTLIST id defines the id attribute of element type book. This attribute is required for the each element book. !ELEMENT title defines the title element to be of the type PCDATA. Consequently, the element contains character data. !ELEMENT author defines the author element to be of the type PCDATA. Listing 1.4. XML document with a DTD ]> < title >The Two Towers J.R.R. Tolkien < title >The Return of the King J.R.R. Tolkien < title >Catch 22 Joseph Heller

However, DTD does not define constraints such as the number of instances of a particular element within a document, the type of data within each element, and so on. Since data

1.2. DTDs and XML Schemas

11

typing and instantiation constraints are less critical in a document-centric XML document the DTD is inherently suitable for such documents. For more information about DTD, see [79, 37, 88]. XML schemas differ from DTDs in that the XML schema definition language is based on XML it self. XML schema is richer and consequently more suitable for a structure description than DTD. XML Schemas also support namespaces and data types. A lot of built-in data types are defined: xs:string, xs:decimal, xs:integer, xs:boolean, xs:date, and xs:time. Example 1.4 (An XML Schema). In Listing 1.5 XML Schema of document in Listing 1.1 is shown. The books and book elements are said to be of a complex type because they contain other elements. The title and author elements as well as id attribute are said to be simple type because they contain only text. They can not contain any other elements or attributes. Let us notice constraints put to value of the attribute id.

Listing 1.5. XML Schema of documents which contain information about books and authors (e.g. a document in Listing 1.1) < xsd:attribute name=”id” type=”IdType” use=”required”/> < xsd:restriction base=”xsd:string”>

For additional information concerning XML Schema, see [84, 85, 89].

12

Chapter 1. Extensible Mark-up Language (XML)

1.3

Model of XML Documents

An XML document is usually modelled as a graph the nodes of which correspond to XML elements and attributes. The graph is mostly a tree (we suppose that none of the attributes is of IDREF/IDREFS type), so called XML tree. An attribute is modelled as a child of the related element. String values of elements or attributes or empty values occur in leafs. A common feature of XML query languages (see Section 1.4) is a possibility to formulate paths in the XML graph. Example 1.5 (An XML tree). In Figure 1.1 an XML tree of XML document in Listing 1.1 is shown. The meaning of tree nodes is different, therefore in this Figure element nodes, attribute nodes, and text values of elements and attributes are highlighted using different shapes. Element node is marked by a circle, attribute node by a triangle, and node of element or attribute text value by a rectangle.

books

book

book

book

id

title

author

id

title

author

id

title

author

003-04312

The Two Towers

J.R.R. Tolkien

001-00863

The Return of the King

J.R.R. Tolkien

045-00012

Catch 22

Joseph Heller

Fig. 1.1. An XML tree of XML document in Listing 1.1

Document order There is an ordering, document order [79], defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML. The attribute nodes and namespace nodes of an element occur before the child elements of the element. The namespace nodes are defined to occur before the attribute nodes. In [38] the document order is informally depicted. The document order in an XML instance orders its nodes corresponding to the order in which a sequential read of the XML

1.4. XML Query Languages

13

(textual) representation of the instance would encounter the nodes. A much more useful characterization of document order is that this order is determined by a preorder traversal of the document tree. In a preorder traversal, a tree node v is visited and assigned its preorder rank pre(v) before its children are recursively traversed from left to right. A postorder traversal is the dual of preorder traversal: a node v is assigned its postorder rank post(v) after all its children have been traversed from left to right. Example 1.6 (Postorder and preorder traversal). In Figure 1.2 the preorder and postorder traversal of an XML tree are shown. The document order is a < b < c < d < e < f < g < h < i < j, and thus pre(a) = 0, pre(b) = 1, . . . , pre(j) = 9. The postorder traversal is acquired: post(d) = 0, post(e) = 1, post(c) = 2, post(f ) = 3, post(b) = 4, post(i) = 5, post(j) = 6, post(h) = 7, post(g) = 8, post(a) = 9.

a (0) g (6)

(1) b (5) f

(2) c (3) d

a (9)

(4) e

(2) c

h (7) i (8)

g (8)

(4) b

j (9)

(a)

(0) d

h (7)

(3) f (1) e

i (5)

j (6)

(b)

Fig. 1.2. (a) Preorder and (b) postorder traversal of an XML tree

1.4

XML Query Languages

To obtain specified data from an XML database a number of special query languages have been developed, e.g. XML-QL [20], XQL [66], XPath [81], and XQuery [80]. A common feature of these languages is a possibility to formulate paths in the XML graph. Such a path is a sequence of element or attribute names from the root element to a leaf. Regular expressions provide a valuable method for paths specifications. In fact, most of XML query languages are based on the XPath language that uses a form of path expressions for composing more general queries.

14

Chapter 1. Extensible Mark-up Language (XML)

1.4.1

XPath

The core of the XPath language [83, 81], the path expression, directly reflects the recursive nature of tree-shaped data. To be more precise, XPath expressions operate on trees of element or attribute nodes. The basic XPath query expression in non-reduced notation is axis::tag[filter], which provides by evaluation on the context node u a set of nodes u0 , where: • the relation given by the axis contains (u, u0 ), • tag for u0 is tag, • the condition assigned by filter assumes the value true on u0 . In other words [38], XPath expressions specify a tree traversal via two parameters: (1) a context node (not necessarily the root) which is the starting point of the traversal, (2) and a sequence of location steps syntactically separated by /, evaluated from left to right. Given a context node, a step’s axis establishes a subset of document nodes (a document region). This set of nodes, or forest, provides the context nodes for the next step which is evaluated for each node of the forest in turn. The results are unioned together and sorted in document order. Let us note that the name test axis::* succeeds for any element tag or attribute name. XPath Axes The XPath defines a family of 13 axes, i.e. relationship types in that an context element can be associated to other elements represented in the XML tree. Among these the child and descendant-or-self axes, probably more widely known by their mnemonic abbreviations / and //, respectively. The family of axes defined in the XPath is designed to allow a set of graph traversal operations that were seen to be atomic in XML document trees. In [55, 54] author proposes the Conditional XPath language which is expressive completer than the XPath. Table 1.1 lists all XPath axes and verbally sketches their semantics. Let us note regular path expressions correspond to the XPath axes children and descendant-or-self plus name tests. Example 1.7 (XPath semantics). To illustrate the semantics of the XPath axes, Figures 1.3–1.6 depict the result forests for all steps (except attribute, self, and namespace axes) along different axes taken from various context nodes. Example 1.8 (XPath queries). To illustrate the complexity of XPath language, some queries on collections DBLP [48] and Shakespeare in XML [11] are depicted:

1.4. XML Query Languages

15

//[author = ’Rudolf Bayer’]//title – find all title elements where Rudolf Bayer is the author. No matter is done at ancestors of found author elements. Descendants of the relevant parent elements of the element author are elements identified by title. /dblp/conference/issues/issue/inproceedings[year = ’2001’]/title – find all proceeding articles which were published in 2001. The path to the element year is exactly defined. //[speaker = ’Mark Anthony’]//line – find all line elements whose ancestor element contains child element speaker containing Mark Anthony. Ancestors of the element identified by speaker are not defined. //[title = ’The Tragedy of Anthony and Cleopatra’]/act//line – find all line elements where the title is The Tragedy of Anthony and Cleopatra. No matter is done at ancestors of found title elements. Ancestors of line elements are elements identified by act. The relevant parent element of the found element title is the parent for the act element.

Table 1.1. Semantics of axes supported by XPath for context node u Axis parent ancestor ancestor-or-self child descendant descendant-or-self preceding following preceding-sibling following-sibling attribute self namespace

Result Forest the first node on the path from u to the root node nodes lie on the path from u to the root node (recursive closure of parent) u and nodes lie on the path from u to the root node direct descendants of the node u all nodes, which the node u is the ancestor for (recursive closure of child) like descendant, plus v nodes preceding to node u (except ancestors) in the document order nodes following to node u (except descendants) in the document order siblings of node u preceding in the document order (like preceding, same parent as u) siblings of node u following in the document order (like following, same parent as u) attribute nodes of node u u namespace nodes of node u

16

Chapter 1. Extensible Mark-up Language (XML) 0

2 3

6 5

1

7

4

0

0

1

8

2 9

7

5 8

4

3

1

6 2 9

3

5

7

4

8

9

(c)

(b)

(a)

6

Fig. 1.3. XPath semantics: signposted nodes are nodes of the result forest if an (a) parent::*, (b) ancestor::*, and (c) ancestor-or-self::* step is taken from context node 7. 0

2 3

6 5

1

7 8

4

0

0

1

2 9

3

1

6 5

4

2

7 8

9

(b)

(a)

3

6 5

4

7 8

9

(c)

Fig. 1.4. XPath semantics: signposted nodes are nodes of the result forest if an (a) child::*, (b) descendant::*, and (c) descendant-or-self::* step is taken from context node 1.

1.4.2

XQuery

The XQuery [80] is a hopeful XML query language today. A subset of the XPath is a part of the XQuery but more complex constructs are put into the language. Therefore, the XQuery is said to be too complex. Example 1.9 (XQuery queries). In Listing 1.6 three XQuery queries are shown. Listing 1.6. Example of XQuery queries 1. doc(’books.xml’)/books/book[price50 order by $x/name return $x/name 3. { for $b in doc(’books.xml’)/books/book,

1.5. Another XML Technologies

17

0

0 6

1 2 3

5

1

7

4

6

2

8

9

5

3

4

7 8

9

(b)

(a)

Fig. 1.5. XPath semantics: signposted nodes are nodes of the result forest if an (a) preceding::* and (b) following::* step is taken from context node 6 and 1, respectively. 0

0 6

1 5

2 3

4

1

7 8

2 9

3

(a)

6 7

5 4

8

9

(b)

Fig. 1.6. XPath semantics: signposted nodes are nodes of the result forest if an (a) preceding-sibling::* and (b) following-sibling::* step is taken from context node 6 and 2, respectively. $n in $b/name, $a in $b/author return { $n } { $a } }

1.5

Another XML Technologies

There are a lot of technologies related to the XML. Due to the fact that this book is aimed only to indexing XML data some of them are briefly described. SOAP [91] is the Simple Object Access Protocol used for invoke code offered by Web services over the Internet using XML and HTTP.

18

Chapter 1. Extensible Mark-up Language (XML)

Since an XML document does not contain any representation information, it can be formatted in a flexible manner. A standard approach to formatting XML documents is using XSL [87], the eXtensible Stylesheet Language. It consists of XSL Transformations (XSLT), a language for transforming XML, and XSL Formatting Objects (XSL-FO), an XML vocabulary for specifying formatting semantics. The XML Linking Language (XLink) [92] allows elements to be inserted into XML documents in order to create and describe links between resources. It uses XML syntax to create structures that can describe links similar to the simple unidirectional hyperlinks of today’s HTML, as well as more sophisticated links. The XML Pointer Language (XPointer) [82] is used as the basis for a fragment identifier for any URI reference that locates a resource. XPointer, which is based on the XPath, supports addressing into the internal structures of XML documents. It allows for examination of a hierarchical document structure and choice of its internal parts based on various properties, such as element types, attribute values, character content, and relative position.

Chapter 2 Indexing XML Data – State of the Art From database point of view the XML has been recently understood as a new approach to data modelling [63]. A well-formed XML document or a set of documents is an XML database and the associated DTD or schema specified in the language XML Schema is its database schema. Implementation of a system enabling us to store and query XML documents efficiently (so called native XML databases) requires a development of new techniques [63, 12] and is of great importance in the world of information technologies today. A common feature of XML query languages, as it was described in Section 1.4, is a possibility to formulate paths in the XML graph. Such a path is a sequence of element or attribute names from the root element to a leaf. Regular expressions provide a valuable method for paths specifications. In the past, there were many considerations about use of existing relational or object-relational DBMSs for storing and querying XML data. Since a tree is accessed during evaluation of a query, conventional approaches through the conventional database languages SQL or OQL fail or they are not too efficient. Consequently, a form of another indexing is necessary. Recently there are several approaches to indexing XML or, more general, semi-structured data. Some of them are based on a traditional relational technology (e.g. Lore [56], STORED [21], and XISS [49]), other use special data structures for representation of XML data like trie (e.g. Index Fabric [17] and DataGuide [64]), multi-dimensional data structures (e.g. XPath Accelerator [38]), or signature data structures (e.g. [97]). A more complete summary of various approaches to indexing XML data is e.g. [15, 2]. In the course of the development of XML databases the need for a benchmark framework has become more and more evident: many different ways to store and query XML data have been suggested in the past, e.g. XMark [69], XML Data Repository [78], DBLP [48], and INEX [35]. As a numbering schema is often applied for quickly assessment of ancestor-descendant

19

20

Chapter 2. Indexing XML Data – State of the Art

relationship between XML tree nodes, in Section 2.1 a such schema is depicted. In Section 2.2 approaches based on relation decomposition are descibed – the STORED and XISS are briefly proposed. In Section 2.3 some of multi-dimensional approaches are depicted. Section 2.4 shows approaches based on trie representation of XML documents. Existing approaches often allow only an efficient processing of a minor subset of an XML query language. Often, the processing XML queries is not too efficient, or algorithms and data structures are not possible to apply for indexing huge volume XML data. Consequently, these approaches are evaluated as well: the extent of implemented query language subset and the efficiency of the query processing are compared.

2.1

Numbering Scheme

To facilitate XML query processing it is crucial to provide mechanisms to quickly determine the ancestor-descendant relationship between XML tree nodes. The Dietz’s numbering scheme [22, 23] was the first to use tree traversal order to determine the ancestor-descendant relationship between any pair of tree nodes. Author proposition was: for two given nodes x and y of a tree T , x is an ancestor of y if and only if x occurs before y in the preorder traversal of T and after y in the postorder traversal. Example 2.1 (Dietz’s numbering scheme). In Figure 2.1 an XML tree whose nodes are annotated by Dietz’s numbering scheme is shown. Each node is labeled with a pair of preorder and postorder numbers. In the tree, we can tell node (1,4) is an ancestor of node (5,3), because node (1,4) comes before node (5,3) in the preorder (i.e., 1 < 5) and after node (5,3) in the postorder (i.e., 4 > 3). Due to the fact that 2 < 5 but 2 < 3 node (2,2) is not an ancestor of node (5,3).

(0,9) (1,4) (2,2) (3,0)

(6,8) (7,7)

(5,3) (4,1)

(8,5)

(9,6)

Fig. 2.1. An XML tree whose node are annotated by Dietz’s numbering scheme An obvious benefit from this approach is that the ancestor-descendant relationship can be determined in constant time by examining the preorder and postorder numbers of tree nodes.

2.2. Approaches Based on Relation Decomposition

2.2

21

Approaches Based on Relation Decomposition

XML data management systems often store the document schema and data in proprietary repositories [56, 30]. Such solutions provide a big flexibility but sometimes they incur space and time costs because of replicating the schema and processing replicated schema. The more significant disadvantage, however, is that great features such as concurrency, recovery, transactions management developed in the commercial database systems can not be used. There is a number of researches aimed to bring the power of commercial DBMS into the XML world and the majority of them is based on the idea of decomposition of XML documents into a set of relational tables. It is known that semi-structured data can be stored as a ternary relation (XML requires some additional minor attributes) but such straightforward mapping is not efficient [32]. Different researches developed so far propose more sophisticated and efficient mappings.

2.2.1

STORED

The classical research based on relational decomposition of XML document is STORED [21]. It attempts to generate a good relational schema automatically for the existing XML data instance. Generated schema is based on the patterns discovered in the data instance. Subsequently, queries and updates over the semi-structured view are rewritten into queries and updates over the relational store. If some query mix is known in advance, it may be used during the schema generation phase. The generated STORED mapping may not capture the whole data instance. Portions of data which do not fit into the relational schema are stored in an overflow graph which can be implemented by non-relational storage or by the simple ternary relation. Authors declare that the efficiency of the overflow graph is not crucial because the graph should be small. Of course, it is not always true. Generating a good mapping meets some competing problems. One may want to limit the number of tables, bound the disk space, minimize the number of nulls and possibly meet some other application-dependent requirements. These goals may be modeled as a storage cost optimization problem which unfortunately is known to be NP-hard in the size of the input data. Instead, data mining based heuristic is proposed. Mapping building algorithm uses the notions of storage patterns and pattern support. Storage pattern consists of a prefix path and a set of attributes and sub-elements which is called the body. Pattern support is the number of XML elements reachable in the document by the pattern prefix path and containing nodes described by the pattern body. The algorithm calculates the patterns with the highest support and then it groups them to satisfy the restrictions on the number of tables, attributes and total disk space. Finally patterns are translated into STORED mappings and the accompanying overflow mappings. Experiments with the DBLP dataset [48] shown that the algorithm is able to generate relational mappings which cover from 60% up to 90% of the data instance size producing three relations. STORED is one of the most outstanding researches in the XML data storage field and it proved that relational storage may be efficiently used for XML data.

22

Chapter 2. Indexing XML Data – State of the Art

However, the modern approaches compete with STORED both with respect to space and time costs while being much more flexible working with the dynamically changed data.

2.2.2

XML Indexing and Storage System (XISS)

In the paper [49] a system called XISS for indexing and storing XML data is proposed. This system is based on a numbering scheme for nodes which quickly determines the ancestor-descendant relationship between elements and/or attributes in the XML tree. In this article several algorithms for processing regular path expressions are proposed. Numbering Scheme Authors propose that the limitation of previously described Dietz’s numbering schema is the lack of flexibility. That is, the preorder and postorder may be recomputed for many tree nodes, when a new node is inserted. To get around this problem, a new numbering scheme is proposed. This numbering schema uses an extended preorder and range of descendants and associates each node with a pair of numbers as follows: • For a tree node y and its parent x, order(x) < order(y) and order(y) + size(y) ≤ order(x)+size(x). In other words, interval [order(y), order(y)+size(y)] is contained in interval [order(x), order(x) + size(x)]. • For two sibling nodes x and y, if x is the predecessor of y in preorder traversal, order(x) + size(x) < order(y). P Then, for a tree node x, size(x) ≥ y size(y) for all y’s that are a direct child of x. This numbering scheme is equivalent to that preorder traversal. Lemma 2.1. For two given nodes x and y of a tree T , x is an ancestor of y if and only if order(x) < order(y) ≤ order(x) + size(x). Proof. See [49]. Example 2.2 (XISS numbering scheme). In Figure 2.2 XML trees whose nodes are annotated by both Dietz’s and XISS numbering schemes are shown. In Figure 2.2(b) each node is labelled by a pair, which defines an interval. The interval of a node is properly contained in the interval of its parent node. For example, a node (11,15) is contained in both (10,30) and (1,100). Hence, the node with order 11 is a descendant of nodes with order 10 and 1. Compared with Dietz’s scheme, the numbering scheme is more flexible and can deal with dynamic updates of XML data more efficiently. It is a common technique to handling of insertions into an XML document – extra spaces can be reserved to future insertions.

2.2. Approaches Based on Relation Decomposition

23

(0,9) (1,4) (2,2) (3,0)

(1,100) (6,8) (7,7)

(5,3) (4,1)

(8,5)

(a)

(10,30) (11,15) (9,6)

(15,5)

(45,30) (50,20)

(30,10) (22,5)

(55,5)

(65,5)

(b)

Fig. 2.2. Numbering schemes of an XML tree: (a) Dietz’s numbering scheme (b) XISS numbering scheme Global reordering is not necessary until all the reserved spaces (i.e., unused order values) are consumed. Note that for both numbering schemes, deleting a node does note cause renumbering the nodes. However, it is easier for the numbering scheme to recycle the order values of deleted nodes.

Index Structure The index structure of XISS is composed of three major components: element index, attribute index, and structure index. All distinct name strings are collected in the name index, which is implemented as a B+ -tree. Then, each distinct name string is uniquely identified by a name identifier (or nid ) returned from the name index. All string values (i.e. attribute value and text value) are collected in value table. Each XML document is also assigned a unique document identifier (did ), which is an index key to retrieve the document name. An element or attribute can be uniquely identified by its order and did in the entire system. Both the element index and attribute index are implemented as a B+ -tree using name identifiers (nid ) as keys. Each entry in a leaf node points to a set of fixed-length for elements (or attributes) having an identical name string, grouped by document they belong to. The element index allows to find all elements with the same name string. Each element record includes an pair and other related information of the element, and the element records are in a sorted order by the order values as shown in Figure 2.3. The attribute index has almost the same structure as the element index, except that the record in attribute index has a value identifier vid, which is a key used to obtain the attribute value from the value table. Structure index is a collection of linear arrays (see Figure 2.4), each of which stores a set of fixed-length records for all elements and attributes from an XML document. Within an array, the elements and attributes are together sorted by their order value (i.e., in preorder traversal). Each record of the structure index stores a nid, order values of the first sibling, first child, and the first attribute and so on.

24

Chapter 2. Indexing XML Data – State of the Art

Element nid

Document ID list

B+-tree

, Depth, Parent ID Element record Element list with the same name in the same document

Fig. 2.3. XISS element index Processing of a Regular Path Expression Authors propose path-join algorithms to process of regular path expression queries. Consider the following sample query: /chapter/ */figure[@caption=’Tree Frogs’]. The main idea of the proposed path-join algorithms is that a complex path expression is decomposed into several simple path expressions. Each simple expression produces an intermediate results that can be used in the subsequent stage of processing. The results of the simple path expressions are than combined or joined together to obtain the final result of a given query. A regular path expression can be decomposed to a combination of the following basic subexpressions:

Document ID (did)

B+-tree Array of all elements and attributes in the same document

Element record nid, , Parent order, Child order, Sibling order, Attribute order

Fig. 2.4. XISS structure index

2.2. Approaches Based on Relation Decomposition

25

1. a subexpression with a single element or a single attribute, 2. a subexpression with an element and an attribute (e.g., figure[@caption=’Tree Frogs’]), 3. a subexpression with two elements being in ancestor–descendant relationship (e.g., chapter/figure or chapter/ */figure), 4. a subexpression that is a Kleene closure (+,∗) of another subexpression (e.g., chapter* or chapter+), and 5. a subexpression that is a union of two another subexpressions. A subexpression type (1) can be processed by accessing the element index or attribute index. A subexpression type (5) can be processed by merging two intermediate results and grouping by documents. For the other three subexpression types (2), (3), and (4), authors propose three path-join algorithms, namely, EA-Join, EE-Join, and KC-Join, respectively. The EA-Join algorithm joins two intermediate results from subexpressions, which are a list of elements and a list of attributes (a subexpression type (2)). Both the list of elements and list of attributes are sequential scanned. An attribute and element are merged if they belong to the same document and the element is the parent for the attribute. The EE-Join algorithm joins two intermediate results, each of which is a list of elements obtained from an subexpression (a subexpression type (3)). Similar to the EA-Join algorithm, both the lists of elements are sequential scanned. Two elements are merged if they belong to the same document and they are in ancestor-descendant relationship. The KC-Join algorithm processes a regular path expression that represents zero, one or more occurrences of a subexpression. In each processing stage, KC-Join algorithm applies EE-Join to the result from the previous stage repeadly until no more results can be produced.

Assessment Regular path expressions require many traversal steps leading to poor performance in the trivial storage schemas due to necessity of multiple parent-child tests. XISS proposes an enumeration of elements which allows to determine ancestor-descendant relationship between two nodes instantly. XISS approach is reported to be efficient on both real and synthetic data sets for queries matching the pattern mentioned above. However, it is not clear if it is efficient for other types of queries which include traversal in XPath axes different from ancestor-descendant or more complex regular expressions.

26

2.2.3

Chapter 2. Indexing XML Data – State of the Art

XML Indexing and Retrieval with a Hybrid Storage Model

In this subsection we outline a method introduced in [71]. This method differentiates between indexing text in elements and indexing attributes. Either of the parts of indexing exploits different indexing model. In principle, the first one is realized as an full-text index while the second one is realized using conventional relational DBMS. The resultant hybrid system architecture called XRS-II brings to the user possibility of issuing complex queries which combine similarity queries used in classical full-text search engines as well as attribute matching used in relational databases. Text Indexing Indexing text in XML elements is based on the approach called Bottom Up Schenze (BUS), see [72]. All text elements (elements containing only text) are indexed similarly like documents in conventional full-text system where indexing is based on creation of inverted index of terms. BUS modification keeps for the term its frequency in element’s content and for every indexed term instead of document’s ID (DID), a unique element UID. The element UID is constructed as a document ID and as a path in its document tree from the root element to the appropriate element containing the term. For more details about constructing the UID, see [72]. For example, an entry (6, h5, 3, 6, 9i), (3, h12, 1, 4i) in the inverted index means that term ”retrieval” occurs six times in an XML document with DID = 5 lying in element determined with the path 3 → 6 → 9 and three times in an XML document with DID = 12 lying in element determined with the path 1 → 4. The QEP (Query Evaluation Procedure) on elements is realized through the inverted index. Moreover, the UID construction allows to determine the parent-child relationship. This structural information allows the engine to process the query more sophisticatedly and aggregate the term frequencies in bottom up fashion. The asset is that some elements located on higher levels of the XML hierarchy can be returned as a query result even though the lower leveled elements are irrelevant to the query. This behaviour can be viewed as pleasant because it can reflect the structural semantics of the XML document.

Attribute Indexing The attributes are put into the database tables in such way that all the attribute lists pertaining to the same element type are saved in the same table. In another words, a table is allocated to each element type if it has an attribute list. Every instance of an element type (with attributes) is stored as a single record in appropriate table. The table consists one columns for each attribute and also several columns for various identifiers. The QEP for attributes is quite similar to that for full text. Attribute queries are first transformed into the SQL queries and run against the corresponding attribute tables. An

2.3. Multi-dimensional Approaches

27

advantage of using a relational database in processing attributes is that it can handle a set of comparison operators and join operators efficiently. If a user query is mixed with text and attributes, the steps addressed in both paragraphs are performed together and the results are merged. In merging, all the elements retrieved from the database are parts of the search results and those from the full-text play the role of carrying the similarity values to the elements extracted from the database.

2.3

Multi-dimensional Approaches

Multi-dimensional approaches to indexing XML data are based on the idea that a path or an element may be represented as a point of vector space. Multi-dimensional data structures like R-tree [39] and UB-tree [6] are employed to indexing such points.

2.3.1

Indexing XML Data as a Multi-dimensional Problem

An approach introduced in [4, 5] joins the relational approach with the multi-dimensional one. Three building blocks of XML documents are used: • Paths - the notation of a path is a basic concept of XML and query languages for XML. • Values - value is defined as the content of elements and value of attributes. • Documents - document identifiers group paths of one document. A set XMLdocs of XML document with docid document identifier which is stored in the database is defined as follows. Definition 2.1 (XML document as 3-dimensional points). Let P be a set of paths,S C) { choose β ∈ [α : γ], so that 1/2C − ε ≤ count(α : β) ≤ 1/2C + ε split page (α : γ) into page (α : β) and page (β + 1 : γ) update ancestor inner nodes }

Listing 6.2. Point query algorithm to find tuple T Input: T : tuple, only index attributes are specified Output: T : tuple, all attributes are specified

6.3. Operations of the (B)UB-tree

69

ξ = Z(T ) find [α : β] in the UB−tree, such that α ≤ ξ ≤ β retrieve page (α : β) into main memory search content of page (α : β) to find T

Listing 6.3. Deletion algorithm for a tuple T Input: T : tuple to delete from the UB−tree Output: none // for easy illustration the algorithm does not handle the special // case of the tuple being stored on the root page or last leaf ξ = Z(T ) search [α : β] in the UB−tree, such that α ≤ ξ ≤ β retrieve page (α : β) delete T from page (α : β) if (count (α : β) < 1/2C − ε) { merge page (α : β) with the neighboring page (β + 1 : γ) into page (α : γ) if (count (α : γ) > C) { choose δ ∈ [α : γ] with count (α : δ) ≤ 1/2C − ε and count (δ + 1 : γ) ≤ 1/2C − ε split page (α : γ) into page (α : δ) and page (δ + 1 : γ) } update ancestor inner nodes }

The most important and most difficult algorithm in the (B)UB-tree is the range query algorithm. An exponential (according to the dimension) algorithm is presented in [6]. A linear (according to the Z-address bit-length) algorithm is presented in [53, 65]. In Listing 6.4 the algorithm is depicted. Listing 6.4. Original (B)UB-tree range query algorithm Input: tuples QL,QH which define the query box QB Output: a set of tree tuples in the query box stored in an array R ξstart = Z(QL) ξ = ξstart ξend = Z(QH) while (1) { search [α : β] in the UB−tree, such that α ≤ ξ ≤ β retrieve page (α : β) find tuples matched by the query box if (ξ >= ξend ) { break } ξ = getNextZAddress(ξ,QL,QH) }

70

Chapter 6. UB-tree and its Variants

A crucial operation of the algorithm is the getNextZAddress() operation. The operation finds the next Z-address of the query box and following Z-region intersecting the query box. In [53, 65] a description of the operation is very vague. For that reason, we have developed our own range query algorithm [74, 73] based on a linear algorithm of the intersection of a Z-region and query box (see Listing 6.5). Due to the the fact that the intersection algorithm was developed, the range query algorithm is very similar to the R-tree range query algorithm employing the intersection operation of an MBB and query box. The range query is processed by iterating through the tree and filtering of irrelevant tree’s parts, i.e. (super)Z-regions in the case of (B)UB-tree, which do not intersect a query box. Listing 6.5. Novel (B)UB-tree range query algorithm based on the intersection operation Input: tuples QL,QH which define the query box QB Output: a set of tree tuples in the query box stored in an array R Variables: a node N , a stack Z which contains a current path in the tree Z.Remove() R.Remove() N = the root node Z.Push(N ) while (!Z.Empty()) { if (N is not leaf) { if ( there is the next Z−region, [α : β], in N which intersects QB) { Z.Push(N ) read a child of the region item into N } else { N = N .Pop() } } else if (N contains points of the query box) { add such points into R N = Z.Pop() } }

Chapter 7 R-tree and its Variants 7.1

Introduction

Since 1984 when Guttman proposed his method [39], R-trees have become the most cited and most used as reference data structure in this area. As is required and expected by applications, they support usual point and range queries, and also some forms of spatial joins. Another interesting query supported by R-trees, to some extent, is the k-NN query. R-tree can be thought of as an extension of B-trees in a multi-dimensional space. It corresponds to a hierarchy of nested n-dimensional minimum bounding boxes (MBB) which may be defined by two n-dimensional points. If N is an interior node, it contains couples of the form (Ri , Pi ), where Pi is a pointer to a child of the node N . If R is its MBB, then the boxes Ri corresponding to the children Ni of N are contained in R. Boxes at the same tree level may overlap. If N is a leaf node, it contains its couples of the form (Ri , Oi ), so called index records, where Ri contains a spatial object Oi . In Figure 7.1 a general structure of the R-tree for indexing point data is depicted. Each node of the R-tree contains between K and C entries unless it is the root and corresponds to a disk page. Other properties of the R-tree include the following:

• Whenever the number of a node’s children drops below K, the node is deleted and its descendants are distributed among the sibling nodes. The upper bound C depends on the size of the disk page.

• The root node has at least two entries, unless it is a leaf.

• The R-tree is height-balanced; that is, all leaves are at the same level. The height of an R-tree is at most blogK (m)c − 1 for m index records (m > 1). 71

72

Chapter 7. R-tree and its Variants

super-region

Bl:Bh

... ...

Bl:Bh T

...

Bl:Bh

region (MBB)

... ...

T

...

... ...

Bl:Bh T

...

T

...

index – hierarchy of MBBs

...

Bl:Bh T

...

T

...

Bl:Bh T

...

T

indexed tuples

tuples in the region

Fig. 7.1. Structure of the R-tree

7.2

Split Procedure

As a dynamic data structure, most attention of previous works on R-trees has been devoted to the split procedure during the adding of new index records into an R-tree. It significantly affects the index performance. Three split techniques (Linear, Quadratic, and Exponential) proposed in [39] are based on a heuristic optimization. The Quadratic algorithm has turned out to be the most effective and other improved versions of R-trees are based on this method. The algorithm uses the following strategy: Given a set of C + 1 entries, each entry is assigned to one of the two produced nodes, according to the criterion of minimum area, i.e., the selected node is the one that will be enlarged the least in order to include the new entry. Unfortunately, this criterion is taken for granted and not proved to be the best possible. The Quadratic algorithm tends to prefer the group with the largest size and higher population. In most cases this group will be least enlarged. Hence, there is a high chance it will need less area in order to accommodate the next entry, so it will be enlarged again. Over time, this will create a very uneven distribution, with most entries in one node. Also, when one of the groups becomes full, the rest of C − K + 1 entries are assigned to the second group without any geometric criteria. A minimum node capacity constraint also exists; thus a number of entries are assigned to the least populated node without any control at the end of the split procedure. This fact usually causes a significant overlap between the two nodes.

7.3. R-tree Variants

7.3

73

R-tree Variants

R-tree performance is usually measured with respect to the retrieval cost (in terms of disk accesses) of queries. The majority of performance studies concerns point, range, and k-NN queries. Considering the R-tree performance, the concepts of node coverage and overlap between nodes are important. Obviously, an efficient R-tree search requires that both the overlap and coverage are minimized. Minimal coverage reduces the amount of dead area covered by R-tree nodes. The minimal overlap is even more critical than the minimal coverage; searching objects falling in the area of k overlapping nodes, up to k paths to the leaf nodes may have to be executed in such a way. Variants of R-trees differ in the way they perform the split algorithm during insertions, i.e. which minimization criteria are used. Literature has identified a variety of criteria for the layout of keys on nodes that affect retrieval performance. These criteria are: minimal node area, minimal overlap between nodes, minimal node margins or maximized node utilization. It is impossible to optimize all of these parameters simultaneously. We will briefly put forward two well-known approaches to the R-tree optimization - R∗ -trees and R+ -trees. Authors of [50] put forward, in their recent exhaustive overview, another six variants. The main feature of R∗ -trees [7] involves the node-splitting policy. Therefore, the R∗ tree differs from the R-trees mainly in the insertion algorithm. Although original R-tree algorithms tried only to minimize the area covered by MBBs, the R∗ -tree algorithms also take the following objectives into account: • The overlap between MBBs at the same (non-leaf) tree level should be minimized. The lesser overlap, the smaller the probability that one has to follow multiple search paths. • Perimeters (margins) of MBBs should be minimized. For example, in 2D the preferred rectangle is the square, since this is the most compact rectangular representation. • Storage utilization should be maximized. Nodes should store as many entries as possible so that the height of the tree is kept low. According to the R∗ -tree split algorithm, the split axis is the one that minimizes a cost value S (S being equal to the sum of all margin values of the different distributions). Then the distribution which achieves minimum overlap-value is selected to be the final one along the chosen split axis. On the other hand, the distinction between the ”minimum margin” criterion to select a split axis and the ”minimum overlap” criterion to select a distribution along the split axis, followed by the R∗ -tree split algorithm, could cause the loss of a ”good” distribution if, for example, that distribution belongs to the rejected axis. The design of the R∗ -tree also introduces a policy called forced reinsert: If a node overflows, it is not split in the right away. Moreover, p entries, p > 0, are removed from

74

Chapter 7. R-tree and its Variants

the node and reinserted into the tree. Authors of [7] suggest p should be about 30% of the maximal number of entries per page. Through all above mentioned techniques they reached performance improvements of up to 50% compared to the basic R-tree. Clipping-based schemes do not allow any overlaps between bucket regions; they have to be mutually disjoint. A typical access method of this kind is the R+ -tree [70], a variant of the R-tree which allows no overlap between regions corresponding to nodes at the same tree level and an object can be stored in more than one leaf node. R+ -trees are considered to be one of the most efficient indexes for supporting point and range queries. Other approaches to an improvement of original R-trees release some of their basic features. For example, the MBBs have been replaced by minimum bounding spheres or polygons. In [8] R+ -trees are extended to support k-NN queries. Special attention should be devoted to the use of signatures in connection with R-trees. The approach [62] offers an RS-tree that consists of an R-tree and an S-tree [19], i.e. a well-know hierarchical signature file. The main application of this data structure is an improvement of incremental k-NN query algorithm. As far as structures like R-tree are concerned, processing the narrow range query is inefficient. In Chapter 9 the Signature R-tree is described. This data structure make it possible more efficient processing a narrow range query.

Chapter 8 Multi-dimensional Forests 8.1

Introduction

As in the case of term tuples or points representing paths, some tuple sets are of different dimensions. Such variously dimensional tuples can be indexed in a single n-dimensional vector space, where n is the maximal dimension over all the tuples in a given set. Shorter (lower-dimensional) tuples can be aligned to n-dimensional tuples, where the extra dimensions are set to a blank value. On the other side, this simple approach has a major drawback. Since all the tuples are modelled in high-dimensional space there is a large amount of redundant information (the blank values) stored within the extra dimensions of possibly great number of aligned tuples. Example 8.1 (Path dataset). Let us take paths extracted from the Protein Sequence Database XML document [78]. Approximately 17 mil. paths are obtained from this document. In Figure 8.1 we see path length frequency histogram and dimension frequency of points representing paths, respectively. We see that the creation of 9-dimensional R∗ -tree will lead to unnecessary storage overhead. Example 8.2 (Term dataset). We deal with term indexing, let us take a term dataset as an example. The term dataset was extracted from the TREC’s collections of text documents [59], in particular from LATIMES and FBIS collections. These collections contain 816,716 unique terms. Figure 8.2(a) shows a term length frequency histogram of the whole term dataset. Figure 8.2(b) is another interpretation of Figure 8.2(a) and shows how the number of terms grows with growing maximal allowed term length (i.e. with growing dimension of term space). We can observe that for majority of the terms the term length is smaller than 15. Thus, creation of 40-dimensional BUB-tree will lead to unnecessary storage overhead.

75

76

Chapter 8. Multi-dimensional Forests

(a)

(b)

Fig. 8.1. (a) Path length frequency histogram of path dataset extracted from the Protein Sequence Database XML document (b) Dimension frequency of points representing paths in the document

8.2

BUB-forest

In this section we introduce a new multidimensional data structure, BUB-forest, which was designed just to avoid unnecessary storage and performance overhead when indexing variously dimensional tuple sets. Definition 8.1 (The BUB-forest). BUB-forest BFk (n1 , n2 , . . . , nk ) is a data structure forest consisting of k BUB-trees BTi (ni ), 1 ≤ i ≤ k. Let n be the dimension of the original high-dimensional vector space Ω = Dn . Every BUB-tree BTi (ni ) indexes an ni -dimensional space, where 1 ≤ i ≤ k, 1 ≤ j < k ⇒ nj+1 > nj , and nk = n. Let us have m tuples Ti of various dimensions, where di is the dimension of the tuple Ti , 1 ≤ i ≤ m. Then tuple Ti is indexed by such BUB-tree BTj (nj ) for which di ≤ nj ∧ j > 1 ⇒ di > nj−1 , for 1 ≤ i ≤ m, 1 ≤ j ≤ k. If di < nj then dimension of Ti is increased to nj and values Tik = blank value ∈ D, di < k ≤ nj . In other words, the BUB-forest indexes each tuple Ti using such BUB-tree BTj (nj ) the dimension of which is the lowest but greater or equal to the dimension of Ti . The blank value is often zero. An example of BUB-forest BF2 , i.e. consisting of two BUB-trees, is presented in Figure 8.3. In this example, the BUB-trees are of the same heights but that is not a rule.

8.2. BUB-forest

77

(a)

(b)

Fig. 8.2. (a) Term length frequency histogram of term dataset extracted from the TREC’s document collections LATIMES and FBIS (b) Number of terms with growing maximal term length

Fig. 8.3. Example of BUB-forest BF2

8.2.1

Operations on BUB-forest

Basic operations, i.e. insertion, deletion and point query, are performed by such a BUB-tree BTj to which the argument tuple is assigned. Definition 8.2 (BUB-forest range query). Let a query box QB is defined by two nqb -dimensional boundary points QL and QH, nqb ≤ n. The range query is in the BUB-forest BFk (n1 , n2 , . . . , nj , . . . , nk ) performed as a sequence of range queries QBj , defined by ni -dimensional boundary points QLj and QHj , on BUB-trees BTi (ni ), for which nqb ≤ ni , 1 ≤ i ≤ k. Let 1 ≤ l ≤ nj . If l ≤ nqb then QLj l = QLl and QHj l = QHl , else QLj l = QHj l = blank value. Query result of the BUB-forest range query is the union of the particular BUB-tree query results. Notes: Obviously, the number of single range queries may be lower than k. In similar way there could be defined forests also for other existing data structures, e.g.

78

Chapter 8. Multi-dimensional Forests

the B-tree or the R-tree. In the case of B-tree, there is only one dimension but the use of forest could serve as a compression tool since keys of variable lengths are stored in multiple B-trees. The forest data structure has been already applied to signature S-trees [19]. Since BUB-forest is a persistent data structure we can further consider two variants of disk cache. First, the BUB-trees of a BUB-forest share a single disk management and thus single disk cache. Second, each BUB-tree has its own disk cache. The latter possibility is obviously more efficient. Each BUB-tree of a BUB-forest can index different number of tuples thus the tree heights can differ.

8.3

Storage Volume Reduction

Let us now compare the storage volume required when indexing variously dimensional tuple sets using a single BUB-tree and using a BUB-forest. the tuple set consists of m PLet n tuples, where mi is the number of di -dimensional tuples, i=1 mi = m, 1 ≤ i ≤ n. The following calculations are only auxiliary since the disk management of BUB-tree as well as of BUB-forest produces some additional storage overhead. Suppose that single coordinate of a tuple requires b bytes for storage. When using a single space for indexing, the number of bytes Vbt required to store the whole tuple set is Vbt = m × n × b. When using k spaces of dimensions n1 , n2 , . . . , nk for indexing, the number of bytes Vbf required for the tuple set storage is

Vbf =

ni k X X

(mj × ni × b),

i=1 j=ni−1

Thus the number of saved bytes (when using k spaces) is

Vbt − Vbf

ni ni k X X X = ( mj × n × b − mj × ni × b) i=1 j=ni−1

j=ni−1

where n0 = 1. If we consider the minimal storage volume V for the tuple set V = Pn i=1 mi × i × b bytes, then obviously V ≤ Vbf ≤ Vbt . Equality V = Vbf can be achieved if we increase the number of BUB-trees in BUB-forest to k = n. Simultaneously, we must process that greater number of BUB-trees in BUB-forest leads to greater number of range query executions. Thus, the number of BUB-trees should be chosen heuristically, following the statistical distribution of tuples. In general, the lower-dimensional BUB-trees should index major part of the tuple set. Example 8.3 (Storage volume reduction for term index). Let us take the term dataset used in Figure 8.2. Let the maximal term length be n = 40.

8.4. Cost Analysis

79

When using 40-dimensional space, the storage volume will be Vbt = 816, 716 × 40 × 1 = 31.2MB. If we use 3 spaces of dimensions 9, 17 and 40, we will get storage volume Vbf = 509, 258 × 9 × 1 + 280, 050 × 17 × 1 + 27, 408 × 40 × 1 = 10MB. The smallest possible storage volume is V = 7.1MB. This simple example shows that the BUB-forest saves (theoretically) 68% of the single BUB-tree’s storage volume. Furthermore, the storage overhead is still about 41% higher when compared with the ideal case.

8.4

Cost Analysis

A simple range query is processed by retrieval of those regions (tree leaf nodes respectively) that intersect a given query box. Let NI be the number of such regions and m be the number of indexed tuples. Then complexity of the range query is O(logc (m)×NI ), where c is a fixed node capacity (treeP arity respectively). In the case P of BUB-forest BFk , the complexity of a range query is O( ki=1 logc (mi ) × NIi ), where ki=1 mi = m, NIi is the number of regions intersecting the i-th query box (i.e. query box constructed for the BUB-tree BTi ). Usage of BUB-forest significantly reduces the storage volume and it can be calculated that the additional BUB-forest overhead is relatively low (thanks to the node utilization over 50%). The number of BUB-trees in BUB-forest was chosen in order to maximize the range query efficiency. The multi-dimensional forests do not solve the problem of curse of dimensionality itself. However, disk access cost is decreased during the processing of range queries as far as the forest is employed for indexing points of different dimensions. Chapter 12 presents experimental results which prove the above mentioned auxiliary outcomes.

Chapter 9 Signature Multi-dimensional Trees 9.1

Introduction

As far as structures like n-dimensional B-tree, R-tree, and BUB-tree are concerned, processing the narrow range query is inefficient. The efficiency decreases with rising dimension – the curse of dimensionality takes place. An efficient solution of this problem does not seem to exist. In [47] Signature R-tree which make it possible more efficient processing a narrow range query is introduced. In [43, 44] the data structure is employed for indexing XML data. In this work the signature [51] is applied for more effective processing of a narrow range query over point data index. This approach enriches the existing multi-dimensional data structure but native functionality is preserved. The novel data structure is more resistant to the curse of dimensionality. This feature is grounded in better preservation of data distribution achieved by a signature method over multi-dimensional index. Due to the fact that the R-tree is well-known data structure used in many recent DBMSs, we apply the signature extension to the R-tree. Of course, another multi-dimensional data structures (e.g. (B)UB-tree) may be enriched by the signature. Section 9.2 describes a problem of processing a narrow range query in a multi-dimensional data structure. Due to the fact that the signature is applied for more efficient processing a narrow range query, Section 9.3 resumes signature methods. Section 9.4 introduces the multi-dimensional signature for efficient processing a narrow range query. In Section 9.5 Signature R-tree is depicted. Emphasis is put on processing the narrow range query. Section 9.6 outlines a technique for generating the multi-dimensional signature. In Chapters 10 and 11 we compared BUB-tree with R∗ -tree and their respective signature variants. The results confirm that our approach is significantly better than BUB-tree and R∗ -tree, respectively. 81

82

9.2

Chapter 9. Signature Multi-dimensional Trees

Processing a Narrow Range Query in Multidimensional Data Structures

In general, multi-dimensional data structures divide an n-dimensional space into sub-spaces (regions). In the case of the R-tree and (B)UB-tree, the tuples are clustered to MBBs and Z-regions, respectively. The index is made by hierarchies of the regions (so called superregions). Consequently, the tuples of the region are stored in one leaf node. The inner nodes contain a definition of super-regions, MBBs in the case of R-tree again. An algorithm of range query filters the irrelevant tree node (regions), only leaf nodes intersected by the query box are searched. Example 9.1 (Reason of inefficiency processing of the narrow range query in R-tree). Let us take a 2-dimension space which contains points (4,1), (4,5), and (6,4). These points define MBB (4,1):(6,5) (see Figure 9.1) and they are stored on a single leaf node. Now, a range query is defined by the query box (1, 2) : (5, 2). The region is intersected by the query box and it will be searched. Consequently, this region is relevant to the query box from the R-tree point of view, but it contains no point from the query box.

0 1 0 1 2 3 4 5 6 7

2 3

4 5 6

7 MBB (4,1):(6,5)

T1

query box (1,2):(5,2)

T3 T2

Fig. 9.1. Points T1 , T2 , and T3 in MBB (4, 1) : (6, 5) and the narrow range query (1, 2) : (5, 2)

Definition 9.1 (Intersect and relevant regions, relevant ratio). Let RQ be the range query defined by the box QB. Regions intersecting a query box during the processing of a range query are called intersect regions and regions containing at least one point of the query box are called relevant regions. We denote their number by NI and NR , respectively. The relevance ratio is cR = NNRI .

9.3. Signature Methods

83

Experiments show (see Chapters 10 and 11), the ratio cR is close to zero (cR  1) for a narrow range query in the case of R-tree and (B)UB-tree. The efficiency of the narrow range query processing is not optimal. A lot of irrelevant regions must be searched, therefore a lot of extra disk accesses must be performed. Definition 9.2 (Quality ratio of a range query algorithm). Let us take a range query algorithm, which searches NRQ regions for a query RQ. The R quality ratio of a range query algorithm is cQ = NNRQ . In optimal case, cQ = 1. The value cQ decreases sharply for increase dimension of indexed space in current multi-dimensional data structures. Consequently, the processing of a narrow range query is ineffective. Note, the number of inner nodes  the number of leaf nodes (the number of regions) in the case of tree data structure. Since such ratios take hold of efficiency of a range query algorithm rather precisely. A probability that the irrelevant region is matched decreases with downward region volume. The reducing of region volume is a way for an efficiency improvement of processing the narrow range query. We need to insert a piece of information into a data structure for the reducing of region volume and, consequently, for better filtration of irrelevant tree nodes (regions). The result of increasing cQ is the decrease of disk access cost and data structure overhead. In this case we apply the n-dimensional signature as a piece of information. In this work we describe such extension of the R-tree and the novel data structure is called the Signature R-tree.

9.3

Signature Methods

The signature file method has widely been advocated as an efficient access method to deal with many applications demanding a large volume of textual databases, such as libraries, office information, and medical information systems [13, 14]. Therefore, the signature file approach has become a well-known concept for implementing associative retrieval on data files kept in stable storage. Recently, the use of signature files was extended to support multimedia data, such as images, voice, and video [7]. Many recent DBMSs support multimedia data and require a dynamic storage structure which performs not only retrieval operations, but also insertion, deletion, and update operations in an efficient manner. As a result, several dynamic signature files have been proposed, for example S-tree [19] or Quick filter [98]. The signature file is an abstraction which acts as a filtering mechanism to reduce the number of block accesses and CPU time to execute a query. A signature is a bit string formed from the terms which are used to index a record in a data file. Signature files typically make use of the superimposed coding technique in order to create a record signature [27]. When we assume that a record consists of n terms, each term is converted into a bit string, called the term signature, using a hash function. The record signature is

84

Chapter 9. Signature Multi-dimensional Trees

formed by superimposing (inclusive ORing) the n term signatures. The number of 1’s in the signature S is called the weight γ(S). To answer a query, we first examine the signature file rather than the data file, to immediately discard non-qualifying records. For this, a set of terms in a query is hashed to form a query signature in the same way used for the record signature. If the record signature contains 1’s in the same position as the query signature (i.e. the query signature is included in the record signature), the record can be considered as a potential match. A bitwise AND operation is employed for the decision. However, there can be a case where the record signature may qualify for a query signature, but the record itself does not satisfy the query. This is called the false drop.

9.4

Signature Multi-dimensional Trees

As far as the Signature R-tree is concerned the n-dimensional signature helps to filter irrelevant parts of an R-tree preferably during a narrow range query processing. The ndimensional signature can be applied to various multi-dimensional data structures (e.g. (B)UB-trees). Definition 9.3 (n-dimensional signature). Let Ω be an n-dimensional discrete space, Ω = Dn , |D| = 2lD . Let us take a set of m points (tuples) T 1 , T 2 , . . . , T m , where T i = (t1 , t2 , . . . , tn ), T i ∈ Ω, T i j = tj ∈ D, 1 ≤ i ≤ m, 1 ≤ j ≤ n. Let F be a mapping creating a signature: {0, 1}lD → {0, 1}lS . n-dimensional signature S n (T 1 , T 2 , . . . , T m ) = (S1 , . . . , Sn ) = (F (T 1 1 ) OR . . . OR F (T m 1 ), . . . , F (T 1 n ) OR . . . OR F (T m n )), where Si is a signature ∈ {0, 1}lS , lS is the length of the signature, n × lS is the length Pn of the n-dimensional signature. The weight of the n-dimensional signature n γ(S ) = k=1 γ(Sk ), where γ(Sk ) is the weight of the signature Sk . We can discover the absence of relevant points (the points belonging to a query box) using the AND operation for n-dimensional signature of a query and the n-dimensional signature of points in a region during the processing of a range query. Consequently, we apply the AND operation to coordinate values of points how it is usual in signature methods. Note that the meaning of the constants ψ and φ is introduced in Definition 3.3. Definition 9.4 (Range query processing with the n-dimensional signature). Let us take the range query defined by two points of an n-dimensional space QL = (ql1 , . . . , qln ) and QH = (qh1 , . . . , qhn ). Let us create the n-dimensional signature of the query box S n QB = (SQB1 , . . . , SQBn ): if qli = qhi , or qli 6= qhi and qhi −qli ≤ ψ, then SQBi = F (qli ) = F (qhi ) and SQBi = F (qli ) OR F (qli + 1) OR . . . OR F (qhi ), respectively. If qhi − qli ≥ φ then SQBi = 2lS −1 (the number with only true bits). Let us take the n-dimensional signature S n = (S1 , . . . , Sn ) of points T 1 , T 2 , . . . , T m . The points generating the n-dimensional signature can belong to the query box if all partial signatures Si and SQBi , 1 ≤ i ≤ n, are matched by the AND operation. A partial signatures Si and SQBi are matched if:

9.5. Signature R-tree

85

• for qli = qhi and qhi − qli ≥ φ it holds Si AND SQBi = SQBi . • for qli 6= qhi and qhi − qli ≤ ψ it holds γ(Si AND SQBi ) ≥ 1. Consequently, the n-dimensional signatures S n and S n QB are matched by the AND operation if all partial signatures Si and SQBi , 1 ≤ i ≤ n, are matched. Of course, if SQBi contains only true bits, the operation AND can be omitted. If γ(SQBi ) → lS a probability of the false drop is close to one (see Section 9.6). Consequently, this algorithm is possible to apply only for low values of ψ. Example 9.2 (Usage of the n-dimensional signature for filtration of irrelevant tree pages). Let us express the creation and application of a simple n-dimensional signature for better filtration of irrelevant tree nodes in the R-tree. Let us take points from Example 9.1. The first coordinate of the n-dimensional signature contains superimposed first coordinates of the points: 4 (100) OR 4 (100) OR 6 (110). The second coordinate equals: 1 (001) OR 5 (101) OR 4 (100). In this way, the n-dimensional signature (110,101) is created. Since the second coordinates of both query box points contain the same values (the number 2) then all relevant points contain value 2 in the second coordinate. Consequently, the n-dimensional signature of the query hyper box is (111,010). The region (MBB (4,1):(6,5)) is recognized as irrelevant by the signature operation (111,010) AND (110,101). Since (010) AND (101) 6= (010) then the region is irrelevant, in spite of the query box intersecting the region is searched during the narrow range query processing in the classical R-tree.

9.5

Signature R-tree

The Signature R-Tree is the R-tree data structure with added n-dimensional signature for better filtration of irrelevant tree nodes. A general structure of the Signature R-tree is presented in Figure 9.2. Leaf nodes include indexed tuples, which are clustered into regions – MBBs. The MBBs of leaf points can be hierachized to MBBs again and, in this way, super-regions are created. A definition of the regions and super-regions (two points in n-dimensional space) is stored in inner tree nodes. The n-dimensional signature is assigned to each region. The node’s item with the definition of a super-region holds an n-dimensional signature as well, superimposed with signatures of the direct node’s children. Consequently, such a tree contains two hierarchies, the hierarchy of MBBs and of n-dimensional signatures. Operations of the R-tree are preserved, and of course we apply the n-dimensional signature for better filtration of irrelevant nodes during processing a narrow range query. The signature helps to examine the intersection algorithm of which the node is/is not relevant to a user’s query box. Note, the number of inner nodes  the number of leaf nodes in the case of tree data structure. Since the n-dimensional signatures are inserted only into inner

86

Chapter 9. Signature Multi-dimensional Trees

nodes, enlargement of a data structure is not enormous (see Chapters 10 and 11). Now, operations of the Signature R-tree shall be described. n-dimensional signature of tuples in the super-region

super-region

region (MBB)

n-dimensional signature of tuples in the region

...

Bl:Bh S T

... ...

T

...

Bl:Bh S

...

...

index – hierarchy of MBBs and n-dimensional signatures

... ...

Bl:Bh S T

...

Bl:Bh S

T

...

...

Bl:Bh S T

...

T

...

Bl:Bh S T

...

T

indexed tuples

tuples in the region

Fig. 9.2. Structure of the Signature R-Tree

9.5.1

Operations of the Signature R-tree

Operations Insert, Delete, and Find (point query) are handled by algorithms of the selected R-tree variant. Consequently, an arbitrary splitting algorithm or an algorithm, with various complexity, can be chosen for Insert operation, see [39, 70, 7]. Moreover, in the case of the Signature R-tree’s Insert and Delete operations, a change of tuples in a leaf node must be reflected by changes of n-dimensional signatures in all inner nodes of a current path. The Hamming distance [19] is applied for measuring a similarity of signatures. The propagation of the changes to the root node is finished if the Hamming distance of old and new n-dimensional signatures equals 0 in some node.

9.5.2

Range Query Operation for a Narrow Hyper Box

An advantage of described approach is that the algorithm of range query is not changed for a general range query. Of course, in the case of processing a narrow range query, we apply the n-dimensional signature for a better filtration of irrelevant tree nodes. Let us suppose the well-known Intersection operation, which ascertains whether a MBB is intersected by a query box in linear time. Listing 9.1. Range query operation for a narrow hyper box Input: tuples T1 ,T2 which define the query box Output: a set of tree tuples in the query box stored in an array R Variables: a node N , a stack Z which contains a current path in the tree

9.5. Signature R-tree

87

Z.Remove() R.Remove() N = the root node Z.Push(N ) while (!Z.Empty()) { if (N is not leaf) { if ( there is the next minimal bounding box, M BB, in N with non−empty M BB ∩ QB) { n n determine whether M BB can be relevant applying AND operation on SQB and SM BB if it is matched { Z.Push(N ) read a child of region item into N } else { N = Z.Pop() } } else { N = N .Pop() } } else if (N contains points of the query box) { add such points into R N = Z.Pop() } }

The increase of the data structure volume after adding of n-dimensional signatures is not enormous, but the time complexity of the algorithm is improved (see next Section). The important issue is that the algorithm stays without change for a general range query. In general, the volume of a region is reduced using a signature by following of perfect distribution of points. The n-dimensional signature forms a spatial region as well. Let RS be a signature region formed for a set of points with the MBB RMBB . The intersection and signature operations are applied to filtration of irrelevant regions. Consequently, regardless to the shape of the signature region: RMBB ∩ RS 6= ∅, RMBB ∩ RS ⊆ RMBB . The formula turns out to be NRQ ≤ NI . The most important issue is that the efficiency of the signature extension is always better or equal in comparison to the R-tree. Our experiments prove that the efficiency of the Signature R-tree is always better for real data.

88

Chapter 9. Signature Multi-dimensional Trees

Example 9.3 (Signature region). In Figure 9.3 tuples and regions from Examples 9.1 and 9.2 are depicted. Signature regions are shown in particular dimensions. We see that the result region is less than the MBB. 0 1 0 1 2 3 4 5 6 7

2 3

4 5 6

7

T1 T3 T2

0 1

4 5 6

T3

4 5 6

T1 T3 T2

(b)

2 3

0 1 2 3 4 5 6 7

T2

2 3

0 1

7

T1

0 1 0 1 2 3 4 5 6 7 (a)

2 3

0 1 2 3 4 5 6 7

7

7

T1 T3 T2

0 1 0 1 2 3 4 5 6 7

4 5 6

2 3

4 5 6

7

T1 T3 T2

(c)

Fig. 9.3. Tuples and regions from Examples 9.1 and 9.2: (a) MBB (b) signature region (c) the result region – an intersection of the MBB and signature region

9.5.3

Cost Analysis

The complexity is not modified for basic operation Find, Insert, and Delete in the case of the Signature R-tree. A policy of node splitting or complexity of splitting algorithm depends on the chosen R-tree variant or a selected splitting algorithm [39, 70, 7]. In the case of the Signature R-tree, a change of tuples in a leaf node must be propagated to changes of n-dimensional signatures in all inner nodes of the current path. Consequently, the complexity is preserved. The complexity of the general range query algorithm is O(NI × logc m), where NI is the number of intersect regions, c is the node’s capacity. It holds the value cR becomes much lower than 1 for a narrow range query (particularly for increasing dimension of indexed space). In the case of the Signature R-tree the complexity is O(NRQ × logc m), where NRQ is the number of searched regions (leaf nodes). Our experiments show NI 

9.6. Signature Generating

89

NRQ ≥ NR , consequently cQ → 1 in the case of the Signature R-tree. In other words, the space complexity of the algorithm is enhanced for reducing the time complexity. The R-tree clusters points of n-dimensional space into regions and follows an approximate distribution of data. The properties of clustering weaken the curse of dimensionality. Since the signatures follow a perfect distribution of data, the signatures eliminate the curse of dimensionality explicitly. Consequently, disk access cost is decreased during the processing of narrow range queries.

9.6

Signature Generating

Now, we shall describe a method for generating an n-dimensional signature suitable for more effective processing a narrow range query. In the case of the multi-dimensional data structure, point clustering is controlled by principles of appropriate structure. As far as the R-tree is concerned, points are clustered into the MBB. In the case of signature data structures (e.g. S-tree) signatures with a minimal Hamming distance are clustered. Of course, the signatures of points clustered in a region of a multi-dimensional data structure do not have minimal Hamming distances. If the signature of points would contain more true bits, then the signature of region’s tuples would contain almost only true bits. Consequently, such a signature does not filter any irrelevant regions, because the probability of false drop is close to one. Evidently, the weight of a signature must be the smallest. Therefore, a hash function mapping each value to only one bit in a signature is used. Of course, we can not thicken the signature of a region by adding extra true bits for the reduction of false drop probability, as it is known in the S-tree. In this case, the n-dimensional signatures of superior tree’s level would contain almost only true bits again. Let F : D → H be a hash function. Let us take a domain to be D = {0, 1, . . . , 2lD − 1} and a range to be H = {0, 1, . . . , 2lS −1} (see Definition 9.3). The hash function F is created by a generator of pseudo-random numbers (e.g. generator with a normal distribution). If |D| = lS then the mapping is suitable to define F as a simple one. If H = {20 , 21 , . . . , 2lS −1 }, consequently only one bit is generated for each value and γ(S n ) is the smallest. If qli = qhi for a query box’s coordinate, then γ(Si ) = 1. For pure detection of irrelevant regions it must hold γ(Si ) = the number of node items. Taken into consideration a hierarchy of n-dimensional signatures, it must hold lS = |D| for pure detection of irrelevant tree nodes. Such a length of a signature is not possible in real cases (often |D| = 232 ). The suitable length of the n-dimensional signature is a subject for experiments. An n-dimensional signature is created by the superimposing of signatures independently for each dimension. Consequently, this case can arise, e.g., the region containing points (2, 3, 4) and (1, 5, 1) is relevant to the query box QL = (2, 5, 0), QH = (2, 5, max(D)) from the signature filtering point of view. Experimental results show that this is not the case for the R-tree, because the clustering does not work in this way. Of course, it depends on data distribution.

Part IV Experimental Results

91

Chapter 10 Multi-dimensional Approach to Indexing XML Data In our experiments1 , we used Protein Sequence Database XML document from the XML Data Repository project [78]. The document size is 683 MB. It includes 21,305,818 elements and 1,290,647 attributes. 17,007,879 paths, 76 labelled paths, and 2,156,552 terms were obtained from this document. With respect to the frequency of the path lengths, the multidimensional forests with two trees indexing spaces of dimension n = 7 and n = 9 were created to indexing XML data. We used BUB-tree, R∗ -tree, and their signature variants with the length of n-dimensional signature n × 64. The length was determined on basis of experiments of signature multi-dimensional trees described in next Chapter.

Table 10.1. Path index characteristics Dimen- Number sion of n points 7 8,268,357 9 8,739,522

BUB-tree 440.9 562.1

DimenNumber of sion inner leaf n nodes (BUB-tree) 7 10,917 214,842 9 17,751 270,065

Index size [MB] Sign. BUB-tree R∗ -tree 471 [+7%] 478.6 635.2 [+13%] 603.1

Sign. R∗ -tree 512.2 [+7%] 680.7 [+13%]

Node item size [B]/Node capacity inner node leaf node ∗ ∗ ∗ R -tree Sign. R -tree R -tree Sign. R∗ -tree 56/34 112/17 32/63 32/63 72/26 144/13 40/51 40/51

The experiments were executed on an Intel Pentium r 4 2.4Ghz, 512MB DDR333, under Windows XP. Visual C++ .NET 2003 compiler was employed. 1

93

94

Chapter 10. Multi-dimensional Approach to Indexing XML Data

Table 10.1 summarizes the path index characteristics, square brackets include increase of index volume for signature multi-dimensional trees. The node size 2048B was tuned for all trees. Average utilisation 62% was reached. First, queries for values of elements and attributes and queries defined by simple path based on a parent-child relationship similar to /ProteinDatabase/ProteinEntry [reference/refinfo/authors/author= ’Smith, E.L.’] were tested. For each space two sets of queries were selected. The first space of dimension 7 includes paths of the maximal length 5 and the second one of dimension 9 includes paths of the maximal length 7. The first set includes queries with smaller size of results (bellow 10), the second one includes queries with larger size of results (103 –104 ). In all cases, ratio of number of searched leaf nodes and number of all leaf nodes, disk access cost, and query processing time were measured. The results of query processing are presented in Table 10.2. Evidently, R∗ -tree proves better properties than BUB-trees during the narrow range query processing. Signature variants of multi-dimensional data structures provide better efficiency than classical data structures. In the case of Signature R∗ -tree only 0.14% of leaf nodes were searched and the average time of query processing is 70 ms.

Table 10.2. Results of queries for values of elements and attributes and queries defined by simple path based on a parent-child relationship in the path index Query set 1 2 3 4 Avg.

BUB tree 0.05 2.3 0.04 2.2 1.15

Searched leaf nodes [%] Sign. R∗ BUB-tree tree 0.04 0.32 0.4 0.60 0.03 0.20 0.4 0.05 0.22 0.29

Query set 1 2 3 4 Avg.

BUB tree 0.13 5.8 0.2 5.6 2.9

DAC Sign. R∗ -tree 0.061 0.45 0.013 0.03 0.14

BUB Sign. R∗ Sign. tree BUB-tree tree R∗ -tree 324 319 960 74 15,275 1,124 1,671 1,043 513 489 817 477 14,451 1,245 220 184 7 566 794 917 445

Time [s] Signature R∗ Signature BUB-tree tree R∗ -tree 0.1 0.08 0.015 0.9 0.19 0.094 0.2 0.20 0.140 0.8 0.017 0.015 0.5 0.12 0.07

95 Second, efficiency of XPath axes implementation was tested. In all cases, DAC and query processing time for the Signature R∗ -tree were measured. Results of query processing are presented in Table 10.3. Since, some axes (child, following-sibling, and preceding-sibling) are implemented by a sequence of range queries, DAC for one range query are presented in Table 10.3 as well. The number of searched leaf nodes is very low again. Consequently, the DAC and time of query processing are low as well. Table 10.3. Results of XPath axes queries in the path index Axis descendant child descendant-or-self ancestor parent ancestor-or-self following preceding following-sibling preceding-sibling Avg.

Number of Searched DAC for resultant leaf DAC simple elements points nodes [%] range query 1,121 982 0.20 621 9 9 0.05 225 25 1,245 1,015 0.23 648 5 1 6 2,487 2,017 0.45 1,387 2,312 1,803 0.39 1,124 5 5 0.04 187 27 7 7 0.03 165 24 1026.6 585 0.20 622.4 25.3

Time [s] 0.06 0.1 0.05 0.1 0.09 0.09 0.09 0.08

Chapter 11 Signature Multi-dimensional Data Structures Whereas the previous Chapter describes the experimental results of the application of multi-dimensional data structures for indexing XML data, this Chapter is aimed to detailed tests of the signature multi-dimensional data structures. The Protein Sequence Database XML document was used for the experiments. 17,007,879 points were obtained from this document. With respect to the frequency of the path lengths (see Figure 8.1), two multidimensional indices indexing spaces of dimension n = 7 and n = 9 were created to indexing XML data. Domain cardinalities of the spaces |D| = 232 . The R∗ -tree and Signature R∗ tree data structures were used for indexing the spaces. The lengths of n-dimensional signatures were chosen n × 32, n × 64, and n × 128. Tables 11.1 and 11.2 summarize a characterization of data collection, index size, and index multi-dimensional data structures, respectively. Square brackets include the increase of index volume for Signature R∗ -trees. An average utilisation of 62% was reached in all cases. Table 11.1. A characterization of the test data collection and the size of index file Dimension

n 7 9 Dimen-

sion n 7 9

Number of points 8,268,357 8,739,522

Index size [MB] R∗ -tree 478.6 603.1

Index size [MB] Signature R∗ -tree n × 32 n × 64 n × 128 493 [+3%] 512.2 [+7%] 536 [+12%] 651.4 [+8%] 680.7 [+13%] 711.7 [+18%]

97

98

Chapter 11. Signature Multi-dimensional Data Structures

Table 11.2. A characterization of index multi-dimensional data structures DimenNumber of inner nodes sion R∗ Signature R∗ -tree n tree n × 32 n × 64 n × 128 7 15,731 22,456 33,186 65,412 9 24,750 36,451 55,750 112,412 DimenNumber of leaf nodes sion R∗ Signature R∗ -tree n tree n × 32 n × 64 n × 128 7 256,520 257,124 258,187 260,741 9 318,370 320,741 331,474 335,846 Item size [B]/Node capacity Dimeninner node leaf node ∗ ∗ ∗ sion n R Signature R -tree R Signature R∗ -tree tree n × 32 n × 64 n × 128 tree n × 32 n × 64 n × 128 7 56/34 84/23 112/17 168/11 32/63 32/63 32/63 32/63 9 72/26 108/18 144/13 216/9 40/51 40/51 40/51 40/51 Dimension R∗ n tree 7 5 9 5

Height of tree Signature R∗ -tree n × 32 n × 64 n × 128 5 6 8 6 8 9

Inserted n-dimensional signatures enlarge the size of inner node items and a capacity of the inner nodes decreases (for equal node size) for the growing length of the signature. Consequently, the tree height increases. For more correct comparison, we chose the same size of node 2048 B for all trees. Of course, the node size can be extended and properties of the Signature R-tree can be improved (the height will be lower). We can see that the index size of Signature R∗ -trees extends to 3–18% in comparison to the R∗ -tree. An overhead of a Signature R∗ -tree escalates for growing signature length. We must choose an appropriate rate between the lesser number of searched regions during range query processing and an accrual of data structure overhead. The consequential results show the index size magnifies by the units of percentage for the signature length which filters the irrelevant tree nodes very well. Two set of queries were tested for each space. The first set includes queries with a smaller result set (< 10), the second set holds queries with a larger result set (about 103 ). Each query processes a simple XPath query for values of elements and attributes of the XML document like /ProteinDatabase/ProteinEntry[reference/refinfo/citation=

99

Table 11.3. A characterisation of narrow range query sets Query set 1 2 3 4 Avg.

Dimension n 7 7 9 9 -

nψ Result (see Definition 3.3) size 2 5 2 3,397 2 8 2 2,794 1,551

Query set NI NR cR 1 828 1 0.0012 2 1,542 717 0.47 3 641 7 0.01 4 171 136 0.8 Avg. 795.5 215.3 0.32

’Nature’]. In Table 11.3 a characterization of the narrow range query sets is shown. Note, the results were given an average for all tests. Values NI (number of the intersect regions, see Definition 9.1), NR (number of the relevant regions), and cR (ratio of the relevant and intersect regions) are given for the R∗ -tree. Naturally, the relevance ratio cR is approximately the same for all data structures. We see that the ratio is rather low ( 1) for the narrow range queries. Consequently, efficiency of query processing is not optimal for current multi-dimensional data structures. Table 11.4. Experimental results of processing the narrow range queries – NRQ Query set 1 2 3 4 Avg.

NRQ R∗ Signature R∗ -tree tree n × 32 n × 64 n × 128 828 162 16 10 1,542 1,142 792 761 641 258 44 36 171 165 136 136 795.5 431.8 247 235.8

The efficiency of narrow range query processing was measured by the number of searched leaf nodes (regions) NRQ , ratio cQ (ratio of the relevant and searched regions, see Definition 9.2), disk access cost, and time of query processing. In Table 11.4 the NRQ values are presented for R∗ -tree and Signature R∗ -tree for more lengths of an n-dimensional signature.

100

Chapter 11. Signature Multi-dimensional Data Structures

Table 11.5 presents the ratio between NRQ to the number of all leaf nodes. We can see the ratio is lesser in the case of Signature R∗ -tree. Of course, the percentage is lesser for longer n-dimensional signatures. Another view of the trend is the higher values of cQ ratio in Table 11.6. We see the ratio is closer to one more than in the case of R∗ -tree. Consequently, the improvement of the Signature R∗ -tree (compare to the R∗ -tree) is 1.2–83× (see the second table). Table 11.5. Experimental results of processing the narrow range queries – the ratio of searched leaf nodes Query Ratio of searched leaf nodes [%] set R∗ Signature R∗ -tree tree n × 32 n × 64 n × 128 1 0.32 0.061 0.006 0.004 2 0.60 0.45 0.310 0.300 3 0.20 0.16 0.013 0.011 4 0.05 0.04 0.040 0.040 Avg. 0.29 0.18 0.11 0.089

In Table 11.7 the DAC is presented for processing of the test queries. The number of searched leaf nodes (inner nodes as well) is lower in the case of the Signature R∗ -tree. In spite of the n-dimensional signature escalates the data structure overhead, the DAC is decreased. We see that we must choose a compromise between a better quality of longer signature and lower data structure overhead. The optimal length of the n-dimensional signature is n × 64 in this case. Table 11.8, which contains times of query processing supports the conclusion. We see that the Signature R∗ -tree provides a better efficiency of processing the narrow range queries.

101

Table 11.6. Experimental results of processing the narrow range queries – cQ ratio Query set 1 2 3 4 Avg.

cQ R Signature R∗ -tree tree n × 32 n × 64 n × 128 0.0012 0.006 0.06 0.1 0.47 0.63 0.91 0.94 0.01 0.03 0.16 0.2 0.80 0.96 1 1 0.32 0.41 0.53 0.56

Query set 1 2 3 4 Avg.



Improvement of cQ ratio Signature R∗ -tree n × 32 n × 64 n × 128 5× 53× 83× 1.3× 2× 2× 3× 16× 20× 1.2× 1.3× 1.3× 2.6× 18.1× 26.6×

Table 11.7. Experimental results of processing the narrow range queries – the disk access cost Query DAC ∗ set R Signature R∗ -tree tree n × 32 n × 64 n × 128 1 960 478 74 108 2 1,671 1,393 1,043 1,285 3 817 396 477 340 4 220 200 184 201 Avg. 917 616.8 444.5 483.5 Query set 1 2 3 4 Avg.

Improvement ratio of DAC Signature R∗ -tree n × 32 n × 64 n × 128 2× 13× 9× 1.2× 1.6× 1.3× 2× 1.7× 1.4× 1.1× 1.2× 1.1× 1.6× 4.3× 3.2×

102

Chapter 11. Signature Multi-dimensional Data Structures

Table 11.8. Experimental results of processing the narrow range queries – time of query processing Query set 1 3 2 3 Avg.

Time R∗ tree 0.08 0.19 0.20 0.017 0.12

Query set 1 2 3 4 Avg.

of query processing [s] Signature R∗ -tree n × 32 n × 64 n × 128 0.05 0.015 0.03 0.17 0.094 0.16 0.14 0.140 0.15 0.015 0.015 0.015 0.094 0.066 0.089

Improvement ratio Signature R∗ -tree n × 32 n × 64 n × 128 1.7× 5× 3× 1.1× 2× 1.2× 1.5× 1.5× 1.3× 1.1× 1.1× 1.1× 1.4× 2.4× 1.7×

Chapter 12 Multi-dimensional Term Indexing for Efficient Processing of Complex Queries Whereas the Chapter 10 describes the experimental results of the application of multidimensional data structures for indexing XML data, this Chapter is aimed to detailed tests of the multi-dimensional forest. In our experiments, we used terms from the TREC’s document collections (see Figure 8.2), including 816,716 unique terms. Several regular expression queries were processed, each by the classical B-tree-based inverted list as well as by the multidimensional approach – using UB-tree, BUB-tree, and BUB-forest. Several term datasets were created, according to choice of the maximal allowed term lengths. The size of a dataset in the case of maximal term length 9 was 509,258, in the case maximal term length 40 it was dataset consisting of 816,716 terms. All the term datasets were indexed by B-tree, UB-tree, BUB-tree and BUB-forest. Following tables summarize the index characteristics: B-tree characteristics tree height 4 node capacity 26 number of nodes 27,857–46,494 index size 9.5–52.9 MB

utilization 73–68%

UB-tree characteristics |D| 28 dimension number of nodes 29,121–46,657 Z-regions node capacity 26 node size

9–40 tree height 4 27,182–43,594 utilization 71.3–71.2% 355–1192 B index size 9.9–53 MB

BUB-tree characteristics |D| 28 number of nodes 26,358–37,027 inner node capacity 19 leaf node capacity 31–36

9–40 tree height 4 24,322–34,180 utilization 66.9–66.6% 422–1600 B 10.6–56.5 MB

dimension Z-regions node size index size 103

104

Chapter 12. Multi-dimensional Term Indexing for Efficient Processing of Complex Queries

According to the maximal allowed term lengths, BUB-forests BF1 (9) − BF4 (9, 13, 17, 40) were used. BUB-forest BF4 (9, 13, 17, 40) characteristics |D| BT1 : number of items inner node capacity leaf node capacity ... BT4 : number of items inner node capacity leaf node capacity

28 509,258 19 31

index size 20.8MB dimension 9 number of nodes 26,358 node size 422 B utilization 67.8%

27,408 19 36

dimension 40 number of nodes 1,252 node size 1600 B utilization 68.2%

tree height 4 number of Z-regions 24,322

tree height 3 number of Z-regions 1,159

The left, right and left-right extensions were tested. For the left extension, expressions soft*, atom*, and sub* were specified, for the right extension, expressions *soft, *less, and *session were specified, and for the left-right extension, expressions *machine*, *nalist*, and *scient* were specified. In all cases, disk access costs, number of compared terms, and query processing realtimes were observed with respect to increasing length of terms. The values of particular results were averaged. The DAC was computed as the number of logical accesses to disk pages times the size of disk page (which is fixed). In order to particular regular expression query and the maximal length of terms, the number of retrieved terms (i.e. the query selectivity) was between 0 and 1182. Results of the right extension query are presented in Figure 12.1. In the case of B-tree, this query was performed very efficiently since the disk access costs and the number of compared terms are lowest. This fact is also reflected by the achieved real times. When compared with UB-tree and BUB-tree, the BUB-forest stores shorter Z-addresses which is reflected by lower disk access costs and the query processing times (see Figure 12.1a and 12.1c). Results of the left-right extension query and the left extension query are presented in Figures 12.2 and 12.3. For processing of the queries by B-tree, all the terms must be sequentially retrieved and compared against the query (see Figure 12.2b). The costs are thus linear according to the number of terms. For the multi-dimensional approach, the number of compared terms (Figures 12.2b and 12.3b) as well as the number of disk access costs (see Figure 12.2a and 12.3a) are lower than by the B-tree. For the multidimensional approach, the efficiency significantly decreases with growing dimension since for dimensionalities 9 and 15 the number of indexed terms increased by 50% but the number of compared terms increased up to 32 times during the queries execution. The results show that the BUB-forest does not solve the problem of curse of dimensionality itself. However, storage of shorter Z-addresses is beneficial as we can observe from the disk access costs (see Figure 12.2a and 12.3a) as well as from the query processing

105

(a)

(b)

(c)

Fig. 12.1. Statistics of right extension test

realtimes (see Figure 12.2c and 12.3c). The efficiency improvement of the BUB-forest over the UB-tree or the BUB-tree is up to 50%. Due to the fact that a regular expression is processed by a sequence of narrow range queries, the efficiency of both UB-tree and BUB-tree is almost the same. If we take into account that most of the real-world terms are shorter than 15 characters (see the term length distribution in Figure 8.2), we could claim that multidimensional approach is very efficient.

106

Chapter 12. Multi-dimensional Term Indexing for Efficient Processing of Complex Queries

(a)

(b)

(c)

Fig. 12.2. Statistics of left-right extension test

107

(a)

(b)

(c)

Fig. 12.3. Statistics of left extension test

Conclusion In this work the multi-dimensional approach to indexing XML data was described. The multi-dimensional approach splits an XML document to the root to leaf paths, which are mapped into multi-dimensional points. Such points are indexed by balanced and paged multi-dimensional data structures. Consequently, XML queries are processed using queries of these data structures – point and range queries are applied. Implementation of a query for values of elements and attributes, query defined by a simple path based on an ancestordescendent relationship as well as XPath axes was described. Two techniques for indexing more XML documents as well as inserting, updating, and deleting in previously indexed XML documents were depicted. Consequently, the approach is hopeful for implementation of a native XML database. A relevant problem of indexing XML data is a term indexing, consequently novel multi-dimensional approach for term indexing was explained. The approach makes it possible to process complex queries over a large term collection. Due to the fact that the obtained points are specific, a straightforward application of existing data structures like R-tree and UB-tree does not lead to success. The first particularity is the points have very different dimensions. The overhead of recent data structure is enormous for indexing such points. Therefore, the query processing is not too efficient. Multi-dimensional forest (e.g. BUB-forest) is employed for indexing such points. The second one, the range query used in the multi-dimensional approach to indexing XML data is called the narrow range query. Processing of the query is not too efficient in existing multi-dimensional data structures. Signature multi-dimensional data structures (e.g. Signature R-tree) were developed for more efficient processing narrow range queries. Conventional approaches index particular elements (and attributes). Simple query for values of elements and a query defined by a simple path based on an ancestor-descendent relationship are processed using a consecutive filtering of elements which are not in ancestordescendent relationship as long as the result is retrieved. In our approach such queries are processed using one query in a data structure and the filtering of a large number of irrelevant elements does not approach. In our future work, we would like further to improve the abilities and the efficiency of the multi-dimensional approach. In particular we will develop an implementation of another complex XML queries which are defined by XML query languages such as XPath and XQuery. We would like to use data types described by XML Schema for querying and develop an efficient implementation of approximate querying of XML documents. Otherwise, querying of compressed XML data is an important research challenge these days. 109

Bibliography [1] Ricardo Baeza-Yates and Berthier Ribiero-Neto. Modern Information Retrieval. Addison Wesley, New York, 1999. [2] Dmitry Barashev, Michal Kr´atk´ y, and Tom´aˇs Skopal. Modern Approaches to Indexing ˇ XML Data. Transactions of VSB-Technical University of Ostrava, Computer Science and Mathematics Series, 2(2):19–30, 2003. [3] Dmitry Barashev and Boris Novikov. Indexing XML Data to Support Regular Expressions. In Proceedings of Advances in Databases and Information Systems, ADBIS 2002, 6th East European Conference, Bratislava, Slovakia, volume Research Commmunications, pages 1–10, September 8-11, 2002. [4] Michael G. Bauer, Frank Ramsak, and Rudolf Bayer. Indexing XML as a Multidimensional Problem. Technical Report TUM-I0203, Technical University M¨ unchen, 2002, http://www3.informatik.tu-muenchen.de/~bauermi/papers/TUM-I0203.pdf. [5] Michael G. Bauer, Frank Ramsak, and Rudolf Bayer. Multidimensional Mapping and Indexing of XML. In Proceedings of BTW 2003, Leipzig, volume 26 of LNI. GI, 2003. [6] Rudolf Bayer. The Universal B-Tree for multidimensional indexing: General Concepts. In Proceedings of World-Wide Computing and Its Applications’97 (WWCA’97), Tsukuba, Japan, Lecture Notes in Computer Science. Springer–Verlag, 1997. [7] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R∗ tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pages 322–331. ACM Press, 1990. [8] A. Belussi, E. Bertino, and B. Cataniac. Using spatial data access structures for filtering nearest neighbor queries. Data & Knowledge Engineering, 40(1):1–31, 2002. [9] Stefan Berchtold, Christian B¨ohm, Daniel A. Keim, and Hans-Peter Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of ACM PODS Symphosium on Principles of Database Systems. ACM Press, 1997. 111

112

BIBLIOGRAPHY

[10] Christian B¨ohm, Stefan Berchtold, and Daniel A. Keim. Searching in Highdimensional Spaces – Index Structures for Improving the Performance Of Multimedia Databases. ACM Computing Surveys, 33(3):322–373, 2001. [11] Jon Bosak. Shakespeare in XML, 1999, http://www.ibiblio.org/xml/examples/ shakespeare/. [12] R. Bourret. XML and Databases, 2001, http://www.rpbourret.com/xml/XMLAndDatabases.htm. [13] Jae-Woo Chang, Joon Ho Lee, and Yoon-Joon Lee. Multikey access methods based on term discrimination and signature clustering. In Proceedings of 12th International Conference on Research and Development in Information Retrieval, SIGIR’89, Cambridge, Massachusetts, USA, pages 176–185. ACM Press, June, 1989. [14] Walter W. Chang and Hans-J¨org Schek. A signature access method for the starburst database system. In Proceedings of the 15th International Conference on Very Large Data Bases (VLDB’89), Amsterdam, The Netherlands, pages 145–153. Morgan Kaufmann, August, 1989. [15] Akmal B. Chaudhri, Awais Rashid, and Roberto Zicari. XML Data Management: Native XML and XML-Enabled Database Systems. Addison Wesley Professional, 2003. [16] Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In Proceedings of 23rd International Conference on Very Large Databases (VLDB’97), pages 426–435. Morgan Kaufmann, 1997. [17] Brian Cooper, Neal Sample, Michael J. Franklin, G´ısli R. Hjaltason, and Moshe Shadmon. A Fast Index for Semistructured Data. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), pages 341–350. Morgan Kaufmann, 2001. [18] M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994. [19] Uwe Deppisch. S-tree: A dynamic balanced signature index for office retrieval. In Proceedings of 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’86), Pisa, Italy, pages 77–87. ACM Press, September, 1986. [20] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. XMLQL: A Query Language for XML. Technical report, WWW Consortium, August, 1998, http://www.w3.org/TR/NOTE-xml-ql/. [21] Alin Deutsch, Mary Fernandez, and Dan Suciu. Storing semistructured data with STORED. In Proceedings of 1999 ACM SIGMOD International Conference on Management of Data, pages 431–442. ACM Press, 1999.

BIBLIOGRAPHY

113

[22] Paul F. Dietz. Maintaining order in a linked list. In Proceedings of 14th annual ACM symposium on Theory of Computing (STOC 1982), pages 122–127, 1982. [23] Paul F. Dietz and Daniel D. Sleator. Two Algorithms for Maintaining Order in a List. In Conference Record of the 19th Annual ACM Symposium on Theory of Computing (STOC 1987), pages 365–372. ACM Press, May 1987. [24] Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. A Metric Index for Approximate Text Management. In Proceedings of IASTED International Conference Information Systems and Database (ISDB 2002). ACTA Press, 2002. [25] Jiˇr´ı Dvorsk´ y, Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel. Term Indexing in Information Retrieval Systems. In Proceedings of Communications in Computing 2003 (CIC 2003), pages 263–270. CSREA Press, 2003. [26] Christos Faloutsos. Gray Codes for Partial Match and Range Queries. IEEE Transactions on Software Engineering, 14(10), 1988. [27] Christos Faloutsos and Stavros Christodoulakis. Signature Files: An Access Method for Documents and its Analytic Performance Evaluation. ACM Transactions on Information Systems, 2(4):267–288, 1984. [28] Robert Fenk. The BUB-Tree. In Proceedings of 28rd VLDB International Conference on Very Large Data Bases (VLDB’02), Hongkong, China. Morgan Kaufmann, 2002. [29] Robert Fenk, Volker Markl, and Rudolf Bayer. Improving Multidimensional Range Queries of non rectangular Volumes specified by a Query Box Set. In Proceedings of International Symposium on Database, Web and Cooperative Systems (DWACOS), 1999. [30] Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. STRUDEL: a Web site management system. In Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, pages 549–552. ACM Press, 1997. [31] P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. In Proceedings of ACM Symposium on Theory of Computing (STOC 1995). ACM Press, 1995. [32] Daniela Florescu and Donald Kossmann. Storing and Querying XML Data using an RDMBS. IEEE Data Engineering Bulletin, 22(3):27–34, 1999. [33] Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490–499, 1960. [34] Michael Freeston. A General Solution of the n-dimensional B-tree Problem. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, USA. ACM Press, 1995.

114

BIBLIOGRAPHY

[35] Norbert Fuhr, Norbert Gvert, Saadia Malik, Mounia Lalmas, and Gabriella Kazai. INEX – Initiative for the Evaluation of XML Retrieval, 2004, http://www.is.informatik.uni-duisburg.de/projects/inex/index.html.en. [36] Volker Gaede and Oliver G¨ unther. Multidimensional Access Methods. ACM Computing Surveys, 30(2):170–231, 1998. [37] Charles F. Goldfarb and Paul Prescod. XML Handbook. Publisher Prentice Hall, December 2001. [38] Torsten Grust. Accelerating XPath Location Steps. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, USA. ACM Press, June 4-6, 2002. [39] Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, Annual Meeting, Boston, USA, pages 47–57. ACM Press, June 1984. [40] International Organization for Standardization. ISO 8879: Standard Generalized Markup Language (SGML), http://www.iso.org, 1986. [41] Nikos Karayannidis, Aris Tsois, Timos Sellis, Roland Pieringer, Volker Markl, Frank Ramsak, Robert Fenk, Klaus Elhardt, and Rudolf Bayer. Processing Star Queries on Hierarchically-Clustered Fact Tables. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB’02), Hongkong, China. Morgan Kaufmann, 2002. [42] Michal Kr´atk´ y, Jaroslav Pokorn´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel. The Geometric Framework for Exact and Similarity Querying XML Data. In Proceedings of First EurAsian Conference, EurAsia-ICT 2002, Shiraz, Iran, volume 2510 of Lecture Notes in Computer Science. Springer–Verlag, October 27-31, 2002. [43] Michal Kr´atk´ y, Jaroslav Pokorn´ y, and V´aclav Sn´aˇsel. Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. In Accepted at Proceedings of International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int’l Conference on Extending Database Technology (EDBT 2004), Lecture Notes in Computer Science. Springer–Verlag, 2004. [44] Michal Kr´atk´ y, Jaroslav Pokorn´ y, and V´aclav Sn´aˇsel. Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. In Proceedings of International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int’l Conference on Extending Database Technology (EDBT 2004), 2004. [45] Michal Kr´atk´ y, Jaroslav Pokorn´ y, and V´aclav Sn´aˇsel. Indexing XML data with UBtrees. In Proceedings of Advances in Databases and Information Systems, ADBIS 2002, 6th East European Conference, Bratislava, Slovakia, volume Research Commmunications, pages 155–164, September 8-11, 2002.

BIBLIOGRAPHY

115

[46] Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel. Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 40(3):381–396, 2004. [47] Michal Kr´atk´ y, V´aclav Sn´aˇsel, P. Zezula, Jaroslav Pokorn´ y, and Tom´aˇs Skopal. Efficient Processing of Narrow Range Queries in the R-Tree. Technical Report ARG-TR-01-2004, Amphora Research Group (ARG), Deˇ partment of Computer Science, VSB–Technical University of Ostrava, 2004, http://www.cs.vsb.cz/arg/techreports/sigrtree.pdf. [48] M. Ley. How to mirror DBLP, 1998, http://www.informatik.uni-trier.de/~ley/db/about/instr.html. [49] Quanzhong Li and Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB’01). Morgan Kaufmann, 2001. [50] Yannis Manolopoulos, Alexandros Nanopoulos, Apostolos N. Papadopoulos, and Yannis Theodoridis. R-trees Have Grown Everywhere. Submitted to ACM Computing Surveys, 2003. [51] Yannis Manolopoulos, Alexandros Nanopoulos, and Eleni Tousidou. Advanced Signature Indexing for Multimedia and Web Applications. The Kluwer International Series on Advances in Database Systems, 2003. [52] Yannis Manolopoulos, Yannis Theodoridis, and Vassilis J. Tsotras. Advanced Database Indexing. Kluwer Academic Publisher, 2001. [53] Volker Markl. Mistral: Processing Relational Queries using a Multidimensional Access Technique. Ph.D. thesis, Technical University M¨ unchen, Germany, 1999, http://mistral.in.tum.de/results/publications/Mar99.pdf. [54] Maarten Marx. Conditional XPath, the first order complete XPath dialect. In Proceedings of Symposium on Principles of Database Systems (PODS 2004), Paris, France. ACM Press, 2004. [55] Maarten Marx. XPath with Conditional Axis Relations. In Proceedings of 9th International Conference on Extending Database Technology (EDBT’04), Heraklion, Crete, Greece, volume 2992 of Lecture Notes in Computer Science, pages 477–494. Springer– Verlag, March 14-18, 2004. [56] Jason McHugh, Serge Abiteboul, Roy Goldman, Dallas Quass, and Jennifer Widom. Lore: A Database Management System for Semistructured Data. ACM SIGMOD Record, 26(3):54–66, 1997.

116

BIBLIOGRAPHY

[57] Laurent Mignet, Denilson Barbosa, and Pierangelo Veltri. The XML Web: a First Study. In Proceedings of Twelfth International World Wide Web Conference, WWW 2003. ACM Press, 2003. [58] Donald R. Morrison. PATRICIA - Practical Algorithm To Retrieve Coded in Alphanumeric. Journal of the ACM (JACM), 15(4):514–534, 1968. [59] NIST. Text REtrieval Conference (TREC), 2004, http://trec.nist.gov/. [60] Open GIS Consortium. Geography Markup Language (GML) Implementation Specification, Version: 3.00, http://www.opengis.org/docs/02-023r4.pdf, January 2003. [61] O’Reilly. Docbook: The definitive guide, 2003, http://www.docbook.org. [62] Dong-Joo Park, Shin Heu, and Hyoung-Joo Kim. The RS-tree: An Efficient Data Structure for Distance Browsing Queries. Information Processing Letters, 80(4):195– 203, November, 2001. [63] Jaroslav Pokorn´ y. XML: a challenge for databases?, pages 147–164. Kluwer Academic Publishers, Boston, 2001. [64] J. Widom R. Goldman. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proceedings of 23rd International Conference on Very Large Data Bases (VLDB’97), pages 436–445. Morgan Kaufmann, 1997. [65] F. Ramsak, V. Markl, R. Fenk, M. Zirkel, K. Elhardt, and R. Bayer. Integrating the UB-tree into a Database System Kernel. In Proceedings of 26th International Conference on Very Large Data Bases (VLDB’00). Morgan Kaufmann, 2000. [66] Jonathan Robie. XML Query Language (XQL), 1999, http://metalab.unc.edu/ xql/xql-proposal.xml. [67] S. Roman. Advanced Linear Algebra. Springer Verlag, 1995. [68] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw Hill Publications, 1st edition, 1983. [69] Albrecht R. Schmidt, Florian Waas, Martin L. Kersten, Daniela Florescu, Ioana Manolescu, Michael J. Carey, and Ralph Busse. The XML Benchmark. Technical Report INS-R0103, CWI, Amsterdam, The Netherlands, April, 2001, http://monetdb.cwi.nl/xml/. [70] Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos. The R+ -Tree: A Dynamic Index For Multi-Dimensional Objects. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97), pages 507–518. Morgan Kaufmann, 1997.

BIBLIOGRAPHY

117

[71] Dongwook Shin. XML Indexing and Retrieval with a Hybrid Storage Model. Knowledge and Information Systems, 3(2), 2001. [72] Dongwook Shin, H. Jang, and H. Jin. BUS: an effective indexing and retrieval scheme in structured documents. In Proceedings of the third ACM Conference on Digital Libraries, New York, USA, pages 235–243, 1998. [73] Tom´aˇs Skopal, Michal Kr´atk´ y, V´aclav Sn´aˇsel, and Jaroslav Pokorn´ y. On Range Queries in Universal B-trees. Technical Report ARG-TR-01-2003, Amphora Research ˇ Group (ARG), Department of Computer Science, VSB–Technical University of Ostrava, 2003, http://www.cs.vsb.cz/arg/techreports/range.pdf. [74] Tom´aˇs Skopal, Michal Kr´atk´ y, V´aclav Sn´aˇsel, and Jaroslav Pokorn´ y. A New Range Query Algorithm for the Universal B-trees. Submitted to Information Systems Journal, March 9, 2004. [75] Tom´aˇs Skopal, Pavel Moravec, Michal Kr´atk´ y, V´aclav Sn´aˇsel, and Jaroslav Pokorn´ y. An Efficient Implementation of the Vector Model in Information Retrieval. In Proceedings of the fifth National Russian Research Conference, RCDL’2003, Digital Libraries: Advanced Methods and Technologies, Digital Collections, Saint-Petersburg, Russia, pages 170–179. Saint-Petersburg State University Published Press, 2003. [76] Stefan Berchtold and Daniel A. Keim and Hans-Peter Kriegel. The X-Tree: An Index Structure for High-Dimensional Data. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB’96), pages 28–39, San Francisco, USA, 1996. Morgan Kaufmann. [77] Graham A. Stephen. String Searching Algorithms. World Scientific’s Lecture Notes Series on Computing, 1998. [78] University of Washington’s database group. The XML Data Repository, 2002, http://www.cs.washington.edu/research/xmldatasets/. [79] W3 Consortium. Extensible Markup Language (XML) 1.0, W3C Recommendation, 10 February 1998, http://www.w3.org /TR/REC-xml. [80] W3 Consortium. XQuery 1.0: An XML Query Language, W3C Working Draft, 12 November 2003, http://www.w3.org/TR/xquery/. [81] W3 Consortium. XML Path Language (XPath) Version 2.0, W3C Working Draft, 15 November 2002, http://www.w3.org/TR/xpath20/. [82] W3 Consortium. XML Pointer Language (XPointer) Version 1.0, W3C Working Draft, 16 August 2002, http://www.w3.org/TR/xptr/. [83] W3 Consortium. XML Path Language (XPath) Version 1.0, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/xpath/.

118

BIBLIOGRAPHY

[84] W3 Consortium. XML Schema Part 1: Structure, W3C Recommendation, 2 May 2001, http://www.w3.org/TR/xmlschema-1/. [85] W3 Consortium. XML Schema Part 2: Datatypes, W3C Recommendation, 2 May 2001, http://www.w3.org/TR/xmlschema-2/. [86] W3 Consortium. MathML, 2001, http://www.w3.org/Math/. [87] W3 Consortium. The Extensible Stylesheet Language Family (XSL) , 2004, http://www.w3.org/Style/XSL/. [88] W3 Consortium. DTD Tutorial, 2004, http://www.w3schools.com/dtd/. [89] W3 Consortium. XML Schema Tutorial, 2004, http://www.w3schools.com/schema/. [90] W3 Consortium. HyperText Markup Language (HTML) 4.01 Specification, W3C Recommendation, 24 December 1999, http://www.w3.org/TR/html4/. [91] W3 Consortium. Simple Object Access Protocol (SOAP), W3C Recommendation, 24 June 2003, http://www.w3.org/TR/soap/. [92] W3 Consortium. XML Linking Language (XLink) Version 1.0, W3C Recommendation, 27 June 2001, http://www.w3.org/TR/xlink/. [93] Niklaus Wirth. Algorithms and Data Structures. Prentice Hall, 1984. [94] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes, Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994. [95] xml-cml.org. Chemical Markup http://www.xml-cml.org/.

Language

(CML),

15

November

2003,

[96] Cui Yu. High-Dimensional Indexing, volume 2341 of Lecture Notes in Computer Science. Springer–Verlag, 2002. [97] Pavel Zezula, Giuseppe Amato, Franca Debole, and Fausto Rabitti. Tree Signatures for XML Querying and Navigation. In Proceedings of XML Database Symposium, XSym 2003, volume 2824 of Lecture Notes in Computer Science, pages 149–163. SpringerVerlag, 2003. [98] Pavel Zezula, Fausto Rabitti, and Paolo Tiberio. Dynamic Partitioning of Signature Files. ACM Transactions on Information Systems, 9(4):336–369, October 1991.

Author’s publications [1] Marek Andrt, Michal Kr´atk´ y, Vojtˇech Sv´atek, and V´aclav Sn´aˇsel: AmphoraWS – webov´a sluˇzba pro vyhled´av´an´ı ve strukturovan´ ych dokumentech. In Czech. Accepted at DATAKON 2004, Brno, Czech Republic. [2] Dmitry Barashev, Michal Kr´atk´ y, and Tom´aˇs Skopal: Modern Approaches to Indexˇ ing XML Data. Transactions of the VSB-Technical University Ostrava, Computer Science and Mathematics Series, 2(2): 19-30, ISBN 80–248–0455–7, ISSN 1213–4279, 2003. [3] Jiˇr´ı Dvorsk´ y, Michal Kr´atk´ y: Multidimensional Sparse Matrix Storage. In Proceedings of the Annual International Workshop on Databases, Texts, Specificaˇ a R´ ˇ ıˇcka, Czech Republic, ISBN tions and Objects (DATESO 2004), Desn´a–Cern´ 80-248-0457-3. Published on CEUR Workshop Proceedings, ISSN 1613–0073, http://CEUR-WS.org/Vol-98/, 2004. [4] Jiˇr´ı Dvorsk´ y, Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: Benchmarking the Multidimensional Approach for Term Indexing. In Proceedings of the Annual International Workshop on Databases, Texts, Specifications and Objects (DATESO ˇ a R´ ˇ ıˇcka, Czech Republic, ISBN 80–248–0330-5, 2003 2003), Desn´a–Cern´ [5] Jiˇr´ı Dvorsk´ y, Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: Term Indexing in Information Retrieval Systems. In Proceedings of Communications in Computing 2003 (CIC 2003), CSREA Press, pages 263-270. Las Vegas, USA, 2003. [6] Jiˇr´ı Kosek, M. Kr´atk´ y, and V´aclav Sn´aˇsel: Struktura re´aln´ ych XML dokument˚ ua metody indexov´an´ı. In Czech. Accepted at ITAT 2003, High Tatras, Slovakia, 2003. [7] Michal Kr´atk´ y: Distribuovan´ y syst´em pro pr´aci s prostorov´ ymi daty a jejich vizualizaci na WWW v prostˇred´ı CORBA. In Czech. In Proceedings of GIS 2002, Ostrava, Czech Republic, ISSN 1213-239X, 2002. [8] Michal Kr´atk´ y: Vyuˇzit´ı SVD pro indexov´an´ı latentn´ı s´emantiky. In Czech. Technical Report ARG-TR-02-2002. Department of Computer ˇ Science, VSB-Technical University of Ostrava, Czech Republic, 2002, http://www.cs.vsb.cz/arg/techreports/lsi-svd_ma.pdf. 119

120

AUTHOR’S PUBLICATIONS

[9] Michal Kr´atk´ y and Marek Andrt: On Efficient Part-match Querying of XML Data. In Proceedings of the Annual International Workshop on Databases, Texts, ˇ a R´ ˇ ıˇcka, Czech Republic, Specifications and Objects (DATESO 2004). Desn´a–Cern´ ISBN 80-248-0457-3. Published on CEUR Workshop Proceedings, ISSN 1613–0073, http://CEUR-WS.org/Vol-98/, 2004. [10] Michal Kr´atk´ y, Jaroslav Pokorn´ y, V´aclav Sn´aˇsel, and Tom´aˇs Skopal: Implementace os XPath ve v´ıcerozmˇern´em pˇr´ıstupu pro indexov´an´ı XML dat. In Czech. In Proceedings of Znalosti 2004, Brno, Czech Republic, ISBN 80-248-0456-5, 2004. [11] Michal Kr´atk´ y, Jaroslav Pokorn´ y and V´aclav Sn´aˇsel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int’l Conference on Extending Database Technology (EDBT 2004), LNCS, Springer-Verlag, page count 10, Heraklion - Crete, Greece, 2004. [12] Michal Kr´atk´ y, Jaroslav Pokorn´ y and V´aclav Sn´aˇsel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. In Proceedings International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int’l Conference on Extending Database Technology (EDBT 2004), page count 15, Heraklion - Crete, Greece, 2004. [13] M. Kr´atk´ y, Jaroslav Pokorn´ y, and V. Sn´aˇsel: Indexing XML data with UB-trees. In Proceedings of Advances in Databases and Information Systems, ADBIS 2002, 6th East European Conference, Bratislava, Slovakia, ISBN 80-227-1744-4, 2002. [14] M. Kr´atk´ y, Jaroslav Pokorn´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: The Geometric Framework for Exact and Similarity Querying XML Data. In Proceedings of First EurAsian Conferences, EurAsia-ICT 2002, Shiraz, Iran, Springer–Verlag, LNCS 2510, 2002. [15] Michal Kr´atk´ y, Jaroslav Pokorn´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: The Geometric Approach for Indexing XML Data. In Proceedings of DATAKON 2002, Brno, Czech Republic, ISBN 80-210-2958-7, 2002. [16] Michal Kr´atk´ y and Tom´aˇs Skopal: Benchmarking the UB-tree. In Proceedings of the Annual International Workshop on Databases, Texts, Specifications and Objects ˇ a R´ ˇ ıˇcka, Czech Republic, ISBN 80–248–0330-5, 2003. (DATESO 2003), Desn´a–Cern´ [17] Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: Geometrick´e indexov´an´ı a dotazov´an´ı multimedi´aln´ıch dat. In Czech. In Proceedings of DATAKON 2002, Brno, Czech Republic, ISBN 80–210–2958–7, 2002. [18] Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: V´ıcerozmˇern´ y pˇr´ıstup pro netrivi´aln´ı vyhled´av´an´ı term˚ u. In Czech. In Proceedings of Znalosti 2003, Ostrava, Czech Republic, ISBN 80-248-0229-5, 2003.

AUTHOR’S PUBLICATIONS

121

[19] Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: Efektivn´ı vyhled´av´an´ı v kolekc´ıch obr´azk˚ u tv´aˇr´ı. In Czech. In Proceedings of DATAKON 2003, Brno, Czech Republic, ISBN 80–248–0330–5, 2003. [20] Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: Image Compression Using SpaceFilling Curves. Accepted at ITAT 2003, High Tatras, Slovakia, 2003. [21] Michal Kr´atk´ y, Tom´aˇs Skopal, and V´aclav Sn´aˇsel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 40(3):381–396, 2004. [22] Michal Kr´atk´ y, V´aclav Sn´aˇsel, Pavel Zezula, Jaroslav Pokorn´ y, and Tom´aˇs Skopal: Efficient Processing of Narrow Range Queries in the R-Tree. Technical Report ARG-TR-01-2004, Amphora Research Group (ARG), Departˇ ment of Computer Science, VSB–Technical University of Ostrava, 2004, http://www.cs.vsb.cz/arg/techreports/sigrtree.pdf. ˇ [23] Michal Kr´atk´ y, Svatopluk Stolfa, V´aclav Sn´aˇsel, and Ivo Vondr´ak: Vyhled´av´an´ı ˇ v rozs´ahl´ ych hierarchi´ıch tˇr´ıd. In Czech. In Proceedings of Objekty 2003, VSB– Technical University of Ostrava, Ostrava, Czech Republic, ISBN 80-248-0274-0, 2003. [24] Pavel Moravec, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Random Projections for Dimension Reduction in Information Retrieval Systems. In Proceedings of Industrial Mathematics and Mathematical Modelling, IMAMM’03, Roˇznov pod Radhoˇstˇem, Czech Republic, 2003. [25] Tom´aˇs Skopal, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Metrick´e a semi-metrick´e indexov´an´ı vektorov´ ych model˚ u pro dokumentografick´e informaˇcn´e syst´emy. In Czech. In Procedings of Znalosti 2004, Brno, Czech Republic, ISBN 80–248–0456–5, 2004. [26] Tom´aˇs Skopal, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Porovn´an´ı nˇekter´ ych metod pro vyhled´av´an´ı a indexov´an´ı multimedi´aln´ıch dat. In Czech. In Proceedings of Kyberˇ netika 2002, Zilina, Slovakia, ISBN 90–967609–7–1, 2002. [27] Tom´aˇs Skopal, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Properties of Space Filling Curves and Usage with UB-trees. In Proceedings of ITAT 2002, Brdo, High Fatra, ISBN 80-7097-499-0, Slovakia, 2002. [28] Tom´aˇs Skopal, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Efektivn´ı implementace vektorov´eho modelu pro dokumentografick´e informaˇcn´ı syst´emy. In Czech. In Proceedings of DATAKON 2003, Brno, Czech Republic, ISBN 80–248–0330–5, 2003. [29] Tom´aˇs Skopal, Jaroslav Pokorn´ y, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Revisiting Mtree Building Principles. In Proceedings of Advances in Databases and Information Systems (ADBIS 2003). Dresden, Germany, Springer–Verlag, LNCS 2798, 2003.

122

AUTHOR’S PUBLICATIONS

[30] Tom´aˇs Skopal, Michal Kr´atk´ y, Jaroslav Pokorn´ y, and V´aclav Sn´aˇsel: On Range Queries in Universal B-trees. Technical Report ARG-TR-01-2003, Department of ˇ Computer Science, VSB–Technical University of Ostrava, Czech Republic, 2003, http://www.cs.vsb.cz/arg/techreports/range.pdf. [31] Tom´aˇs Skopal, Pavel Moravec, Jaroslav Pokorn´ y, Michal Kr´atk´ y, and V´aclav Sn´aˇsel: Efficient Implementation of Vector Model in Information Retrieval. In Proceedings of the fifth National Russian Research Conference, RCDL’2003, Digital Libraries: Advanced Methods and Technologies, Digital Collections, St. Petersburg, Russia, pp. 170-179, ISBN 5-94158-056-8, 2003. [32] Tom´aˇs Skopal, V´aclav Sn´aˇsel, and Michal Kr´atk´ y: Image Recognition Using Finite Automata. In Proceedings of Prague Stringology Conference, PSC’02, Prague, Czech Republic, ISBN 80–01–02616–7, 2002. [33] Tom´aˇs Skopal, V´aclav Sn´aˇsel, Michal Kr´atk´ y, and Vojtˇech Sv´atek: Searching Internet Using Topological Analysis of Web pages. In Proceedings of Communications in Computing 2003 (CIC 2003), CSREA Press, pages 271–277, Las Vegas, USA, 2003. ˇ akov´a, and Michal Kr´atk´ [34] V´aclav Sn´aˇsel, Daniela Dur´ y: Navigation through Query Result Using Concept Lattice. In Proceedings of the Annual International Workˇ a shop on Databases, Texts, Specifications and Objects (DATESO 2002), Desn´a-Cern´ ˇ R´ıˇcka, Czech Republic, ISBN 80–248–0330–5, 2002.

Index narrow range query, 37, 82 numbering scheme, 20

n-dimensional signature, 84 B+ -tree, 65 B-tree, 35 BUB-forest, 75, 76 BUB-forest range query, 77 BUB-tree, 35, 65

path, 44 path index, 47 persistent data structure, 78 persistent tree, 65 point query, 36 point representing a labelled path, 45 point representing a path, 46 postorder, 13 preorder, 13

complex range query, 36 curse of dimensionality, 37 database management system (DBMS), 21, 81 disk access cost (DAC), 54, 73, 89 disk cache, 78 document order, 12 DTD, 10

quality ratio of a range query algorithm, 83 query window, 36 R∗ -tree, 35, 73 R+ -tree, 74 R-forest, 75 R-tree, 35, 71 range query, 36 range query processing with the n-dimensional signature, 84 relevant ratio, 82 relevant regions, 82

Extensible Mark-up Language (XML), 7 feature transformation, 35 feature vector, 35 intersect regions, 82 labelled path, 44 labelled path index, 47 M-tree, 35 minimum bounding box (MBB), 71 minimum bounding polygon, 74 minimum bounding sphere, 74 mixed content, 45 model of XML documents, 44 multi-dimensional approach to indexing XML data, 43 multi-dimensional data structure, 35, 43 multi-dimensional forest, 75 123

signature, 83 Signature (B)UB-tree, 81 signature methods, 83 signature multi-dimensional trees, 84 Signature R-tree, 85 STORED, 21 term as n-dimensional point, 59 term index, 46 UB-tree, 35, 65

124 valid XML document, 8 well formed XML document, 8 X-tree, 35 XISS, 22 XISS numbering scheme, 22 XML Linking Language (XLink), 18 XML Pointer Language (XPointer), 18 XML query languages, 13 XML schema, 11 XML tree, 12 XPath, 14 XPath Accelerator, 28 XPath axes, 14 XQuery, 16 Z-address, 65 Z-ordering, 65 Z-region, 66

INDEX

Suggest Documents