Reconstructing XML Subtrees from Relational ... - Semantic Scholar

0 downloads 0 Views 317KB Size Report
maximum ID of e's descendants. For a shared node, parent-. Type is introduced to indicate the parent type of an element. A column (e or f) is introduced for each ...
Reconstructing XML Subtrees from Relational Storage of XML documents Artem Chebotko, Dapeng Liu, Mustafa Atay, Shiyong Lu and Farshad Fotouhi Wayne State University Department of Computer Science 5143 Cass Avenue, Detroit, Michigan 48202, USA {artem, dliu, matay, shiyong, fotouhi}@cs.wayne.edu Abstract

nent of such systems and its performance has great impact on the query response time.

Numerous researchers have proposed to use relational databases to store and query XML documents. One important component of such systems is the XML subtree reconstruction, which reconstructs the subtrees rooted at the matching nodes of an XML query and returns them to the user as the query result. Existing reconstruction algorithms either do not support recursive XML view schema, or require expensive nested queries or joins of multiple relations. In this paper, we propose an efficient XML subtree reconstruction algorithm, Reconstruct, which overcomes these limitations and uses an efficient stack-based structural join algorithm to recover all the parent-child relationships between elements. One salient advantage of this algorithm is that it employs the inlining feature of the inlining-based storage of XML documents, which is known as one of the best relational XML storage schemes. Both our algorithmic analysis and experimental study show that Reconstruct is efficient and scalable.

While existing work on XML publishing [19, 18, 5, 6, 7, 3] focuses on presenting existing relational data in XML format, these algorithms are not directly applicable to the XML subtree reconstruction problem we pose here for the following reasons: 1) the output XML view is defined by an application in XML publishing, while in XML subtree reconstruction, the structure of the output XML subtrees must conform to the structure imposed by the original input XML schema or the input XML documents, 2) they only support non-recursive XML view schema while an input XML schema might be recursive or the input XML documents might contain element e under another element e, 3) they do not exploit the inlining feature of the relational storage of XML documents, where the main idea of inlining is to store an element f in the same tuple as its parent e if e contains at most one f . The inlining approach has been proved extremely effective in storing and quering XML documents in relational databases [21, 13], 4) they use expensive nested queries and joins of multiple relations.

1. Introduction Numerous researchers have proposed to use relational databases to store and query XML documents [4, 8, 16, 21, 12, 14, 2, 13]. In these systems, the elements selected by an XML query are returned to an application in one of the following two modes [23]: 1) Select mode: the unique identifiers (IDs) of the selected elements are returned to the application, which can extract the contents of these elements if necessary, or 2) Reconstruct mode: the XML subtrees rooted at the selected elements are extracted and reconstructed from the storage of XML documents and returned to the application. One constraint is that these XML subtrees must conform to the structure imposed by the original input XML schema or the input XML documents. Therefore, XML subtree reconstruction is one important compo-

While some authors briefly discussed the XML subtree reconstruction in various contexts [11], no algorithmic details were provided. Similarly, [23, 24, 8] only presented some experimental results on XML subtree reconstruction without the algorithmic details. Although [20] did present an algorithm for generating reconstruction XML view, the reconstruction relies on an XML query engine that is already in place: the output XML view is specified in terms of an XML query and the reconstruction is performed by executing the XML query against a default XML view of the underlying relational data over the XML query engine. The assumption of the existence of such an XML query engine, only pushes the problem of XML subtree reconstruction inside the query engine, with the immediate limitation that the performance of the reconstruction will be highly dependant on the performance of the XML query engine. In addition, as the XML query engine is one major goal of our system, we cannot assume the existence of such an XML query en-

gine which internally supports XML reconstruction to implement XML reconstruction to avoid the egg and chicken problem. Recently, [11] proposes a strategy to deal with recursive XML view schemas in reconstructing XML subtrees, unfortunately, it needs to construct the root-to-leaf path dynamically and reconstruction involves joins of multiple relations. In addition, the inlining feature of the relational storage of XML documents is not exploited. The main contributions of this paper are: 1. We propose an efficient XML subtree reconstruction algorithm, Reconstruct, which overcomes the limitations of existing algorithms and uses an efficient stack-based structural join algorithm to recover all the parent-child relationships between elements. One salient advantage of this algorithm is that it employs the inlining feature of the inlining-based storage of XML documents, which is known as one of the best relational XML storage schemes. 2. We present our algorithmic analysis to show that our Reconstruct algorithm is both correct and efficient with the time complexity of O(n+ne log2 ne ), where ne and na are the numbers of elements and attributes respectively in the output XML subtree, and n = ne + na . 3. We conduct our experimental study to show that the proposed algorithm Reconstruct is efficient and scalable. Organization. The rest of the paper is organized as follows. In section 2, we discuss related work, and in Section 3, we give a brief overview of our schema mapping algorithm. In Sections 4, 5, and 6, we present XML subtree reconstruction algorithm Reconstruct, analyze its time complexity, prove its correctness, and provide experimental study results. Finally, we present our conclusions and future work in Section 7.

2. Related Work Numerous researchers have proposed to use relational databases to store and query XML documents [4, 8, 16, 21, 12, 14, 2, 13]. The main challenge of this approach is that, one needs to resolve the conflict between the hierarchical, ordered nature of the XML data model and the flat, unordered nature of the relational data model. This conflict can be resolved by the following three XML-to-Relational mappings: • Schema mapping. Either a fixed generic relational schema (Schema-oblivious XML storage [4, 8, 16, 9]) is used, or a relational schema is generated from an XML schema or DTD (Schema-based XML storage [21, 12, 2, 14, 13]) for the storage of XML documents. To support the ordered nature of the XML

data model, an element order encoding scheme such as those proposed in [23] can be used and additional columns are introduced to store the ordinals of XML elements. • Data mapping, which shreds an input XML document into relational tuples and inserts them into the relational database [1]. • Query mapping, which translates an XML query into SQL queries, executes them against the database and returns the query result to the user [11, 3, 6, 10, 15, 18, 20]. If the query result is to be returned as XML documents, then a reconstruction algorithm is needed to reconstruct the XML subtrees rooted at the matching nodes. Although XML publishing [19, 18, 5, 6, 7, 3] is very relevant to the XML subtree reconstruction posed in this paper, there is a fundamental difference between these two problems: XML publishing focuses on presenting existing relational data in XML format, hence, there is no direct correspondance between the relational schema and the output XML view schema. However, XML subtree reconstruction is conducted over relational data which is the storage of XML documents. The relational schema is generated from the input XML schema, and it is required that the reconstructed XML documents must conform to the input XML schema in structure.

3. Relational Storage of XML documents In this section, we give an overview of our schemabased relational storage of XML documents. In particular, we present a schema mapping algorithm, ODTDMap, which generates a relational schema from an XML DTD for storing and querying ordered XML documents. Although our proposed XML subtree reconstruction algorithm is described in this context, it can be easily adapted to other relational storage schemes of XML documents. First, to facilitate the processing of the input XML DTD1 , we model it with a DTD graph, in which nodes represent element or attribute types, and edges represent parent-child relationships. In addition, for each edge from node e to f in the graph, if e can contain at most one f , then this edge will not be labeled; otherwise, it will be labeled by a *. For example, given the DTD in Figure 1, the corresponding DTD graph is shown in Figure 2 where attribute labels are preceded by a @2 . To 1 2

Since the input DTD might be very complex due to its hierarchical nesting nature, the simplification step described in [13] will be applied first to simplify the input DTD. In implementation, to ensure the uniqueness of attribute labels, we can use the concatenation of an attribute name and its owner element name as the attribute label. For example, attribute id of element book can have label book.id.

xbib(ID, endID) book(ID, parentID, endID, book.id, book.year, book.publisher, title.ID, title.endID, title, citation.ID, citation.endID) author(ID, parentID, endID, fname.ID, fname.endID, fname, lname.endID, lname) chapter(ID, parentID, endID) section(ID, parentID, endID, parentType) paragraph(ID, parentID, endID, parentType, paragraph) cites(ID, parentID, cites)



Figure 4. Generated relational schema • Inlinable node, a node with exactly one incoming edge and that edge is not labeled by a *. For example, id, year, publisher, title, and citation are inlinable nodes of the book element in the DTD graph shown in Figure 2.

Figure 1. Sample XML DTD

xbib * book *

* @id

@year @publisher

title

author

citation

chapter

*

* fname

*

lname

@cites

section

* paragraph

*

Figure 2. DTD graph deal with a set-valued attribute, the edge between the attribute and its parent (the owner element) is labeled by a *. In Figure 2, cites is a set-valued attribute. Second, we classify the nodes in the DTD graph into three categories: • Root node, a node without incoming edges. For example, xbib is the root node of the DTD graph shown in Figure 2. • Shared node, a node with more than one incoming edge. For example, paragraph is a shared node of the DTD graph shown in Figure 2. xbib

* book (@id, @year, @publisher, title, citation)

author (fname , lname )

*

*

*

chapter

@cites

* *

section

*

*

paragraph

Figure 3. Inlined DTD graph

lname.ID,

An element node is called a leaf element node if it has no child element nodes. We then inline each hierarchy of inlinable nodes to its root which is not inlinable. The resulting graph is called an inlined DTD graph. For example, the DTD graph shown in Figure 2 will be transformed into the inlined DTD graph shown in Figure 3. Finally, for each node e in the inlined DTD graph, a relation e is generated with the following relational attributes, where e.inlinedSet represents the set of nodes inlined into node e: • ID, the primary key. • parentID, if e is a non-root node, • endID, if e is an element node. • parentT ype, if e is a shared node. • e, if e is a leaf element node. • f.ID, for each element f ∈ e.inlinedSet. • f.endID, for each element f ∈ e.inlinedSet. • f , for each f ∈ e.inlinedSet such that f is a leaf element or an attribute node. Basically, in the generated relational schema, we associate each element (including the inlinable elements) with a unique ID (ID or f.ID), which uses the global order encoding proposed in [23]. However, our algorithm can be easily adapted to other order encodings [23]. Attribute parentID is introduced for each non-inlinable element to preserve the tree structure of an XML document. To facilitate the processing of recursive XML queries (queries with // axis), each element e is associated with an endID, which stores the maximum ID of e’s descendants. For a shared node, parentType is introduced to indicate the parent type of an element. A column (e or f ) is introduced for each leaf element or attribute to store its string value. For example, given the inlined DTD graph shown in Figure 3, the relational schema shown in Figure 4 is generated.

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Algorithm Reconstruct Input: element e with e.ID, e.endID and e.name have been initialized Output: return the XML document rooted at element e Begin Element list elist = ReturnDescendants(e) String xmlstr = “” Stack S = EmptyStack() root = elist.next() xmlstr.append(“”) S.push(root) // reconstruct the descendants of root e = elist.next() While e is not NULL do If e.endID = e.ID AND ID = e.ID AND ID S[top].endID, then no element in elist following e is a child of S[top]. Proof: Suppose there is an element f following e in elist that is a child of S[top], then all the elements between S[top] and f in elist including e must be descendants of S[top]. Therefore, we have e.endID ≤ S[top].endID. This contradicts the condition of the lemma. Hence, no element in elist following e is a child of S[top]. Theorem 5.4 Algorithm Reconstruct returns the XML subtree rooted at element e. Proof: [Sketch] First, we call algorithm ReturnDescendants to retrieve the structure-encoded sequence elist of the to-be-reconstructed XML subtree. Second, we output the opening tag of the first element of elist, which is the root element, and push it into the stack. Third, for each next element e from elist, if e.endID ≤ S[top].endID, then according to Lemma 5.2, e must be a child of S[top], we output its opening tag and attributes and push it into the stack.

As a result, the stack contains the path from the root to the current element. Otherwise (e.endID > S[top].endID), according to Lemma 5.3, all the children of S[top] have been output. We then output the value and closing tag of the top element of the stack and pop the stack. Finally, after all elements in elist are processed, lines 33-35 output a value and a closing tag for each element in the stack and pop it. Since Reconstruct correctly establishes parent-child relationships and respects the order of the structure-encoded sequence, thus, it will output correctly all the children of an element in the document order. Therefore, Reconstruct correctly returns the XML subtree rooted at element e. Finally, Theorem 5.5 and Theorem 5.6 state the time complexity of algorithms ReturnDescendants and Reconstruct. Theorem 5.5 The time complexity of algorithm ReturnDescendants is O(n + ne log2 ne ), where ne and na are the numbers of elements and attributes respectively in elist, the output of ReturnDescendants, and n = ne + na . Proof: [Sketch] It follows from the fact that we retrieve each descendant element/attribute of e from the database in constant time and the sorting of elist in the order of id takes time O(ne log2 ne ). Theorem 5.6 The time complexity of algorithm Reconstruct is O(n + ne log2 ne ), where ne and na are the numbers of elements and attributes respectively in the output XML subtree, and n = ne + na . Proof: [Sketch] Calling ReturnDescendants takes time O(n + ne log2 ne ). The algorithm processes each element in the output XML subtree in constant time: (1) output opening tag and push the element to a stack; 2) pop the element from the stack and output its closing tag. Similarly, the algorithm processes each attribute in constant time. Therefore, the time complexity of algorithm Reconstruct is O(n + ne log2 ne ).

6. Experimental Study For our experimental study, we implemented algorithm Reconstruct and three versions of algorithm ReturnDescendants: 1. N aive: naively retrieves each descendant element from the database with a separate query. 2. Server − side − ReturnDescendants(S − RD): retrieves all descendant elements in ID order with a few queries according to Figure 6 (inlined elements are retrieved in the same query of their non-inlinable parents) and stores them in a database server in-memory buffer, which serves as elist described in the algorithm. Client software stores only a cursor to access elist elements.

3. Client − side − ReturnDescendants(C − RD): retrieves all descendant elements in ID order with a few queries according to Figure 6 (inlined elements are retrieved in the same query of their non-inlinable parents) and stores them in a client in-memory buffer, which serves as elist described in the algorithm. The above algorithms were coded using Java 1.4.1 software development kit, Personal Oracle Database 10g Release 10.1 was used as an XML storage. Experiments were run on the computer with CPU Pentium IV 2.4 GHz and RAM 512 MB operated by Windows XP Professional. For each experiment, we performed the reconstruction of the whole XML document for 6 times and computed the average of last 5 runs ignoring the first run. In all experiments, we reconstructed XML data in memory and did not output it into a file or on a screen to avoid unnecessary I/O operations. Further in this section, when we refer to N aive, S − RD or C −RD as a version of ReturnDescendants, the reader should assume it is used in conjunction with Reconstruct, which performs actual XML document reconstruction.

6.1. Comparison of Naive, S-RD and C-RD Using the XMark benchmark [26, 17], we generated a set of XML documents, which conform to the auction.dtd, and stored them in the database. To evaluate scalability and performance of our algorithms, we reconstructed XML documents of size 5, 10, 15, 20, 25, and 30 MB. Experimental results for N aive, S − RD and C − RD are given in Figure 9 (time axis has a logarithmic scale). Based on the performance, implementations are ranked from the best to the worst as follows: S − RD, C − RD and N aive. Although S − RD and C − RD are algorithmically similar, C − RD is slower, because it additionally performs an allocation of a client-side in-memory buffer to copy data from the server to the client. N aive is the slowest one, as it runs a lot of queries, equal to the number of elements in an XML document. We did not consider N aive in our further experiments, because it is too slow to compete with S − RD and C − RD.

6.2. S-RD and C-RD performance on different data sets. To evaluate the performance of S − RD and C − RD on different data sets, we ran our program on 500KB XML documents that conform to auction.dtd [26, 17], article.dtd [25] and sigmodrecord.dtd [22]. Table 1 shows important statistical information about these data sets, such as the number of elements in DTD documents, the number of generated relational tables, the number of elements in XML documents, and the number of tuples in

Figure 9. Comparison of Naive, S-RD and CRD.

tables. Both auction.dtd and article.dtd are quite complex and contain recursive definitions, sigmodrecord.dtd is simple and enables intensive inlining.

Auction Article SigmodRecord

DTD elements 77 26 11

Tables 26 8 4

XML elements 6639 1796 11526

Tuples 4126 1738 5309

Table 1. The data set statistics

The histogram given in Figure 10 shows the experimental results for S−RD and C −RD performance on different data sets. The best performance is achieved on the XML instance article.xml, which conforms to article.dtd. Even though article.xml reveals that only 1796 − 1738 = 58 XML elements were inlined, article.dtd allows inlining of 26 − 8 = 18 distinct DTD elements, which decreases the number of tables and queries, making S − RD and C − RD run faster. Additionally, the number of XML elements in article.xml is smaller than in the other data sets. For the second winner sigmodrecord.xml, the number of tables is small (4), however, the number of XML elements is large (11526), which decreases the performance. Finally, auction.xml contains many elements (6639) which are stored in many tables (26), so that our algorithms take the most time in this case. In general, our algorithms run faster when: 1. the number of inlined DTD elements is larger, which results in the smaller number of relational tables and queries; 2. the number of XML elements is smaller, which results in the smaller number of elements in elist and faster sorting and reconstruction.

Figure 10. S-RD and C-RD performance on different data sets and indexes.

6.3. Effect of database indexes on S-RD and C-RD performance. All experiments, discussed above, were run on tables indexed by element ID. We also studied the performance of S − RD and C − RD on data without an index and with the multi-column index on ID and endID. Comparison of these three approaches is illustrated in Figure 10. Experiments on XML documents of size 5, 10, and 15 MB that conform auction.dtd showed the same tendency – the best performance with the multi-column index, medium performance with the index on ID, and the worst performance without an index. However, the advantage of the multi-column index over the index on ID (and non-index approach) is numerically small. For example, S − RD ran faster on data with the multi-column index than on the same data with the index on ID in approximately 0.2, 0.35, and 0.5 seconds for 5, 10 and 15 MB respectively, which is equivalent to 2-3% of the total running time. Further study should be conducted on files of much larger sizes. In summary, our Reconstruct and ReturnDescendants (S − RD and C − RD) implementations are efficient and scale well with the size of XML documents. Their performance highly depend on intensity of inlining and the quantity of elements in an XML instance. In particular, the performance increases when the number of inlined DTD elements increases and the number of XML elements decreases. Preliminary experimental results on indexing techniques imply an advantage of the multi-column index on ID and endID over the other approaches (no index, the index on ID), however, further study should be conducted on files of much larger sizes.

7. Conclusions and Future Work We presented algorithm Reconstruct to reconstruct XML subtrees, rooted at an arbitrary node, from a relational storage of XML documents. We showed that this algorithm is applicable in conjunction with ODTDMap, however, it can be easily adapted for usage with other schema mapping algorithms. Reconstruct has the following advantages over existing algorithms. It can reconstruct XML documents with recursive schema and does not require the user to specify XML view manually. It is efficient in terms of time complexity – order of nlogn, number of relational joins – it uses stackbased structural join, and number of SQL queries – it exploits the inlining feature of the inlining-based storage of XML documents. Reconstruct is formally proven to be correct. The extensive experimental study showed that Reconstruct and ReturnDescendant implementations are efficient and well-scalable. Our ongoing work includes: (1) the development of a generic reconstruction algorithm that can be used across various XML database platforms; (2) the improvement of the reconstruction algorithm so that it is linear rather than logarithmic.

References [1] M. Atay, D. Liu, Y. Sun, S. Lu, and F. Fotouhi. Mapping XML data to relational data: A DOM-based approach. In 8th IASTED International Conference on Internet and Multimedia Systems and Applications, pages 59–64, Kauai, Hawaii, August 2004. [2] P. Bohannon, J. Freire, P. Roy, and J. Simeon. From XML schema to relations: A cost-based approach to XML storage. In ICDE, 2002. [3] P. Bohannon, H. Korth, P. Narayan, S. Ganguly, and P. Shenoy. Optimizing view queries in ROLEX to support navigable tree results. In VLDB Conference, 2002. [4] A. Deutsch, M. F. Fernandez, and D. Suciu. Storing semistructured data with STORED. SIGMOD Conference, pages 431–442, 1999. [5] M. F. Fernandez, Y. Kadiyska, D. Suciu, A. Morishima, and W. C. Tan. SilkRoute: a framework for publishing relational data in XML. TODS, 27(4):438–493, 2002. [6] M. F. Fernandez, A. Morishima, and D. Suciu. Efficient evaluation of XML middle-ware queries. In SIGMOD Conference, 2002. [7] M. F. Fernandez, A. Morishima, and W. C. Tan. SilkRoute: trading between relations and XML. In WWW Conference, 2000. [8] D. Florescu and D. Kossmann. Storing and querying XML data using an RDBMS. IEEE Data Engineering Bulletin, 22(3):27–34, 1999. [9] T. Grust. Accelerating XPath location steps. In SIGMOD Conference, pages 109–120, Madison, Wisconsin, USA, June 2002.

[10] S. Jain, R. Mahajan, and D. Suciu. Translating XSLT programs to efficient SQL queries. In WWW, 2002. [11] R. Krishnamurthy, V. T. Chakaravarthy, R. Kaushik, and J. F. Naughton. Recursive XML schemas, recursive XML queries, and relational storage: XML-to-SQL query translation. In Proc. of the 20th International Conference on Data Engineering, pages 42–53, Boston, Massachusetts, USA, March 2004. [12] D. Lee and W. Chu. Constraints-preserving transformation from XML document type definition to relational schema. In ER, 2000. [13] S. Lu, Y. Sun, M. Atay, and F. Fotouhi. A new inlining algorithm for mapping XML DTDs to relational schemas. In Proc. of the First International Workshop on XML Schema and Data Management, in conjunction with the 22nd ACM International Conference on Conceptual Modeling (ER2003), Chicago, Illinois, October 2003. [14] M. Mani and D. Lee. XML to relational conversion using theory of regular tree grammars. In VLDB Workshop on EEXTT, 2002. [15] I. Manolescu, D. Florescu, and D. Kossman. Answering XML queries over heterogeneous data sources. In VLDB, 2001. [16] A. Schmidt, M. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of XML documents. In WebDB, 2000. [17] A. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu, , and R. Busse. XMark: a benchmark for XML data management. In VLDB, pages 974–985, 2002. [18] J. Shanmugasundaram, J. Kiernan, E. Shekita, C. Fan, and J. Funderburk. Querying XML views of relational data. In VLDB Conference, pages 204–215, 2001. [19] J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, and B. Reinwald. Efficiently publishing relational data as XML documents. In VLDB Conference, 2000. [20] J. Shanmugasundaram, E. Shekita, J. Kiernan, R. Krishnamurthy, E. Viglas, J. Naughton, and I. Tatarinov. A general technique for querying XML documents using a relational database system. SIGMOD Record, 30(3):261–270, 2001. [21] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In VLDB, pages 302–314, 1999. [22] Sigmod Record: XML Edition. http://www.dia.uniroma3.it/Araneus/Sigmod/Record/DTD/. [23] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J. Shekita, and C. Zhang. Storing and querying ordered XML using a relational database system. In SIGMOD Conference, pages 204–215, 2002. [24] F. Tian, D. J. DeWitt, J. Chen, and C. Zhang. The design and performance evaluation of alternative XML storage strategies. SIGMOD Record, 31(1):5–10, 2002. [25] XBench - A Family of Benchmarks for XML DBMSs. http://db.uwaterloo.ca/∼ddbms/projects/xbench/. [26] XMark: an XML benchmark project. http://www.xmlbenchmark.org.