VLEI Code: An Efficient Labeling Method for ... - Semantic Scholar

1 downloads 0 Views 70KB Size Report
... Code: An Efficient Labeling Method for Handling XML Documents in an RDB. Kazuhito Kobayashi. ∗†. Wenxin Liang. †. Dai Kobayashi. †. Akitsugu Watanabe.
VLEI Code: An Efficient Labeling Method for Handling XML Documents in an RDB Kazuhito Kobayashi∗† †

Wenxin Liang†

Department of Computer Science

Dai Kobayashi† ‡

Akitsugu Watanabe†

Haruo Yokota†‡

Global Scientific Information and Computing Center

Tokyo Institute of Technology, 2-12-1 O-okayama Meguro Tokyo, Japan { kkobayashi, wxliang, daik, aki, yokota} @de.cs.titech.ac.jp

Abstract A number of XML labeling methods have been proposed to store XML documents in relational databases. However, they have a vulnerable point, in insertion operations. We propose the Variable Length Endless Insertable (VLEI) code and apply it to XML labeling to reduce the cost of insertion operations. Results of our experiments indicate that a combination of the VLEI code and Dewey order is effective for handling skewed insertions.

1. Introduction By storing XML documents in a relational database, many useful functions of an off-the-shelf relational database management system, such as concurrency control and transaction recovery, can easily be used for XML databases [1]. To create such databases, a number of XML nodelabeling methods, treating ancestor-descendant containment relationships of XML tags, have been proposed. Two well-known labeling methods capable of handling the containment relationship are the preorderpostorder method [2] and the Dewey order method [3]. However, when integer codes are used with these methods, they are vulnerable in insertion operations: labels for many nodes must be changed after inserting a new node. To tackle this problem, several methods have been proposed. [4] prepares an appropriate interval between the numbers for preorder-postorder labels in advance, while QRS uses floating-point numbers [5]. ORDPATH uses a pre-calculated binary code for Dewey order labels to prepare the intervals [6]. However, all these methods require relabeling of many nodes once the prepared space is used up. In this paper, we propose a new bit sequence code, Variable Length Endless Insertable code or VLEI code for short, ∗

This work was done while the author was with Tokyo Institute of Technology. The author is now working at Software Division, Hitachi, Ltd.

Proceedings of the 21st International Conference on Data Engineering (ICDE 2005) 1084-4627/05 $20.00 © 2005 IEEE

and apply this code to XML labeling to reduce the cost of insertions.

2. VLEI Code Definition: A bit sequence v = 1 · {0|1} ∗ is a VLEI code, iff it satisfies the following condition: v · 0 · {0|1}∗ < v < v · 1 · {0|1} ∗ For instance, 10 < 1 < 11 and 100 < 10 < 101 < 1 < 110 < 11 < 111. For any two existing adjacent VLEI codes vl and vr (vl < vr ), a new code v i can always be allocated between them without modifying already allocated codes by the following InsertCode algorithm, where l(v) is defined as the bit length of a VLEI code v: InsertCode(vl ,vr ) if l(vl ) ≤ l(vr ) then vi = vr · 0 else vi = vl · 1 return vi The correctness of the algorithm is proved in [7].

3. XML Labeling with the VLEI Code It is not difficult to apply the VLEI code to the preorderpostorder method. We developed a packing method for storing the VLEI code into a fixed length integer [7]. On the other hand, a label in the Dewey ordering method is composed of identifiers (usually “.”) and numbers describing the order of siblings, for example “12.5.6”. We can apply the VLEI code to the numbers. When a new XML node is inserted, a new VLEI code for the node should be allocated by the InsertCode algorithm. However, the straightforward representation by combinations of binary VLEI codes with the character code for “.” requires a large memory space to store it into a database. To reduce the memory space needed, we propose an efficient Dewey order representation with octal VLEI codes, using “9” to indicate the identifier. When a child node is inserted, the label for the

node becomes “the label of its parent node” + “9” + “octal VLEI code”. We name it the suffixed octal VLEI Code.

4. Evaluation To evaluate retrieval and insertion performance, we store XML documents using three types of labeling methods: SPARSE [4], QRS [5], and Dewey order with the suffixed octal VLEI code. SPARSE stores the preorder and postorder numbers as INT (four bytes), while QRS stores the preorder and postorder numbers as Float (four bytes). Both techniques use eight bytes for each label. To make the memory space usage the same, we use BIGINT (eight bytes) for the suffixed octal VLEI code. Note that the Dewey order with SPARSE or QRS cannot be stored in such a limited space. The following table shows the experimental environment. CPU Memory DBMS

Celeron 1.7 Ghz PC-2100 512 MB

MySQL 4.0.16-nt

HDD OS P. Env

Seagate ST340016A Windows XP Professional Sun JDK 1.4.2

We measured the execution time for a number of XPath queries for hamlet.xml in [8]. The following table shows the average execution time for 100 executions. The results indicate that there is little difference between these three methods in terms of retrieval performance. Queries Q1: /PLAY//LINE Q2: /PLAY/ACT/SCENE /SPEECH/LINE/STAGEDIR Q3: LINE[STAGEDIR=”Aside”]

SPARSE 137 ms 11,885 ms 112 ms

QRS 179 ms 13,099 ms 109 ms

VLEI 104 ms 16,577 ms 148 ms

We then inserted artificial elements into the hamlet.xml represented by these three methods, respectively. We use the Zipf [9] distribution (θ=0, 0.5, 1) to select places to insert elements. If an overflow in some number occurred, all labels were renumbered. The following table shows the execution times for 1000 insertions. Skew Parameter θ=0 θ = 0.5 θ=1

SPARSE 50,235 ms 50,241 ms 4,343,493 ms

QRS 51,875 ms 51,588 ms 3,757,849 ms

VLEI 25,340 ms 27,078 ms 818,652 ms

The results indicate that the insertion performance of the VLEI code is much superior to the others, especially in highly skewed situations. For instance, VLEI code is four times faster than the other two methods when θ = 1. This difference is caused by the reduction in renumbering. More detailed evaluations are reported in [7].

5. Conclusions and Future Work In this paper, we proposed the VLEI code and applied it to XML labeling, which can allocate any number of new codes without modifying already allocated codes. We also proposed the suffixed octal VLEI code to apply the VLEI code to the Dewey order efficiently.

Proceedings of the 21st International Conference on Data Engineering (ICDE 2005) 1084-4627/05 $20.00 © 2005 IEEE

We then compared retrieval and insertion performance of Dewey order codes using the suffixed octal VLEI code with the SPARSE and QRS methods consuming the same memory space. The experimental results indicate that the insertion speed of the proposed code is four times faster than the others in highly skewed situations, while there is little difference between them in retrieval speed. In the experiments, we used only artificial data for insertion operations. In future, we plan to use an XML benchmark such as [10] to evaluate the VLEI code in actual use.

Acknowledgement We thank Dr. Toshiyuki Amagasa of NAIST for his advice on applying the VLEI code to the Dewey order. This work is partially supported by a Grant-in-Aid for Scientific Research of MEXT Japan (#16016232), by CREST of JST, and by the TokyoTech 21COE Program “Framework for Systematization and Application of Large-Scale Knowledge Resources”.

References [1] Igor Tatarinov, Stratis Viglasand Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. In Proc. of SIGMOD Conf., pages 204– 215, 2002. [2] Paul F. Dietz. Maintaining Order in a Linked List. In Proc. of the 14 annual ACM symposium on Theory of Computing, pages 122–127, 1982. [3] Online Computer Libraly Center. Introduction to the Dewey Decimal Classification. http://www.oclc.org/oclc/fp/about/about the ddc.htm. [4] Takeharu Eda, Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. A Robust Node-Labeling Scheme for XML trees. DBSJ Letters, 1(1):35–38, 2002. [5] Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. QRS: A Robust Numbering Scheme for XML Documents. In Proc. of 19th International Conference on Data Engineering (ICDE 2003), pages 705–707, 2002. [6] P. O’Neil, E. O’Neil, S. Pal, I. Cseri, and G. Schaller. Ordpaths: Insert-friendly xml node lables. In Proc. of SIGMOD Conf 2004, pages 903–908, 2004. [7] Kazuhito Kobayashi, Wenxin Liang, Dai Kobayashi, Akitsugu Watanabe, and Haruo Yokota. Update Conscious XML Labeling Methods using Dedicated Codes. Technical Report TR04-0006, CS Dep., Tokyo Inst. of Tech., 2004. [8] Jon Bosak. The Plays of Shakespeare in XML. http://www.oasis-open.org/cover/bosakShakespeare200.html. [9] William J. Reed. The Pareto, Zipf and other power laws. http://linkage.rockefeller.edu/wli/zipf/reed01 el.pdf. [10] Kanda Runapongsa, Jignesh M.Patel, H.V.Jagadish, Yun Chen, and Shurug Al-Khalifa. Michigan benchmark: Towards xml query performance diagnostics. In Proceedings of the 29th VLDB Conference, 2003.