1
Fast transform for effective XML compression Przemyslaw Skibinski, Szymon Grabowski, Jakub Swacha Abstract – The main drawback of the XML format seems to be its verbosity, a key problem especially in case of large documents. Therefore, efficient encoding of XML constitutes an important research issue. In this work, we describe a preprocessing transform meant to be used with popular LZ77-style compressors. We show experimentally that our transform, albeit quite simple, leads to better compression ratios than existing XML-aware compressors. Moreover, it offers high decoding speed, which often is of utmost priority. Keywords – XML compression, text transform.
I. INTRODUCTION The Extensible Markup Language (XML) is one of the most important formats for data interchange on the Internet. The chief benefit of XML is its extreme simplicity and flexibility. Thanks to adopting only a few simple rules for organizing data, the format is extremely useful and portable. XML is a metalanguage: the set of tags used for marking up the data is chosen by the author of a given document. In this way, various entities from the real world can be described naturally with XML tag names. The main disadavantage of using XML documents are their large sizes caused by highly repetitive (sub)structures of those documents and often long tag and attribute names. Therefore, a need to compress XML, both efficiently and conveniently to use, has early been identified as one of burning research issues in the scientific community. Apparent disappointment with slow progress in universal compression in recent years has directed many researchers and practitioners towards specialized compression. A common approach to specialized compression is to preprocess a file of a given type and then submit to a general-purpose compressor. Preprocessing ideas usually exploit specific features of text, record-aligned data, executable files, and XML. Nowadays, numerous real-world compressors and archivers make use of data specific “tricks”, especially for compressing text and executables, with gain in many cases on the order of 5–10% for those data types, see the frequently updated MaximumCompression site (http://www. maximumcompression.com).
II. REVIEW OF EXISTING XML COMPRESSION METHODS One of the first XML-oriented compressors was XMill [6] presented in 2000. It parses the XML data and splits them into three components: element and attribute symbol names, plain Przemyslaw Skibinski – Uniwersytet Wroclawski, Instytut Informatyki, ul. Joliot-Curie 15, 50-383 Wroclaw, POLAND. Email:
[email protected]. Szymon Grabowski – Politechnika Lodzka, Katedra Informatyki Stosowanej, al. Politechniki 11, 90-924 Lodz, POLAND. E-mail:
[email protected].
text and the document tree structure. As those components are typically vastly different, it pays to compress them as separate streams − possibly even using different compressors, albeit, to the best of our knowledge, such time/compression trade-offs with XMill have not been seeked for. XMill component streams have been originally [6] compressed with gzip, and then [2] also with bzip2, PPMD+ and PPM*.1 With gzip and order-5 PPMD+ the XMill transform improves compression by about 18% [2], but once higher order contexts come to play (bzip2, PPM*), the gains disappear, and it even compresses worse than the respective compressors on unpreprocessed documents. The supposed reason is that e.g., high order PPM compressors already handle the different contexts well enough, so the XMill transform helps little if at all, and on the other hand breaking the original structure makes impossible to exploit crosscomponent redundancy. Cheney’s XML-PPM is a streaming compressor which uses a technique named multiplexed hierarchical modeling (MHM). It switches between four models: one for element and attribute names, one for element structure, one for attributes, one for strings, and encodes them in one stream using PPMD+ or, in newer implementations, Shkarin’s PPMd. The tag and attribute names are replaced by shorter codes. An important idea in Cheney’s algorithm is injecting the previous symbol from another model into the current symbol’s context. Injecting means that both the encoder and decoder assume there is such a symbol in the context of the current symbol but don’t explicitly encode or decode it. The idea of symbol injection is to preserve (at least to some degree) contextual dependencies across different structural models, which was totally lost in XMill. SCMPPM [1] can be seen as an extreme case of XMLPPM. Instead of using only a few structural classes, it uses a separate model for each element symbol. All structure elements having the same ancestor path will be encoded in the same PPM model, but different elements will use different models. This technique, called Structure Context Modeling (SCM), wins over XML-PPM on large documents (tens of megabytes), but loses on smaller files. Also, SCMPPM requires lots of memory for housing multiple statistical models and under limited memory scenarios it may lose significantly, even compared to pure PPMd [3]. In a recent work [3] Cheney proposed a hybrid solution (Hybrid Context Modeling, HCM), trying to combine the best features of MHM of SCM. In this algorithm initially a single model for each structural class is used, with symbol injection, i.e., it starts exactly as MHM. The novelty is to keep a counter of each element occurences. Once it exceeds a predefined 1
References to all the general-purpose compressors mentioned in this work can be found at http://www.maximumcompression.com.
Jakub Swacha – Uniwersytet Szczecinski, Instytut Informatyki w Zarzadzaniu, ul. Mickiewicza 64, 71-101 Szczecin, POLAND E-mail:
[email protected].
CADSM’2007, February 20-24, 2007, Polyana, UKRAINE
2 threshold, the given element gets its own model space, so is separated from the other elements. Such context splitting technique could potentially require huge amount of memory on very large files (similarly to SCM), so also a limit on the number of models is imposed. Those two parameters are chosen experimentally. Albeit sound, the HCM algorithm rarely dominates both SCM and MHM. Several proposals (see e.g., [5] and references therein) make use of the observation that a valid XML structure can be described by context-free grammar, and grammar based compression techniques can be then applied. Grammar based compression can be seen as generalization of dictionary based compression, and it can identify and succinctly encode potentially complex patterns in the text. Still, this approach, albeit promising, so far has not yielded compressors competitive e.g. to XML-PPM in the compression ratio. A recent trend in XML compression is to support queries directly in the compressed representation. At the moment, the most advanced solution in this domain is XBzip [4]. Although this scheme is quite impressive in both compression ratio and search/navigation capabilities, it loses to SCMPPM in compression ratio even if no support for queries is implemented. Together with auxiliary structures for searching, it sometimes needs even more space than a respective gzip archive, at least with the default settings (cf. Table 2, XBzipIndex column, and Fig. 1 (top) in [4]). Yet another line of research is to construct DTD- or Schema-aware compressors (XCQ, ICT XML-Xpress™). Taking into account that the syntax of the document is already stored in a DTD, impressive compression ratio can be obtained, provided a DTD for a given XML document is available for both compressor and decompressor, and the given document fully conforms to it. Theoretically, this could be the best way to handle XML compression, but in practice XML documents with unavailable (or even undefined) DTD are often used, and many documents are frequently restructured, raising the compressor/decompres–sor incompatibility issue. In this paper we shall address neither the problem of making the compressed XML queriable, nor using DTD in the process. Instead, we shall focus at devising a method to store XML in a very compact form, and in a way as simple and fast as possible.
III. REDUNDANCY IN XML DOCUMENTS We have made several observations concerning typical XML documents. In this section we mention them briefly, and in the next section we will present how we exploit those redundancies in detail. Firstly, in a well-formed XML document, every end tag must match the corresponding start tag. Therefore, each end tag may be replaced with merely a closing flag. Secondly, in every XML document there are words which tend to appear with high frequency. This is particularly the case with tag and attribute names, but attribute values or some of the element content words can also appear many times. Such frequent words can be extracted in a prescan over the document to form a dictionary. Every time a dictionary word
is found in the document, it can be replaced by its short dictionary index. If encoded properly, the dictionary index is always shorter than the word it references. However, the dictionary must be known explicitly to the decoder, e.g., written word-by-word at the beginning of the preprocessor output. Thirdly, leading blanks in XML document lines are usually more or less regular in their count, which is beyond the grasp of the general-purpose compression models. Special encoding of the leading blanks can help to handle them optimally. Fourthly, in many documents an end tag is usually followed with a newline character. A single symbol can thus be used to represent such a concatenation. Fifthly, many fields in databases are numeric, and storing numbers as text is ineffective. Numbers can be encoded more efficiently using a numerical system with base higher than 10. Some of these observations are backed up in the literature, and as such have been utilized in existing algorithms. These techniques lead to better compression performance and we have used them as integral part of our transform.
IV. XML-WRT TRANSFORM In this section we introduce the proposed XML Word Replacing Transform (XML-WRT, or XWRT in short) through detailed description of its main constituents. According to our experiments, the most important transform from the compression viewpoint is replacing most frequent words with references to a dictionary. The dictionary is obtained in a preliminary pass over the data, and contains all non-overlapping case-sensitive sequences containing letters, of length at least lmin = 2 that appear at least fmin = 6 times in the document. Moreover, the dictionary can contain start tags (without attributes), that is sequences of characters that start with < and end with >. Start tags can be preceded by one or more space symbols. Also, phrases =" and ">, which are typically around attribute values, are replaced with a 1-byte code. The words selected for the dictionary are written explicitly, with separators, at the beginning of the output file. For most documents the dictionary contains not more than several hundred items, hence the codewords have one or two bytes. Additionally, the dictionary entries may contain leading blanks which helps on regular document layouts. Dictionary references are encoded using a byte-oriented prefix code, as described in [9]. Although it produces slightly longer output than, for instance, Huffman coding, the resulting data can be easily compressed further, which is not the case with Huffman. The coding scheme is optimized for further LZ77 compression; as noted in [9], a PPM-friendly transform should instead use a less dense code, with non-intersecting ranges for different codeword bytes. Our scheme applies the spaceless word model [7], in which single spaces before encoded words in the textual content are omitted, as they can be automatically inserted on decoding. Another idea in XML-WRT is compact encoding of numbers, or more precisely, of digit sequences. Any sequence of digits, of length at least 1, is replaced with two adjecent codes. The first code is a single character from ‘1’ to ‘4’ which identifies a numeral sequence and tells the length (in
CADSM’2007, February 20-24, 2007, Polyana, UKRAINE
3 bytes) of the second code. Longer digit sequences are simply broken into several shorter ones, but this case happens very rarely in practice. The second code represents the given digit sequence in a compact form, namely as base-256 numbers. In case the digit sequence starts with one or more zeroes, the initial zeroes are left intact in the text. We observed that for some datasets slightly better results were obtained with other radix bases, 64 or 100. Still, using the densest possible encoding seems the best choice on average. There is no special encoding for fractional numbers, like 1123.550, so they are represented as two encoded integers separated with the decimal point. Using only ‘1’..‘4’ ASCII symbols for the number length code makes it possible to spare ‘5’..’9’ symbols for the dynamic dictionary codewords. On the overall, the number encoding gives about 1–3% gain with gzip, although there are files for which about 1% compression loss has been observed.
V. EXPERIMENTAL RESULTS In order to compare the performance of our algorithm to existing XML compressors, as well as widely-used generalpurpose compressors, a set of experiments has been run. In compression benchmarking, a proper selection of datasets used in experiments is essential. To our knowledge, there is no publicly available and widely respected XML dataset corpus to this date. We decided to base our test suite on XML dataset corpus proposed in [8], as it was devised to “cover a wide range of XML data formats and structures”. As we were not able to keep up with the original corpus precisely due to inability to locate and fetch one of its files, and to track down the exact versions of some other datasets, we modified the corpus making use of the datasets available at the University of Washington XML Data Repository (http://www.cs.washington.edu/research/xml datasets/www/repository.html). As a result, our experimental corpus consists of: • DBLP, bibliographic information on major computer science journals and proceedings, • Lineitem, Line items from the 10 MB version of the TPC-H benchmark, • Nasa, astronomical data, • Shakes, a corpus of marked-up Shakespeare plays, • SwissProt, a curated protein sequence database, • UWM, university courses. Table 1 presents detailed information for each dataset: its size (in bytes), the number of elements, the number of attributes, the number of distinct element types, and the maximum depth. The test machine was Intel Pentium 4 2.8 GHz with 512 MB, running Windows XP. The XML-WRT transform was implemented in C++ and compiled with Visual C++ 6.0. The application, XML-WRT v1.0, is available with sources at http://www.ii.uni.wroc.pl/~inikep/ research/XML/XML-WRT10.zip. In the experiment, data transformed with XWRT were then passed to three general-purpose compression programs: gzip, LZMA and PPMd. Gzip uses Deflate, the most widely-used compression algorithm, known for its fast compression and
very fast decompression, but limited efficiency. LZMA uses proprietary compression method, also implemented in better known 7zip compression utility, known for its high efficiency and very fast decompression, but slow compression. PPMd uses PPMII compression algorithm, achieving the highest compression efficiency for the price of slow compression and decompression. TABLE 1 BASIC CHARACTERISTICS FOR THE USED XML DATASETS dataset DBLP Lineitem Nasa Shakes SwissPr UWM sum
file size
# of elements
# of attributes
133 862 735 32 295 475 25 050 288 7 894 983 114 820 211 2 337 522 316 261 214
3 332 130 561 871 476 646 179 690 2 977 031 66 729 7 594 097
404 276 1 56 317 0 2 189 859 6 2 650 459
# of max. distinct depth elements 370 435 6 817 3 33 714 8 28 159 7 117 852 5 4 054 5 555 031 –
Notice that XWRT has been designed for LZ77descendent compression algorithms, such as Deflate, and LZMA. Related work on WRT shows that PPM efficiency can be improved using a different form of the textual transform output. Improving PPM efficiency was out of scope of our research, as our primary intention was to keep the decompression process fast. Therefore, the results for XWRT+PPM are presented only for comparison purposes. In Table 2, the compression results obtained for XWRTtransformed datasets are compared to those achieved by the same compression algorithms on the datasets in their original form. Existing XML-aware compressors are represented in the results with the fast XMill 0.7, and the current state-of-the-art XML compressor, XML-PPM 0.98.2. Another reason supporting this choice were mild memory requirements for those compressors, less than 20 MB, which more or less correspond to the memory use of our XML-WRT transform. Unfortunately, it appears that XML-PPM is not truly lossless (it did not reproduce exactly any of our seven test files in the decompression), and XMill fails to exactly reproduce DBLP file. It is apparent from the results that transforming XML data greatly improves its compression ratio (30% on average in case of gzip). If the transform output is encoded with LZMA, instead of gzip, the average improvement rises to 41%, with decompression time only 20% longer than gzip on the original data. If the reverse transform were included with the decompressor, the time gap would be even smaller, due to significant save of I/O operations. The XWRT+LZMA compression ratio surpasses the gzipbased XMill result by 27%, and even though XWRT has not at all been tuned up for PPM, XWRT+PPMd attains compression ratio 9% better than the current state-of-the-art XML compressor, with much faster decompression (the exact time measurement for XML-PPM cannot be quoted, as the program froze during decompression of Nasa dataset).
CADSM’2007, February 20-24, 2007, Polyana, UKRAINE
4
REFERENCES TABLE 2
[1]
COMPRESSION RATIOS IN BITS PER CHARACTERS AND COMPRESSION / DECOMPRESSION TIMES
DBLP Lineitem Nasa Shakes SwissPr UWM average ctime (s) dtime (s)
gzip 1.2.4
xwrt + gzip 1.2.4
xwrt + lzma 4.35
xmill 0.7
xmill 0.8 ppmd
1.463 0.721 1.208 2.182 0.985 0.553 1.186 31.65 32.74
1.029 0.488 0.851 1.560 0.675 0.383 0.831 73.69 36.95
0.868 0.383 0.686 1.452 0.451 0.329 0.695 194.83 41.25
1.250 0.380 1.011 2.044 0.619 0.382 0.948 41.70 36.49
0.940 0.270 0.823 1.584 0.477 0.310 0.734 85.79 78.34
xwrt + ppmd xml-ppm -o6 0.98.2 -m16 0.757 0.857 0.258 0.273 0.644 0.729 1.251 1.367 0.438 0.465 0.252 0.259 0.600 0.659 112.99 121.26 84.06 failed
VI. CONCLUSIONS AND FUTURE WORK We have presented a fast XML transform aiming to improve lossless XML compression in combination with existing general purpose compressors. We focused on fast decoding of a compressed document, i.e. the reverse transform is not only fast, but also optimized for the LZ77 compression family characterized by very fast decompression. The main components of our algorithms are: a semi-dynamic dictionary of frequent alphanumerical phrases (not limited to “words” in a conventional sense), spaceless word model, binary encoding of numbers and succinct document layout representation. Thanks to the proposed transform, the XML compression of a widely-used LZ77-type algorithm, Deflate (used by default in gzip and zip formats), can be improved by as many as 30%. The price is that the encoding speed gets more than twice worse with gzip, but it is still much better than with PPM-based compressors. We suppose that the main advantage of our algorithm over the competitors comes from applying the dictionary encoding not only for structural elements (tag names, attributes) but also to the textual content. We expect that relaxing the rules for the items in our dictionary (e.g., accepting pairs of words, formatted dates, fractional numbers, email addresses etc.) could help even a little more. XML-WRT works in two passes over the text, the first of which is to gather text tokens and then generate the dictionary of the most frequent tokens. We wanted the proposed transform to be simple. For this sake, we have given up several promising ideas, which are left for future work. The most important is breaking up the document into so-called containers [6]. This would result in complicating the transform a lot, but preliminary experiments show that implementing this idea can significantly improve the compression with gzip, and to a lesser degree, with LZMA. A related idea is separation of numerical data (digits) to another stream. Similarly, textual data (i.e., the textual remnants after the dictionary-based text encoding) can also be removed to another stream. Another line of our research is to optimize the transform for PPM compression, to increase its advantage over XML-PPM even more.
J. Adiego, P. de la Fuente, and G. Navarro, “Merging Prediction by Partial Matching with Structural Contexts Model,” in Proc. of the IEEE Data Compression Conf., Snowbird, UT, USA, pp. 522, 2004. (Also available at http://www.dcc.uchile.cl/~gnavarro/ps/dcc04.2.ps.gz.)
[2] J. Cheney, J., “Compressing XML with multiplexed hierarchical PPM models,” in Proc. of the IEEE Data Compression Conf., Snowbird, UT, USA, pp. 163–172, 2001. [3] J. Cheney, J., “Tradeoffs in XML Database Compression,” in Proc. of the IEEE Data Compression Conf., Snowbird, UT, USA, pp. 392–401, 2006. [4] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan, “Compressing and Searching XML Data Via Two Zips,” in Proc. of the Int. World Wide Web Conf. (WWW), Edinburgh, Scotland, pp. 751–760, 2006. [5] G. Leighton, “Two New Approaches for Compressing XML”, M.Sc. Thesis. Acadia University, Wolfville, Nova Scotia, 2005 (Also available at http://cs.acadiau.ca/ ~005985l/MThesis.zip.) [6] H. Liefke and D. Suciu, “XMill: an efficient compressor for XML data,”, in Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, Dallas, TX, USA, pp. 153–164, 2000. [7] E.S. Moura, G. Navarro, and N. Ziviani, “Indexing Compressed Text,” in Baeza-Yates R. editor, Proc. of the 4th South American Workshop on String Processing (WSP’97), Valparaiso, Carleton University Press, 95– 111, 1997. [8] W. Ng, W.-Y. Lam and J. Cheng, “Comparative Analysis of XML Compression Technologies,” World Wide Web, Vol. 9, No. 1, pp. 5–33, 2006. [9] P. Skibinski, Sz. Grabowski, and S. Deorowicz, “Revisiting dictionary-based compression,” Software– Practice and Experience, 35(15):1455–1476, 2005.
CADSM’2007, February 20-24, 2007, Polyana, UKRAINE