ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XIV ASP Conference Series, Vol. 347, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds.
XML Data in the Virtual Observatory R. G. Mann Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh, EH9 3HJ, United Kingdom R. M. Baxter, R. Carroll, and Q. Wen National e-Science Centre, University of Edinburgh, 15 South College Street, Edinburgh, EH8 9AA, United Kingdom O. P. Buneman, B. Choi, W. Fan, R. W. O. Hutchison, and S. D. Viglas School of Informatics, University of Edinburgh, Appleton Tower, Crichton Street, Edinburgh, EH8 9LE, United Kingdom Abstract. XML is the lingua franca of the Web services world and so will play a major role in the construction of the Virtual Observatory. Its great advantages are its flexibility, platform-independence, ease of transformation, and the wide variety of existing software that can process it. An obvious disadvantage in its use as an astronomical data format is its verbosity; the number of bytes taken up writing the XML tags can easily outnumber those constituting the actual astronomical data. The verbosity of XML in this regard is a problem in many other disciplines, and computer scientists are developing more generic solutions to that found in the VOTable specification. In this paper we describe two of these projects currently underway in Edinburgh, which focus on the compression and querying of XML, and a technology for representing the structure of a binary file in XML, enabling it to be read as if it were XML.
1.
Introduction: what’s wrong with VOTable?
One of the first fruits of the nascent Virtual Observatory (VO) movement was the draft specification for VOTable1 , an XML standard for the exchange of astronomical data in tabular form. The goal in designing a new data format was to aid interoperability between the distributed data centres coming together in the VO. The choice of XML as the basis for that format was determined by its central place in the Internet world, together with the existing XSLT2 technologies for transforming XML documents. The flexibility of XML as a way of storing data comes at a cost, namely the overhead in data volume of the tags surrounding every data value in an XML document. The definers of VOTable understood that this overhead would mean that VOTable would be a prohibitively expensive format for storing large quan1
See Ochsenbein et al. (2004) for V1.1 of the VOTable specification
2
http://www.w3.org/TR/xslt
223
224
Mann et al.
tities of astronomical data. So, in addition to the pure XML version of VOTable, in which every data value is fully tagged inside a element, they provided two additional ways of storing the tabular data: (i) a FITS Serialization, in which the metadata describing the table are presented in fully tagged XML in the VOTable document, but the data values themselves are stored in an external FITS binary file, which is referenced within the VOTable document; and (ii) a Binary Serialization, in which the table of data values was stored in a binary format within the VOTable document, using some suitable encoding. These two additional options were pragmatic choices on the part of a group addressing the pressing needs of the VO community, but they lack some elegance, as well as each having specific design problems. For example, the VOTable specification leaves unclear what a parser should do if, when manipulating a data set using the FITS serialization, it finds that the metadata in the elements of the VOTable document do not match those in the header of the linked FITS file. Also, application of the Binary serialization is made at the level of the full table, thereby rendering the whole table unreadable by humans. At a more practical level, the additional complexity of the FITS and Binary variants of VOTable meant that few data centres published data using them, nor, at least initially, did many applications support the ingestion of them; although some (e.g., TOPCAT3 ) now do. While the VO community was debating the pros and cons of binary and XML data formats, the same issues were being addressed in a number of other, data–rich scientific disciplines, and computer scientists began to devise generic solutions to these problems. In this paper we introduce to the VO community two of these, developed at the University of Edinburgh: the first, VX (Buneman et al. 2004), circumvents the verbosity of the pure XML variant of VOTable by decomposing the XML document into a vectorized form which admits efficient querying; while the second, BinX (Carroll et al. 2004), is based on a language for describing binary data and their layout in XML.
2.
VX: Vectorizing XML
Vertical partitioning is a well-known technique for optimizing query performance in relational databases. An extreme form of this technique, which we call vectorization, is to store each column separately. Liefke and Suciu (2000) applied the idea of vectorization to XML documents, decomposing them into vectors (the sequences of data values appearing under all paths bearing the same sequence of tag names) and the skeleton which stores the tree–like structure of an arbitrary XML document. This work yielded XMILL, a tool for compressing XML documents, but the aim of VX is somewhat broader than simple compression – as it aspires to developing techniques whereby large XML repositories can be efficiently queried – and it also differs in some technical aspects from that of Liefke & Suciu (2000); it does not compress the vectors and uses a different method for compressing the skeleton.
3
http://www.starlink.rl.ac.uk/TOPCAT
XML Data in the Virtual Observatory
Figure 1.
The tree representation of an XML document.
Figure 2.
Its VX decomposition: (a) skeleton; and (b) vectors
225
Those interested in the details of VX should consult the paper by Buneman et al. (2004), but some of its essential features can be understood from comparing Figures 1 and 2, which represent, respectively, the original, tree form of an XML document, and its vectorized decomposition. The numbers next to the edges in 2(a) denote structures which appear more than once in a depth-first traversal of the tree. Queries on the XML document can then be run by a processor which holds the skeleton in main memory and reads data values from the vectors on disk only when needed. The current processor works with a fragment of XQuery that can express all relational conjunctive queries. The regular structure of a tabular dataset yields a trivial skeleton under VX decomposition, allowing very quick querying. Tests (Buneman et al. 2004) on the PhotoObj table of the SDSS EDR SkyServer database (Stoughton et al. 2002) show that the VX prototype is comparable with the EDR SkyServer for those queries where SQL Server does not make heavy used of indexes, since these have not been implemented in VX yet. 3.
BinX
BinX (Carroll et al. 2004) has been developed by the edikt4 project at the National e-Science Centre, as a solution to a dilemma expressed within the VO community, and many other data-rich disciplines: We want the flexibility of storing our data as XML, but we also want the compactness of storing our data in a binary format. BinX comprises an XML annotation language which de4
http://www.edikt.org
226
Mann et al.
Figure 3.
FITS–to–VOTable conversion using BinX.
scribes the data types and structures in binary data files and an associated library for making use of that language in the manipulation of XML and binary data files. The BinX language can describe primitive data types – such as characters, bytes, integers (16, 32, and 64 bits), floating point numbers (32 and 64 bits) and strings – as well as arrays, sequences (structs), and unions, and allows the user to define a type based on an array, sequence or union. The description of the binary data is stored in an XML file known as a SchemaBinX document. This contains metadata describing each data element within the binary data file, as well as a reference to the binary data file itself: like the FITS variant of VOTable, BinX separates metadata (stored as XML) from bulk data (which remains in a compact binary format). The availability of a SchemaBinX descriptor enables one to manipulate a binary file as if it were an XML document – e.g., run an XPath query over it – via the BinX library. A second representation of the binary data is the DataBinX document, in which the schema information describing a binary data file is supplemented by all the actual data values. Figures 3 and 4 show schematically one use to which BinX has been put: converting between FITS binary tables and VOTable files (or any other format which can be specified using an XSLT transformation of the DataBinX file). Note that the VOTable-to-FITS conversion is more complicated, due to the ASCII header. This conversion works well, but the intermediate DataBinX files can be large, so one possible development would be to combine VX and BinX. A working group5 of the Global Grid Forum is extending the ideas of BinX in the development of a Data Format Description Language (DFDL), and BinX may morph into a reference implementation of that standard, as it develops.
5
http://http://forge.gridforum.org/projects/dfdl-wg
XML Data in the Virtual Observatory
Figure 4.
4.
227
VOTable–to–FITS conversoin using BinX.
Summary and Conclusions
As the VO moves from development to production, and people begin wanting to move large volumes of tabular data around, the pure XML version of VOTable will not suffice, and experience suggests that the FITS and Binary serialized versions will not be used widely within the community. Therefore the VO community must think about how it will handle large volumes of tabular data, and the purpose of this paper is to introduce to that community two research projects which may point toward promising solutions to that problem. Those interested in learning more about VX or BinX should contact Bob Mann (
[email protected]). References Buneman, P., Choi, B., Fan, W., Hutchison, R.W.O., Mann, R.G., & Viglas, S.D. 2004, accepted for publication in Proc. of The 21st International Conference on Data Engineering (ICDE 2005). Preprint available from http://homepages.inf.ed. ac.uk/v1bchoi/paper.pdf Carroll, R., Virdee, D., & Wen, Q. 2004, in Proc. of The UK e-Science All Hands Meeting 2004, http://www.allhands.org.uk/proceedings/papers/124.pdf Liefke, H., & Suciu, D. 2000, in Proc. ACM SIGMOD International Conference on Management of Data Ochsenbein, F., et al. 2004, IVOA Recommendation, http://www.ivoa.net/Documents/latest/VOT.html Stoughton, C., et al. 2002, AJ, 123, 485