Paper ID: 331
Making legacy data accessible for XML applications Volker Turau FH Wiesbaden, University of Applied Sciences, Department of Computer Science, Kurt-Schumacher-Ring 18, 65197 Wiesbaden, Germany
[email protected]
WWW: http://www.informatik.fh-wiesbaden.de/turau/
Abstract. This paper presents design and implementation of DB2XML,
a tool for transforming data from relational databases into XML documents. Document type declarations (DTDs) are generated describing the characteristics of the data making the documents self contained and usable as a data exchange format. DB2XML is written in Java and accesses databases through JDBC drivers. It can be used as a standalone tool or as a servlet in web based applications. Combined with client-side interpreted XSLT stylesheets DB2XML provides a novel approach for publishing data from relational databases on the web.
1 Introduction The extensible markup language (XML) is increasingly nding acceptance as a standard not only for describing documents, but also as a data description language for all types of information. It will fundamentally change applications which rely on electronic data interchange. XML will eliminate the need for custom-built interfaces to access data. With the embracement of XML by the leading web browser companies, the vision of web browsers as universal user front ends could become reality. But other tools such as text processing tools or spread sheet programs will also use XML as a format for data exchange. There is an strong tendency to run more applications on the World Wide Web. The logical structuring capabilities of XML will turn the information network WWW into a global computing platform. As with many new technologies a crucial question is how to transform data from legacy applications such as relational databases or work ow systems into the new format in order to make this data accessible to XML applications. According to some sources more than 75% of all current web pages are generated from data contained in underlying databases. At the moment this data is transformed into HTML by server side applications. The task of the browsers is to render the HTML document. An alternative is to transform this data into XML and to generate a corresponding stylesheet (XSL or DSSSL) to leave the actual visual rendering to the clients [5]. This approach has several advantages:
2
{ Better design:
This approach separates database aspects, business logic and layout design. These dierent domains are very often tightly coupled in current web applications as can be seen in server-side applications where C or Perl code, HTML tags and SQL statements are highly interleaved. This results in applications which are usually hard to maintain and dicult to extend. { Support of dierent clients: The generation of stylesheets for individual client software (e.g. browsers or handheld devices) or for printing formats such as PDF or RTF is separated from the content. { Load balancing: The computational load can be partitioned between servers and clients. { Lower bandwidth: The bandwidth requirements for HTML documents are on the average much higher than for XML documents and stylesheets together. DB2XML is a tool for transforming data from relational databases into XML documents. It is written in pure Java. DB2XML provides three main functions: { Transforming the results of database queries or the complete content of databases into XML documents. { Generation of meta data describing the characteristics of the data in form of a document type de nition (DTD). { Transformation of the generated XML documents using XSLT stylesheets. DB2XML can be used in two ways: as a standalone tool (with or without a graphical user interface) or as a servlet to dynamically generate XML documents in a web-server. Databases are accessed through JDBC drivers. This paper is organized as follows. Section 2 discusses the problem of mapping relational data into XML documents. The following section gives an overview of the approach taken in DB2XML. Section 4 presents some details of the implementation. Finally, we summarize our work, compare it to other approaches and list future work. In an appendix we discuss an example demonstrating the usage of XSLT stylesheets in DB2XML.
2 The Problem There are quite a few technical problems involved in mapping relational data into XML documents which stem from the dierences in both models. The relational model as proposed by E.F. Codd diers considerably from the data model of XML. It is basically a three tier model1 : a database is a collection of tables consisting of records and a record consists of a xed number of elds. Values of record elds must be atomic, that is they cannot themselves be lists of values, or else names of whole relations standing for lists of values. On the other hand XML documents are annotated trees of arbitrary depth. Hence from a structural 1
Not considering catalogs.
3
point of view XML is capable of representing the relational model. The question is how to de ne a useful mapping. Note that mapping general XML documents into relational structures is a much more dicult problem. Despite the fact that XML documents are trees, XML has a at name space: no element type may be declared more than once. Since all element types are de ned in the same scope, they all must have dierent names. Element types can declare attributes and dierent element types may use common attribute names. This resembles the restrictions in the relational model: the names of all tables must be dierent and dierent tables may use the same eld names. This suggests a mapping from tables to element types and from elds to attributes. Each row of a table would than correspond to an empty element and the values of the attributes would be the eld values. Figure 1 shows a relational table, the resulting DTD, and the corresponding XML document. While this seems to be a natural mapping rule, it has some serious drawbacks. Field values are not represented as rst class entities, i.e. not as element types. This makes the storage of metadata (types, read/write restrictions, precisions, encodings etc.) about eld values at least clumsy. On the other hand mapping the elds of all tables to element types retaining the names, leads to name clashes in case two tables have a common attribute name.
a b c 1 "hello" true 2 "bye" false Table t
#REQUIRED b CDATA #REQUIRED c (true j false) #REQUIRED>
Fig. 1. Simple mapping of a relational table into an XML document An XML document contains text, a sequence of characters, which may represent markup or character data. Legal characters are white spaces and the legal graphic characters of Unicode and ISO/IEC 10646 [7]. In addition to character data relational databases support a variety of binary types such as numeric types, byte, varbinary etc. There are two possibilities to represent values of these types in XML: in external entities (i.e. in separate les) or as legal XML characters after a suitable transformation. Some data types such as numeric types have a unique character representation which can be used. For other binary types a character encoding scheme (such as base64) must be employed. Character elds of type VARCHAR or TEXT may be transformed directly into XML as long as they neither contain the characters & and < nor the string ]]>. In that case strings have to be protected in CDATA sections or stored externally.
4
Databases allow elds to hold null values to signal that this eld has currently no value assigned to it. The mapping has to treat null values in a way that they can be distinguished from special values such as empty strings.
3 The mapping The design goals for the mapping of relational data into XML used in DB2XML are: { The mapping must allow ecient implementations. { The generation of document type declarations must be possible. { The processing of the generated documents (e.g. by XSL processors) should be simple and ecient. { The amount of meta data included in the document should be con gurable (from none to everything the database system oers). DB2XML generates for each table and for each column of each table an element type. Meta data is associated with these elements using attributes. The root element of the generated DTD represents the content of dierent tables or the results of several queries or both. The user can choose the name of this element, the default name is database. This element has one attribute URL, its value is the URL of the database using the driver speci c syntax. All records belonging to one query or one table are grouped in a single element. The user can choose the name of this element, by default this is the name of the corresponding table or table i, i is the sequence number of the table. A database element may contain several table elements. A table element type has an attribute QUERY, its value is the query string for this element. Optionally the attributes PKEY and FORKEY are supported. Their values are comma separated lists with the names of the element types representing the table's primary key columns respectively imported foreign key columns (note that these names are unique). Records are also represented as elements. Users can choose names for record elements, the default is record i, i is the sequence number of the table. If applicable, users may choose to use the table name. Its content is a sequence of elements corresponding to the individual elds. If null values are to be discarded, those elds which are nullable are optional. The default settings ensure that the generated XML documents are valid. If users choose individual names for element types, they have to ensure that all names are dierent. DB2XML warns users if some element types have the same name. Alternatively a hierarchical naming scheme avoiding all naming con icts is available. Each eld in a record is represented as an individual element type. The name depends on the system settings. If the eld value is stored in an external entity (currently only binary elds are supported) the content is EMPTY otherwise it is either CharData or CDSect depending on the database type of the eld. These 2
enumeration of type names appearing in the corresponding table
5
Table 1. Attributes of element types for record elds Name TYPE NULLABLE ISNULL ISPKEY ACCESS CASESENSITIVE PRECISION AUTOINCREMENT CURRENCY ENCODING SCALE HREF NAME
DB Types all types all types all types all types all types char types num. types integer types num. types binary types num. types
Range Meaning types2 local type name of the eld truejfalse eld may have null value truejfalse eld value is null truejfalse eld is part of the primary key RjRW eld value is write protected truejfalse eld value is case sensitive CDATA maximal number of digits truejfalse automatically numbered eld truejfalse eld denotes a cash value CDATA encoding scheme CDATA maximal number of digits to the right of the decimal point binary types CDATA external reference all types CDATA name of the record column
element types can have a variety of attributes depending on the type. Table 1 gives an overview over these attributes. Only the attributes HREF, ISNULL and NAME are declared as #IMPLIED, the remaining attributes are #FIXED (i.e. the attribute has a default value). Figure 2 shows an excerpt of a DTD generated by DB2XML. DB2XML supports the transformation of complete or partial databases. In the latter case the output is controlled by a sequence of SQL queries and table names such as: [orders] select * from orders as Orders where ORDERID < 10261 | [sup] suppliers
Each entry in this sequence can optionally be pre xed with a name. This will be the name for the corresponding table element type. The usage of SQL queries allows a very ne tuning of the resulting XML document. Further restructuring of documents is based on the proposed XML transformation language XSLT [5]. The language allows the transformation of the at table structure of the database into arbitrary deep hierarchical structures. An example for such a transformation is given in appendix A. The details of the mapping rules and further examples can be found in [11].
4 The implementation DB2XML is implemented using Java 1.1 and is 100% pure Java and consists of more than 40 classes organized in ve packages with over 15.000 lines of code. Access to databases is based on JDBC version 1.20 and therefore virtually any relational database can be used as a data source. DB2XML has been tested on
6
QUERY CDATA #REQUIRED PKEY CDATA #IMPLIED>
]>
....
>
Fig. 2. An internal document type declaration generated by DB2XML dierent platforms (Unix and Win32) using dierent databases (Oracle, SQLServer, MySql, Access) and dierent drivers (JDBC-ODBC bridge, type 3 and 4). Our experience has shown, that not all JDBC drivers fully implement the standard, especially those parts dealing with metadata. Special precautions against incomplete drivers have been taken to prevent system crashes or unde ned behaviour. DB2XML relies on the ability to convert text from Unicode to the local coding and vice versa. In Java 1.1 this can be done with the classes Input- and OutputStreamWriter. A wide variety of encodings are supported (including 8 bit Unicode, EBCDIC and ISO latin codes). There are signi cant variations between the SQL types supported by dierent database products. JDBC de nes its own SQL types (called JDBC types) and a mapping between the type systems. DB2XML is solely based on JDBC types. The class JDBCType encapsulates all general methods and information to generate XML data for JDBC types (e.g. attributes, encoding schemes etc.). Behavior which is speci c to particular JDBC types is implemented in subclasses of JDBCType (e.g. JDBCBinaryType or JDBCBitType). These classes de ne type speci c access methods and are responsible for the generation of dedicated attributes. This design encapsulates the mapping in a few classes, this eases transitions to dierent mappings. DB2XML can generate internal and external DTDs. An important requirement was that each query should be executed only once (in order to prevent
7
Fig. 3. The main panel of DB2XML undesirable situations such as phantom reads). The generation of internal DTDs therefore required the usage of temporary les, since DTDs must be de ned at the beginning of a le. Multiple select-statements can be executed within a single transaction. The generated XML documents can be exported in their textual form (in a le or a stream) or as structured objects using the Document Object Model DOM [13]. The DOM structure is useful for applications which perform further processing of the document. One example is the transformation of the document using XSLT stylesheets. DB2XML includes support for XSL processors. The current version includes an interface for the XSL processor developed by the Lotus Corporation [9] and the IBM XML parser [6]. In case a web-browser does not support XSL-stylesheets (currently only Microsoft's Internet Explorer 5 includes an XSL processor), the transformation to HTML is performed inside DB2XML. The DOM structure can also be used for updating the document (i.e. inserting, deleting or changing elements). The updated document can be written back to the database (not supported in release 1.1). This will allow interactive web-based applications. All SQL queries for one XML document use the same connection to a database. The default behaviour is to use a dierent connection for each XML document. Used as a servlet this is an unacceptable overhead, therefore open connections can be reused. Support of connecting pooling is planed for future releases. To support the usage of DB2XML with and without a graphical user interface all con guration parameters are stored in a central repository (about 75 dierent
8
properties). These parameters can be saved into a le for later usage. Figure 3 shows the main panel of the DB2XML graphical user interface.
5 Related work According to our knowledge DB2XML is the rst eort to transform relational data into XML documents (this has been done for SGML documents). Database companies such as Oracle or vendors of application platforms such as Software AG have announced similar tools for the near future. There have been proposals to de ne schemas in XML for structured data. One of the rst eorts was the XML-Data proposal [1]. The main focus was to describe schemas using XML itself, rather than DTD syntax. XML-Data is used by the Microsoft XML parser as a way to expose DTD information. XML-DATA covers a large variety of concepts: class hierarchies, datatypes, constraints etc. The Document Content Description (DCD) facility for XML is designed for describing constraints to be applied to the structure and content of XML documents [4]. The proposal incorporates a subset of XML-Data and is conformant to the RDF Model and Syntax Speci cation [8]. DCD includes support for describing datatypes supported by SQL, but full support for a database interface is currently not included (also listed under future work). One of the newest eorts is XML Schema, a speci cation supported by the WWW consortium [2]. The purpose of the XML schema language is to provide an inventory of XML markup constructs with which to write schemas. The purpose of a schema is to de ne and describe a class of XML documents by using these constructs to constrain and document the meaning, usage and relationships of their constituent parts: datatypes, elements and their content, attributes and their values, entities and their contents. Schema constructs may also provide for the speci cation of implicit information such as default values. The rationale behind the above mentioned proposals is that most of the current application programming interfaces for XML (SAX and DOM) have no adequate support for accessing DTDs. Furthermore, XML will probably always be compatible with SGML, thus all extensions have to be conform with SGML. By de ning an XML-based schema language, existing XML technologies (parsers, tools) can be reused. DB2XML relies on DTDs and does not introduce a new meta layer. First of all DB2XML focuses on databases, while the above mentioned languages aim at a wider range of applications. The second reason is eciency and simplicity. The introduction of meta layers makes the processing of the data more complex. Since one area of the applications for DB2XML are web-based system, the simple design of style sheets is another goal. Heavy usage of meta structures makes the design and processing of stylesheets more dicult. Unidex has developed the tool XML Convert to convert at les (e.g., comma separated value les, xed length records, etc.) into XML documents [12]. The structure of the le has to be described by the user with an XML based language
9
called XFlat. In DB2XML this is not necessary, since all relevant information is automatically retrieved from the database.
6 Conclusions and Future Work The widespread usage of XML is still ahead of us. In order to fully utilize the potential of XML, access to legacy data has to be provided for XML tools. DB2XML provides this access for data stored in relational databases (or those data sources implementing a JDBC driver). It has been tested as a standalone tool and as a servlet using dierent databases and drivers. The generated XML documents have been successfully validated. If the proposals for the schema languages mentioned in section 5 will become stable, DB2XML will support these. Currently only the generation of XML documents is supported. The next step would be to allow the import of documents conforming to our speci cation into the database. This way, XML could be used as an interchange format for relational data. If the proposals for an XML query language (such as XQL [10]) become xed, such a language could be implemented in DB2XML by mapping XML queries to equivalent SQL queries, executing these in the database and transforming the results into XML documents.
Availability. The latest version of DB2XML including all documentation, servlets, and examples can be downloaded from the DB2XML web site at:
http://www.informatik.fh-wiesbaden.de/ turau/DB2XML/index.html
A Appendix: Usage of XLST in DB2XML XSLT (XSL Transformations) is a new powerful tree transformation language based on XML [5]. An XSLT program is called a stylesheet. It consists of a set of template rules which have two parts: a pattern which is matched against a node in the source document and a template which can be instantiated to form part of the result document. An XSLT processor reads the source document and applies the templates in a depth- rst manner and produces the output document. The ability of XSLT stylesheets to transform at XML documents into hierarchically structured documents is very useful for DB2XML. It enables the transformation of relational tables into arbitrary complex documents. To illustrate this principle we consider a typical hierarchically structured object: a rooted tree. A directed tree can be represented in a relational database as a table tree with two columns: start and end. There is a row for each edge of the tree. Each node has a unique identi er. The labels for the nodes of the tree are stored in a second table not considered in this example. The root of the tree has a symbolic predecessor root. Figure 4 illustrates from left to right: a graphical, a relational and the XML representation of a the tree generated by DB2XML using the query select * from tree.
10 r
1
,@ , @ Rr 4 r, r 3 @ 2 ? A A 5 r 6AU r A A AU r
7
Graphical representation
start end root 1 1 2 1 3 1 4 3 5 3 6 6 7 Relational representation
root 1 1 2 .....
Simple XML representation
Fig. 4. Representation of a rooted tree in a relational table and in XML The following XSLT stylesheet transforms the simple XML representation into a hierarchical representation.
The resulting XML document is shown below. Exchanging the tags and to
and to - yields a valid HTML representation, based on nested lists.
1 2 3 5 6
11
7 4
References 1. Andrew L., Jung E., Maler E., Thompson H.S, Paoli J., Tigue J., Mikula N.H, De Rose S.: XML-Data, W3C Note, 05-Jan-1998, (http://www.w3.org/TR/ 1998/NOTE-XML-data). 2. Beech, D., Lawrence, S., Maloney, M., Mendelsohn, N., Thompson, H.: XML Schema Part 1: Structures, Part 2: Datatypes W3C Working Draft, 6-May-1999, (http://www.w3.org/TR/xmlschema-1). 3. Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible Markup Language (XML) 1.0, W3C Recommendation, 10-Feb-1998, (http://www.w3.org/TR/REC-xml). 4. Bray, T., Frankston, C., Malhotra, A.: Document Content Description for XML, W3C Note, 31-July-1998, (http://www.w3.org/TR/NOTE-dcd). 5. Deach, S. (editor): Extensible Stylesheet Language (XSL), W3C Working Draft, 21-April-1999, (http://www.w3.org/TR/WD-xsl). 6. IBM, XML for Java Parser v2.0 (http://www.alphaworks.ibm.com/). 7. The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996. 8. Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Speci cation, W3C Recommendation, 22-Feb-1999, http://www.w3.org/TR/ REC-rdf-syntax). 9. Lotus Development Corporation, LotusXSL Version 0.17.2 (http://www.alphaworks.ibm.com/formula/LotusXSL). 10. Robie, J., Lapp, J., Schach, D.: XML Query Language (XQL), W3C-QL '98 Workshop, September 1998, (http://www.w3.org/TandS/QL/QL98/pp/xql.html). 11. Turau, V.: The DB2XML user manual (Version 1.1), Technical report TR-0199, FH Wiesbaden, May 1999, (http://www.informatik.fh-wiesbaden.de/turau/ DB2XML/index.html) 12. Unidex Inc.: XML Convert 1.0, 1999, http://www.unidex.com/. 13. Wood, L. (WG chair): Document Object Model Speci cation, W3C Working Draft, 16-April-1998, (http://www.w3.org/TR/WD-DOM).