Converting Business Documents: A Classification of

Converting Business Documents: A Classification of Problems and Solutions using XML/XSLT Erik Wüstner, Thorsten Hotzel, Peter Buxmann Freiberg University of Technology Chair of Information Systems / Wirtschaftsinformatik Lessingstr. 45 09596 Freiberg / Germany Phone: ++49 3731 392611 Fax: ++49 3731 393117 WWW: http://www.bwl.tu-freiberg.de/wi E-Mail: [email protected] E-Mail: [email protected] E-Mail: [email protected]

Abstract The exchange of business documents can be realized either by standardizing or converting these documents. In this paper, we examine different conversion strategies from an economic perspective. In particular, we provide a classification of conversion problems as well as solutions using XML/XSLT. 1.

Introduction

Coordination among actors is always based upon the exchange of information. For example, in supply chains the partners exchange business documents, such as forecasts, orders, and invoices. Therefore, a common problem is the usage of different standards and formats, e.g. different EDI standards. To achieve compatibility, the partners basically have two possibilities to choose from: the usage of a neutral, standardized format1 or a conversion between varying formats. The goal of this paper is to examine different conversion strategies and to provide a classification of potential conversion problems for data centric [14] XML business documents. Recent work in this area [6] concentrates primarily on transforming DTDs and XML Schemas. We extend this approach by going beyond the pure element-to-element transformation. Our research includes problems, which stem from differently formatted data as well as problems due to the varied grouping of XML data. 1 The term format is seen in this context as a means for the structured description of information subjects.

In the next section we examine the relationship between standardization and conversion. We will show that both strategies have advantages and disadvantages, which result in a trade-off between standardization and conversion costs. The third section discusses different conversion strategies and their influence on conversion costs. In the fourth section a classification for potential XML conversion problems is presented and methods of resolution are suggested. In terms of finding solutions for these problems we focus on the use of XSLT. Finally, we sum up and give an outlook on further research. 2.

The trade-off between standardization and conversion

As already mentioned, actors who are willing to communicate basically have the choice between standardization and conversion in order to achieve interoperability. In the following, we will concentrate on the economic aspects of this issue. In many cases overall standard can meet individual needs only to a limited extent. If in contrast all participants adopt a format of their own, they gain more individuality. As mentioned above, the use of a variety of different formats requires conversion. Figure 1 shows overall standardization costs and overall conversion costs depending on the standardization level. In this context, overall standardization costs include both standardization costs and opportunity costs.

will demonstrate in the fourth section. Costs resulting from insufficient results include expenses for manual post-editing of the conversion result where possible and where not, opportunity costs for the drawback of having insufficiencies. Apparently there is a trade-off between overall costs of standardization and overall conversion costs. The optimum is where the sum of standardization and conversion costs is minimal. Now, what could such an optimum look like?

Figure 1: Trade-off between standardization costs and conversion costs.

Standardization costs contain all the costs that are necessary to implement a standard [2], e.g. software costs, hardware costs, and personnel costs. Obviously, standardization costs are proportional to the level of standardization. Opportunity costs represent the drawback resulting from individuality loss. If one is bound to use a non-fitting standard, opportunity costs are higher compared to the use of a completely customized format. We assume diminishing marginal utility for the gradient of the overall standardization costs curve, since regarding the opportunity costs, the marginal benefit of lowered standardization pressure decreases with more and more gained flexibility. The graph of overall conversion costs is just reversed, since with high standardization hardly any conversion is necessary whereas precise conversions between multitudes of formats cause comparatively high costs. Cost models of conversion are discussed in [3] and [6]. In our model, overall conversion costs are the sum of: •

•

costs for generating the converter:

Costs for generating the converter accrue through developing the necessary software and through acquiring thorough knowledge on the data that has to be converted in terms of “what data in format A is represented by what data in format B”. Commercial converters are available for a limited number of format combinations. costs for the actual process of converting business documents:

With a sophisticated conversion instrument, the actual process of converting should be rather simple and cost-efficient on a per document basis. Different conversion strategies can influence this factor. •

and costs resulting from an insufficient conversion result:

An insufficient conversion result can occur if the conversion instrument is error-prone or if information loss could not be avoided. The probability of information loss obviously increases with the heterogeneity of the used formats, as we

Figure 2: A possible scenario for an optimum.

In this example the actors have agreed to use the standards 1 and 2 on a lower level, but to use different formats on a higher level. In a possible scenario, standard 1 could be SOAP [9] for sending the documents, standard 2 could be XML, and on top of these standards, partners would use XML based vocabularies such as xCBL, OAGIS or cXML for defining their business documents. Furthermore, they would use XSLT stylesheets or Java classes to convert between their formats. However, there are different strategies to organize such conversions. These strategies and their influence on conversion costs are examined in the next section.

3. Conversion strategies for business documents Conversions can be conducted using different strategies, which stem primarily from network topologies. Their use is not restricted to inter-organizational information exchange; they are also applicable for information exchange between different departments or production sites within larger companies. We assume that bi-directional communication between the partners has to be supported. The following structures represent rather abstract, ideal solutions, in practice; these structures are often combined or mixed up. 3.1.

Ring Structure

The actors could agree to position their respective formats in a ring structure as shown in figure 3:

Another idea is to enable conversion in both directions. In this case the maximum number of conversions would decrease from n-1 to n/2; but the number of needed converters would double. This very approach of shortcuts between formats leads to peer-topeer structures, where every possible shortcut is utilized. 3.2.

Peer-to-Peer Structure

A peer-to-peer structure is the equivalent of an intermeshed network, so every format can be directly converted into any other format.

Figure 3: A ring structure.

Conversion is conducted by passing the document in a single, pre-defined direction along the ring, until the document finally is converted into the requested target format. The actual implications on conversion costs are: •

•


With n participants, there are n converters necessary, which keeps the development costs for such converters comparably low. It is crucial to use or develop loss-free converters. The ring structure implies a multiplication of information loss because a possible error tends to propagate throughout the entire conversion process. An advantage of ring structures is their easy extensibility. If a new format is added, only two new converters have to be developed while one converter is superfluous. It is a disadvantage, that if a format of the ring is dismissed, an additional converter is needed.

Implications on conversion costs are: •

costs for the actual process of converting business documents:

These costs are comparatively high, since the document is passed along the ring. In the worst case the document hast to be converted n-1 times in order to reach its target format. •

Figure 4: A peer-to-peer scenario.


These costs also seem to be rather high in ring structures. As mentioned, in this scenario information losses can multiply. Furthermore, if one conversion process is corrupt, many other conversions cannot be done error free. At the bottom line, ring structures can be recommended for small to medium sized communication scenario. Furthermore, ring structures are applicable, if one can ensure that all conversion processes are reliable and free of information loss. In order to overcome the limitations of this scenario, one could try to temporarily store occurring information loss and then insert this information in a later conversion stage. Obviously, additional conversion costs arise if such techniques are used.

•


The participation of n actors with up to n formats implies the use of up to n(n-1) converters. Therefore, development costs can be considerably high. Peer-to-peer scenarios are restricted in terms of extensibility since up to 2n new converters have to be developed for every new format. costs for the actual process of converting business documents:

Conversion is fast and costs are comparatively low, since documents can be converted in just one conversion step. •


If the converter is accurate, insufficient conversion results should occur rather seldom, since every converter can be thoroughly adjusted to both involved formats. In contrast to ring structures, an information loss does not alter the results of other conversions. Thus, a peer-to-peer strategy is worthwhile for scenarios with a small and stable number of actors. Only this way they can benefit from sufficient and fast conversion processes while avoiding unreasonably high costs for developing converters.

3.3. Usage of an intermediary format This strategy utilizes the concept of an intermediary format S that mediates in all document exchange processes within the network. Therefore, with n formats, 2n converters are needed. As shown in figure 5, any of the formats can be converted into each other via S:

The use of an intermediary format is recommended for scenarios with a large number of actors. This way, good conversion speed and relatively low costs for creating conversion means on a per actor basis can be utilized. Table 1 gives a summary on the suitability of the mentioned strategies for scenarios with a different number of actors, where “+” means good suitability, “o” means average suitability and “-“ means poor suitability.

Table 1: Suitability of conversion strategies.

Strategy / No. of Actors Ring Structures Peer-to-Peer Structures Intermediary Format

Small + + o

Medium + o o

Large +

4. Problems and their solution when converting XML business documents Figure 5: Usage of an intermediary format S.

The implications of the usage of an intermediary format S on conversion costs are: •

•


In this scenario the costs include two components, costs for developing the intermediary format itself and costs for developing the necessary converters. In order to avoid opportunity costs of standardization, the intermediary format should align to the other formats, instead of vice versa. Costs for developing S can be considerably, since S should allow for all possible elements and structures of the participating formats. The expenses for developing the converters should not be too high, because S itself should be designed for proper convertibility. The extensibility of this strategy is good; a new format requires the creation of two new converters. If a format is dismissed, no additional costs emerge.

The Extensible Markup Language (XML) [10] is a standard to describe and exchange business documents. XML is a rather fundamental specification, which is just the basis for developing concrete standards, which then – on a higher layer – define means for storing and exchanging data in a content-dependent manner. According to a recent study [5], nowadays there are more than 250 different XML-based e-business vocabularies that describe the content and the structure of business documents. Partners willing to communicate can use these XML vocabularies to exchange different data [1]. They agree on the basic standard XML, but can adopt an XML vocabulary that fits their specific needs. Technologies such as Extensible Stylesheet Language Transformations (XSLT) can be used to convert XML documents [12]. Our working hypothesis is that in our model the use of XML and of related standards such as XSLT and XPath implicates a right-shift of the overall conversion costs curve, as illustrated in figure 6:

costs for the actual process of converting business documents:

These costs are rather low, since every document is converted twice. •


To ensure a conversion with minimum loss of information, S has to cover the full range of possibilities of all the involved formats, which indeed can be difficult. Similar to peer-to-peer structures, a possible occurring information loss does not influence the results of other conversions.

Figure 6: XML and XSLT imply lowered overall costs.

We base this hypothesis on the improved conversion possibilities when using XML as a means for electronic

data interchange. This concerns especially the part of costs for generating the converter, which is, in the case of XML, often an XSLT stylesheet. Such a stylesheet is applied to an XML document by an XSLT processor such as Xalan of Apache [13]. The decreased costs for generating the converter in comparison to traditional converting result from the following aspects: XML grammar is defined in DTDs or XML Schemas [11]. In terms of conversion, schemas provide additional functionality since they have more data types and can be parsed for further processing. This results in more implicit knowledge about the data that shall be converted. XML is human readable, so conversion information can often be extracted directly from the document. XML teams up perfectly with common programming languages such as C++ and Java. The existence of a large developer community around XML, with much of the software available being open source. Therefore, the right-shift of the overall conversion costs curve implies a new optimum that allows more individuality for the involved partners, while the overall costs have decreased. To sum it up, results of the use of XML and related technologies are more flexibility and lower conversion costs. In the following, we examine the translation and conversion of XML-defined business documents via XSLT at the syntactical and lexical level. Since we do not focus on the semantic level, creating the stylesheet requires human assistance for building conversion rules.2 We try to categorize problems involved in the process of XML transformation and provide solutions and XSLT snippets where applicable to solve them. Recent work in this area [6] mainly concentrates only on converting DTDs or XML Schemas. This approach focuses on the conversion of tree structures, while the actual data enclosed within the tags is left aside. Because the conformity of this data is crucial to further processing, our research extends this approach by concentrating on this data and on structural problems that cannot be solved by simply transforming DTDs or XML-Schemas. Before we go on with describing possible problems, we introduce a terminology for the following analysis. We say markup, when we mean tag names and the corresponding attributes, we say structure, when we mean the nesting level of the tags in the document, we say core data when we mean the content of the character data encapsulated in tags and attributes, we speak of core data •

•

•

•

format when we mean the format of the character data encapsulated in tags and we say grouping, when we mean

a set of tags, which form a group by belonging to the same context or sub context. Finally, we say overall content information, when we mean the information provided by the combination of core data, markup and structure. Permutations of these elements raise a number of problems, which are presented in the following classification.

4.1. Markup is different, overall content information is equal3 Converting XML documents, which only differ by their markup, is the easiest of all possible cases. The relation of the markup in the source document to those in the target document is 1:1, as shown in the example below. That means for the tag or attribute to be converted, that there is a corresponding tag or attribute in the target document, containing the same core data. In this case, transformation is simply done by inserting the core data into the matching target markup, which often results just in “renaming the tags”.

Figure 7: Example for different markup.

The XSLT statement to transform the document of figure 7a into the document of 7b is:

4.2. Core data format differs, whereas core data itself is the same4 As already mentioned, conversion of XML documents include conversion of the core data stored inside an XML document. This congruency of core data is crucial to their further automated processing. Figure 8 shows a simple example for the variety of possibilities to express core data in different core data formats.

Figure 8: A date in US format and in German format.

2

A completely automatic generation of the stylesheet can, if possible at all, only be achieved if the conversion process starts at the semantic level by utilizing technologies such as RDF [8], RDF-Schema and ontologies [7]. These technologies provide meta-information to describe the tags used in the XML representation of a business document.

3 An example in practice is the conversion of xCBL’s into OAGIS’ and vice versa. 4 An example in practice is the conversion of xCBL’s into OAGIS’ and vice versa.

Converting information with XSLT in this case is comparably easy, since the core data is equal, only their format differs. The XSLT statement needed to transform the data has to be based on a known and fixed algorithm. This algorithm contains information on character order, separating characters, calculation rules, etc. Within the XSLT-statement, a number of string operations have to be done: In more complex cases, such as country codes, lookup tables can be used to determine the appropriate identifier. A lookup table contains replacements and equivalences of certain values.

Figure 9: Example for different country codes.

In the example of figure 9 a lookup table can be used to get the appropriate country code identifier [4] for “taiwan”. Other examples for the usage of lookup tables include units, currencies and metrics.

4.3. Different structure depth, while the structure itself contains meta information 5 The structure of an XML document is the extent to which information is drilled down into logical subunits. Figure 10 illustrates two different formats, which contain strictly speaking just two names. Format 10a) explicitly states the information, that the first name is “John” and the last name is “Doe”, while one needs semantic background knowledge in order to retrieve the same information out of format b), because the -Tag of format 10b) gives not enough meta information on its core data.

Figure 10: Content can be stored in different ways.

The XSLT statement to convert document 10a) into 10b) is: 5 An example in practice is the conversion of the sub-elements of OAGIS’ into xCBL’s .

A problem is, that the structure of 10b)’s core data has to be known. This knowledge can be defined in an algorithm for generating the appropriate structure. A similar problem occurs, when converting from 10b) to 10a). In this case, one has to know, what character is used to separate the sub groups of 10b)’s core data and which sub group belongs to which tag of 10a). The above XSLT example assumes, that a blank space is used as a separator between first name and last name.

4.4. Core data is the same, overall content information differs6 Figure 11 shows two examples, where the overall content information is different with the core data being the same. Format 11a) just stores two plain phone numbers, whereas format 11b) provides additional information about the network type of the phone number. This causes a problem if a conversion from 11a) to 11b) is needed, since there is simply no knowledge available about the network type in format 11a). In a few cases, lookup tables might be helpful. Otherwise, one could only try to guess. Hence, without additional information a correct and reliable conversion is rather unlikely in this case. Converting from 11b) to 11a) is no immediate difficulty, but nevertheless implies the loss of information about the network type of the phone numbers. This issue is a typical example, that there are situations, where information loss cannot be avoided.

Figure 11: Content can be stored in different ways.

4.5.

Different grouping

An XML document consists of groups of tags, which form semantic units through their nesting. When converting XML documents, one implicitly assumes, that structuring data is unambiguous. In fact, reality is more complex – there are a variety of 6 An example in practice is the conversion of telephone numbers between OAGIS and xCBL.

possibilities to logically structure information. This situation is illustrated in figure 12. While in 12a) information about the VAT rate logically belongs to the ‚item‘ context, the document in 12b) presumes that information on the VAT rate is part of the ‚Invoice Summary‘. In such cases, exact and reliable conversion is strongly questionable, because it is by all means unsafe to infer information between varying contexts. Nevertheless, in practice a conversion can be done, e.g. 12b)´s element can be generated with:

other data, which is already contained in the document. Conversion from 13a) to 13b) requires the generation of the summary tags by processing core data of 13a), whereas in the opposite direction the summary tags get lost. A sample XSLT statement to generate the value of the -tag of 13b) is

Figure 12: Different Grouping.

4.6.

Redundant data

7

Some XML business vocabularies contain additional summaries or calculated results, which may be useful for the further processing of the document. This type of information is redundant since it can be obtained by processing, e.g. through calculating, summarizing, filtering, or formatting other data given in the document. The following figure shows two formats. The tags , and of 13b) are redundant, since their core data is the result of processing 7 An example for redundant data is the tag within xCBL’s .

Figure 13: Example for redundant information.

4.7. Missing Information It is obvious, that documents, which shall be converted into each other, have to contain the same overall information. Correct and precise conversion between formats with asymmetric overall information is by definition not possible. Anyhow, in practice this presumption is too strict. A common example is information about time in business documents. A number of XML business vocabularies require this kind of

information, while many others do not. If such information is not necessary for the recipient of an XML document, the insertion of a “null” value might be a pragmatic alternative compared to no conversion and hence no data interchange at all.

5. Outlook In this paper, we examine different strategies for converting business documents. In particular, we showed how documents, which are described in XML, can be converted using XSLT. Therefore, the paper provides a classification of conversion problems and respective solutions. Our thesis remains that the usage of this technology reduces conversion costs for the various reasons stated in section 4. In the future, we want to support this thesis by conducting case studies in order to compare recent experiences of large enterprises in the field of traditional EDI to the new conversion possibilities of XML/XSLT. Furthermore we aim at applying existing cost models for conversion [3][6] on the usage of XML and XSLT. Moreover, we plan to develop the existing conversion module STYX [1] into a conversion server and to offer it as a web service, which will be registered in UDDI and described in WSDL. Currently, we are developing a graphic user interface that supports a mapping between different formats and automatically generates the respective XSLT stylesheets. However, the conversion is conducted on a syntactical and lexical level. In contrary, approaches like RDF [8] and ontologies [7] touch issues such as semantic web and artificial intelligence; the practical approach is widely based on technologies like XSLT and on the human factor. As long as there is no way of storing information free of context, there will be no possibility to entirely automate the process of exchanging data. Even if there would be a way to do so, the necessity of an exact general and unique definition of meaning at some level of data storage would remain. Probably the conversion problem would rather shift to standardizing ‘atomic’ information units on a lower level. But then another question arises: what means ‘atomic’ information unit? Is it the lowest semantic content of data, one can think of? Is it a word or a character, and if yes in which language?

Acknowledgements: We thankfully acknowledge the support from the German National Science Foundation for our work.

References: [1] P. Buxmann, L. Martín Díaz, E. Wüstner: XML-based supply chain management – as SIMPLEX as it is -, in

Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS 2002), 2002. [2] P. Buxmann, T. Weitzel, F. Westarp, W. König: The Standardization Problem in Networks - A General Framework, in Jakobs, K. (eds.): Standards and Standardization: A Global Perspective, Idea Publishing Group, 1999. [3] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom: Change detection in hierarchically structured information, in SIGMOD, pages 493-504, 1996. [4] ISO: ISO 3166, http://www.din.de/gremien/nas/nabd/iso3166ma/index.html, visited on 2001-12-18. [5] A. Kotok, Extensible and more: an updated survey of XML business vocabularies, http://www.xml.com/pub/2000/08/02/ebiz/extensible.html, August 2000, visited on 2001-12-17. [6] H. Su, H. Kuno, E. Rundensteiner: Automating the transformation of XML documents, Workshop on web information and data management (WIDM ´01), November 2001. [7] M. Uschold, M. Gruninger: Ontologies: Principles, Methods and Applications, Knowledge Engineering Review; Volume 11 Number 2, June 1996. [8] W3C: RDF, http://www.w3.org/TR/REC-rdf-syntax, 1999, visited on 2001-12-17. [9] W3C: SOAP, http://www.w3.org/TR/soap12-part1/, 2001, visited on 2002-01-02. [10] W3C: XML, http://www.w3.org/TR/2000/REC-xml20001006, 2000, visited on 2001-12-17. [11] W3C: XML Schema, http://www.w3.org/XML/Schema, 2001, visited on 2001-12-17. [12] W3C: XSL, http://www.w3.org/TR/xslt, 1999, visited on 2001-12-17. [13] XML Apache Group: XSLT-Processor Xalan, http://xml.apache.org, visited on 2001-12-22. [14] No Author: The Attribute/Text Conundrum – “Document-Centric” vs. “Data-Centric”, http://www.xmleverywhere.com/newsletters/20000525.htm, April 2001, visited on 2002-01-02.

Converting Business Documents: A Classification of

Converting Business Documents: A Classification of

Suggest Documents

Converting InDesign Documents to PDFs

Converting InDesign Documents to PDFs

Automatic classification of digital documents

A System for Converting PDF Documents into Structured ... - CiteSeerX

Converting numerical classification into text classification - UF CISE

Classification of Campus E-Complaint Documents ...

Incremental classification of invoice documents - CiteSeerX

An Overview of E-Documents Classification - icmlc

Automatic Classification of Documents by Formality - uOttawa

Text Classification of Formatted Text Documents - CiteSeerX

Personalized Classification of Web Documents - Semantic Scholar

Partially Supervised Classification of Text Documents - CiteSeerX

DOC: Deep Open Classification of Text Documents

Classification of Campus E-Complaint Documents ...

Type Classification of Semi-Structured Documents - CiteSeerX

ScienceDirect Classification of Test Documents ...

Collective Classification of Textual Documents Using ...

Geographical Classification of Documents Using ... - Semantic Scholar

Designing Business Documents

Social Business Documents - Core

3. PREPARING BUSINESS DOCUMENTS

Visualization and Classification of Documents: A New Probabilistic ...

Representation and Classification of Text Documents: A ... - CiteSeerX

Representation and Classification of Text Documents: A ... - CiteSeerX