Checking Determinism of XML Schema Content Models in Optimal Time

4 downloads 32249 Views 390KB Size Report
Oct 26, 2010 - 2000), as a file format for office documents (ISO, 2008), and as a basis of speci- ...... files use element names derived from the fruit names (apple, orange, etc.) ...... 4726340 in Sun Developer Network Bug Database, available at.
Checking Determinism of XML Schema Content Models in Optimal Time Pekka Kilpel¨ainen University of Eastern Finland, School of Computing, Kuopio

Abstract We consider the determinism checking of XML Schema content models, as required by the W3C Recommendation. We argue that currently applied solutions have flaws and make processors vulnerable to exponential resource needs by pathological schemas, and we help to eliminate this potential vulnerability of XML Schema based systems. XML Schema content models are essentially regular expressions extended with numeric occurrence indicators. A previously published polynomial-time solution to check the determinism of such expressions is improved to run in linear time, and the improved algorithm is implemented and evaluated experimentally. When compared to the corresponding method of a popular production-quality XML Schema processor, the new implementation runs orders of magnitude faster. Enhancing the solution to take further extensions of XML Schema into account without compromising its linear scalability is also discussed. Key words: Regular expression, numeric occurrence indicator, one-unambiguity, weak determinism, unique particle attribution, Java

1. Introduction and motivation XML (Bray et al., 1998) is a popular method for representing data as text documents, especially for delivery over the Internet. Applications of XML have become ubiquitous. It is used, for example, to represent Web pages (W3C, 2000), as a file format for office documents (ISO, 2008), and as a basis of specifications for exchanging medical records (ANSI, 2005) and for conducting business over the Internet (ebXML, 2002), just to mention a few application areas. XML uses markup tags to denote the boundaries of nested elements that constitute the structure of documents. Processing of XML documents becomes more reliable by applying grammars, called schemas, for validating document instances by checking that their structure and their content agree with the constraints specified by the schema. Various XML schema definition languages have Email address: [email protected] (Pekka Kilpel¨ ainen)

Preprint submitted to Information Systems

October 26, 2010

been proposed.1 The oldest and the simplest of them is the DTD (Document Type Definition) formalism, which is defined in XML Recommendation (Bray et al., 1998) as a simplification of a similar notation originally introduced for SGML (ISO, 1986). Schema languages more powerful than DTDs have been proposed since then, the most popular of them currently being Relax NG (Clark and Murata, 2001) and W3C XML Schema (Thompson et al., 2001, 2004). A good theoretical comparison of these dominant XML schema languages is given by Murata, Lee, Mani, and Kawaguchi (2005). Readable practical introductions to XML Schema include (Fallside and Walmsley, 2004) and (Walmsley, 2002). A study of which features of DTDs and XML Schema are used in practical document schemas has been carried out by Bex et al. (2004). W3C XML Schema has become among database practitioners the most popular of these dominant XML schema languages; see, e.g., Bex et al. (2009). Indeed, XML Schema includes features that are expected of database schema languages, like a type system with base types similar to those of SQL and the ability to attach schema-defined types to document elements, and to express key and uniqueness constraints. The type system of XML Schema also forms the basis for the typing of the W3C XML query language XQuery (Boag et al., 2007). Major XML schema languages all use (their syntactic variants of) regular expressions (Kleene, 1956) called content models for constraining instances of element content types defined by the schema. Content model expressions describe sequences of elements using operators for concatenation, choices, and iteration. XML Schema uses numeric occurrence indicators, given with attributes minOccurs and maxOccurs, to express the required minimum and the allowed maximum number of iterations of a subexpression. This is a syntactic extension to the unrestricted Kleene star operator of standard regular expressions. (Numeric occurrence indicators appear also in POSIX “interval expressions” (IEEE, 2001, Chap.9) and in Perl patterns (Wall and Schwartz, 1991; Hazel, 2003).) Below is an example of a regular expression with numeric occurrence indicators, call it E, using a simplified abstract syntax: (a2..3 | b)2..2 b An XML Schema content model which corresponds to this expression is shown in Figure 1. An intuitive reading of this expression could be “accept sequences that consist of exactly two repeats of a choice between a sequence of two-to-three a elements or a single b, followed by a single b”. Declarative semantics for an expression E is given by stating the language L(E) defined by the expression, which is here the following set of element sequences: {aaaab, aaaaab, aaaaaab, aabb, aaabb, baab, baaab, bbb}

1 We use “schema” written in lower case as a generic noun, and “XML Schema” capitalized to refer to the schema definition language specified by the W3C XML Schema Recommendation.

2

Figure 1: An XML Schema content model.

According to the study of Bex et al. (2004), most content model expressions in practical schemas are simple; especially nested iterations are rare. The same appears to hold for numeric occurrence indicators, too, but they do appear in some production schemas. For example, the ISO/IEC MPEG-7 Version 2 Schema2 gives for IntegratedCoordinateSystem elements the following content model with nested numeric occurrence indicators: (TimeIncr MotionParams2..12 )0..65535 This specifies that those elements consist of 0, . . . , 65 535 repeats of a concatenation of a TimeIncr element with 2, . . . , 12 MotionParams elements. This content model happens to be a hard case for current XML Schema processors; we will return to this while discussing experiments in Section 4. It is easy to see that numeric occurrence indicators do not extend the expressive power of regular expressions, since they can be replaced by repeated concatenations of subexpressions. The problem with such expansion of numeric occurrence indicators is that it may lengthen content model expressions by an exponential factor (Kilpel¨ainen and Tuhkanen, 2003, 2007). Indeed, the succinctness of the numeric notations does have an essential impact on the complexity of problems related to regular expressions (Gelade et al., 2009b). Earlier formalizations of XML Schema content models, for example (Martens et al., 2006), have mostly ignored numeric occurrence indicators, until the preliminary inspections by Kilpel¨ainen and Tuhkanen (2003), and the more recent and more thorough studies by Gelade, Martens, and Neven (2007; 2009b). The latter works consider, in addition to numeric occurrence indicators, also the interleaving operator which is used in Relax NG content models, but they do not consider the determinism of content models. XML Schema content models are restricted by a requirement that they are deterministic. The XML Schema Recommendation calls this condition the unique particle attribution (UPA) property. The same condition has been called in other sources also unambiguity (ISO, 1986), one-unambiguity (Br¨ uggemannKlein and Wood, 1998), and weak determinism (Sperberg-McQueen, 2005; Gelade et al., 2009a). Walmsley (2002) has given the following simplified explanation for the determinism constraint of XML Schema: 2 http://standards.iso.org/ittf/PubliclyAvailableStandards/MPEG-7

3

schema files/mpeg7-v2.xsd.

A schema processor, as it makes its way through the children of an instance element, must be able to find only one branch of the content model that is applicable, without having to look ahead to the rest of the children. Determinism of expressions can be discussed informally by considering possible paths through the symbols of the expression. For this we consider markings of expressions, where symbol occurrences are called positions which are represented by uniquely indexed copies of symbols. For example, E ′ = (a2..3 | 1 b1 )2..2 b2 is a marking of the above expression E, where symbol a occurs at position a1 , and symbol b occurs at positions b1 and b2 . It may be possible to continue some initial path of positions through an expression by two different positions ai and aj that are occurrences of a common symbol a. If this is the case, we say that these positions compete, and the expression is nondeterministic; otherwise the expression is deterministic. As a simple example, the expression F = a1..2 1 a2 is nondeterministic, since an initial position a1 could be continued, as a path through F , by the competing occurrences a1 and a2 of symbol a. The equivalent expression a2..3 is, on the other hand, trivially deter1 ministic, simply because no symbol occurs twice in it. The above expression E is seen deterministic by considering the language of its marking E ′ , which is L(E ′ ) = {a1 a1 a1 a1 b2 , a1 a1 a1 a1 a1 b2 , a1 a1 a1 a1 a1 a1 b2 , a1 a1 b 1 b 2 , a1 a1 a1 b 1 b 2 , b 1 a1 a1 b 2 , b 1 a1 a1 a1 b 2 , b 1 b 1 b 2 } . It is easy to check that the positions b1 and b2 , which are the only repeated occurrences of any symbol in E ′ , never compete as a continuation of a common prefix in the words of L(E ′ ). A slight change in the occurrence bounds of an expression can change its status from deterministic to nondeterministic, or vice | b1 )2..2 b2 of the above versa. For example, consider the variation E1′ = (a1..2 1 ′ (deterministic) expression E . This modified expression has positions b1 and b2 as competing continuations of paths that start with a1 a1 , which makes E1′ nondeterministic. The rationale for requiring determinism of content models is not quite clear. The original intent of determinism, as introduced in SGML, was to make content model expressions easier for humans to read (Goldfarb, 1990, p. 414). Currently determinism appears to be mainly useful as an aid to efficient implementations. Deterministic content models of XML DTDs can be translated to deterministic automata in linear time3 (Br¨ uggemann-Klein, 1993a), which supports the implementation of efficient validating XML parsers. While nondeterministic regular expressions can be translated to Glushkov automata in quadratic time—which is worst-case optimal (Br¨ uggemann-Klein, 1993a)—the determinization of these automata requires in the worst case exponential time and space (Meyer and 3 As pointed out by Gelade et al. (2009a), this holds in the case of a fixed alphabet; otherwise translating a deterministic content model into a deterministic automation requires quadratic time in the worst case.

4

Fischer, 1971). The determinism constraint appears also useful for implementing efficient XML Schema validators: Kilpel¨ainen and Tuhkanen (2004) have sketched how to implement deterministic XML Schema content models by a hierarchy of deterministic automata extended with counters. Solutions for validating nondeterministic XML Schema content models with automata of polynomial size have not been presented. Kilpel¨ainen and Tuhkanen (2003) have shown that nondeterministic regular expressions with numeric occurrence indicators can be matched in polynomial time, but their dynamic programming based solution is mainly of theoretical interest by requiring quadratic space with respect to the length of the word to be matched. The problem of checking the determinism of XML Schema content models is of both theoretical and practical interest. Solvability and complexity of algorithmic problems are of academic interest in general. As a practical concern, it is crucial that technologies which are used in an open environment like the Internet are robust; that is, they should work correctly and without wasting resources. It is not sufficient that implementations are able to deal with benevolent inputs which satisfy expectations of their implementers. In addition, they shall be immune to problems caused by pathological inputs, which might be created intentionally by attackers who try to paralyze systems. XML Schema is used as a typing mechanism to describe the format of XML messages interchanged between clients and servers in Web services (Christensen et al., 2001; Alonso et al., 2004; Papazoglou, 2008). It is important that processing of schemas which might be loaded, say, through some external service directory do not use excessive amounts of resources. The experiments discussed in Section 4 indicate that currently this is not the case: the implementation of UPA checking in a popular XML Schema validator is vulnerable to denial-ofservice attacks using pathological content models. Efficient and correct solutions for checking the determinism of XML Schema content models are not widely known. Appendix H of the XML Schema Recommendation outlines a complex sequence of operations for testing determinism; the procedure includes unfolding numeric occurrence indicators, translating the expression into an automaton with epsilon transitions, and determinizing the automaton. Direct implementations of this procedure are inherently exponential: Both the unfolding of numeric occurrence indicators and the determinization can lead to automata of exponential size (Kilpel¨ainen and Tuhkanen, 2007; Meyer and Fischer, 1971). Kilpel¨ ainen and Tuhkanen (2007) published a polynomialtime method for testing whether a given regular expression with numeric occurrence indicators is deterministic. They also reviewed earlier attempts of solutions for this problem, and showed them erroneous. A recent in-depth study of deterministic regular expressions with numeric occurrence indicators by Gelade, Gyssens, and Martens (2009a) considers the difference between weak determinism, as considered here, and its more restrictive variant called strong determinism, which differs from the XML Schema UPA constraint. Gelade, Gyssens, and Martens present a cubic-time algorithm for testing whether a given expression is strongly deterministic. They also show that weakly deterministic expressions (with numeric occurrence indicators) are 5

strictly more expressive than strongly deterministic ones; the expressive power of the latter is shown to be equal with ordinary deterministic expressions (without numeric occurrence indicators). Weakly deterministic expressions are also more succinct than strongly deterministic ones: there are weakly deterministic expressions with numeric occurrence indicators for which the size of the smallest equivalent strongly deterministic expression is exponentially bigger (Gelade et al., 2009a, Th. 6). The current article is a continuation of (Kilpel¨ainen and Tuhkanen, 2007), which formed the theoretical breakthrough for the polynomial-time checking of weak determinism of regular expressions with numeric occurrence indicators. The main motivation of the current article is— in addition to a small orderof-magnitude improvement of the previous result— to popularize the potential vulnerability of XML Schema processors caused by exponential-time methods, and, as a cure for this vulnerability, to make the new determinism-checking method better known by discussing its efficient implementation, usability, and relevance for practical XML Schema processing. Shortcomings of solutions for checking the determinism of XML Schema content models are not restricted to published algorithms; they afflict practical implementations, too. Apache Xerces (1999-2005) may be the most popular XML Schema processor currently. It is installed on millions of computers as the default XML parser of the Java Platform. Implementations in C++ and Perl are also available. Despite its good reputation as a reliable processor, Xerces also fails on some surprisingly simple tasks. The content model E = (a2..3 | b)2..2 b of Figure 1 was argued above to be deterministic, but the version of Xerces which is included in JDK 1.6, as the default XML validator of the current Java Platform, erroneously rejects it as nondeterministic.4 The implementer has likely followed an erroneous suggestion by Sperberg-McQueen (2005), which was to simplify UPA checking by first replacing subexpressions F m..n with (potentially large) occurrence bounds 1 ≤ m < n by iteration F 1..2 . This modification transforms the deterministic expression E = (a2..3 | b)2..2 b into the expression E1 = (a1..2 | b)2..2 b, which we discussed above to be nondeterministic. The UPA checking algorithm of Apache Xerces is also vulnerable to efficiency problems, as is shown by the experiments reported in Section 4. For example, Xerces fails to process the content model of the MPEG-7 IntegratedCoordinateSystem elements which was discussed above. We use regular expressions extended with numeric occurrence indicators as the model for studying XML Schema content models. Definitions and some previously published results on their determinism are reviewed in Section 2, mainly relying on the article by Kilpel¨ainen and Tuhkanen (2007). In that article, determinism was shown to be testable in quadratic time for expressions over a fixed alphabet, and in cubic time for expressions over an unlimited alphabet. In Section 3 we improve those results by refining the determinism-checking algorithm so that it works in linear time for expressions over a fixed alphabet, 4 The

version of Xerces which was included in JDK 1.5 accepts this content model correctly.

6

and in time O(n2 / log n) for expressions over an unlimited alphabet, where n is the length of a binary encoding of the expression. Section 4 presents results of implementation experiments, which verify the linear scalability of the algorithm and show that it is more efficient than its predecessor and also practically efficient. Comparisons with Apache Xerces show that the implementation of the new algorithm outperforms the corresponding method of Xerces by several orders of magnitude. XML Schema extends regular expressions also with additional features like all groups, substitution groups, and element wildcards. We sketch in Section 5 how these extensions could be taken into account in the new algorithm without compromising its asymptotic complexity. Section 6 is a discussion of the relevance of the results, and Section 7 is a conclusion. 2. Numeric regular expressions and determinism This section provides a compact review of previously published concepts and results, in order to get a basis for discussing and improving the checking of the determinism of content models. More complete definitions, and proofs of the results, can be found in the article by Kilpel¨ainen and Tuhkanen (2007). Central features of XML Schema content models can be studied by modeling them as regular expressions extended with numeric occurrence indicators. XML Schema content-model particles element, choice, and sequence correspond to symbols and operators of traditional regular expressions. The minOccurs and maxOccurs attributes of the content-model particles correspond to numeric occurrence indicators of their corresponding expressions. We will call regular expressions with numeric occurrence indicators, for shortness, simply expressions here. Further extensions of XML Schema, all groups, substitution groups, and element wildcards, are considered in Section 5. Expressions are built of symbols a ∈ Σ of an alphabet Σ. The alphabet of content model expressions in document schemas consists of element type names like title, purchaseOrder, xhtml:table etc. We have choice denoted by the infix operator ’|’ and concatenation denoted implicitly by juxtaposing of subexpressions as binary operators, and numeric iteration F m..n and optionality denoted by F ? as unary operators applied to subexpression F . The occurrence bounds m and n of an iteration F m..n are non-negative integers, where the minimum bound m may not exceed the maximum bound n. The maximum bound may also be ∞, corresponding to XML Schema notation maxOccurs="unbounded". An expression E describes a language L(E) which consists of sequences of alphabet symbols called words. The language L(E) is a subset of the set of all words that can be constructed using symbols of alphabet Σ, which is denoted by Σ∗ . The catenation of two languages L1 and L2 is denoted by L1 L2 , and defined as L1 L2 = {uv|u ∈ L1 , v ∈ L2 } . That is, L1 L2 comprises the words that consist of some word u of L1 as their prefix followed by some word v of L2 as their suffix. The n-fold catenation of a 7

language L by itself is denoted by Ln , and defined by L0 = {ǫ} and Ln+1 = Ln L; here L0 = {ǫ} is a language that consists of the empty word ǫ alone. That is, the language Ln comprises the words that can be formed by concatenating n words chosen from the language L. Using these notations, the semantics of expressions can be defined inductively as follows: L(a) = {a} for a ∈ Σ; L(F | G) = L(F Sn ) ∪ L(G); L(F m..n ) = i=m L(F )i .

L(F ?) = L(F ) ∪ {ǫ}; L(F G) = L(F )L(G);

The semantics of an iteration F m..n means to accept the language formed by concatenating at least m but at most n words from the language L(F ) of the body F of the iteration. With an unlimited maximum bound ∞ the iteration accepts words that result from concatenating arbitrarily many words from L(F ), provided that their number is at least that of the minimum bound m. We define subexpressions in the usual way. That is, if E is an expression of any of the forms shown above, then the expressions F and G (if present) are proper subexpressions of E, as well as the proper subexpressions of F and of G (if present). The subexpressions of an expression E include the expression E in addition to its proper subexpressions. For shortness, we exclude from expressions the symbols ∅ and ǫ that are normally used for expressing the trivial special cases of an empty language and the language {ǫ} of a single empty sequence. We assume that the maximum bound of numeric iterations is at least two; other cases can be eliminated by applying simple rules based on the equivalences F 0..0 ≡ ǫ, F 0..1 ≡ F ?, and F 1..1 ≡ F . (Resulting occurrences of ǫ can further be eliminated by straightforward simplifications, unless the entire expression is equivalent to ǫ, in which case the expression reduces to a single ǫ.) We define the size of expressions similarly to (Gelade et al., 2009a), but extend their definition slightly to cover unlimited alphabets, too. For this, we assume some unspecified but fixed encoding scheme which uses fixed-length symbols to represent operators of the expressions, and binary numbers to represent integer occurrence bounds of iterations. The unlimited maximum bound ∞ can also be represented by some fixed symbol, and the symbols of a finite alphabet Σ likewise. The size, or length, of an expression over a finite alphabet is then the number of these fixed symbols plus the total length of the binary representation of integers in its representation. If the alphabet is unlimited, we assume that binary strings of unrestricted length are used to represent alphabet symbols. The size, or length, of an expression over an unlimited alphabet is then the number of operator symbols, plus the total length of the binary representation of integers and of the alphabet symbols that occur in the expression. We treat expressions as their markings, where occurrences of symbols a ∈ Σ are represented as uniquely indexed copies ai called positions. We denote the set of all positions by symbol Π. We use notation Pos(E) to refer to the set of positions in an expression E, and (.)♮ as an unmarking operator that removes subscripts of positions and thereby allows to refer to their underlying

8

symbol. The unmarking operation is applied with a straightforward and obvious extension also to words made of positions, and to languages and expressions over positions. So, the language L(E) of a marked expression E consists of words over Pos(E), while (L(E))♮ = L((E)♮ ) is a language over the symbols of Σ which occur at positions of E. With these notions, determinism of expressions can be defined in terms of languages accepted by their markings: Definition 2.1 (Br¨ uggemann-Klein and Wood, 1998). A marked expression E is nondeterministic if there are two different positions x, y ∈ Pos(E) and some words u, v, w ∈ Pos(E)∗ such that uxv ∈ L(E), uyw ∈ L(E), and (x)♮ = (y)♮ . An expression is deterministic if it is not nondeterministic. For example, the (marked) expression E = (a1 )1..2 a2 is nondeterministic, since it satisfies the conditions uxv ∈ L(E), uyw ∈ L(E) and (x)♮ = (y)♮ with the words u = a1 , v = ǫ, and w = a2 , and with the competing positions x = a2 and y = a1 . On the other hand, the equivalent expression F = (a1 )2..3 is deterministic simply because it contains only a single position. Analyzing the determinism of expressions is based on examining their First positions, and transitions between Last and First positions of subexpressions, which were defined already in the early works of McNaughton and Yamada (1960) and Glushkov (1961). The First set of positions that can begin words accepted by an expression F is defined as follows: First(F ) = {x ∈ Pos(F ) | xw ∈ L(F ) for some w ∈ Pos(F )∗ }

(1)

The dual set of Last positions that can terminate words accepted by expression F is defined correspondingly: Last(F ) = {x ∈ Pos(F ) | wx ∈ L(F ) for some w ∈ Pos(F )∗ }

(2)

A pair of positions (x, y) is a transition of expression E if xy can appear as a subword of some word in the language L(E). We say that a pair (x, y) is a forward transition of E if (x, y) ∈ Last(G) × First(H) for some subexpression F = GH of E. If (x, y) is not a forward transition but (x, y) ∈ Last(G) × First(G) for some iterative subexpression F = Gm..n of E, we say that it is a backward transition of E. Obviously there are no other transitions than forward or backward transitions, and each forward or backward transition is indeed a transition.5 As a simple example, consider an expression E = a2..2 1 a2 , which accepts the language L(E) = {a1 a1 a2 }. Expression E has two transitions, a backward transition (a1 , a1 ) and a forward transition (a1 , a2 ). 5 Notice that if the expression E contained “bogus iterations” Gm..n with n ≤ 1, some backward transitions might not be transitions. Similarly, if E contained “dead ends” caused by the symbol ∅, some forward or backward transitions might fail to be transitions.

9

Transitions of an expression can be stored in a Follow relation for analyzing the determinism of the expression. (See, e.g., (Br¨ uggemann-Klein, 1993b). More commonly, Follow relations are used for defining and constructing automata for expressions; see, e.g., (Br¨ uggemann-Klein, 1993a) and (Aho, 1994, Sec. 5.2).) An expression can be nondeterministic only if its First set contains two positions that are occurrences of a common symbol, or if it has transitions (x, y) and (x, z) from some position x to two different occurrences y and z of a common symbol. The converse does not hold, however: With numeric occurrence indicators, some transitions never materialize as competing continuations of a path through the expression. For example, the positions a1 and a2 of the above expression E are not competing since the transitions (a1 , a1 ) and (a1 , a2 ) do not occur as a continuation of a common prefix in the only word of L(E). For this reason, certain backward transitions are excluded from Follow relations. The backward transitions that are included are those of so called flexible iterations, which are defined formally as follows: Definition 2.2 (Kilpel¨ ainen and Tuhkanen, 2007). Let E be a marked expression. An iterative subexpression F = Gm..n of E is flexible in E if there is some word uws ∈ L(E) such that w ∈ L(F )l ∩ L(G)k for some l ∈ N and k < l × n. In this case, we call the word w a witness to the flexibility of F in E. Observe that above the subwords u, w and s of a word accepted by a marked expression E correspond to unique paths through positions of the expression E. Flexibility is intuitively about alternative interpretations for counting the number of iterations needed to match some input. On the other hand, determinism is intuitively about the uniqueness of expression positions that can be used for matching symbols of the input word. So, the notions are different, but as we have seen already in Section 1, flexibility can be the cause of nondeterminism. According to Definition 2.2, a word w is a witness to the flexibility of iteration F = Gm..n if it can be accepted both as l iterations of F and as fewer than l × n iterations of its body G. Intuitively, this means that a traversal through the expression could continue after w either by some position in First(G), reiterating the body of the iteration, or by some (potentially competing) position outside F . (Notice that any of the words u, w and s used in the definition may also be empty.) On the other hand, if iteration F = Gm..n is not flexible, its backward transitions do not cause competition between positions of First(G) and positions that are outside of F (Kilpel¨ainen and Tuhkanen, 2007). If an iteration is flexible in itself, and not only as a proper subexpression of a larger expression, then we simply say that it is flexible. Iterations F = Gm..n whose occurrence bounds m and n differ are immediately seen flexible. An expression G is called nullable if it accepts the empty word, that is, if ǫ ∈ L(G). If the body of an iteration F is nullable, the empty word is an immediate witness to the flexibility of F . In addition, flexibility can result from rather subtle interaction of occurrence bounds of nested iterations. As examples, consider

10

the (marked) expression E = (a2..3 | b1 )2..2 b2 = F b2 = G2..2 b2 1 with iteration body G = (a2..3 | b1 ), and some of their variations. The language 1 L(F ) consists of eight words, which are a41 , a51 , a61 , a21 b1 , a31 b1 , b1 a21 , b1 a31 , and b1 b1 , using superscripts as a shorthand notation for a number of repeated symbols. These words are the candidates for a witness to the flexibility of F in E. It is straightforward to check that none of them can be accepted as fewer than two iterations of the body G of the iteration, and thus iteration F is not flexible in E. On the other hand, its variation F1 = G2..2 with G1 = (a1..2 | b1 ) 1 1 is flexible in the expression b2 ; E1 = (a1..2 | b1 )2..2 b2 = F1 b2 = G2..2 1 1 the word w = a1 a1 is a witness to the flexibility of F in E1 , by wb2 ∈ L(E1 ) and w ∈ L(F1 )1 ∩ L(G1 )1 . The positions b1 and b2 of expression E1 are competing continuations of w as a consequence of this flexibility. This is the reason for the nondeterminism of E1 , which was seen in Section 1. As a second variation, consider the expression E2 = (E)2..2 = (F b2 )2..2 = ((a2..3 | b1 )2..2 b2 )2..2 . 1 Iteration F is not flexible in E2 either: The language L(E2 ) consists of concatenations of two words taken from L(F ) and terminated by the position b2 which does not belong to F . Thus the candidates for a witness to the flexibility of F in E2 are the same as the candidates for a witness to its flexibility in E. Finally, if we make the position b2 optional to get the variation E3 = (F b2 ?)2..2 = ((a2..3 | b1 )2..2 b2 ?)2..2 , 1 iteration F = G2..2 with G = (a2..3 | b1 ) becomes flexible in E3 : The word 1 w = a81 is a witness to this, by wb2 ∈ L(E3 ) and w ∈ L(F )2 ∩ L(G)3 where 3 < 2 × 2. As a consequence, positions b1 and b2 are competing continuations of w, which makes expression E3 nondeterministic. The definition of flexibility is somewhat involved for the reason that it has been formulated to meet two goals simultaneously. First, it gives a sufficient condition for constructing Follow relations correctly; and secondly, there is an efficient method for locating and recognizing flexible iterations in an expression by a single scan of the expression. We will review this method in Section 3. The flexibility of iterations is taken into account in the construction of Follow relations as follows: Definition 2.3 (Kilpel¨ ainen and Tuhkanen, 2007). Let E be a marked expression. The Follow relation Foll(F ) is defined for each subexpression F of E inductively as follows: 11

• If F = x for any x ∈ Π, then Foll(F ) = ∅; • If F = G?, then Foll(F ) = Foll(G); • If F = G | H, then Foll(F ) = Foll(G) ∪ Foll(H); • If F = GH, then Foll(F ) = Foll(G) ∪ Foll(H) ∪ [Last(G) × First(H)]; • If F = Gm..n , then  Foll(G) ∪ [Last(G) × First(G)] Foll(F ) = Foll(G)

if F is flexible in E, otherwise;

That is, all forward transitions are included in the Follow relation of expression E, but backward transitions are included only if they are transitions of an iteration which is flexible in E. The main result of (Kilpel¨ainen and Tuhkanen, 2007) was that the Follow relations thus defined, together with the First and Last sets, are sufficient for checking the determinism of expressions: Theorem 2.4 (Kilpel¨ ainen and Tuhkanen, 2007). A marked expression E is nondeterministic if and only if there are two different positions x, y ∈ Pos(E) with (x)♮ = (y)♮ such that (A) x, y ∈ First(E), or (B) (z, x), (z, y) ∈ Foll(E) for some position z ∈ Pos(E), or (C) (z, x) ∈ Foll(G) and y ∈ First(G) for some iteration F = Gm..n in E and for some position z ∈ Last(G). As an example of Theorem 2.4, consider the following expressions: E4

= (b1 ?a1 ) | a2

E5 E6

= (a1 b1 )1..2 a2 = (a1 b1 )2..2 a2

E7

= (a1 b1 a2 ?)2..2

Expression E4 is nondeterministic by satisfying the condition (A) with a1 , a2 ∈ First(E4 ). Expression E5 is also nondeterministic by the competition of positions a1 and a2 , which is observed through the condition (B) with (b1 , a1 ), (b1 , a2 ) ∈ Foll(E5 ). Expression E6 on the other hand is deterministic. Especially, it does not satisfy the condition (B) because the backward transition (b1 , a1 ) is not included in the Follow relation of the non-flexible iteration (a1 b1 )2..2 . Finally, expression E7 is nondeterministic even though it comprises of a non-flexible iteration F7 = G2..2 with G7 = (a1 b1 a2 ?). Again the backward transition (b1 , a1 ) 7 is not included in the Follow relation, but the expression satisfies the condition (C) with (b1 , a2 ) ∈ Foll(G7 ), a1 ∈ First(G7 ), and b1 ∈ Last(G7 ). In order to check whether condition (C) of Theorem 2.4 holds, we need to examine, for each iteration F = Gm..n , competitions potentially caused by their backward transitions. We will discuss in the next section how to realize the Follow relation efficiently, and how to optimize the checking of this condition (C). 12

3. A linear-time algorithm In this section we show how the determinism of expressions can be checked in linear time, based on the ideas reviewed in Section 2. A method that checks the determinism of expressions over a finite alphabet in quadratic time, and of expressions over an unlimited alphabet in cubic time, was introduced in (Kilpel¨ ainen and Tuhkanen, 2007). We improve this result by optimizing the examination of the Follow relation used for checking the conditions of Theorem 2.4. The Follow relation is often realized as follow lists, which are computed for each position of the expression. The improvement of the determinism-checking algorithm is based on the simple trick of combining the follow lists, for the last positions of each subexpression, together into a single FollowLast list, which allows the contents of all of them to be examined simultaneously. We use the standard word RAM with unit-cost operations on w-bit integers of length w = O(log n) as the model for analyzing algorithms, where n denotes the size of the input (Hagerup, 1998). That is, we assume that algorithms are executed on a standard single-processor architecture such that the input can be loaded in main memory, and whose processor word-length is restricted but long enough for indexing the contents of the main memory. Implementation of the algorithms using Java, and practical experiments on the Java virtual machine, are discussed in Section 4. We assume that the given expression E has been parsed and translated into an expression tree, whose inner nodes are labeled by the operators of the expression, and whose leaves are labeled by positions of the expression. This can be done within linear time using well-known parsing techniques; see, e.g., Sippu and Soisalon-Soininen (1988). Nullability of subexpressions F of the expression, that is, whether ǫ ∈ L(F ), is relevant information for analyzing the determinism of the expression. Information of nullability can be recorded in the nodes of the expression tree easily while the tree is being built: Leaves (that is, positions) are never nullable, while optional nodes always are. A concatenation is nullable iff both of its subexpressions are nullable, and a choice is nullable iff either of its subexpressions is nullable. Finally, an iteration Gm..n is nullable if and only if m = 0 or the body G is nullable. The above Follow relations can be represented as follow sets, denoted by followF (x), consisting for each position x ∈ Pos(F ) of those positions to which x is related in the Follow relation of expression F . That is, follow F (x) = {y ∈ Pos(F ) | (x, y) ∈ Foll(F )} .

(3)

Testing for the case (C) of Theorem 2.4 might require follow G (z) sets to be examined potentially for several positions z ∈ Last(G). In order to examine the contents of all of these follow sets simultaneously, we combine them together into a single FollowLast set, which is defined for each subexpression F of E as follows: [ FollowLast(F ) = followF (x) . (4) x∈Last(F ) 13

procedure markFlexible(F : Expression, N : integer) returns double: // Returns the “flexibility value” of F ; N is the product of maximum bounds // of nested iterations, which is used for testing the flexibility of F = Gm..n case F = x ∈ Π: return 1; case F = G?: markFlexible(G, N ); return ∞; case F = (G | H): return max(markFlexible(G, N ), markFlexible(H, N )); case F = (GH): if ǫ ∈ L(H) then fG ← markFlexible(G, N ); else fG ← markFlexible(G, 1); if ǫ ∈ L(G) then fH ← markFlexible(H, N ); else fH ← markFlexible(H, 1); if ǫ ∈ L(G) and ǫ ∈ L(H) then return ∞; if ǫ ∈ L(G) and ǫ 6∈ L(H) then return fH ; if ǫ 6∈ L(G) and ǫ ∈ L(H) then return fG ; if ǫ 6∈ L(G) and ǫ 6∈ L(H) then return 1; case F = Gm..n : fG ← markFlexible(G, N × n); if m < n or fG ≥ (N × n)/(N × n − 1) then Mark that F is flexible; return fG × n/m; Figure 2: A procedure for recognizing flexible iterations of an expression E with the call markFlexible(E, 1)

These FollowLast sets have been applied also by Br¨ uggemann-Klein and Wood (1992) in their study of deterministic regular languages, and by Fuchs and Brown (2003), as their confusion set, for testing the unique-particle attribution property of XML Schema content models.6 Computation of the FollowLast sets requires knowledge of the flexibility of iterations. A linear-time method for recognizing flexible iterations of an expression was developed and proved correct by Kilpel¨ainen and Tuhkanen (2007, Sect. 4), based on defining and computing numeric flexibility values for subexpressions. Intuitively, the flexibility value of an expression is the maximal relative variation in the number of iterations that the expression can take to match any word. That is, if some word belongs to L(E)m ∩ L(E)n such that m/n is 6 Unfortunately Fuchs and Brown ignored the complications caused by the flexibility of iterations, which made their algorithm incorrect (Kilpel¨ ainen and Tuhkanen, 2007).

14

maximal, this ratio is the flexibility value of expression E. Notice that a nullable expression can match the empty word using any number of iterations, including zero. Because of this, we define the flexibility value of nullable expressions to be infinite. For example, consider the expression E = a5..11 . Now the word a55 belongs in both L(E)11 and in L(E)5 , by which the flexibility value of E is at least 11/5 = 2.2. Since it is easy to see that no bigger relative difference can arise, this is also the flexibility value of E. We do not repeat the proofs of (Kilpel¨ainen and Tuhkanen, 2007) here. Instead—for completeness, and in order to help practical implementers—we present a procedure for recognizing and marking the flexible iterations of an expression, based on their flexibility values, in Figure 2. (The inductive definition of the flexibility values is actually easy to extract from the return statements of the procedure.) All flexible iterations of an expression E are recognized and marked by invoking the procedure with the call markFlexible(E, 1). The unlimited value ∞ is used in the formulation of the procedure assuming that it behaves similarly to the value Double.POSITIVE INFINITY of Java. That is, x × ∞ = x/0 = ∞ for all positive values x. We give here an example of the operation of the markFlexible procedure. As an example, we show how the procedure would recognize and mark the flexible iterations of the expression ((a2..3 | b1 )2..2 b2 ?)2..2 , 1 which was considered as example E3 in Section 2. The initial invocation is | b1 )2..2 b2 ?)2..2 , 1) , markFlexible(((a2..3 1 which leads to the case F1 = Gm..n with G1 = ((a2..3 | b1 )2..2 b2 ?) and m = n = 1 1 2. (We use subscripts to distinguish subexpressions Fi , Gi , and Hi considered by different procedure invocations, and their flexibility values fG i and fH i .) The flexibility value fG 1 of the body G1 of the iteration is computed by the recursive call | b1 )2..2 b2 ?), 1 × 2) . fG 1 ← markFlexible(((a2..3 1 This call leads to the case F2 = (G2 H2 ), where G2 = (a2..3 | b1 )2..2 and H2 = 1 b2 ?. Since b2 ? is nullable, the flexibility value of the expression G2 is computed with | b1 )2..2 , 2) . fG 2 ← markFlexible((a2..3 1 This call leads to the case F3 = Gm..n with G3 = (a2..3 | b1 ) and m = n = 2. 3 1 The flexibility value of G3 is computed by the recursive call | b1 ), 2 × 2) , fG 3 ← markFlexible((a2..3 1 by N = 2. This leads to the case F4 = (G4 | H4 ) with G4 = a2..3 and H4 = b1 , 1 whose result is computed by evaluating the calls markFlexible(a2..3 1 , 4) and markFlexible(b1 , 4), and returning their maximum. The call markFlexible(a2..3 1 , 4) 15

is evaluated as an instance of F5 = Gm..n with G5 = a1 , m = 2, and n = 3. The 5 flexibility value of subexpression G5 is computed by the call fG 5 ← markFlexible(a1 , 4 × 3) , which evaluates to 1. Since the occurrence bounds m and n of the iteration a2..3 1 differ, the subexpression F5 = a2..3 is marked flexible. The value fG 5 × n/m = 1 1 × 3/2 is returned as the value of the call markFlexible(a2..3 1 , 4). Since markFlexible(b1 , 4) = 1 , the maximum of the flexibility values of the subexpressions of (a2..3 | b1 ) is 3/2, 1 which is returned as the value of markFlexible((a2..3 | b ), 4) = f 1 G 3 . Since 1 this value 3/2 exceeds (N × n)/(N × n − 1) = (2 × 2)/(2 × 2 − 1) = 4/3, the procedure marks the subexpression F3 = (a2..3 | b1 )2..2 flexible, and then returns 1 and stores 3/2 × 2/2 = 3/2 as the flexibility value fG 2 . Since G2 = (a2..3 | b1 )2..2 is a non-nullable branch of the expression F2 = 1 (G2 H2 ), the flexibility value of the expression H2 = b2 ? is computed with the call fH 2 ← markFlexible(b2 ?, 1) . This leads to the call markFlexible(b2 , 1), whose return value 1 is ignored, and ∞ is returned as the flexibility value fH 2 of the nullable expression b2 ?. The flexibility value fG 2 = 3/2 of the non-nullable branch of F2 = (G2 H2 ) is then | b1 )2..2 b2 ?). returned as the flexibility value fG 1 of the expression G1 = ((a2..3 1 Finally, we are back at the initial invocation. Since the occurrence bounds of the outermost iteration G2..2 are equal, and the flexibility value fG 1 = 3/2 1 is less than (1 × 2)/(1 × 2 − 1) = 2, the outermost iteration is not marked flexible. Finally, fG 1 × 2/2 = 3/2 is returned as the (void) flexibility value of the expression.  It is easy to see that procedure markFlexible performs only some fixed operations at each node of the expression tree, and therefore runs in linear time with respect to the length of the expression. Simple rules for computing the First and FollowLast sets for the subexpressions F of E are given in Figure 3. Let us call these rules DC rules (for Determinism Constraints). The rules are meant to be applied on the expression tree of E in a bottom-up order, so that the computation of the sets for a node is based on the sets computed for its child nodes. The rules include determinism constraints prefixed by DC1, . . . , DC4, which specify constraints for the determinism of subexpressions of E. The correctness of the First and FollowLast computation by the rules is straightforward to verify: Lemma 3.1. The First and FollowLast sets are computed correctly by the DC rules of Figure 3. Proof. Let F i(F ) and F L(F ) denote the sets that are computed according to the rules to be the First and the FollowLast set, respectively, for expression F . We use induction to show that F i(F ) = First(F ) and F L(F ) = FollowLast(F ) for 16

case F = x ∈ Π: First(F ) ← {x}; FollowLast(F ) ← ∅; case F = G?: First(F ) ← First(G); FollowLast(F ) ← FollowLast(G); case F = (G | H): DC1: (First(G))♮ ∩ (First(H))♮ = ∅ First(F ) ← First(G) ∪ First(H); FollowLast(F ) ← FollowLast(G) ∪ FollowLast(H); case F = (GH): DC2: (FollowLast(G))♮ ∩ (First(H))♮ = ∅ if ǫ ∈ L(G) then DC3: (First(G))♮ ∩ (First(H))♮ = ∅ First(F ) ← First(G) ∪ First(H); else First(F ) ← First(G); endif if ǫ ∈ L(H) then FollowLast(F ) ← FollowLast(G) ∪ First(H) ∪ FollowLast(H); else FollowLast(F ) ← FollowLast(H); endif case F = Gm..n : DC4: x ∈ FollowLast(G) ∧ y ∈ First(G) ∧ (x)♮ = (y)♮ ⇒ x = y First(F ) ← First(G); if F is flexible in E then FollowLast(F ) ← FollowLast(G) ∪ First(G); else FollowLast(F ) ← FollowLast(G); endif Figure 3: “DC rules” for checking the determinism of subexpressions

17

each subexpression F , as defined by equations (1), (3), (4), and by Definition 2.3 of Foll(F ). There are five cases to consider: When F = x ∈ Π, the assignment F i(F ) ← {x} is correct by L(F ) = {x}, and the assignment F L(F ) ← ∅ is correct by Foll(F ) = ∅. When F = G?, the First and Last sets and the Follow relation are the same for F and G. This implies that F i(F ) gets the correct value and FollowLast(F ) = FollowLast(G), which is assigned to F L(F ) correctly. When F = (G | H), the First set, the Last set, and the Follow relation of F are, respectively, the disjoint union of the First sets, of the Last sets, and of the Follow relations, of G and H. This implies that F i(F ) ← First(G) ∪ First(H) is a correct assignment, and FollowLast(F ) = FollowLast(G) ∪ FollowLast(H), which is assigned to F L(F ) correctly. When F = (GH), the value of First(F ) depends on the nullability of G, and the value of Last(F ) depends from the nullability of H. So, for both F i(F ) and F L(F ) there there are two cases to consider. If ǫ ∈ L(G), then First(F ) = First(G) ∪ First(H), and otherwise First(F ) = First(G). In both cases the correct value is assigned to F i(F S ). If ǫ 6∈ L(H), then Last(F ) = Last(H). This implies that FollowLast(F ) = x∈Last(H) followF (x). For each x ∈ Last(H) we have followF (x) = followH (x). Together these give that [ FollowLast(F ) = followH (x) = FollowLast(H) , x∈Last(H) which is assigned to F L(F ). On the other hand, if ǫ ∈ L(H), then Last(F ) = Last(G)∪Last(H). Again, for each x ∈ Last(H) we have followF (x) = followH (x), and for each x ∈ Last(G) we have followF (x) = followG (x) ∪ First(H). Plugging these into equation (4) gives FollowLast(F ) = FollowLast(G) ∪ First(H) ∪ FollowLast(H) , which is assigned to F L(F ). Finally, when F = Gm..n , we have First(F ) = First(G) and Last(F ) = Last(G). So, the correct value is assigned to F i(F ). If F is not flexible, Foll(F ) = Foll(G), and thus [ FollowLast(F ) = followG (x) = FollowLast(G) . x∈Last(G) If F is flexible, Foll(F ) = Foll(G) ∪ [Last(G) × First(G)], which means that for each x ∈ Last(F ) we have followF (x) = followG (x) ∪ First(G). This implies that FollowLast(F ) = FollowLast(G) ∪ First(G) . In both cases the correct value of FollowLast(F ) is assigned to F L(F ).  The following theorem states that computing the First and FollowLast sets and checking the related determinism constraints are sufficient for establishing the determinism of a given expression E. 18

Theorem 3.2. A marked expression E is nondeterministic if and only if there are two different positions x, y ∈ Pos(E) with (x)♮ = (y)♮ , and a subexpression F of E such that (1) x, y ∈ First(F ), or (2) F = (GH) with x ∈ FollowLast(G) and y ∈ First(H), or (3) F = Gm..n with x ∈ FollowLast(G) and y ∈ First(G). Proof. First assume that E is nondeterministic, which is equivalent to that some of the conditions (A), (B), and (C) of Theorem 2.4 hold. Condition (A) implies condition (1). Condition (B) holds only if condition (2) or (3) holds. Condition (C) is equivalent to condition (3). Then assume that some of the conditions (1), (2), and (3) hold. Condition (1) holds only if condition (A) or (B) of Theorem 2.4 holds. Condition (2) implies condition (B), and condition (3) is again equivalent to condition (C).  Observe that the determinism constraints DC1,. . . ,DC4 of Figure 3 ensure that the subexpressions of E do not satisfy the conditions of Theorem 3.2: Constraints DC1 and DC3 prevent condition (1) (competing First positions), and constraints DC2 and DC4 are negations of conditions (2) and (3), respectively. The formulation of constraint DC4 is slightly more involved than the others, because FollowLast(G) and First(G) may have some positions in common, while the sets controlled by the other conditions originate from distinct subexpressions and are thus disjoint. As an example, consider the deterministic expression G = (x?a1 ? | ya2 ? | za3 ?) . Here have First(G) = {x, a1 , y, z} and FollowLast(G) = {a1 , a2 , a3 }, which share the position a1 . As this example shows, the FollowLast sets may contain several occurrences of the same symbol even though the expression is deterministic, while multiple occurrence of any symbol in a First set is an immediate indication of nondeterminism. We can terminate the examination and report that E is nondeterministic as soon as some of its subexpressions violates any of the determinism constraints. This implies that no symbol of the alphabet Σ appears more than once in any First set, and thus each union of the First sets can be performed in O(|Σ|) steps, simultaneously keeping track of related determinism constraints. A simple strategy is to represent the sets as lists of positions, maintained in increasing order of their underlying symbols, and to implement the unions and the checking of constraints DC1 and DC3 by merging the lists. As discussed above, the FollowLast sets may contain several occurrences of the same symbol even though the expression is deterministic. Nevertheless, the verification of constraints DC2 and DC4 can be supported by storing in each FollowLast set at most two occurrences of any symbol. Constraint DC2 is only concerned with checking that FollowLast(G) does not contain any occurrences of symbols that occur in First(H). Similarly, constraint DC4 is only concerned with checking that FollowLast(G) does not contain, for any position 19

y ∈ First(G), an occurrence of the symbol (y)♮ at any position different from y. Similarly to the implementation of First sets, it is straightforward to realize the FollowLast sets as lists of positions ordered by their underlying symbol, and realize their unions and the checking of the determinism constraints by merging the lists. Since each FollowLast list contains at most 2|Σ| positions, each of these merge steps can be performed in O(|Σ|) steps. As an example, consider the expression F = G2..3 whose body is the expression G = (x?a1 ? | ya2 ? | za3 ?) considered above. This expression would be found nondeterministic as follows: Now First(G) = {x, a1 , y, z} and FollowLast(G) = {a1 , a2 , a3 }. Let A and B be list representations of these sets. Ordering the positions by the alphabetic order of their symbols, list A is [a1 , x, y, z], and list B would contain any two of a1 , a2 , and a3 . The violation of determinism constraint DC4 would appear while merging the lists A and B, by observing that list B contains either position a2 or a3 which competes with the position a1 ∈ A. Based on the above discussion, the rules and constraints of Figure 3 can be evaluated in O(|Σ|) steps for each subexpression of E. This gives us the following theorem: Theorem 3.3. The determinism of a regular expression with numeric iterations over a fixed alphabet can be checked in linear time. If the alphabet is unlimited, the check can be performed in time O(n2 / log n), where n is the length of the binary representation of the expression. Proof. An expression of length n contains O(n) subexpressions, each of which can be examined by the above discussion in O(|Σ|) steps. The size of a fixed alphabet is O(1), and each of its symbols can be represented in constant space and thus operated on in constant time, which gives the first result. In the case of an unlimited alphabet, the result follows from the simple observation that only the symbols that occur in the expression are relevant for its determinism, and thus is suffices to run the algorithm on this finite part of the alphabet alone. Let n be the length of the representation of E as a binary string, and k be the number of different alphabet symbols that occur in E. Then n = Ω(k log k), since we need Ω(log k) bits to represent k different symbols. Thus n ≥ ck log k for some constant c and all sufficiently large n. For such c and n also k k ≤ ≤ 3/c , n/ log n ck log k/ log(ck log k) when k ≥ c. Therefore k = O(n/ log n). The symbols in E can be identified and substituted by numbers {1, . . . , k} in linear time, say, by storing them in a trie; see, e.g., Navarro and Raffinot (2002, Sec. 3.1). The size of this new alphabet is then O(n/ log n), which implies the second claim.  The above linear complexity bound is obviously optimal for expressions over a fixed alphabet. On the other hand, the O(n2 / log n) upper bound for expressions over an unlimited alphabet was derived using a crude counting argument. This bound might be lowered, either by a more careful analysis or a 20

more elaborate algorithm. It seems difficult to go below time Θ(n|Σ|) using a method that is based on examining, at each of the O(n) subexpressions, whether sets of O(|Σ|) symbols are disjoint. Two ordered lists of length m and n can be merged in time O(m log(n/m)), where m ≤ n, and this bound is optimal for comparison-based algorithms (Brown and Tarjan, 1979). Thus an asymptotically more efficient algorithm would seem to require an approach which is different from merging-based examination of First and FollowLast sets. 4. Experimentation This section describes the experiments that were carried out for verifying and demonstrating the viability of our approach. The algorithms described above were implemented using Java 6.0 (Sun Microsystems, 2006b). The implementation was written for these experiments, and it does not constitute a complete XML Schema processor. The implementation includes a simple recursive-descent parser for a subset of XML Schema complexType definitions, which recognizes the content-model particles choice, sequence and element, and their minOccurs and maxOccurs attributes. These features are sufficient for representing all regular expressions with numeric iterations that were discussed above. The parser uses StAX (Streaming API for XML) (BEA, 2003; Sun Microsystems, 2005, Chap. 3) for scanning its input, and translates content models into expression trees built of leaves for element particles, and of operator nodes for choices, sequences, optionality, and numeric iterations of subtrees. Operator nodes have fixed arities; choice and sequence nodes are binary, and nodes for optionality and numeric iteration are unary. One relevant design decision was to realize sequences of subexpressions within a choice or a sequence as balanced binary trees, since they were to be quite long in our experiments; see test collections Rep n , Seq n , and Choice n below. This strategy leads to shallow expression trees, which can be processed in an intuitive manner using recursion, without stumbling to limitations of Java’s recursion stack (Sun Microsystems, 2002). Another design decision was to make the run-time comparison of symbols to take constant time only, no matter how long their string representations might be. For this, the parser uses a hash table to map element names to small integers, which are used as the internal representation of symbol names. While building the expression tree, the parser also determines the nullability of each subexpression and records it in the corresponding node of the expression tree. The actual UPA checking begins with a preprocessing that traverses the expression tree for locating and marking its flexible iterations, using an implementation of procedure markFlexible in Figure 2. This is followed by another traversal of the expression tree that computes the First and FollowLast sets according to the rules of Figure 3, and checks that the determinism constraints DC1,. . . ,DC4 are not violated. The First and FollowLast sets are represented using Java ArrayList objects, and the positions contained in them are represented as (references to) leaves of the expression tree.

21

For comparison, also the previous quadratic-time determinism-checking algorithm of (Kilpel¨ ainen and Tuhkanen, 2007) was implemented using data structures and sub-routines similar to those that were used for the new linear-time algorithm. The main difference between these algorithms is that the quadratictime algorithm computes for each position of the expression a separate follow list, whereas the linear-time method computes for each subexpression a single FollowLast list. In order to update the follow lists, the quadratic-time algorithm also computes for each subexpression a list of its Last positions. The built-in XML Schema validator of the current Java platform, Apache Xerces (Apache Software Foundation, 2007), was used as an efficiency baseline in experimental comparisons. Its source code is available as a part of the snapshot release of the Java platform.7 For measuring the time of the builtin XML Schema UPA checking of current Java (JDK 1.6), a slightly modified copy of Xerces was created by manually inserting timing instructions around the single call of the method checkUniqueParticleAttribution, in Xerces class XSConstraints. All experiments were run on a single-user 3 GHz Pentium 4 workstation with 1 GB of main memory. The experiments were carried out under the Red Hat Linux (version 6.0.52) operating system. Applications were compiled and run using OpenJDK version 1.6.0. Java class files were executed using the -Xincgc option, which causes the Java garbage collector to run continuously in the background; this was done in order to avoid noticeable pauses caused by garbage collection, which could have interfered with the measurements. All times were measured using the System.nanoTime() method of Java. It is likely that the time needed for UPA checking is ignorable for most content models which appear in XML schemas of practice. For example, the schema for XML Schema instances8 is likely more complex than most domainspecific schemas. Its length as a text file is about 2400 lines, or 86 KB, and it contains 56 complexType definitions. Parsing this schema and translating it into an internal Schema object using Xerces took about 1.3 seconds of elapsed time. This processing includes 55 UPA checks9 , whose total time was about 3 ms only. Measuring method invocations whose length is in the scale of microseconds is rather inaccurate, but these results suggest that the efficiency of UPA checking might not be an issue with many real-world schemas. On the other hand, some schemas developed for production use may also be challenging for current implementations. For example, Xerces fails to process the ISO/IEC MPEG-7 Version 2 Schema in our experiments. The challenging element declaration of this schema is shown in a simplified form in Figure 4. Our implementation processes the shown schema in total time of about 0.3 seconds, out of which UPA testing takes about 5 ms. Apache Xerces, on the other hand, fails with this schema completely. When we increased the heap space of the Java virtual 7 http://download.java.net/jdk6/6u3/promoted/b05/index.html 8 http://www.w3.org/2001/XMLSchema.xsd. 9 One

complexType definition, openAttrs, includes no element content model.

22

Figure 4: A simplified fragment of the ISO/IEC MPEG-7 schema

machine to 1000 MB, Xerces processed the same schema for about 2 minutes before throwing an OutOfMemoryError. Instead of harvesting further existing schemas, classes of test cases consisting of increasingly large content models were constructed, in order to be able to measure observable run times and to systematically estimate the scalability of the implemented algorithms with respect to large and complex content models. All test cases consisted of deterministic content models, which were also correctly accepted by each test run. That is, the implementations examined each test case completely, without aborting the examination, which would have happened with nondeterministic content models. Examination of the source code of Xerces reveals that Xerces, essentially, first translates the content model into an automaton, and then checks the outgoing transitions of this automaton for potential overlap of their symbols. This leads to the hypothesis that expressions which translate into automata with a large number of transitions would be hard cases for the UPA check of Xerces. As such, we created a collection of content model expressions Rep n of the form (tn | tn−1 | · · · | t1 )0..∞ , where ti for i = 1, . . . , n are distinct element names. That is, Rep n is an optional and unlimited iteration of a choice of n different symbols. These expressions were represented concretely as XML Schema files10 like the one shown in Figure 5. The times used by Xerces for UPA checking while processing content models of Rep 200 , Rep 400 , . . . , Rep 3000 are displayed in Figure 6 together with times of our implementation (“quadratic UPA”) of the previous quadratic-time algorithm of (Kilpel¨ ainen and Tuhkanen, 2007). The size of the schema files for these expressions ranges from 14 KB to 216 KB. Each test run was repeated 10 times. Most plots contain the timing results of each of the 10 repeats. The measurements of each test case are connected by a vertical bar, and the medians 10 Schema

files were generated using XSLT (Clark, 1999) scripts parametrized with n.

23

Figure 5: Representation of the test Rep 3 as an XML schema.

of consecutive sets of measurements are connected by horizontal line segments. (When there are two medians this sometimes leads to ragged curves, especially in the case of short times whose measuring is less accurate. As an example, see the lowest curve in Figure 7.) The initial runs in each set of 10 repeats were systematically the longest ones. This is likely due to the Java just-in-time compiler, which obviously takes most effect during a couple of initial repeats. In some of the plots these maximum times have been manually eliminated, if their large divergence from the others would have confused the results. The time usage of both Xerces UPA checking and our “quadratic UPA” shown in Figure 6 clearly increases quadratically with respect to the size of the expressions. The times reported of Xerces include the UPA checking only, and not the times of parsing expressions or constructing automata. Similarly, the times measured of our implementations include the preprocessing of the expression tree by the markFlexible procedure and the computation of the sets used for checking the determinism—that is, the First, Last and follow lists in the case of the quadratic-time algorithm, and the First and FollowLast lists in the case of the linear-time algorithm—but not the parsing of the input nor the construction of the expression tree. For both implementations the UPA checking reported in Figure 6 dominates the overall processing time of these tests, ranging from 40 % (for Rep 200 ) to 90 % (for Rep 3000 ) of total time. The reason is, obviously, that in this case the time of UPA checking grows an order of magnitude faster than the time of constructing expression trees or automata. The UPA checking times of Xerces, “quadratic UPA”, and our implementation of the new algorithm described in Section 3 (“linear UPA”) are displayed together in Figure 7. The reported UPA checking times of our linear-time implementation amount to 10–20 % of its total execution times. A logarithmic scale is used for displaying values of greatly differing magnitude. The time usage of our linear-time implementation increases an order of magnitude slower than the time usage of Xerces; for Rep 200 it is roughly 40 %, and for Rep 3000 it is less 24

♦ ♦ ♦ ♦ ♦

1.8 s Xerces Rep n quadratic UPA Rep n

1.6 s

♦ +

1.4 s ♦ 1.2 s

♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

♦ ♦ ♦ ♦ ♦ ♦

1.0 s ♦ ♦ ♦

0.8 s ♦ ♦ ♦

0.6 s 0.4 s 0.2 s + ♦

♦ + +

♦ + + 500

♦ + + +

♦ ♦ + +

♦ ♦ ♦ + + + + +

♦ ♦ + + +

1000

♦ ♦

+ + + +

1500

+ + + +

+ + +

2000

+ + +

+ + +

+ + +

2500

+ + + +

+ +

3000

Figure 6: UPA checking times with Xerces and the algorithm of (Kilpel¨ ainen and Tuhkanen, 2007) on Rep n as a function of n.

than 1 % of the time used by Xerces. The class of the Rep n test cases was intentionally designed to be hard for Xerces. As presumably unbiased test cases, we found content types called Sequence and Choice in an XML schema of the XSDBench project.11 The complexType definition of the Sequence type is shown in Figure 8. The content model of the Sequence type is essentially of the form (F )? x (F )0..∞ y (F ) z (F )1..∞ , where the repeated subexpression F is a sequence of the form a? b0..∞ c d1..∞ . We generalized this expression into a class of increasingly long expressions Seq n of the form (F · · · F )? x (F · · · F )0..∞ y (F · · · F ) z (F · · · F )1..∞ , where the subexpression F is repeated n times within each of the four parentheses. The content model of type Choice in the XSDBench schema is rather similar; it is essentially of the form (G)? x (G)0..∞ y (G) z (G)1..∞ , 11 http://www.codesynthesis.com/projects/xsdbench/.

25

10 s Xerces Rep n quadratic UPA Rep n linear UPA Rep n 1s

0.1 s

10 ms

1 ms

+ ♦ ♦ + + ♦ + ♦ +  +   +  +  

+ + ♦ ♦ + ♦ + + + +  + +     

♦ ♦ ♦ + + + + + +        500

♦ + + + + + + +       

♦ ♦ + + + + + + +

     

♦ ♦ + + + + + + +

    

+ + + +

   

  

  



 



♦ + + + + + +

   

1000

♦ ♦

♦ + + + +

♦ ♦ + + + + +

1500

2000



♦ ♦

♦ +  ♦

+ + +

+ + + +

+ +

+ + +

+ +

  



   

  

   

 

    





2500

3000

Figure 7: UPA checking times of Rep n by Xerces and our quadratic-time and linear-time algorithms.

where the repeated subexpression G is this time a choice of the form a? | b0..∞ | c | d1..∞ . This content model was generalized into a class of expressions Choice n of the form (G1 | · · · | Gn )? x (G1 | · · · | Gn )0..∞ y (G1 | · · · | Gn ) z (G1 | · · · | Gn )1..∞ . The subexpressions Gi above are choice expressions ai ? | b0..∞ | ci | d1..∞ , i i where ai , bi , ci , and di are element names generated as freshly numbered copies of the original ones. This was done for keeping the choice subexpressions deterministic.12 The UPA checking times of Xerces on tests Seq n and Choice n for n = 20, 30, . . . , 100 are shown in Figure 9. Again, the times are seen to increase quadratically with respect to the length of the expressions. For some reason, the Seq n tests that involve sequential subexpressions appear to be more tedious for Xerces than those made of choices, roughly by a factor of three. The execution times keep rising steeply beyond the figures shown in Figure 9. For example, Xerces uses roughly 75 seconds for the UPA checking of Seq 200 . Our implementation can handle much larger cases of Seq n and Choice n than Xerces.13 The results displayed in Figure 10 confirm our expectation of the lin12 The actual schema files use element names derived from the fruit names (apple, orange, etc.) used in the original XSDBench schema. 13 Xerces throws a stackOverFlowError in method calcFollowList on Seq with n ≥ 800. n

26



Figure 8: XSDBench complexType definition of Sequence.

♦ 9s

Xerces Seq n Xerces Choice n

8s

♦ +

♦ ♦ ♦ ♦

7s

♦ ♦

6s 5s

♦ ♦

4s

+ + +

♦ ♦

3s

+ +

♦ ♦

2s ♦ ♦

1s ♦ + 20

♦ + 30

♦ +

+

40

50

+ 60

+ +

70

+ +

80

90

Figure 9: Xerces UPA checking times of Seq n and Choice n for n = 20, . . . , 100.

27

100

linear UPA Choice n quadratic UPA Seq n linear UPA Seq n

2.0 s

1.5 s

♦ ♦ ♦ ♦ ♦ + ♦ + ♦ +

1.0 s

0.5 s

♦ ♦ ♦ ♦ ♦ + + + ♦ ♦ + ♦ +  ♦ ♦  +  ♦    + ♦  +    ♦ + ♦  +   0 5000

♦ ♦ ♦ + + ♦ + ♦ ♦ + + ♦ + ♦ + ♦ ♦ ♦ ♦ + +  +            

♦ ♦ ♦ ♦ ♦ + ♦ + ♦ + ♦ ♦ ♦ + + ♦ ♦ + ♦ + + + ♦ + + + +            

10000

♦ ♦ ♦

♦ + 

♦ ♦ ♦ ♦ + + + ♦ +

♦ ♦ ♦ ♦ + + ♦ + ♦ + + + +

               

    

♦ ♦ ♦ ♦ + + + ♦ ♦ + +

♦ ♦ ♦ ♦ ♦ + ♦ + + ♦ +

♦ ♦ ♦ ♦ + ♦ + ♦ +

15000

♦ ♦ + ♦ + ♦ + ♦ ♦ ♦ ♦ + + + +

+ ♦ ♦ ♦ + ♦ + ♦ ♦ + ♦ ♦ + + + + + + + +

      

    

      

20000

Figure 10: Linear UPA checking times of Choice n and Seq n for n = 1000, . . . , 20000.

ear scalability of the new algorithm. The time of the “quadratic UPA” algorithm on the Seq n tests is shown also in Figure 10. In this case its worst-case complexity does not arise, and the time usage of the algorithm is comparable to “linear UPA” on the Choice n tests. On the other hand, the time usage of “quadratic UPA” on the Choice n tests increases quadratically. This is the reason why these times are not shown in this figure: “quadratic UPA” takes over 5 seconds already at the start of the plots, on the test Choice 1000 . Contrary to the case of Xerces, the Choice n tests appear to be for “linear UPA” harder than the Seq n tests, roughly by a factor of two. The size of the schema file for Choice 1000 is about 1.5 MB, and the size of the schema file for Choice 20000 is about 30 MB. Elapsed total time for processing the schema of Choice 20000 by our linear-time implementation is roughly 10 seconds. As a comparison, the Linux wc utility, which counts the number of lines, words and characters in a text file, uses about 3 seconds on this file. Observe that the size of the alphabet increases here from the 4003 elements of Choice 1000 up to the 80 003 elements of Choice 20000 , that is, by a factor of 20. Interestingly, this increase does not show up as an additional factor in run times, as it could, according to the O(|Σ|n) worst-case complexity estimate. (The last experiment discussed in this section demonstrates that an increasing alphabet size may also lead to the quadratic worst-case behavior.) The UPA checking times of Xerces and our implementations on Choice 10 , Choice 20 , . . . , Choice 150 are shown together in Figure 11. Remember that Choice n is the collection that is, of the test collections Seq n and Choice n , the 28

Xerces Choice n quadratic UPA Choice n linear UPA Choice n

10 s

1s

♦ ♦

♦ ♦

♦ ♦

♦ +  ♦ ♦

♦ ♦

♦ ♦

♦ ♦



♦ ♦

♦ ♦

♦ ♦ ♦ ♦ ♦ ♦

0.1 s

10 ms

1 ms

+ ♦ +  +   ♦ +  + + 

♦ ♦ ♦ + + +  + +  +  +

+ +  + +

 



20

+  +  

+ + + + +     40

+ + + + + +  +   

+ + + + + + +     

+ + + + +

+ + + +

+ + + +

+ + + +

+ + + + +

+ + + +

+ +

+ + + + +

     

   

    

   

   

  

  

  

+

+ +

+ + + +    

60

80

100

120

140

Figure 11: UPA checking times of Choice n by Xerces and by our implementations.

easier one for Xerces and the harder one for our implementations. We see that at the end of the plot “linear UPA” is roughly 10 times faster than “quadratic UPA”, and roughly 1000 times faster than the UPA check of Xerces. The treatment of numeric occurrence indicators is a challenge to traditional automata-based implementations of regular expressions. As one of them, Xerces eliminates numeric iterations by repeating the involved subexpressions the number of times specified by the occurrence bounds, which easily leads to problems. Xerces seems to avoid this expansion if the iterated expression is a plain element particle. For example, Xerces is able to check the determinism of an expression like F 1..15000 within a few microseconds when F is a single element, but if F is a sequence or a choice, the processing fails by running out of memory. For experimenting with the effect of increasing occurrence bounds we constructed a class of test cases Occ n of the form (t1 | · · · | t100 )1..n . That is, content model Occ n is a choice of 100 different elements, with maxOccurs = n. The UPA checking times of Occ 10 , . . . , Occ 200 by Xerces and our “linear UPA” implementation are shown in Figure 12. The times of the “quadratic UPA” are not shown since they are rather similar to those of “linear UPA”, just 1–2 ms longer. The corresponding schema files are almost identical and each about 8 KB long, differing by the value of their maxOccurs attribute only. We were not able to test Xerces with much larger values of n without running out of memory. The time usage of Xerces UPA checking clearly increases linearly with 29

♦ +

Xerces Occ n linear UPA Occ n

10 ms

♦ ♦

8 ms

♦ ♦ ♦



♦ ♦

♦ ♦ ♦

♦ ♦ ♦

♦ ♦



6 ms





♦ ♦ ♦

♦ ♦ ♦

+

+

+

+

+ + +

+ + +

+ + +

+ + + +

♦ ♦

♦ 4 ms + ♦ 2 ms

+ ♦ ♦ + + +

+ ♦ + + +

♦ + ♦ + + + +

♦ ♦ ♦ + + + +

♦ ♦



+

+

+

+

+ + +

+ + +

+ + + +

+ + + +

50

+ + + + +

+ + + +

100

+

+ +

+

+

+

+

+ + +

+ + +

+ + +

+ + +

+ + +

+ + +

150

200

Figure 12: UPA checking times of Occ n by Xerces and our implementation.

respect to n, which is on the other hand exponential with respect to the length of the input. UPA checking is a rather negligible part of total processing by Xerces here, ranging from 8 % of total time on Occ 10 down to 0.4 % on Occ 200 . The time usage of our implementation stays within a constant, as expected. Large occurrence values are not the only source of exponentially increasing processing times. Xerces may expand nested iterations with small occurrence values into large automata, too. For investigating this behavior, we created a class of test cases Nest n , which are content model expressions of the form (· · · (t1 | t2 | · · · | t100 )2..2 · · · )2..2 , where a choice of 100 different elements appears inside n nested iterations with minOccurs = maxOccurs = 2. The UPA checking times by Xerces and our linear-time implementation on test cases Nest 1 , Nest 2 ,. . . , Nest 8 are displayed in Figure 13. The length of the schema files for these tests ranges between 209 and 223 lines; the size of each is about 8 KB. Values of n beyond 8 caused Xerces to run out of memory. The times measured of Xerces included large variations upwards. For clarity, only the shortest UPA checking times of Xerces are displayed in Figure 13. For our implementation, all times except the two longest ones are displayed for each set of 10 repeats. The maximum times were systematically those of the first two repeats of each test run. The time usage of Xerces clearly increases exponentially with respect to the length of the content models, while the time used by our implementation hardly changes. The linear increase in execution times of the new algorithm starts to be observable on test cases much larger than those displayed in Figure 13; see below. 30

♦ 12 ms

Xerces Nest n linear UPA Nest n

♦ +

10 ms

8 ms ♦ 6 ms

2 ms



+ + + + ♦

+ +

+ + +

+ + ♦ +

+ ♦ +

+ + + ♦ + + + +

1

2

3

4

4 ms

+ ♦ + + +

+ + + +

+ + + + +

+ + +

5

6

7

8

Figure 13: UPA checking times of Nest n for n = 1, . . . , 8 by Xerces and the linear-time algorithm.

A comparison between “linear UPA” and “quadratic UPA” on the Nest n tests for n = 100, 150, . . . , 600 is shown in Figure 14. The sizes of the schema files for Nest 100 ,. . . ,Nest 600 range from 76 KB to 1280 KB. These tests demonstrate the difference in efficiency between the previous quadratic-time algorithm and the new linear-time algorithm. The explanation for a difference of speed by a factor of roughly 70 in favor of “linear UPA” is that while “quadratic UPA” examines the follow lists of 100 positions t1 , . . . , t100 at each of the n nested iterations, “linear UPA” computes just a single FollowLast list at each of those iterations. As our last experiment we consider a collection of test cases called Qm n (where Q refers to Quadratic), which exhibits the worst-case behavior of all the considered algorithms. The content model Qm n consists of n nested iterations, each of which begins with an optional choice group Cim that consists of m unique elements, according to the below inductive rules:  (C1m )0..2 when n = 1, and 1..2 Qm = n m m when n > 1; (Cn )? Qn−1 above the subexpression Cim is, for each level of the iterations i = 1, . . . , n, a choice of m unique element names, that is, Cim = ai,1 | ai,2 | · · · | ai,m . So, Qm n comprises the following nesting of n levels of iterations: 1..2  1..2 ··· (an,1 | · · · | an,m )? · · · (a2,1 | · · · | a2,m )? (a1,1 | · · · | a1,m )0..2 31

1s

♦ 0.1 s ♦

10 ms + + +

♦ ♦ ♦

♦ ♦ ♦

♦ ♦

♦ ♦

♦ ♦





♦ ♦

quadratic UPA Nest n linear UPA Nest n

+ + + + + +

+ + + + +

+ + + +

+ + + +

+ + + + +

+ + + + + +

+ + + + +

♦ ♦

♦ ♦

+ + + +

+ + + + +

♦ +

+ + + +

1 ms 100

200

300

400

500

600

Figure 14: UPA checking times of Nest n by our “quadratic” and “linear” implementations.

Notice that both the First and the FollowLast list of the expression Qm i contain i × m positions. Since our new algorithm computes these lists for each i = 1, . . . , n, this leads to quadratic complexity. This behavior is clearly visible in Figure 15, where the UPA checking times of our “linear UPA” implementation 100 100 have been plotted for XML Schema representations of tests Q100 20 , Q40 , . . . , Q400 . The sizes of these schema files range from 64 KB to 1240 KB. The quadratic UPA checking time dominates total processing in these tests; for Q100 20 the UPA checking takes about 45 % of total time, and for Q100 400 around 80–90 % of total time. Notice that these results do not contradict the O(|Σ|n) complexity estimate, but the quadratic worst-case complexity requires that the expressions use increasingly large alphabets. Here, the alphabet ranges from the 2000 elements 100 of the content model Q100 20 up to the 40 000 elements of the content model Q400 . m The test cases Qn are even more challenging for “quadratic UPA” and for Xerces. Since these expressions comprise nested iterations, their expansion into automata causes an exponential blow-up in the need of memory. In our test setting, the largest nesting depth (n) for iterations built of choice groups with 100 elements that Xerces was able to process, was 7. The size of the schema file for the test Q100 is about 21 KB, and Xerces used about one minute to process 7 it. The time of Xerces UPA checking for this schema was an ignorable fraction of about 0.6 ms only. Complete processing of this schema by our “linear UPA” implementation took about 0.15 seconds, out of which determinism checking took about 20 %. The largest nesting depth of iterations made of choice groups with up to 13 elements that Xerces was able to process was 10; expression

32

♦ ♦ 4s

linear UPA

Q100 n

for n = 20, 40, . . . , 400

♦ ♦ ♦ ♦



♦ ♦ ♦

3s

2s

1s

♦ ♦

♦ ♦

♦ ♦

♦ ♦

200 KB

♦ ♦ ♦ ♦

♦ ♦ ♦

♦ ♦ ♦

400 KB

♦ ♦ ♦ ♦

♦ ♦ ♦

♦ ♦ ♦

600 KB

♦ ♦ ♦ ♦ ♦

♦ ♦

♦ ♦ ♦ ♦

♦ ♦ ♦ ♦ ♦ ♦



♦ ♦ ♦ ♦ ♦ ♦ ♦

♦ ♦ ♦ ♦ ♦

♦ ♦ ♦ ♦

♦ ♦ ♦ ♦ ♦ ♦

♦ ♦ ♦





800 KB

1000 KB

1200 KB

Figure 15: UPA checking times by “linear UPA” with respect to schema file size, on Q100 for n n = 20, 40, . . . , 400.

depths above 10 with choice group sizes (m) above 5 caused Xerces to run out of memory. UPA checking times from experiments with schemas for content models Q13 n for nesting-depths n = 1, . . . , 10 are displayed in Figure 16. For clarity, only the smallest of the Xerces UPA checking times is displayed in the figure, and the two longest (initial) runs have been excluded from the times measured for our implementations. (Notice also the use of a logarithmic scale on the time axis.) The efficiency of “linear UPA” and “quadratic UPA” on medium-size schemas 100 100 for Q100 1 , Q2 , . . . , Q20 is compared in Figure 17. The quadratic complexity of “linear UPA” is less obvious here than it is in Figure 15, but the new algorithm clearly scales here an order of magnitude better than “quadratic UPA”. As a summary, the presented experiments are in harmony with the theoretical worst-case complexity estimates. With respect to actual run-times, the new linear-time algorithm was on all tests faster than the previous quadratic-time algorithm and the UPA check of Xerces, often by several orders of magnitude. 5. Enhancements for full XML Schema XML Schema contains, in addition to numeric occurrence indicators, also other features that affect the determinism of content models. These include all groups, substitution groups, and element wildcards. Introductory discussions of them can be found in (Fallside and Walmsley, 2004) and (Walmsley, 2002). In this section we outline how all-groups, substitution groups, and element wild-

33

♦ 100 ms Xerces Q13 n quadratic UPA Q13 n linear UPA Q13 n

♦ + 

10 ms

+

+ + +  + +   

+  +  + +  

1 ms + + + + + +     ♦



+ + + + + + +

+ + + + +

+ + + + +     

♦ + + + +      

+ + + ♦ +      

+ + + +   +  ♦    

+ + + + +      ♦ 



  

   

♦ ♦

1000 B

2000 B

3000 B

4000 B

5000 B

Figure 16: UPA checking times by Xerces and by our implementations on Q13 n , n = 1, . . . , 10, with respect to schema file size.

1s

0.1 s

10 ms ♦ ♦ ♦ + ♦ + + +

♦ ♦ ♦ ♦ + + + +

♦ ♦ ♦ ♦ ♦

♦ ♦ ♦ ♦

♦ ♦ ♦ ♦

+ + + + + + + + + + + + + + + + +

♦ ♦ ♦ ♦

+ + + +

♦ ♦



♦ ♦

♦ ♦

♦ ♦ ♦

♦ ♦ ♦

♦ ♦

♦ ♦

♦ ♦





♦ ♦

♦ ♦

quadratic UPA Q100 for n = 1, 2, . . . , 20 n linear UPA Q100 for n = 1, 2, . . . , 20 n

♦ +

+

+ + + +

+ + + + + + + + + + + + + +

+ + + +

+ + + + +

+ + + + +

+ + + + +

+ + +

+ + + +

+ + + + +

+ + + +

+ + + +

♦ ♦

+ + + + +

1 ms 10 KB

20 KB

30 KB

40 KB

50 KB

60 KB

Figure 17: UPA checking times by “quadratic UPA” and by “linear UPA” on Q100 n , n = 1, . . . , 20, with respect to schema file size.

34



Figure 18: XML Schema element declarations with two substitution groups

cards could be supported in the linear-time determinism-checking algorithm based on First and FollowLast lists. The XML Schema all group is a restricted version of the ’&’ operator of SGML (ISO, 1986) and of the interleave operator of Relax NG (Clark and Murata, 2001). An XML Schema all group consists of element particles only, and it may not be included in any sequence or choice group. Its semantics is to describe sequences where each of the specified elements occur once, in any order. Individual elements or the entire group can be made optional by setting their minOccurs attribute to zero, but no other occurrence constraints are allowed. So, all groups form a special case quite distinct from other content models. Testing the determinism of an all group is easy: the group is deterministic if and only if no element name appears more than once in it. XML Schema substitution groups are a mechanism which allows elements to appear in document instances as a substitution of another element called the substitution group head. The name of the head element is given as the value of the substitutionGroup attribute in the element declaration. Substitution groups form a hierarchy, where a member of one substitution group may also be the head of another substitution group. An example of element declarations with substitution groups is given in Figure 18. According to those declarations, element A is the head of a substitution group that contains elements B and C. Similarly, element B is the head of the substitution group with members D and E. A straightforward way to implement substitution groups is to modify content model expressions by replacing any head element X of a substitution group by a choice of all the elements that can be used at the place of X. That is, given the declarations of Figure 18, any occurrences of B would be replaced by (B | D | E), and any occurrences of A would be replaced by (A | B | C | D | E). The head element of a substitution group can also be declared abstract, which means that only its substitutes, but not the head element itself, are allowed in document instances. This can be realized simply by excluding the head element from the choice group that is used as its replacement in the expressions. Replacing substitution group heads by a choice of their alternatives increases the length of content models by a factor which is proportional to the number of different element names, at most. If we assume that the size of the element name alphabet is restricted by a constant, the resulting UPA processing times remain linear with respect to the size of the schema. If a head element of a substitution group appears in an all group, replacing

35

it by a choice of its substitutes leads to a content model whose syntax is illegal. This should not be a problem, though, since all groups containing choices would be used as an internal implementation only. The rule for determinism remains the same as before: no element name may appear more than once in an all group expanded with choice groups either. One of the useful features of XML Schema is that it supports definition of schemas that utilize namespaces. XML Namespaces (Bray et al., 2006) is a mechanism for using multiple sets of element or attribute names in XML documents, identifying these sets, called namespaces, with URIs (universal resource identifiers). Two identifiers or names are deemed equal only if both their namespace URI and their local name (e.g. the element tag-name) are the same. The namespace mechanism allows, for example, XHTML markup to be imported and used in domain specific schemas in a controlled manner. XML Schema element wildcards are related to namespaces. They are represented in content models by an any particle, which stands for an element with an arbitrary name. The namespaces from which the element name can be taken are specified in a namespace attribute. Wildcards extend the notion of nondeterminism of content models in an obvious manner: Two competing positions are in a conflict not only if they are occurrences a common symbol (in the same namespace), but also when they belong to the same namespace and either of them is a wildcard. The value of the namespace attribute can also be, e.g., “##any”, meaning that the wildcard denotes elements from any namespace, and “##local”, meaning that the wildcard denotes elements that are not declared to belong to a namespace. The linear-time UPA checking algorithm could be extended to handle element wildcards as follows: Use separate First and FollowLast lists for each namespace declared by, or imported into the schema, including the ##local namespace. Treat occurrences of wildcards similarly to other positions, inserting them in the First and FollowLast lists of their namespace(s). A wildcard with namespace ##any is included in the First and FollowLast list of each namespace. The determinism constraints are evaluated for each namespace separately, applying straightforward extensions of constraints DC1,. . . ,DC4 of Figure 3 (on page 17). That is, as an extension to constraints DC1, DC2 and DC3, report a UPA violation also if either of the lists being examined contains a wildcard and the other one is not empty. Constraint DC4, which is used for testing the determinism of iterations Gm..n , should be extended with the following condition x ∈ FollowLast(G) ∧ y ∈ First(G) ∧ ((x)♮ = any ∨ (y)♮ = any) ⇒ x = y over the First and FollowLast list of each namespace. It is reasonable to assume that the number of namespaces in a schema is limited by a constant. With this assumption, the above extensions do not increase the asymptotically linear complexity of UPA checking. There will be more lists to merge and to examine at each subexpression, but symbol occurrences will be distributed in them disjointedly according to the namespace of their element name. Thus the total size of First and FollowLast lists at each node of the expression tree remains in O(|Σ|), as before. 36

6. Discussion We now discuss the theoretical and practical relevance of this work, and some closely related theoretical and practical considerations. Br¨ uggemann-Klein has shown that determinism can be tested in linear time both for traditional regular expressions (1993a) and for regular expressions extended with the SGML ’&’ operator (1993b). Her linear-time algorithms are based on transforming the expression first into so-called star normal form, which ensures that any unions performed for computing follow sets are disjoint. This transformation is not directly applicable to expressions with numeric occurrence indicators, though. The star-normal-form transformation eliminates directly nested iterations like (a∗ )∗ , which is replaced by the equivalent form a∗ . On the other hand, replacing nested numeric iterations by a single one generally changes the meaning of the expression. For example, the expression (a4..5 )1..3 describes the language {a4 , a5 } ∪ {a8 , a9 , a10 } ∪ {a12 , a13 , a14 , a15 }, which cannot be described by a single iteration. In general, elimination of nested numeric iterations may lengthen expressions by an exponential factor (Kilpel¨ainen and Tuhkanen, 2003). The linear-time method presented in the current paper is applicable to standard regular expressions, too, by treating the Kleene star F ∗ as the corresponding numeric iteration F 0..∞ . In this sense, the new algorithm is an extension of the previously published method of Br¨ uggemann-Klein (1993a), and shows that the star-normal-form transformation is not necessary for checking the determinism of standard regular expressions in linear time. The determinism constraint of XML Schema has been criticized as an unnecessary and harmful SGML legacy feature. For example, Mani (2001) has suggested the determinism constraint to be left out from XML schema languages. Gelade, Gyssens, and Martens have shown how strongly deterministic expressions can be translated in polynomial time to deterministic counter automata. These automata, originally outlined by Sperberg-McQueen (2004), can be used to validate the contents of XML elements in low-order polynomial time (Gelade et al., 2009a). Based on these results, the authors raise the question whether the (weak) determinism constraint of XML Schema should be replaced in favor of strong determinism. Such considerations and suggestions are surely worthwhile for the future development of XML Schema and of alternative XML schema languages. On the other hand, the practical goal of the current paper is of more short-term nature, that is, to help implementers to deal with the features of XML Schema as it is currently specified. Bex et al. (2009) do not either believe the UPA constraint to be a sensible one, but since it is enforced by the XML Schema specification, they present solutions for helping to create regular expressions which satisfy the determinism constraint. As a negative complexity result they prove that deciding whether a (nondeterministic) regular expression is equivalent to some deterministic one is PSPACE-hard. As constructive results they present and experimentally evaluate algorithms for translating nondeterministic regular expressions into deterministic regular expressions that would be close generalizations of the original 37

one. Bex et al. (2009) do not consider numeric occurrence indicators, which were our main concern in the current paper. A method for deciding whether a given regular language can be represented by a deterministic expression with numeric occurrence indicators is mentioned as a major open problem. XML Schema implementations may try to circumvent the problem of pathological content models by introducing some implementation limits on them. Such external constraints are somewhat problematic, though. It is difficult to pose restrictions that would be strict enough to reject all malicious cases, and yet liberal enough to accept all benevolent cases. For this reason we prefer robust and scalable algorithms as implementations of specified technologies, instead of restrictions added as an afterthought. JAXP, the Java API for XML Processing, provides a SchemaFactory class for initiating validators for given XML schemas. The SchemaFactory class specifies a method for setting on a feature called FEATURE SECURE PROCESSING. The purpose of this feature is to “limit [. . . ] XML Schema constructs that would consume large amounts of resources” (Sun Microsystems, 2006a). In Apache Xerces the effect of this setting includes restricting the values of maxOccurs attributes to be at most 5000, which sounds reasonable. Bounding the occurrence indicators does not alone solve the problem caused by expanding of numeric iterations, though. Deeply nested iterations, like (· · · (a2..2 ) · · · )2..2 , remain as unrecognized pathological cases. When we gave to Xerces a schema with a single content model like the above, with 15 nested iterations, in the test environment described in Section 4, the processing took 21 minutes of total time. The time of UPA checking was in this case an ignorable fraction of about 4 ms only. With 16 nested iterations Xerces threw an OutOfMemoryError. The size of both these schemas is less than 50 lines, or about 2 KB only. XML Schema is used in Web services as the default schema language for specifying the forms of messages that are accepted and returned by distributed services whose interfaces have been specified using WSDL (Christensen et al., 2001). Potential risks caused by pathological XML content models are not that severe for Web services in practice as it might appear, though, because real systems rarely load schemas from unreliable sources. Rather, communication with and between Web services is normally based on agreed-upon schemas, say, as negotiated by an industry consortium for a business domain. A UDDI (Universal Description, Discovery, and Integration; see, e.g., Alonso et al. (2004, Sec. 6.4) or Papazoglou (2008, Ch. 6)) service registry can be used by applications to locate Web services, and to bind to them by loading interface specifications from the service provider’s server. Currently applied Web service invocation frameworks are predominantly used through client-side stubs that have been precompiled at design time (Leitner et al., 2009). In addition to this, the UDDI architecture allows and supports also dynamic binding, which might be useful, for example, in situations where the client application finds out that a service has become unavailable, while alternative implementations of the same 38

high-level service interface can be located through a UDDI registry (Papazoglou, 2008, Sec. 6.3). Indeed, such dynamic run-time binding has been postulated by Leitner et al. (2009) necessary for the realization of loosely coupled, dynamically configured systems which are needed for implementing service-oriented architectures. Dynamic binding makes Web services more vulnerable to attacks known as XML schema poisoning (MITRE, 2009), where an attacker alters the contents of a schema for the purpose of undermining security. As we have seen above and in the test cases like Occ n and Nest n of Section 4, even small pathological schema files can be problematic for current implementations. Scenarios of whether and how XML schema poisoning attacks could actually be carried out on real systems depend on particular solutions used in those systems, and are beyond the scope of this article. Preventing malicious attacks is not the only reason for applying resourceconscious and efficient methods. Design applications, where the processing of schemas happens under the control of a human user, are less prone to problems caused by superfluous resource-usage than systems based on fully automatic processing. Nevertheless, utilizing robust solutions that avoid risks of excessive resource usage is reasonable in human-controlled applications, too. For example, schema development tools would also benefit from methods that will not fail in the case that the user happens to create pathological schemas inadvertently. UPA checking is only a sub-problem of implementing XML Schema validators; efficient UPA checking is of little help, if other processing requires exponential amounts of resources. Currently this is the case, for example, with the method that Apache Xerces uses to translate content models into automata. The initial ideas of Kilpel¨ ainen and Tuhkanen (2004) about implementing deterministic content models as linear-sized automata, which can validate element contents in low-order polynomial time, are an important further motivation for efficient UPA checking. Discussion of efficient implementation of XML Schema content models would require a more thorough treatment and is beyond the scope of this article. 7. Conclusion We have concentrated on the problem of checking the determinism of XML Schema content models, which is a property of document schemas required by the W3C XML Schema Recommendation. Currently applied solutions, which require in the worst case exponential time, were argued to be problematic, for example as potential sources of denial-of-service attacks. XML Schema content models were studied in the simplified model of regular expressions with numeric occurrence indicators. A previously published polynomial-time solution for checking the determinism of such expressions was refined in Section 3 to work in linear time with expressions over a fixed alphabet. The algorithm was implemented, and results of experiments were presented in Section 4. The results confirmed the predicted linear scalability of the new algorithm, and showed it to be more efficient than its predecessor and orders of magnitude more efficient than the corresponding UPA checking method of Apache Xerces. Extensions to 39

the algorithm for taking all groups, substitution groups, and element wildcards of XML Schema into account were discussed in Section 5. The relevance of the results was discussed in Section 6. The new algorithm is an optimal-time solution for checking the determinism of regular expressions with numeric occurrence indicators, and so an extension to the corresponding previous result for traditional regular expressions. On the practical side, the result is, hopefully, a useful step towards robust, polynomial-time XML Schema processing. Determinism checking was shown in Section 3 to be computable for expressions over an unlimited alphabet in time O(n2 / log n), where n is the length of a binary representation of the expression. It would be of theoretical interest to consider whether this bound can be lowered. In the light of Section 4 the practical relevance of this question does not seem very high, though. References Aho, A., 1994. Algorithms for finding patterns in strings. In: van Leeuwen, J. (Ed.), Handbook of Theoretical Computer Science – Volume A: Algorithms and Complexity. Elsevier/MIT Press, Ch. 5, pp. 255–300. Alonso, G., Casati, F., Kuno, H., Marhiraju, V., 2004. Web Services—Concepts, Architectures and Applications. Springer-Verlag. ANSI, 2005. HL7 Clinical Document Architecture, Release 2.0. ANSI, New York, NY, USA. Apache Software Foundation, 1999-2005. Welcome to Xerces. Xerces homepage at http://xerces.apache.org. Apache Software Foundation, 2007. Xerces2 Java Parser Readme. Apache Software Foundation, http://xerces.apache.org/xerces2-j/. BEA, October 2003. Streaming API for XML JSR-173 Specification. BEA Systems, Inc, http://ftpna2.bea.com/pub/downloads/jsr173 1.0.pdf. Bex, G. J., Gelade, W., Martens, W., Neven, F., June-July 2009. Simplifying XML Schema: Effortless handling of nondeterministic regular expressions. In: SIGMOD’09. ACM, pp. 731–743. Bex, G. J., Neven, F., Van den Bussche, J., 2004. DTDs versus XML Schema: A practical study. In: Proceedings of the 7th International Workshop on the Web and Databases. ACM, pp. 79–84. Boag, S., Chamberlin, D., Fern´ andez, M., Florescu, D., Robie, J., Sim´eon, J. (Eds.), January 2007. XQuery 1.0: An XML Query Language. W3C Recommendation, available at http://www.w3.org/TR/xquery. Bray, T., Hollander, D., Layman, A., Tobin, R. (Eds.), August 2006. Namespaces in XML 1.0 (Second Edition). W3C Recommendation.

40

Bray, T., Paoli, J., Sperberg-McQueen, C. M. (Eds.), February 1998. Extensible Markup Language (XML) 1.0. W3C Recommendation, the latest version is available at http://www.w3.org/TR/REC-xml. Brown, M., Tarjan, R., April 1979. A fast merging algorithm. Journal of the ACM 26 (2), 211–226. Br¨ uggemann-Klein, A., 1993a. Regular expressions into finite automata. Theoretical Computer Science 120, 197–213. Br¨ uggemann-Klein, A., 1993b. Unambiguity of extended regular expressions in SGML document grammars. In: Lengauer, T. (Ed.), Algorithms — ESA 93. Springer-Verlag, pp. 73–84. Br¨ uggemann-Klein, A., Wood, D., 1992. Deterministic regular languages. In: Proceedings of the 9th Annual Symposium on Theoretical Aspects of Computer Science. Springer-Verlag, pp. 173–184. Br¨ uggemann-Klein, A., Wood, D., 1998. One-unambiguous regular languages. Information and Computation 142, 182–206. Christensen, E., Curbera, F., Meredith, G., Weerawarana, S., 2001. Web Services Description Language (WSDL) 1.1. W3C Note 15 March 2001. Available at http://www.w3.org/TR/wsdl. Clark, J. (Ed.), November 1999. XSL Transformations (XSLT) Version 1.0. W3C Recommendation. Clark, J., Murata, M. (Eds.), December 2001. RELAX NG Specification. OASIS, available at http://www.oasis-open.org/committees/relax-ng/spec-20011203.html. ebXML, 2002. Electronic http://www.ebxml.org.

business

XML

(ebXML)

home

page.

Fallside, D., Walmsley, P. (Eds.), October 2004. XML Schema Part 0: Primer Second Edition. W3C Recommendation. Fuchs, M., Brown, A., August 2003. Supporting UPA on an extension of XML Schema. In: Extreme Markup Languages 2003. IDEAlliance, Montr´eal, Qu´ebec, available at http://www.mulberrytech.com/Extreme/Proceedings/. Gelade, W., Gyssens, M., Martens, W., August 2009a. Regular expressions with counting: Weak versus strong determinism. In: 34th International Symposium on Mathematical Foundations of Computer Science (MFCS 2009). Springer-Verlag, pp. 369–381, full version available at http://lrb.cs.uni-dortmund.de/~martens/pubs-year.html

41

Gelade, W., Martens, W., Neven, F., 2007. Optimizing schema languages for XML: Numerical constraints and interleaving. In: International Conference on Database Theory (ICDT 2007). Springer-Verlag, pp. 269–283. Gelade, W., Martens, W., Neven, F., 2009b. Optimizing schema languages for XML: Numerical constraints and interleaving. SIAM Journal on Computing 38 (5), 2021–2043. Glushkov, V., 1961. The abstract theory of automata. Russian Mathematical Surveys 16, 1–53. Goldfarb, C. F., 1990. The SGML Handbook. Clarendon Press, Oxford. Hagerup, T., February 1998. Sorting and searching on the word RAM. In: Proceedings of STACS 98, 15th Annual Symposium on Theoretical Aspects of Computer Science. Springer-Verlag, pp. 366–398. Hazel, P., 2003. Perl Compatible Regular Expressions. University of Cambridge, http://www.pcre.org/. IEEE, 2001. IEEE Std 1003.1-2001 Standard for Information Technology — Portable Operating System Interface (POSIX) Base Definitions, Issue 6. IEEE, New York, NY, USA. ISO, October 1986. ISO 8879: Information Processing—Text and Office Systems—Standard Generalized Markup Language (SGML). ISO, Geneva, Switzerland. ISO, April 2008. ISO/IEC 29500-1, Information technology – Document description and processing languages – Office Open XML File Formats – Part 1: Fundamentals and Markup Language Reference. ISO, Geneva, Switzerland. Kilpel¨ ainen, P., Tuhkanen, R., 2003. Regular expressions with numerical occurrence indicators—preliminary results. In: Proc. of the Eighth Symposium on Programming Languages and Software Tools. University of Kuopio, Department of Computer Science, pp. 163–173. Kilpel¨ ainen, P., Tuhkanen, R., 2004. Towards efficient implementation of XML Schema content models. In: Proc. of the 2004 ACM Symposium on Document Engineering. ACM Press, pp. 239–241. Kilpel¨ ainen, P., Tuhkanen, R., 2007. One-unambiguity of regular expressions with numeric occurrence indicators. Information and Computation 205, 890– 916. Kleene, S., 1956. Realization of events in nerve sets and finite automata. In: Shannon, C., McCarthy, J. (Eds.), Automata Studies. Princeton University Press, Princeton, New Jersey, pp. 3–42. Leitner, P., Rosenberg, F., Dustdar, S., May-June 2009. Daios: Efficient dynamic Web service invocation. IEEE Internet Computing 13 (3), 72–80. 42

Mani, M., August 2001. Keeping chess alive: Do we need 1-unambiguous content models? Talk given at Extreme Markup Languages 2001 in Montr´eal. Slides are available at http://web.cs.wpi.edu/~mmani/publications/extreme2001chess.ppt Martens, W., Neven, F., Schwentick, T., Bex, G. J., September 2006. Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems 31 (3), 770–813. McNaughton, R., Yamada, H., March 1960. Regular expressions and state graphs for automata. IRE Transactions on Electronic Computers 9 (1), 39–47. Meyer, A., Fischer, M., October 1971. Economy of description by automata, grammars, and formal systems. In: 12th Annual Symposium on Switching and Automata Theory. IEEE, pp. 188–191. MITRE, September 2009. CAPEC-146: XML Schema Poisoning. MITRE Corporation, in Common Attack Pattern Enumeration and Classification, Release 1.4, http://capec.mitre.org/. Murata, M., Lee, D., Mani, M., Kawaguchi, K., November 2005. Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology 5 (4), 660–704. Navarro, G., Raffinot, M., 2002. Flexible Pattern Matching in Strings: Practical on-line search algorithms for texts and biological sequences. Cambridge University Press. Papazoglou, M. P., 2008. Web Services: Principles and Technology. Pearson Education Limited. Sippu, S., Soisalon-Soininen, E., 1988. Parsing Theory. Vol. I: Languages and Parsing. Springer-Verlag. Sperberg-McQueen, C. M., May 2004. Notes on finite state automata with counters, http://www.w3.org/XML/2004/05/msm-cfa.html. Sperberg-McQueen, C. M., August 2005. Applications of Brzozowski derivatives to XML Schema processing. In: Extreme Markup Languages 2005. IDEAlliance, Montr´eal, Qu´ebec, available at http://www.mulberrytech.com/Extreme/Proceedings/. Sun Microsystems, August 2002. RFE: Tail call optimization. Bug # 4726340 in Sun Developer Network Bug Database, available at http://bugs.sun.com/bugdatabase/view bug.do?bug id=4726340. Sun Microsystems, June 2005. The Java Web Services Tutorial For Java Web Services Developer’s Pack, v1.6. Sun Microsystems, http://www.j2ee.me/webservices/docs/1.6/tutorial/doc/JavaWSTutorialFront.html.

43

Sun Microsystems, 2006a. JavaTM Platform, Standard Edition 6 API Specification. Sun Microsystems, http://java.sun.com/javase/6/docs/api. Sun Microsystems, 2006b. JavaTM Platform, Standard Edition 6 Overview. Sun Microsystems, http://java.sun.com/javase/6/docs/technotes/guides/index.html. Thompson, H., Beech, D., Maloney, M., Mendelsohn, N. (Eds.), May 2001. XML Schema Part 1: Structures. W3C Recommendation. Thompson, H., Beech, D., Maloney, M., Mendelsohn, N. (Eds.), October 2004. XML Schema Part 1: Structures Second Edition. W3C Recommendation. W3C, 2000. XHTMLTM 1.0 The Extensible HyperText Markup Language (Second Edition). W3C Recommendation, revised August 2002. Wall, L., Schwartz, R., 1991. Programming perl. O’Reilly & Associates, Inc., Sebastopol, CA. Walmsley, P., 2002. Definitive XML Schema. Prentice Hall PTR.

44