A high-level language for specifying XML data transformations Tadeusz Pankowski Institute of Control and Information Engineering, Poznan University of Technology, Poland,
[email protected]
Abstract. We propose a descriptive high-level language XDTrans devoted to specify transformations over XML data. The language is based on unranked tree automata approach. In contrast to W3C’s XQuery or XSLT which require programming skills, our approach uses high-level abstractions reflecting intuitive understanding of tree oriented nature of XML data. XDTrans specifies transformations by means of rules which involve XPath expressions, node variables and non-terminal symbols denoting fragments of a constructed result. We propose syntax and semantics for the language as well as algorithms translating a class of transformations into XSLT.
1
Introduction
Transformation of XML data becomes increasingly important along with development of web-oriented applications (e.g. Web Services, e-commerce, information retrieval, dissemination and integration on the Web), where data structures of one application must be transformed into a form accepted by another one. A transformation of XML data can be carried out by means of W3C’s languages XSLT [1] or XQuery [2]. However, every time when an XML document should be transformed into some other form, a new XSLT (or XQuery) program must be written, which requires programming skills. Thus, the operational nature of XSLT and XQuery makes them less desirable candidates for high-level transformation specification [3]. To avoid programming, other transformation languages have been proposed [3–5]. In this paper we propose a new language called XDTrans that is devoted to specify transformations for XML data. Some preliminary ideas of the approach underlying XDTrans were presented in [6] and [7]. We discuss both syntax and semantics of the language as well as some representative examples illustrating using of it. We assume that the user perceives XML documents as data trees according to DOM model [8], and is familiar with syntax and semantics of XPath expressions [9]. Main advantages and novelties of the approach are as follows: – XDTrans expressions are high-level transformation rules reflecting intuitive understanding how an output data tree is constructed from input data trees, and involve XPath expressions, node variables and some non-terminal symbols (concepts) denoting subtrees in a constructed data tree;
– expressive power of XDTrans corresponds to structural recursion [10] and a fragment of top-down XSLT (without functions), however, in XDTrans we can join arbitrary number of different documents (not supported by XSLT [11]); – a transformation of a single XML document can be translated into XSLT program – we propose algorithms for such translations; – in contrast to XSLT, which can be used for transforming documents having only a standard form, XDTrans semantic functions could be applied to transform non-standard representations of XML documents (e.g. in relational databases [6],[12],[13]). The structure of the paper is as follows. In Section 2 we propose XDTrans as a language for high-level transformation specification. We formulate its syntax and semantics, and discuss some examples illustrating its use. In Section 3, algorithms translating a class of XDTrans programs into XSLT are proposed. Section 4 concludes the paper.
2 2.1
High-level transformation specification – XDTrans Syntax of XDTrans
According to W3C standard, any XML document can be represented as a data tree [8], where a node conforms to one of the seven node types: root, element, attribute, text, namespace, processing instruction, and comment. In this paper, we restrict our attention to four first of them. Every node has a unique node identifier (nid). A data tree can be formalized as follows [7]: Definition 1. A data tree is an expression defined by the syntax: data tree ::= nid(tree), tree ::= e-tree | a-tree | t-tree, e-tree ::= (tree, ..., tree), (element tree), a-tree ::= (s), (attribute tree), t-tree ::= nid(s), (text tree), where nid, e, a, and s are from, respectively, a set N of node identifiers, a set ΣE of element labels, a set ΣA of attribute labels, and a set S of string values. By DΣ,S (N ), where Σ = ΣE ∪ ΣA , will be denoted a set of all data trees over Σ and S with node identifiers from N . ¤ Further on we assume that C is a set of non-terminal symbols called concepts, and P is a set of XPath expressions in which some variables can appear. The goal of transformation is to convert a set of input data trees into an expected single output data tree. A transformation can be specified by a set of transformation rules (or rules for short). Every rule determines a type of expected final or intermediate result data tree in a form of a terminal or non-terminal tree expression.
Definition 2. A tree expression over Σ, C, S and P conforms to the following syntax: τ ::= s | E | a(s) | a(E) | C(E) | e(τ, ..., τ ), where: s ∈ S, E ∈ P, a ∈ ΣA , C ∈ C, e ∈ ΣE . The set of all tree expressions will be denoted by TΣ,S (C, P). ¤ Definition 3. A transformation specification language is a system XDT rans = (Σ, C, S, START, P, R), where START ∈ C is the initial concept, and R is a finite set of rules of the following two forms: (C, E) → τ, (C, ($v1 : E1 , ..., $vp : Ep )) → τ , where C ∈ C, any E (possibly with subscripts) is from P, $v (possibly with subscripts) is a node variable, τ ∈ TΣ,S (C, P), and every node variable (if any) occurring in the body occurs also in the head of the rule. ¤ The head of a rule includes a concept C which will be rewritten by the body of the rule. A rule with concept C in the head defines this concept. We assume that any concept in a given set of rules must be defined, and that every concept has exactly one definition. So, our system is deterministic. Recursive definitions for concepts are allowed (e.g. SUB PART in Example 3(1)). There must be exactly one rule, the initialization rule, defining the initial concept START. In order to refer to the root of a document we use ”/” (if the document is understood) or a root node identifying the document (e.g. function document(”URI”) returning the root node). In Fig. 1 there is an example of XML document suppliers, its DTD and data tree. The document parts in Fig. 2 is a transformed form of (a selection of) the document from Fig. 1 (see Example 1). p1 p2 p3 p1 ]>
0 suppliers 1 supplier
supplier
2
10
@sname 3 s1
part 4
part 6
part
@sname
8
11 s2
5
7
p1
p2
part 12
9 p3
Fig. 1. XML document suppliers, its DTD and date tree
13 p1
]>
50 parts 51 part
part
part
52
55
58
@pname
@sname @pname
@sname @pname
@sname
53
54
56
57
59
60
p1
s1
p2
s1
s1
s2
Fig. 2. XML document parts, its DTD and date tree
The following example illustrates how XDTrans can be used to transform suppliers into parts: Example 1. Transform suppliers into parts, where each part has as attributes name of the part and name of its supplier, include only two first parts of each supplier. (1) transformation specification without variables: (START, /suppliers) → parts(PART(.)) (PART, supplier/part[position() ≤ 2]) → part(PNAME(.), SNAME(.)) (PNAME,.) → @pname(text()) (SNAME,.) → ../@sname (2) transformation specification with variables: (START, /suppliers) → parts(PART(.)) (PART,($v:supplier, $p:$v/part[position()≤2], $s:$v/@sname)) → part(PNAME($p), $s) (PNAME,.) → @pname(text()) ¤ 2.2
Semantics for XDTrans
Every XDTrans rule is evaluated in an evaluation context (or context for short) like XPath expressions are [9], [14]. We will write (C, E)hS, ni to denote that the head of a rule is evaluated in a context hS, ni, where S is an ordered set of distinct nodes (a context set), and n is a context node, n ∈ S (hi denotes the empty context). A context is used to evaluate XPath expression(s) included in the head of the rule. Evaluation of the head produces a new context set (the output context set). Every output context determined by the output context set is then used to process the body of the rule, where recursively the same or other rules can be invoked.
A rule is formulated against an input data tree(s) and the result of the rule is an output data tree (possibly non-terminal). An XPath expression (or a sequence of XPath expressions labeled with variables) occurring in the head of the rule, determines a number of (sub)trees instantiated from the tree expression being the body of the rule. All trees developed from the tree expression have a common structure but each of them has unique nodes (node identifiers). If each leaf of such a tree does not contain any concept then the tree is a terminal tree otherwise it is a non-terminal tree. Now we define how a rule transforms input date tree(s) into an output data tree. By NE hS, ni we denote an ordered set of nodes obtained by evaluating E in a context hS, ni. For example, for the data tree from Fig. 1, we have: Nsupplier/part h(1), 1i = (4, 6, 8, 12). We will use a Skolem function new() that for every invocation returns a unique node identifier. Now, we will define a semantics for heads of XDTrans rules and next a semantics for bodies of the rules. Evaluation of (C, E) → τ in the context hi and hS, ni 1. First, the head of the rule must be evaluated in a given context: – the result of the evaluation is an ordered set S1 = NE hS, ni = (n1 , ..., nm ), m ≥ 1, referred to as an output context set, – for any output context hS1 , ni i, 1 ≤ i ≤ m, the body τ of the rule must be processed; – it follows from the restrictions on XML documents that for an initialization rule its head qualification E must return a singleton, i.e. S1 = NE hi = (n1 ), i.e. m = 1. 2. In the next step, τ should be evaluated in m contexts, i.e. the expression τ (hS1 , n1 i, ..., hS1 , nm i), m ≥ 1 must be processed: H1. For an initialization rule we start from the expression (START, E)hi and rewrite it by the expression new()(τ h(n1 ), n1 i), where (n1 ) is the result of evaluation E in the empty context, Fig. 3.
new() START¢²
== > τ¢(n1 ),n1 ²
Fig. 3. Rewriting specified by a transformation rule (START, E) → τ in the empty context.
H2. For a non initialization rule (C, E) → τ and a context hS, ni, where NE hS, ni = S1 = (n1 , ..., nm ), the expression x((C, E)hS, ni) is rewritten by x(τ hS1 , n1 i, ..., τ hS1 , nm i), Fig. 4.
x
x == > .. .
C¢S,n²
τ¢S1,n1 ²
τ¢S1 ,nm ²
Fig. 4. Rewriting specified by a rule (C, E) → τ in a context hS, ni.
Evaluation of (C, ($v1 : E1 , ..., $vp : Ep )) → τ in the context hi and hS, ni 1. The result of processing an expression ($v1 : E1 , ..., $vp : Ep ) in a context hS, ni (or in hi) is an ordered set Ω = (ω1 , ..., ωm ), m ≥ 1, of distinct valuations of variables. Every valuation ω ∈ Ω assigns a node of the input data tree to $vi such that ω($vi ) ∈ NEi hS, ni (or ω($vi ) ∈ NEi hi), for every i, 1 ≤ i ≤ p. For an initialization rule, Ω must have at most one element. H3. An initialization rule (START, ($v1 : E1 , ..., $vp : Ep ))hi is rewritten by new()(τ h(ω1 ), ω1 i), where ω1 is the single valuation satisfying the expression ($v1 : E1 , ..., $vp : Ep ) in the empty context, Fig. 5.
new() START¢²
== > τ¢(ω1),ω1²
Fig. 5. Rewriting specified by a rule (START, ($v1 : E1 , ..., $vp : Ep )) → τ in the empty context.
H4. For a non initialization rule (C, ($v1 : E1 , ..., $vp : Ep )) → τ and a context hS, ni, where Ω = (ω1 , ..., ωm ), m ≥ 1, x(C, ($v1 : E1 , ..., $vp : Ep ))hS, ni) is converted into x(τ hΩ, ω1 i, ..., τ hΩ, ωm i), Fig.6.
x
x == > .. .
C¢S,n²
τ¢Ω,ω1²
τ¢Ω,ωm ²
Fig. 6. Rewriting specified by a rule (C, ($v1 : E1 , ..., $vp : Ep )) → τ in a context hS, ni.
2. Let E[$v]hΩ, ω 0 i denote that an expression E with the only variable $v is to be evaluated in a context hΩ, ω 0 i (see e.g. Fig. 5 and Fig. 6.). In this case the context is given by means of valuations, so it should be resolved first. The resolution is achieved by replacing every valuation with its value for the variable $v. Thus the resolved version of the expression is: E[$v]hS, ni, where S = {ω($v) | ω ∈ Ω}, and n = ω 0 ($v). Evaluation of the body of a rule The meaning of transformations specified by expressions of the form τ hS, ni is illustrated in Fig. 7. If the result does not depend on a whole context we use the notation h., .i.
(1)
(2) x
x
x
x
== >
== > new() s
s¢.,.²
E¢S,n²
(3)
... copy(n1)
copy(nm )
(4) x
x
x == >
a(s)¢.,.²
x == >
a new() s
a(E)¢S,n²
a new() value(n1 ) x
(5)
(6) x
== >
== > C(E)¢S,n²
e new()
x
x
...
.. . C¢S1,n1²
C¢S1,nm ²
e(τ1,...,τq)¢S,n²
τ1¢S,n²
τq¢S,n²
Fig. 7. Graphical interpretation of rewritings specified by bodies of rules
For the all possible tree expressions forming the body of a rule (according to Definition 2), we have: B1. The body of the form s creates a text node with the string value s. Rewriting imposed by the expression is independent of a context, Fig. 7(1): x(shS, ni) ⇒ x(new()(s)). B2. For an expression of the form E in a context hS, ni, where NE hS, ni = (n1 , ..., nm ), subtrees from the input tree(s) identified by ni , 1 ≤ i ≤ m, are copied into the output tree. The expression copy(ni ) recursively copies a source subtree denoted by ni and creates a new node identifier for any copy of a source node, Fig. 7(2): x(EhS, ni) ⇒ x(copy(n1 ), ..., copy(nm )).
B3. An expression of the form a(s) creates an attribute node labeled a with the string value s. Rewriting defined by the expression is independent of a context, Fig. 7(3): x(a(s)hS, ni) ⇒ x(< a, new() > (s)). B4. For an expression of the form a(E) in a context hS, ni, where NE hS, ni = (n1 ), an attribute node labeled a with the string value equal to value(n1 ) is created, Fig. 7(4): x(a(E)hS, ni) ⇒ x(< a, new() > (value(n1 ))). B5. If S1 = NE hS, ni = (n1 , ..., nm ), then an expression x(C(E)hS, ni) is replaced by the expression x(ChS1 , n1 i, ..., ChS1 , nm i), where ChS1 , ni i denotes an invocation of the rule identified by C in a context hS1 , ni i, 1 ≤ i ≤ m, Fig. 7(5): x(C(E)hS, ni) ⇒ x(ChS1 , n1 i, ..., ChS1 , nm i). B6. For an expression e(τ1 , ..., τq ) in a context hS, ni a new element node labeled e is created with q subtrees, where i-th subtree will be developed by evaluating expressions τi in the context hS, ni, 1 ≤ i ≤ q, Fig. 7(6): x(e(τ1 , ..., τq )hS, ni) ⇒ x(< e, new() > (τ1 hS, ni, ..., τq hS, ni). Transformations from Example 1 are carried out as follows: Transformation (1) 1. Evaluation of the initialization rule in the empty context: – Rule: (START, /suppliers) → parts(PART(.)) – Value of the head qualifier in the evaluation context: N/suppliers hi = (1) – Rewritings: H1 STARThi ⇒ 50(parts(PART(.)h(1), 1i) B6
parts(PART(.)h(1), 1i) ⇒ (PART(.)h(1), 1i) B5
PART(.)h(1), 1i ⇒ PARTh(1), 1i 2. Evaluation of the rule defining PART in context h(1), 1i: – Rule: (PART, supplier/part[position() ≤ 2]) → part(PNAME(.), SNAME(.)) – Value of the head qualifier in the evaluation context: Nsupplier/part[position()≤2] h(1), 1i = (4, 6, 12) – Rewritings: H2 PARTh(1), 1i ⇒ (part(PNAME(.),SNAME(.))h(4, 6, 12), 4i, part(PNAME(.),SNAME(.))h(4, 6, 12), 6i, part(PNAME(.),SNAME(.))h(4, 6, 12), 12i) B6
part(PNAME(.),SNAME(.))h(4, 6, 12), 6i ⇒ B6 ⇒ (PNAME(.)h(4, 6, 12), 4i,SNAME(.)h(4, 6, 12), 4i) ... B5 PNAME(.)h(4, 6, 12), 4i ⇒ PNAMEh(4), 4i, since N. h(4, 6, 12), 4i = (4) B5 SNAME(.)h(4, 6, 12), 4i ⇒ SNAMEh(4), 4i ...
3. Evaluation of the rule defining PNAME in context h(4), 4i: – Rule: (PNAME, .) → @pname(text()) – Value of the head qualifier in the evaluation context: N. h(4), 4i = (4) – Rewritings: H2 PNAMEh(4), 4i ⇒ @pname(text())h(4), 4i B4
@pname(text())h(4), 4i ⇒ (value(5)), since Ntext() h(4), 4i = 5 4. Evaluation of the rule defining SNAME in context h(4), 4i: – Rule: (SNAME, .) → ../@sname – Value of the head qualifier in the evaluation context: N. h(4), 4i = (4) – Rewritings: H2 SNAMEh(4), 4i ⇒ ../@snameh(4), 4i B2
../@snameh(4), 4i ⇒ copy(3), since N../@sname h(4), 4i = (3) 5. Analogously for the remaining cases. Transformation (2) 1. Evaluation of the initialization rule in the empty context - as for transformation (1). 2. Evaluation of the rule defining PART in context h(1), 1i: – Rule: (PART,($v:supplier, $p:$v/part[position()≤2], $s:$v/@sname)) → part(PNAME($p), $s) – Value of the head qualifier in the evaluation context: a sequence Ω = (ω1 , ω2 , ω3 ) of three valuations defined as follows: ω1 : (ω1 ($v) = 2, ω1 ($p) = 4, ω1 ($s) = 3), ω2 : (ω2 ($v) = 2, ω2 ($p) = 6, ω2 ($s) = 3), ω3 : (ω3 ($v) = 10, ω3 ($p) = 12, ω3 ($s) = 11). – Rewritings: H4 PARTh(1), 1i ⇒ (part(PNAME($p),$s))hΩ, ω1 i, part(PNAME($p),$s))hΩ, ω2 i, part(PNAME($p),$s))hΩ, ω3 i) B6
part(PNAME($p),$s))hΩ, ω1 i ⇒ B6 ⇒ (PNAME($p)hΩ, ω1 i,$shΩ, ω1 i) ... B5 PNAME($p)hΩ, ω1 i ⇒ PNAMEh(4), 4i, since N$p hΩ, ω1 i = (ω1 ($p)) = (4) ... B2 $shΩ, ω1 i ⇒ copy(3), since N$s hΩ, ω1 i = (ω1 ($s)) = (3) ... 3. Evaluation of the rule PNAME in the context h(4), 4i: in the same way as for the transformation (1). 4. Similarly for the remaining cases.
2.3
Examples
The transformation discussed in Example 1 can be formulated without using variables (specification 1) or with variables (specification 2). Now we will show examples of transformations, where it is necessary to use variables. In general, variables are necessary when the transformation requires a join condition comparing two or more XML values belonging to the same or to two different documents. Example 2. (Join) Join books and papers documents into a bib document. Input ”papers.xml”: ... Input ”books.xml”: ...
Output: ... ...
Transformation specification: (START,($p:document("papers.xml)/papers, $b:document("books.xml)/books) → bib($p, $b)
¤
Example 3. (Recursion) Convert a flat list of part elements, partlist, into a tree representation, parttree, based on partof attributes [15], and do the inverse transformation. Input DTD: ]>
Output DTD: ]>
(1) Transformation specification: (START,/partlist) → parttree(MAIN PART(.)) (MAIN PART,part[not(@partof)]) → part(@partid,@name,SUB PART(.)) (SUB PART,($v1 :@partid, $v2 :../part[@partof=$v1 ])) → part($v2 /@partid,$v2 /@name, SUB PART($v2 )) (2) Inverse transformation specification: (START,/parttree) → partlist(MAIN PART(.),SUB PART(.)) → part(@id,@name) (MAIN PART,part) (SUB PART,part//part) → part(@id,@name,@partof(../@id)) ¤ The enrich the expression power of transformation, we will use a filtering of a set of contexts. We explain it by the following example.
Example 4. (Context filtering) Let us assume that we want to set a filter on a contexts using the position of a context node. For example, the transformation (1) defined in Example 1 would have the form: (START, /suppliers) → parts(PART(.)) (PART, supplier/part[position() ≤ 2]) → part(PNAME(.[$currpos != 2])) (PNAME,.) → @pname(text()) Note that in PART rule we want to ignore the second context which is passed to the right-hand side of the rule. As we have seen, in our example we have three contexts determined by the context set (4,6,12) – so, the evaluation in the context h(4, 6, 12), 6i must be ignored. ¤ We assume that an XPath expression can use the following context variables: – $curr, denoting the current context node – the same as the dot (.); – $currpos, denoting the position of the context node in a context set; – $size, denoting a size (number of elements) of a context set.
3
Translation from XDTrans to XSLT
The following two algorithms transform a specification made in XDTrans into an XSLT program which carries out transformation on document instances. In general, there are many possible XSLT programs that can perform a given transformation. This method can be applied only to transformations over one document because joining of many documents is not supported by XSLT. Algorithm 1 defines translation for the head of a rule and uses Algorithm 2 to translate the bode of a rule. As the result, an XSLT program (stylesheet) is obtained. Algorithm 1 (generating XSLT templates). Input: a specification rule r ∈ R Output: XSLT template Λ(r) Λ(r) = case r of (START,E) → τ : ρ(τ ) (C, E) → τ :