Transformation of XML data using an unranked ... - Semantic Scholar

3 downloads 2966 Views 199KB Size Report
attribute nodes in the tree) is given in Fig. 1. id name partof partof partof id id id id name name ..... van Harmelen, ed.). IOS Press, Amsterdam (2002) 302–306.
Transformation of XML data using an unranked tree transducer Tadeusz Pankowski Institute of Control and Information Engineering, Poznan University of Technology, Poland, [email protected]

Abstract. Transformation of data documents is of special importance to use XML as the universal data interchange format on the Web. Data transformation is used in many tasks that require data to be transferred between existing, independently created Web-oriented applications. To perform such transformation one can use W3C’s XSLT or XQuery. But these languages are devoted to detailed programming of transformation procedures. In this paper we show how data transformation can by specify by means of high-level rule specifications based on uniform unranked tree transducers. We show that our approach is both descriptive and expressive, and we illustrate how it can be used to specify and perform transformations of XML documents.

1

Introduction

In data exchange, data structured in one system must be restructured and translated into a different form conforming to requirements of the other system. Such data transformation is used in many tasks that require data to be transferred between existing, independently created applications. The need of data transformation has become more important recently, as data exchange has expanded with the proliferation of Web-oriented applications such as Web services, Web collaboration, E-commerce etc. The widespread interoperability between such applications is usually performed by means of exchanging data that is encoded in XML. Such documents carry important information to participants of communication processes, so documents must be interpreted correctly by the requesters and providers. In the transformation process, the following three phases can be distinguished [1]: (1) structure identification - in this phase we identify the applied data structuring mechanisms and discover schemas of input and output documents (in this paper we restrict ourselves to XML and DTD); (2) transformation specification in this phase we specify mappings by means of inter-schema correspondences capturing input and output constraints imposed on the documents; (3) performing transformation - in this phase mapping specifications are translated into operations over the input document that produce an output document satisfying the constraints and structures of the output schema.

A transformation can be carried out by means of W3C’s languages XSLT [2] or XQuery [3]. Typically, XSLT is used to add styling information to an XML source document, by transforming it into another presentation-oriented format such as HTML. However, XSLT is also used to perform XML-to-XML transformations expressing rules for transforming one or more source data trees into one or more result data trees. In constructing a result tree, nodes from the source trees can be filtered and reordered, and arbitrary structures can be added. Data transformation can also be done by means of any XML query language (e.g. XQuery). In such case, mapping specifications are translated into appropriate XQuery queries over the input document. The result of the query is the expected output document, and the result must satisfy the output schema. Both languages are functional with powerful computational capability. However, their operational nature makes them less desirable candidates for high-level transformation specification [1]. In this paper we propose a method for specification of XML-to-XML transformations based on tree transducers on unranked trees. The formal framework of the method is that of top-down uniform unranked tree transducers by Martens and Neven [4]. Originally, unranked tree transducers were proposed for typechecking problems for XML queries: statically verifying that every answer to a query conforms to a given output schema. Uniform tree transducers investigated in [4] deals with DTD, and are intended to define tree-to-tree transformation, where the input tree is a regular expression and transformation rules deals with individual symbols. Such transformation is suitable for DTD transformations. However, in the class of applications that we are interested in, we want to transform XML documents, not just their DTD. The main contributions of this paper are the following: – we generalize tree transducers to a specification formalism appropriate to transformation specification of a wide class of XML documents – the specification involves XPath expressions; – we show that transformation rules can be defined by a high-level rule specification formalism, and we show how in systematic and inductive way transformation rules can be applied to the input data tree. In Section 2 we discussed the problem of representation and processing XML data. The method for specification transformation rules based on unranked tree transducers is proposed in Section 3. An illustrating example how to use the transformation is given in Section 4. Section 5 concludes the paper.

2

Exchange, representation and processing of XML data

An XML document is a textual representation of data and consists of hierarchically nested element structures starting with a root element. The basic component in an XML document is an element. An element consists of a start-tag, the corresponding end-tag and content, i.e., the structure between the tags. With elements we can associate attributes. Because XML documents are tree-structured,

a data model for representing them can use conventional terminology for trees. In the XQuery/XPath Data Model proposed by the W3C [5], an XML document is represented by an ordered node-labeled tree (data tree) which includes a concept of node identity. A node in a data tree conforms to one of the seven node kinds: root (or document), element, attribute, text, namespace, processing instruction, and comment. Every node has at most one parent, which is either an element node or the root node. Root nodes and element nodes have sequences of children nodes. A tree contains a root plus all nodes that are reachable directly or indirectly from the root. Every node belongs to exactly one tree, and every tree has exactly one root node. In this paper, we restrict our considerations to four types of nodes: root, element, attribute and text nodes. A data tree considered in this paper can be formalized as follows: Definition 1. Let Σ and OID be two alphabets of labels and (object) identifiers, respectively. Let Σ = ΣA ∪ ΣE , where ΣA is a set of attribute labels, and ΣE is a set of element labels. A data tree conforms to the following syntax, where oid ∈ OID, a ∈ ΣA , e ∈ ΣE , and text denotes a standard string: data tree ::= oid(tree, ..., tree), tree ::= e-tree | a-tree | t-tree, e-tree ::= he, oidi(tree, ..., tree), a-tree ::= ha, oidi(string), t-tree ::= hoidi(string). ¤ Each node has its unique identifier, oid, and all nodes, except the root node and text nodes, have labels assigned to the incoming edge. Attribute and text nodes have values assigned to them. An example of data tree (there are no attribute nodes in the tree) is given in Fig. 1.

0 partlist 1

part

part

part

part

part

2

7

14

21

28

id

name

id

name

partof

id

name

partof

id

name

partof

id

name

3

5

8

10

12

15

17

19

22

24

26

29

31

16 ‘3’

18

20 ‘2’

23

25

27

30

32

‘4’

‘d’

‘2’

‘5’

‘e’

4

6

9

11

13

‘1’

‘a’

‘2’

‘b’

‘1’

‘c’

Fig. 1. Data tree describing a flat list of parts

For selecting parts of XML documents represented by a data tree one can use XPath expressions [6]. A path expression locates nodes within a tree, and

returns a sequence of distinct nodes. A path expression is always evaluated with respect to a context. If E1 /E2 or E1 [E2 ] are path expressions, the context for E2 is a pair (x, S), where: (1) x ∈ S is called the context node; (2) S is the context set, i.e., an ordered set of nodes obtained by evaluating E1 in some context for E1 . For every context (x, S) we can obtain context size of S, i.e., the number of nodes in S, and the context position, i.e., the position of the context node x in the context set S. Notice that the values of both context position and context size are integers greater than zero (if the result of E1 is the empty sequence it can not be a context sequence for any expression). A path expression in XPath consists of one or more steps separated by ’/’. Each step selects a sequence of nodes. A step begins at the context node, navigates to those nodes that are reachable from the context node via a predefined axis, and selects some subset of the reachable nodes. A step has three parts: an axis, a node test, and zero or more predicates: an axis specifies the relationship between the nodes selected by the step and the context node; a test specifies the node type and label (name) of the selected nodes; a predicate is a further filter for the set of selected nodes. A predicate consists of a predicate expression, enclosed in square brackets. A predicate serves to filter a node set, retaining some nodes and discarding others. For example, in the path expression (qualified step) part[@price=”100”], [@price=”100”] is the predicate. In Tab. 1 we describe some axes, their abbreviated forms and examples of use (a node test node() is true for any node) [3, 6].

Table 1. Abbreviated syntax for XPath steps and their meaning

Abbreviated Meaning syntax . .. //

label

@label * @*

self::node()

Description and example

selects the context node; e.g.: ./part is short for self::node()/child::part parent::node() selects the parent of the context node; e.g.: ../part is short for parent::node()/child::part descendant-or-self:: e.g.: part//part is short for part/descendant-or-self:: node() node()/child::part and selects all part descendants of the context node child::label selects the label element children of the context node; child:: can be omitted because it is the default axis; e.g.: part/name is short for child::part/child::name attribute::label selects the label attribute of the context node; e.g.:[@price=”100”] is short for [attribute::price=”100”] child::* selects all element children of the context node attribute::* selects all the attributes of the context node

The semantics for XPath expressions is given by means of three semantic functions: S : SExpr → [N ode → 2N ode ], B : BExpr → [N ode × 2N ode → {true, false}], V : V Expr → [N ode × 2N ode → String ∪ 2String ]. For a path expression p ∈ SExpr, S(p) is a function from the set N ode of context nodes into a power set of nodes. Similarly, for a predicate expression b ∈ BExpr, B(b) is a function from the set N ode×2N ode of contexts into Boolean values, and for a value expression v ∈ V Expr, V(v) is a function from the set of contexts into the union of sets of string values and the power set of string values. The definition of semantic functions is given in Tab. 2; the predicate axis(x, y), where axis ∈ {self, child, parent, descendant, descendant-or-self, attribute} means that x is connected with y by means of the axis; node tests e, a, ∗, and text() denote, respectively, label of element node, label of attribute node, any element or attribute node, and any text node. The function label(x) returns the label of element or attribute node x, the function value(x) returns the string-value of node x The definition is coherent with that of Wadler [7].

Table 2. Semantics of XPath expressions

S(/p)(x) S(axis :: e)(x) S(attribute :: a)(x) S(axis :: ∗)(x) S(attribute :: ∗)(x) S(axis :: text())(x) S(p1 /p2 )(x) S(p1 |p2 )(x) S(p[b])(x) B(p)(x, S) B(v1 = v2 )(x, S) B(not b)(x, S) B(b1 or b2 )(x, S) B(b1 and b2 )(x, S) V(string)(x, S) V(p)(x, S) V(position())(x, S) V(last())(x, S)

= S(p)(r), where r is the root of the document = {y | axis(x, y) ∧ label(y) = e}, where axis ∈ {self, child, parent, descendant,descendant-or-self} = {y | attribute(x, y) ∧ label(y) = a} = {y | axis(x, y) ∧ y is an element node}, where axis ∈{child, descendant, descendant-or-self} = {y | attribute(x, y)} = {y | axis(x, y) ∧ y is a text node}, where axis ∈{self, child, descendant, descendant-or-self} = {z | ∃y ∈ S(p1 )(x) ∧ z ∈ S(p2 )(y)} = S(p1 )(x) ∪ S(p2 )(x) = {y | y ∈ S(p)(x) ∧ B(b)(y, S(p)(x))} = S(p)(x) 6= ∅ = V(v1 )(x, S) = V(v2 )(x, S), (coercion and existential semantics may be used when needed) = ¬B(b)(x, S) = B(b1 )(x, S) ∨ B(b2 )(x, S) = B(b1 )(x, S) ∧ B(b2 )(x, S) = string = {value(y) | y ∈ S(p)(x)} = the position of x in the ordered set S = size(S)

3

Transformation of data trees

Uniform unranked transducers have been proposed by Martens and Neven [4] to transform input Σ-trees into output Σ-trees, where Σ is an labeling alphabet over which trees are defined. The set, TΣ , of unranked Σ-trees is the smallest set of strings over Σ and parenthesis, such that if σ ∈ Σ, and w ∈ TΣ∗ , then σ(w) ∈ TΣ . There is no a priori bound on the number of children of a node, so Σ-trees are unranked. In this paper we follow the idea of unranked tree transducers. However, in our approach: – we are interested the in transformation of data trees representing XMLdocuments; – transformation rules are defined for classes of nodes rather than for individual nodes, thus we will propose a notation for transformation rules specification, which serves as schemas for producing transformation rules. Now we define the tree transducer used in this paper. For a set Q, we denote by CΣ,OID (Q) the set of tree components. By a tree component c ∈ CΣ,OID (Q) we understand an expression with the syntax: c ::= oidhe, oid0 i(q1 , ..., qn ) | oid(tree). Definition 2. Data tree transducer is a tuple (Q, Σ, OID, q0 , R), where Q is a finite set of states, Σ is a labeling alphabet, OID is a set of identifiers, q0 ∈ Q is the initial state, and R is a finite set of rules of the form (q, oid) → c, where q ∈ Q, oid ∈ OID, and c ∈ CΣ,OID (Q).

¤

The transformation defined by T = (Q, Σ, OID, q0 , R) on a tree t in state q, denoted by T q (t), is inductively defined as follows: 1. If t = oid(t1 , ..., tN ), or t = he, oidi(t1 , ..., tN ) and there is a rule (q, oid) → oid00 he0 , oid0 i(q1 , ..., qn ), then T q (t) = oid00 he0 , oid0 i(T q1 (t1 ), ..., T q1 (tN ), ..., T qn (t1 ), ..., T qn (tN )), i.e., T q (t) is obtained from the right-hand side of the rule by replacing every qi by the sequence T qi (t1 ), ..., T qi (tN ). Then we say that the transformation T qi (tj ) denotes the application of a rule with left-hand side equal to (qi , oidj ) in the context node oid, where oidj identifies the tree tj . In the transformation T qi (tj ) two identifiers are available: $qi = oidj , and &qi = oid0 . 2. Otherwise, the rule has the form (q, oid) → oid00 (tree) and its right-hand side constructs the tree according to the given specification, and oid00 will be the parent of the tree.

A set R of rules in a data tree transducer is defined on individual nodes. Now we will define a notation devoted to specification of such transformation rules. There are four kinds of transformation rules: 1. Initial rule – maps the root of the input document and its child nodes into the root and child nodes of the output document, it also defines states for further transformations. An initial rule specification is (q0 , /label) → new(/)hlabel0 , new(.)i(q1 , ..., qn ), n ≥ 1. The specification invoked in a context node x produces the following set of initial rules: {(q0 , oid) → oid00 hlabel0 , oid0 i(q1 , ..., qn ) | oid ∈ S(label)(root(t)), oid00 = new(root(t)), oid0 = new(oid)}. 2. Intermediate rule – maps any node except of children of the root into appropriate nodes of the output document, and defines states for further transformations. An intermediate rule is specified by (q, SExpr) → new(SExpr0 )hlabel, new(.)i(q1 , ..., qn ), n ≥ 1. The specification invoked in a context node x produces the following set of intermediate rules: {(q, oid) → oid00 hlabel, oid0 i(q1 , ..., qn ) | oid ∈ S(SExpr)(x), oid00 = new(y), y ∈ S(SExpr0 )(oid), oid0 = new(oid)}. 3. Copy rule – copies a subtree of the input tree and includes it into the output tree. Does not define states for further transformations, so every copy rule ends one of transformation branches. A copy rule has specification (q, SExpr) → new(SExpr0 )hlabel, copy(.)i and invoked in a context node x produces the following set of copy rules: {(q, oid) → oid00 hlabel, oid0 i | oid ∈ S(SExpr)(x), oid00 = new(y), y ∈ S(SExpr0 )(oid), oid0 = copy(oid)}. 4. Create rule – creates a tree and includes it into the output tree. Does not define states for further transformations and ends one of transformation branches. A create rule is specified by (q, SExpr) → new(SExpr0 )(tree). The specification invoked in a context node x produces the following set of create rules: {(q, oid) → oid00 (tree) | oid ∈ S(SExpr)(x), oid00 = new(y), y ∈ S(SExpr0 )(oid)}. The function new() used in rule specifications is a Skolem function. If it is invoked more than once for the same argument it returns a new node identifier only by the first invocation or when it has no arguments. By consecutive invocations it returns the identifier of the node created by the first evaluation with this argument. The function copy(.) recursively traverses the subtree identified by its argument and computes new() (without arguments) for each node of this subtree. In this way an output image for the input subtree is created. The image inherits labels and values from the input subtree. Within a create rule, the tree is created according to its specification.

4

Illustrating example

Now we show how the tree transducer may be used to construct a hierarchic document of arbitrary depth from the flat data tree from Fig. 1. (This transformation problem was considered in [1]) and [3]. In the input document each part may or may not be part of a larger part; then the id of the larger part is contained in a partof element (see Fig. 1).

100 partlist 101 part

part

128

102 part id 103

107

name

id

105 id

104

106

‘1’

‘a’

108

part

name

109

111

‘2’

‘b’

id

name

131

130

132

‘5’

‘e’

part

114

110

name

129

121

id

name

115

117

122

124

116

118

123

125

‘3’

‘c’

‘4’

‘d’

Fig. 2. The result of the transformation of data tree from Fig. 1 by the transformation rules specified in Tab. 3

We want to convert the flat representation into a hierarchic representation in which part containment is represented by the structure of the document. Each partof element matches exactly one id. Parts having no partof element are not contained in any other part. The DTD of input and output documents are as follows: DTD of input document: DTD of output document: ]> ]>

Data graphs representing some instances of these DTD are given respectively in Fig. 1 and Fig. 2. Specifications of transformation rules are given in Tab. 3. Application of specifications on the data tree t from Fig. 1 produces the set of transformation rules as is shown in Tab. 4 and in Fig 3.

Table 3. Specifications of transformation rules (q0 , /partlist) (q1 , ./part[not partof]) (q2 , ./part[partof]) (q3 , ./id) (q4 , ./name)

→ → → → →

new(/)hparttree, new(.)i(q1 , q2 ) &q1 hpart, new(.)i(q3 , q4 ) new(../part[id = $q2 /partof])hpart, new(.)i(q3 , q4 ) &q3 hid, copy(.)i &q4 hname, copy(.)i

Table 4. Transformation rules produced by the rule specifications from Tab. 3 (q0 , 1) (q1 , 2) (q2 , 7) (q2 , 14) (q2 , 21) (q1 , 28) (q3 , 3) ...

→ → → → → → → ...

100hparttree, 101i(q1 , q2 ) 101hpart, 102i(q3 , q4 ) 102hpart, 107i(q3 , q4 ) 107hpart, 114i(q3 , q4 ) 107hpart, 121i(q3 , q4 ) 101hpart, 128i(q3 , q4 ) 102hid, 103i(h104i(00 100 )) ...

100 q0

T (t partlist,1) =

parttree 101

Tq2(t part,7)

Tq1(t part,2)

Tq1(t part,2) =

101

Tq2(tpart,7) =

Tq2(t part,14)

102

Tq2(t part,21)

Tq2(t part,14) =

107

Tq1(t part,28)

Tq2(t part,21) =

107

part

part

part

part

102

107

114

121

Tq3(t id,3) Tq4(t name,5)

Tq3(t id,8) Tq4(t name,10)

Tq3(tid,15) Tq4(t name,17)

Tq3(t id,22) Tq4(tname,24)

Tq1(t part,28) =

101

Tq3(t id,3) =

102

Tq4(t name,5) = 102

part

id

name

128

103

105

104

106

‘1’

‘a’

Tq3(t id,29) Tq4(t name,31)

. ..

Fig. 3. Transformation of data tree from Fig. 1 by the transducer rules from Tab. 4

5

Conclusion

Exchange of information between heterogeneous Web applications requires that both format and structure of data should be transformed frequently. The transformation must preserve meaning of data, so that sender and receiver could interpret it correctly. At present, XML has been accepted as a common format of data for exchanging information. So, the following issues are of special importance: (1) effective transformation of XML data preserving its meaning, and (2) high-level mechanism for specification of XML data transformations. In this paper, we propose a method based on uniform tree transducers to address both of the above problems. The start point of our work are W3C’s standards as well as transformation methods for XML data based on tree automata [4, 8–10]. The proposed method is a part of our project for processing semistructured data and XML [11–13].

References 1. Tang, X., Tompa, F.W.: Specifying transformations for structured documents. In Mecca, G., Simeon, J., eds.: Proceedings of the 4th International Workshop on the Web and Databases, WebDB 2001. (2001) 67–72 2. XSL Transformations (XSLT) 2.0. W3C Working Draft. www.w3.org/TR/xslt20 (2002) 3. XQuery 1.0: An XML Query Language. W3C Working Draft. www.w3.org/TR/ xquery (2002) 4. Martens, W., Neven, F.: Typecheking top-down uniform unranked tree transducers. In: Database Theory - ICDT 2003. Lecture Notes in Computer Science 2572 (2003) 64–78 5. XQuery 1.0 and XPath 2.0 Data Model. W3C Working Draft. www.w3.org/TR/ query-datamodel (2002) 6. XML Path Language (XPath) 2.0, W3C Working Draft: (2002) www.w3.org/TR/ xpath20. 7. Wadler, P.: Two semantics for XPath. www.research.avayalabs.com/user/wadler/ (2000) 8. Vianu, V.: A Web odyssey: from Codd to XML. In: Proceedings of the 20th ACM Symposium on Principles of Database Systems PODS 2001, ACM Press (2001) 1–15 9. Milo, T., Suciu, D., Vianu, V.: Typechecking for XML transformers. In: Proceedings of the 19th ACM Symposium on Principles of Database Systems PODS 2000, ACM Press (2000) 11–22 10. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: Proceedings of the 28th Conference on VLDB, Hong Kong, China. (2002) 95–106 11. Pankowski, T.: PathLog: A query language for schemaless databases of partially labeled objects. Fundamenta Informaticae 49 (2002) 369–395 12. Pankowski, T.: XML-SQL: An XML query language based on SQL and path tables. Lecture Notes in Computer Science 2490 (2002) 184–209 13. Pankowski, T.: Querying semistructured data using a rule-oriented XML query language. In: 15th European Conference on Artificial Intelligence ECAI 2002, (F. van Harmelen, ed.). IOS Press, Amsterdam (2002) 302–306

Suggest Documents