of expressiveness w.r.t. S, an expression E of the form (A, B) cc, such as cc is equal to. MaxOccurs='unbounded' in G, simplifies to a finite series of ((A, B)|Empty).
An XML Document and Grammar Structure-based Comparison Framework Technical Report Joe Tekli1 and Richard Chbeir1 1 LE2I Laboratory UMR-CNRS, University of Bourgogne 21078 Dijon Cedex France {joe.tekli, richard.chbeir}@u-bourgogne.fr
1. Proof of Lemma 1 Recall the lemma itself: Lemma 1. Given an XML document tree S made of n nodes { S[1], S[2], …, S[n]}, and an XML grammar G to which the XML document conforms, without loss of expressiveness w.r.t. S, an expression E of the form (A, B) cc, such as cc is equal to MaxOccurs=’unbounded’ in G, simplifies to a finite series of ((A, B)|Empty) expressions repeated
MaxDeg(S) |E|
times, where MaxDeg(S)= Max {n. Deg } is the n∈S
maximum node degree in document tree S, and |E| the cardinality of E w.r.t. the main And alternativeness operator (e.g., for E = (A, B, C)*, |E| = 3) ● To guaranty XML grammar expressiveness, any XML document conforming to the grammar prior to the simplification phase must still conform to the transformed grammar afterwards. Here too cases have to be mentioned. Consider grammar expression E=(a, b)+, with ‘+’ MaxOccurs=‘unbounded’ (cf. Transformation Rule 2 in manuscript).
Root a
b
a
Original DTD grammar G
b
a
b
XML Tree P Root
c
a
b
a
b
a
b
c
c
c
c
c
c
XML Tree Q
DTD grammar G’ transformed w.r.t. XML document tree P c
DTD grammar G’’ transformed w.r.t. XML document Q
a. XML document trees
b. Simplifying expressions of the form (A, B)+
Fig. 1. XML grammar transformation w.r.t. document trees.
The first case is when the children of the maximum degree node in the XML document tree match the E expression to be simplified. Consider for instance XML tree P in Fig. 1 where the maximum degree node is that of its root P[0] having P[1] = a, P[2] = b, P[3] = a , …, P[6] = b as its children (here we consider single elements for the sake of simplicity). Then, expression E simplifies to (a, b)? ((a, P[0].Deg b)|Empty)) repeated 3 times, . The XML Document of tree representation P 2 remains valid w.r.t. the resulting grammar G’, G’ being equivalent in its expressiveness to G, w.r.t. document tree P. The second case is when the children of the maximum degree node in the XML document tree do not correspond to the E expression to be simplified. Consider for instance XML tree Q in Fig. 1 where the maximum degree node is Q[2] bearing 8 children nodes of label c. Then, following Lemma 1, expression (a, b)* simplifies to Q[2].Deg . The XML Document of tree representation Q ((a, b)|Empty) repeated 4 times, 2 remains valid w.r.t. the resulting grammar D’’, D’’ being equivalent in its expressiveness to D, w.r.t. document tree Q. 2. Proof of Lemma 2 Lemma 2. Let S be and XML tree and G be a conjunctive XML grammar tree. Algorithm TEDXDoc_XGram produces the minimum-cost edit script transforming XML document tree S to one conforming to grammar tree G •
On one hand, alforithm TEDXDoc_XGram simplifies to the classic tree edit distance algorithm introduced in [8] (producing minimal cost edit scripts) when all nodes of grammar tree G are mandatory. In other words, when all nodes of G are associated null cardinality constraints (MinOcc=MaxOcc=1), the extended tree edit distance formulation (Fig. 10 in manuscript, lines 27-28) simplifies to its classic counterpart: α = Dist[i-1][j-1] + TEDXDoc_XGram(Si, Gj) + CostInsTree(Gj) × (MinOcc_Counter[j] – MinOcc_InvCounter[j])
Ø
(1)
α = Dist[i-1][j-1] + TEDXDoc_XGram(Si, Gj)
On the other hand, when cardinality constraints (i.e., MinOccurs and MaxOccurs) come to play, they are handled via dedicated counters keeping track of the required number of minimum (MinOcc_Counter) and maximum (MaxOcc_Counter) repetitions for each grammar node in the corresponding document tree, as well as the actual number of node repetitions (MinOcc_InvCounter) in the document tree. These allow to consider, for each grammar node/sub-tree, the exact number of occurrences to be: − Added: when the actual number of occurrences, underlined by MinOcc_InvCounter, is lesser than the minimum number of repetitions required by the grammar, designated via MinOcc_Counter, − Removed: when the actual number of occurrences exceeds the maximum number of repetitions required by the grammar constraint, underlined via MaxOcc_Counter (cf. Section 3.4.2 in manuscript, for a detailed description of each counter and how it is handled in the TEDXDoc_XGram algorithm).
Thus, by extending the classic edit distance formulation in [8] (which produces minimal cost edit scripts) so as to consider precisely the ‘exact’ number of occurrences to be added or removed, for each node/sub-tree, when MinOccurs and MaxOccurs constraints come to play (as described in Section 3.4.2 of the manuscript), the resulting TEDXDoc_XGram algorithm would consequently produce minimum cost edit scripts. 3. Proof of Lemma 3 Lemma 3. Let |S| and |G| be the respective cardinalities of the XML document and XML grammar being compared, NG the maximum number of conjunctive grammars corresponding to G, and |CG| the cardinality of the largest conjunctive grammar tree of G. Consequently, the complexity of the edit distance phase, O(NG×|S|×|CG|), simplifies to O(|S|×|G|2) ●
In a grammar definition, ‘Or’ operators could occur at the same structural level or could be encapsulated in each other. Thus, three cases have to be considered in assessing TEDXDoc_XGram’s time complexity. First, consider the case where ‘Or’ operators appear exclusively at the same level. In other words, the corresponding grammar contains a single element definition of the form such as ‘a’ is the root node or an inner node, and ‘b’, ‘c’, ‘d’, … are leaf nodes or composite elements with ‘And’ operators (we utilize here the DTD syntax for ease of presentation). In such grammars, the maximum number of ‘Or’ operators is |G|-2, |G|-1 being the maximum number of conjunctive grammars corresponding to the XML grammar at hand. Hence, when the number of ‘Or’ operators as well as the number of conjunctive grammars are maximal, the maximum cardinality among conjunctive grammars becomes constant: |CG| = 2, conjunctive grammars coming down to a maximum of two nodes: the root node and a single sibling. Thus, when ‘Or’ operators appear exclusively at the same level, the time complexity of TEDXDoc_XGram, O(NG×|S|×|CG|) is equal to O(|S|×(|G|1)×2), which simplifies to O(|S|×|G|). Table 1. XML grammars made of concatenated ‘Or’ operators, such as the number of ‘Or’ operators, and thus the number of conjunctive grammars NG are maximized.
Grammar ... … … …
Number of ‘Or’ operators
|G|
NG
|CG|
1
3
2
2
2
4
3
2
3
5
4
2
4
6
5
2
|G|-2
|G|
|G|-1
2
… Recursively,
Second, consider the case where ‘Or’ operators are encapsulated in each other, such as no two ‘Or’ operators appear at the same level. Corresponding grammars are of the form: … where ‘a’ is the root node or an inner node, and ‘b’, ‘c’, ‘e’, … are leaf nodes or composite elements with ‘And’ operators. In such cases, the maximum number of ‘Or’
operators is
|G|-1
, the maximum number of conjunctive grammars being
|G|+1
(Table 2).
|G|+1
. Hence, Here, the maximum cardinality among conjunctive grammars is equal to when ‘Or’ operators are encapsulated, TEDXDoc_XGram’s complexity O(|S|×NG×|CG|) is |G|+1
equal to O(|S|×
×
|G|+1
|G|2
), which simplifies to O( |S|×
) < O(|G|2×|S|).
Table 2. Grammars made of encapsulated ‘Or’ operators, such as the number of ‘Or’ operators, and thus the number of conjunctive grammars NG, are maximized.
Grammar
Number of ‘Or’ operators
|G|
NG
|G’|
1
3
2
2
2
5
3
3
3
7
4
4
4
9
5
5
|G|-1 2
|G|
|G|+1 2
|G|+1 2
... … … …
… Recursively,
Third, consider the general case where ‘Or’ operators could be encapsulated or concatenated at the same level (e.g., definitions are of the form a((b|c)(d(e|f)|g))…). Complexity analysis here corresponds to the worst case scenario such as: the maximum number of ‘Or’ operators is |G|-2, the maximum number of conjunctive grammars is |G|-1, and the maximum cardinality among conjunctive grammars is |G|+1 . Hence, TEDXDoc_XGram’s time complexity, O(|S|×NG×|CG|) is equal to O(|S|×(|G|1)×
|G|2
|G|+1
), which simplifies to O(|S|×
) < O(|S|×|G|2).
4. Computation Example
Here, we present the result of comparing sample XML document PaperDoc in Fig. 1 to the XML grammar Paper.xsd in Fig. 2.a. *
… … … … … …
a. Sample XML document PaperDoc.
XML tree S Paper Title
Publisher
FirstName
Version
LastName
Length
url
Homepage
Download
b. XML tree representation S of PaperDoc.
Fig. 1. Tree representation of an XML document structure.
a. A sample XML grammar Paper.xsd.
c. Second conjunctive grammar Paper2.xsd.
b. First conjunctive grammar Paper1.xsd.
d. Third conjunctive grammar Paper3.xsd.
Fig. 2. Sample XML grammar, and corresponding conjunctive grammars.
Following our approach, comparing XML document PaperDoc to grammar Paper.xsd comes down to computing the edit distance between corresponding tree representations: S and P={C1, C2, C3}, the latter representing the disjunctive normal form of grammar Paper.xsd (cf. Fig. 3, Paper.xsd encompassing two Or operators): − Dist(S, C1) = 4, is the cost of edit script ES(S, C1) = {Upd(S[2], C1[3]), Ins(C1[3]), Ins(C1[4]), Ins(C1[6])}, and reflects the differences between document tree S and the grammar tree C1, i.e., the modifications that have to be done to S, to become valid w.r.t. C1. These modifications amount to i) updating node S[2] of label Publisher, transforming it into Author, which is of unit cost, and ii) inserting nodes of labels Author, FirstName and LastName, of individual unit costs, since the latter must occur twice in tree S (cf. Table 3). − Dist(S, C2) = 2, is the cost of deleting nodes S[3] and S[4] (of labels FirstName and LastName respectively) from document tree S, ES(S, C2)={Del(S[3]), Del(S[4])}, so that S becomes valid w.r.t. C2.
−
Dist(S, C3) = 0. In other words, no changes need to be made so as to transform S, when computing edit distance, since it is already conforms to conjunctive grammar tree C3 (detailed edit distance matrixes, when computing Dist(S, C2) and Dist(S, C3), are provided in Tables 3, 4 and 5).
Paper
Category MinOcc=0
Author MinOcc=2
Title
MaxOcc=10 Version
MiddleName MinOcc=0
FirstName
Length MinOcc=0 url MinOcc=0
LastName
Homepage
MaxOcc=∞
Download MaxOcc=0
a. Tree representation C1 of Paper1.xsd. Paper
Category MinOcc=0
Publisher
Title
Version
url MinOcc=0
Length MinOcc=0
Homepage
MaxOcc=∞
Download MaxOcc=2
b. Tree representation of C2 of Paper2.xsd.
Paper
Category MinOcc=0
Title
Publisher
Version
MiddleName MinOcc=0
FirstName
Length
url MinOcc=0
MinOcc=0
LastName
MaxOcc=∞
Download MaxOcc=2
Homepage
c. Tree representation of C3 of Paper3.xsd.
Fig. 3. Tree representations corresponding to the set of conjunctive grammars representing the disjunctive normal form of grammar Paper.xsd (cf. Fig. 2).
The first line of the distance matrix, in Table 3, i.e., Dist[0][], corresponds to the sum of the costs of inserting every node of the grammar tree C1. Likewise, the first column, Dist[][0], underlines the sum of the costs of deleting every node of XML tree S. Consequently, the algorithm identifies the combination of insertion/deletion operations of minimum overall cost in populating the remainder of the matrix, TEDXDoc_XGram(S, C1) = Dist[|S|][|C1|] underlining the final distance value. Table 3. Tree edit distance computatirons when comparing XML document tree S and XML conjunctive grammar tree C1. R(C1) C11 (Category, (Paper) MinOcc=0) R(S) (Paper) S1 (Title) S2 (sub-tree of root Publisher) S3 (Version) S4 (Length) S5 (sub-tree of root url)
C12 (Title)
C13 (sub-tree of root Author, MinOcc=2 MaxOcc=10)
C14 (Version)
C15 (Length, MinOcc=0)
C16 (sub-tree of root url, MinOcc=0 MaxOcc=∞)
0
0
1
9
10
10
10
1
1
0
8
9
9
9
4
4
3
4
5
5
5
5
5
4
5
4
4
4
6
6
5
6
5
4
4
9
9
8
9
8
8
4
Table 4. Tree edit distance computations when comparing XML document tree S and XML conjunctive grammar tree C2.
R(S) (Paper) S1 (Title) S2 (sub-tree of root Publisher) S3 (Version) S4 (Length) S5 (sub-tree of root url)
R(C2) (Paper)
C21 (Category, MinOcc=0)
C22 (Title)
C23 (Publisher)
C24 (Version)
C25 (Length, MinOcc=0)
C26 (sub-tree of root url, MinOcc=0 MaxOcc=∞)
0
0
1
2
3
3
3
1
1
0
1
2
2
2
4
4
3
2
3
3
3
5
5
4
3
2
2
2
6
6
5
4
3
2
2
9
9
8
7
6
5
2
Table 5. Tree edit distance computations when comparing XML document tree S and XML conjunctive grammar tree C3. R(C3) C31 (Category, C32 (Paper) MinOcc=0) (Title) R(S) (Paper) S1 (Title) S2 (sub-tree of root Publisher) S3 (Version) S4 (Length) S5 (sub-tree of root url)
C33 (sub-tree of root Publisher)
C34 C35 (Length, C36 (sub-tree of root url, (Version) MinOcc=0) MinOcc=0 MaxOcc=∞)
0
0
1
2
3
3
3
1
1
0
1
2
2
2
4
4
3
0
1
1
1
5
5
4
1
0
1
1
6
6
5
2
1
0
0
9
9
8
5
4
3
0
5. Applications of XML Document and Grammar Comparison
5.1. XML Document Classification
XML document/grammar similarity evaluation enables the classification of XML documents gathered from the web against a set of grammars declared in an XML database (data/type comparison layer). A scenario provided by Bertino et al. [1] comprises a number of heterogeneous XML databases that exchange documents with each other, each database storing and indexing the local documents according to a set of predefined DTD grammars. Consequently, XML documents introduced in a given database are matched, via an XML structural similarity method, against the local grammars. In such an application, a similarity threshold should be identified, underlining the minimal degree of similarity required to bind an XML document to a grammar. The XML grammar for which the similarity degree is highest, and above the specified threshold, is selected. Thus, the XML document at hand is accepted as valid for that grammar. When the similarity degree is below the threshold, for all grammars in the XML database, the XML document is considered unclassified and is stored in a repository of unclassified documents. As a result, none of the access protection, indexing and retrieval facilities specified at XML grammar level can be applied to such documents (similarly to schemas and traditional DBMS) [1].
5.2. XML Document Transformation
An issue complementary to XML document classification, when populating an XML database on the web, is that of document transformation. After having migrated a set of documents, collected from various data sources, into an XML database (defined by a set of grammars), collected documents may be similar to, but not satisfy, any grammar in the database. Hence, storing and consequently managing the documents in the database require i) identifying which of the grammars is most similar to the document at hand (i.e., classification phase [1], where the similarity measure itself is needed, cf. Section 5.1), and then ii) transforming the documents into valid ones w.r.t. their most similar grammars (i.e., document transformation, also known as document revalidation [7]). Here, the advantage of using an XML document/grammar comparison method based on the concept of edit distance becomes obvious. With such a method, a mapping between the nodes in the compared structures is provided in terms of the edit script, along with the similarity value itself. The mapping would thus describe the set of transformation operations to be applied to the document, so that it would become valid w.r.t. the grammar under which it was classified. 5.3. Selective Dissemination of XML Documents
SDI (Selective Dissemination of Information) systems for XML-based data become increasingly popular with the growing use of XML on the web. An SDI system basically manages user preferences so as to identify the users to whom incoming web documents should be broadcasted [9]. It basically allows the filtering of streams of XML documents w.r.t. user preferences. In this context, XML classification and evolution techniques could be employed to build an effective SDI system dedicated to XML-based data. A user profile could be expressed as an XML grammar (DTD or XSD). Consequently, XML classification could be utilized to filter documents based on their similarities w.r.t. the considered grammars. The grammar, describing the user profile, could be initially specified by the user, or automatically inferred from documents previously deemed valuable by the user via means of document clustering [4, 8] and structure extraction techniques [5]. The selective dissemination of XML data can then be undertaken by matching each XML document in the incoming data stream against the grammars modeling user profiles. Documents are distributed to the users whose document/grammar similarities are above a predefined threshold. 5.4. XML Grammar Evolution
XML grammar evolution underlines the modification of a grammar describing a class of documents, in order to capture more accurately the structural characteristics of corresponding XML documents. In other words, it allows to reduce the divergence between the structures of the documents being classified under the grammar, and the structure as specified by the corresponding grammar. However, the evolution phase is a costly process [1] since it requires the use of data mining association rules [6] and structure extraction techniques [5], so as to capture frequent patterns of element structures in the document instances and consequently generate the updated grammars [3]. Hence, it ought to be triggered when the grammar (profile) is not representative anymore of its classified documents (accessed documents) [1], which is where XML document/grammar similarity comes to play. Here, the user/administrator could i) specify an XML document/grammar similarity threshold, that a minimum number of
documents in the XML document class must respect, so as to prevent the evolution phase, ii) or specify the maximum number of non-conforming documents a class must encompass (i.e., documents with SimXDoc_XGram