XCentric: A logic-programming language for XML ... - CiteSeerX

XCentric: A logic-programming language for XML Processing Jorge Coelho1 and Mário Florido2 1

Instituto Superior de Engenharia do Porto & LIACC Porto, Portugal 2 University of Porto, DCC-FC & LIACC Porto, Portugal {jcoelho,amf}@ncc.up.pt

Abstract. In this paper we present the logic-programming language XCentric, discuss design issues, and show its adequacy for XML processing. Distinctive features of XCentric are a powerful unification algorithm for terms with functors of arbitrary arity (which correspond closely to XML documents) and a very rich type language that uses operators such as repetition (*), alternation, etc, as types allowing a compact representation of terms with functors with an arbitrary number of arguments (closely related to standard type languages for XML). This non-standard unification together with an appropriate use of types yields a substantial degree of flexibility in programming.

1

Introduction

XML is a powerful format for tree-structured data. A need for programming language support for XML processing led to the definition of XML programming languages, such as XSLT [22], XDuce [10], CDuce [1], XMλ [15], SXSLT [11] and Xtatic [23]. In this paper we present XCentric, a language based on the unification of terms with flexible arity function symbols extended with a new type system for dealing with sequences and new features for searching sequences inside trees at arbitrary depth. The main contributions of this work are: Regular expression types and unification: regular expression types give us a compact representation of sequences of arguments of functors with flexible arity. They also add extra expressiveness to the unification process. Let us present an illustrating example. The following declaration introduces regular expression types describing terms in a simple bibliographic database: :-type :-type :-type :-type

bib ---> bib(book+). book ---> book(author+, name). author ---> author(string). name ---> name(string).

Type expressions of the form f (...) classify tree nodes with the label f (XML structures of the form < f > ... < /f >). Type expressions of the form t∗ denote a sequence of arbitrary many ts, and t+ denote a sequence of arbitrary many ts with at least one t. Thus terms with type bib have bib as functor and their arguments are a sequence of one or more books. Terms with type book have book as functor and their arguments are a sequence consisting of one or more authors followed by the name of the book. The next type describes arbitrary sequences of authors with at least two authors: :-type type_a ---> (author(string),author(string),author(string)*).

To get the names of all the books with two or more authors we just need the following unification (= ∗ = stands for unification of terms with flexible arity functors and t :: τ means that term t has type τ ): bib(_,book(X::type_a,name(N)),_) =*= BibDoc::bib.

This unifies two terms typed by bib. The type declaration in the first argument is not needed because it can be easily reconstructed from the term. In this case we bind the variable N to the name’s content of the first book with at least two authors. Note how the type type a in the first argument of the unification is used to jump over an arbitrary number of arguments and extract the name of the first book with at least two authors. All the results can then be obtained, one by one, by backtracking. Incomplete terms in depth. The previous point gives XCentric the power of partially specifying terms in breath (i.e. within the subterms of the same parent term). In XCentric one can also partially specify a term in depth using the deep predicate. A query term of the form deep(t1 , t2 ) matches all subterms of t2 that match the term t1 . Consider the following example: in XCentric we can explicitly refer to sequences of terms t1 , ..., tn as < t1 , ..., tn >. Suppose we want to find sequences of two authors in a document instantiating the variable XML. We can use the query: ? − deep(< author(A1 ), author(A2 ) >, XM L). The names of the two authors will bind variables A1 and A2 , and all solutions can be found by backtracking. Note that we can search not only single elements but also sequences of elements. XCentric is not yet another language for XML processing in the sense that it is not aimed to replace or even compete with mainstream XML processing languages, which are based on the functional programming paradigm. The goal of XCentric is to give programmers using the logic programming paradigm a tool that makes it easier to process XML. This is relevant for the logic programming community because subjects such as databases or data-mining are quite relevant application areas of logic programming, and in these areas XML is becoming more and more a standard data format for information exchange.

Dealing directly with sequences of terms and using types as a compact representation of these sequences has been used intensively by functional mainstream XML processing languages based on pattern matching, such as XDuce and CDuce [10, 1]. In contrast logic programming languages usually have libraries [17, 16] to deal with XML that do not use the notions of sequence or regular types: they just translate XML documents to a list of terms. XCentric uses recent results of unification theory (unification of terms with functors of flexible arity [12, 5]) and type systems for logic programming (regular types [25, 9]) as the theoretical basis of a logic programming language for XML processing where sequences of terms are first class objects of the language. As the next examples show, this leads to a much more declarative way of processing XML when compared to the standard Prolog libraries for XML processing. This article focuses on language design, shows its adequacy to write applications that transform and query XML documents, and sketches solutions to implementation issues. To keep the presentation short, we just present some highlights of the language. In the homepage of XCentric, http://www.ncc.up.pt/xcentric/, one can find other references, many XCentric programs, and the XCentric distribution. Theoretical foundations of the non-typed version of flexible arity term unification can be found in [12, 5]. In the rest of the paper we assume that the reader is familiar with logic programming ([14]) and XML ([20]). We start in section 2 by presenting the related work, then in section 3 we present some examples. In section 4 we present sequence variables and flexible arity functions and in section 5 we present the type system of XCentric. In section 6 we explain the role of types in the unification process, then in section 7 we show how XCentric searches sequences in depth and finally in section 8 we conclude and outline the future work.

2

Related Work

In a previous work reported in [5] we proposed unification of terms with functors of arbitrary arity (see [5] for further references) as a basis for XML processing in a logic programming language. The language presented in the previous paper had no types: basically it presented a depth-first formulation of an unification algorithm for flexible arity untyped terms. The system presented in [19] typed a small subset of the logic XML query language Xcerpt [3] with a formalism based on tree automata. The previous work extended tree automata to deal directly with terms with functors of arbitrary arity. In our work we do not need such extension because the arbitrary number of arguments in our framework is denoted by the notion of sequence that is expressible using two symbols of fixed arity 2 and 0. Our work is very much inspired by functional languages for XML processing such as XDuce ([10]) and CDuce ([1]). These languages relied on the notion of trees with an arbitrary number of leaf nodes to abstract XML documents. However they were based on functional programming where the key feature is pattern matching, not unification. Regular expression patterns are often ambiguous and

thus functional XML processing languages presume a fixed matching strategy for making the matching deterministic. In contrast, our work just leaves ambiguous matching non-deterministic and examines every possible matching case by backtracking. This makes it possible to write complicated XML processing tasks in a quite declarative way. Regular expression matching was also used in [13] to extend context sequence matching with context and sequence variables. This previous work dealt with matching, not unification, and it was not integrated in a programming language. In another previous work [4] we proposed the use of regular types for static type-checking in processing XML in standard Prolog. Other approaches based on standard Prolog unification to process XML include [16] and [2]. There are other approaches to the unification of infinite trees based on rational trees [7], but in these approaches the problem addressed is different from ours: rational trees deal with trees with an infinite depth and our approach deals with trees where each node may have an infinite number of leafs.

3

XCentric by example

Here we present several simple examples to familiarize the reader with the language before presenting the details. 3.1

Types and Unification

An XML document is translated to a term with flexible arity function symbol. This term has a main functor (the root tag) and zero or more arguments. Although our actual implementation translates attributes to a list of pairs, since attributes do not play a relevant role in this work we will omit them in the examples, for the sake of simplicity. Example 31 Consider the simple XML file and the corresponding term: John New York [email protected] Sofia Rio de Janeiro 123456789 [email protected]

addressbook(record( name(’John’), address(’New York’), email(’[email protected]’)), record( name(’Sofia’), address(’Rio de Janeiro’), phone(’123456789’), email(’[email protected]’)))

This XML file is valid accordingly to the following DTD:

addressbook (record*)> record (name,address,phone?,email)> name (#PCDATA)> address (#PCDATA)> phone (#PCDATA)> email (#PCDATA)>

Where type addr is described by the following type rule: :-type type_addr ---> addressbook(record(name(string),address(string), phone(string)?,email(string))*)

From now on, whenever a variable is presented without any type information it is implicitly associated with the universal type any (which types any term). Through the following examples we will use the built-in predicates xml2pro and pro2xml which respectively convert XML files into terms and vice-versa. We will also use the predicate newdoc(Root,Args,Doc) where Doc is a term with functor Root and arguments Args. Example 32 We use a new syntax for sequences of terms: stands for the empty sequence and < e1 , . . . , en > stands for the sequence of elements e1 to en . Also we use t :: τ to assign the type τ to the term t. In this example we use the address book document presented before. In this address book we have sometimes records with a phone tag. We want to build a new XML document without this tag. This can be done by the following program (this example is similar to one presented in XDuce [10]): :-type type_a ---> record(name(string),address(string),phone(string)?, email(string)). :-type type_r ---> record(name(string),address(string),email(string)). translate:xml2pro(’addressbook.xml’,Xml), process(Xml,NewXml), pro2xml(NewXml,’addressbook2.xml’). process(,NewA):r_without_phone(A2,A3), newdoc(addressbook,A3,NewA). r_without_phone(,):-!, filter(X,X2), r_without_phone(S2,S3). r_without_phone(_,()). filter(,):-!. filter(X,X).

Example 33 In this example we have two XML documents with a catalogue of books in each (“bookstore1.xml” and “bookstore2.xml”). These catalogues refer to two different book stores. Both “bookstore1.xml” and “bookstore2.xml” have the same DTD and may have similar books. One of these XML documents can be:

... Haskell: The Craft of Functional Programming (2nd Edition) Simon Thompson 41 1999 ...

To check which books are cheaper at bookstore 1 we have the following program: :-type tc ---> catalog(book(name(string),author(string), price(string),year(string))*). best_prices(B):xml2pro(’bookstore1.xml’,B1), xml2pro(’bookstore2.xml’,B2), process(B1,B2,B). process(Books1,Books2,[N,A]):Books1::tc =*= catalog(_,book(name(N),author(A),price(P1),year(Y)),_), Books2::tc =*= catalog(_,book(name(N),author(A),price(P2),year(Y)),_), atom2number(P1,P1f),atom2number(P2,P2f),P1f < P2f.

The predicate best prices/1 returns the cheaper books at “bookstore1.xml”, one by one, by backtracking. Example 34 Given the files bibs.xml and reviews.xml, both containing entrys of books, create a new file with the prices on both files for each book: translate:xml2pro(’./examples-xquery/bibs.xml’,Bib), xml2pro(’./examples-xquery/reviews.xml’,Rev), process(Bib,Rev,NewXml), pro2xml(NewXml,’./examples-xquery/prices.xml’). process(,,NewA):findall(F,query_pub(B,R,F),FL), newdoc(books_with_prices,[],FL,NewA).

query_pub(B,R,book_with_price([],title([],T),price_bstore2([],P2),price_bstore1([],P1))): =~= B, =~= R.

The result is the file: TCP/IP Illustrated 65.95

65.95 Advanced Programming in the Unix environment 65.95 65.95 Data on the Web 34.95 39.95

Example 35 In this example we have two XML files, a text.xml file conforming to the following DTD:

and a bib.xml file with references conforming to the DTD:
bib (author, name)> bibliography (bib+)> author (#PCDATA)> name (#PCDATA)>

Our text.xml < ref > elements contain author names which appear in the < author > element of bib.xml. Our goal is to process text.xml and bib.xml and generate new text2.xml and bib2.xml files. The text2.xml file is almost the same than text.xml but < ref > elements are replaced by new < i > elements where the content is replaced by a number. This number represents the index of that reference ordered by author within the document. The bib2.xml contains only the references appearing in text.xml, ordered by author, and with a label element with the corresponding number that occurs in the text2.xml. This larger example illustrates one advantage of using unification in XML processing. As we shall see the use of unification will solve the problem processing the input XML file only once. Without using unification a direct algorithm for the same problem would have to process the input XML file twice. Of course that this could be optimized but would lead to a much less declarative and direct implementation. The program follows (; stands for type disjunction): :-type type_t ---> (string; ref(string); b(string))*. :-type type_i ---> (string; i(string); b(string))*. :-type type_b ---> bibliography(bib(author(string),name(string))*). run:xml2pro(’text.xml’,T), process(T). process(text(T)):-

process2(T,T2,[],Refs), xml2pro(’bib.xml’,B), add_bib(Refs,B,BibXML,1), newdoc(bibliography,[],BibXML,NewBIB), pro2xml(text(T2),’text2.xml’), pro2xml(NewBIB,’bib2.xml’). process2((),(),L,L). process2(A,B,L,RRefs):A::type_t =*= , B::type_i =*= ,!, insert_sorted(R,NewVar,L,Refs), process2(Y,Y2,Refs,RRefs). process2(S,S,L,L). add_bib([],_,[],_). add_bib([(A,AI)|Refs],B,[bib(label(AI),author(A),name(N))|XML],I):B::type_b =*= bibliography(_,bib(author(A),name(N)),_), I1 is I + 1, atom_chars(I,AI), add_bib(Refs,B,XML,I1).

The run predicate starts the translation process. In a first step, the process2 predicate translates the original text into a new one where the < ref > elements are replaced by new < i > elements and the content is replaced by a new free variable. At the same time pairs (Ref content, F reeV ariable) (corresponding to the author in < ref > and the new variable in < i > respectively) are inserted in a list ordered by Ref content. This is done by predicate insert sorted. At the end we have a list of pairs ordered by author. In a second step, add bib queries the bibliography file, retrieving the references found in the text and replacing the free variables with their correct content, thus, avoiding processing the text file twice. For example, if text.xml corresponds to: Mainstream languages for XML processing such as XDuce Hosoya, CDuce Frish Casagna and Benzaken and Xtatic Pierce rely on the notion of trees with an arbitrary number of leaf nodes to abstract XML documents.

and the bib.xml to: Coelho and Florido Type-based XML Processing in Logic Programming Hosoya

XDuce: A Typed XML processing language Frish Casagna and Benzaken CDuce an XML-centric general-purpose language Pierce Xtatic. http://www.cis.upenn.edu/~bcpierce/xtatic/

the resulting text2.xml will be: Mainstream languages for XML processing such as XDuce 2, CDuce 1 and Xtatic 3 rely on the notion of trees with an arbitrary number of leaf nodes to abstract XML documents.

and the bib2.xml: 1 Frish Casagna and Benzaken CDuce an XML-centric general-purpose language 2 Hosoya XDuce: A Typed XML processing language 3 Pierce Xtatic. http://www.cis.upenn.edu/~bcpierce/xtatic/

Example 36 In this example we go back to the first example presented in Introduction of the paper to show the gain of using unification of terms with flexible arity and types in contrast to standard Prolog unification. In the example presented at the Introduction, to get the names of all the books with two or more authors we just need the following query: ? - bib(_,book(X::type_a,name(N)),_) =*= BibDoc::bib,write(N).

To do the same thing using only SWI-Prolog (which has a quite good library for processing XML in Prolog): pbib([element(_,_,L)]):pbib2(L). pbib2([]). pbib2([element(’book’,_,Cont)|Books]):authors(Cont),!,pbib2(Books). pbib2([_|Books]):pbib2(Books).

authors([element(’author’,_,_),element(’author’,_,_)|R]):write_name(R). write_name([element(’name’,_,[N])]):write(N),nl. write_name([_|R]):write_name(R).

3.2

Incomplete terms in depth

We provide predicates that allow the programmer to find a sequence of elements at arbitrary depth, to search for the nth occurrence of a sequence of elements and to count the number of occurrences of a sequence. The predicates are deep/2, deepp/3 and deepc/3, respectively. For example, consider we have an XML document bound to the variable XML. Suppose we want to find a sequence of elements between two elements named incision and store it in variable Critical. We can do the query: ? - deep(,XML). If we want to find the text of the third occurrence of element author in document Bib we can simply write: ? - deepp(author(T),Bib,3). Here, variable T will be instantiated with the name of the author. If we want to count the number of occurrences of author elements in document Book we can simply do: ? - deepc(author( ),Book,C). Note that any of these predicates use sequences of elements. We now present some examples that use the deep family of built-ins. Example 37 This example is based on a medical report using the HL7 Patient Record Architecture and inspired by the XQuery use cases available at [21]. Given report1.xml:
Procedure The patient was taken to the operating room where she was placed... induced under general anesthesia. A Foley catheter was placed to decompress the bladder and the abdomen was then prepped and draped in sterile fashion. A curvilinear incision was made in the midline immediately infraumbilical and the subcutaneous tissue was divided using electrocautery.

The fascia was identified and #2 0 Maxon stay sutures were placed on each side of the... The fascia was divided using electrocautery and the peritoneum was entered. The small bowel was identified. and the Hasson trocar was placed under direct visualization. The trocar was secured to the fascia using the stay sutures.

1. Find what happened between the first incision and the second incision and write the result in a file named critical.xml: translate:xml2pro(’report1.xml’,Rep), deep(,Rep), newdoc(critical_sequence,[],Critical,FL), pro2xml(FL,’critical.xml’).

The result is: The fascia was identified and #2 0 Maxon stay sutures were placed on each side of the midline.

2. In the Procedure section what Instruments were used in the second Incision? instruments:xml2pro(’report1.xml’,Rep), deep(section(section_title(’Procedure’),S),Rep), deepp(incision(_,instrument(I),_),S,2), write(’’),write(I),write(’’).

Result: electrocautery

Example 38 Given an XML document for a book validated by the following DTD ( taken from the XQuery use cases available at [21]):
book (title, author+, section+)> title (#PCDATA)> author (#PCDATA)> section (title, (p | figure | section)* )> section id ID #IMPLIED difficulty CDATA #IMPLIED>

width height
CDATA CDATA

#REQUIRED #REQUIRED >

CDATA

#REQUIRED >

The following program computes how many sections and figures are in the book: count:xml2pro(’book.xml’,Book), deepc(section(_),Book,SC), deepc(figure(_),Book,FC), write(’’),write(SC),write(’’),nl, write(’’),write(FC),write(’’).

4

Sequence Variables and Flexible Arity Functions

In [5] we extended the domain of discourse of Prolog (trees over uninterpreted functors) with finite sequences of trees. Definition 41 A sequence t˜, is defined as follows: is the empty sequence and t1 , t˜ is a sequence if t1 is a term and t˜ is a sequence. We now proceed with the syntactic formalization, by extending the standard notion of Prolog term with flexible arity function symbols and sequence variables. Consider an alphabet consisting of the following sets: the set of standard variables, the set of sequence variables, the set of constants, the set of fixed arity function symbols and the set of flexible arity function symbols. Definition 42 The set of terms over the previous alphabet is the smallest set that satisfies the following conditions: 1. Constants, standard variables and sequence variables are terms. 2. If f is a flexible arity function symbol and t1 , . . . , tn (n ≥ 0) are terms, then f (t1 , . . . , tn ) is a term. 3. If f is a fixed arity function symbol with arity n, n ≥ 0 and t1 , . . . , tn are terms such that for all 1 ≤ i ≤ n, ti is not a sequence variable, then f (t1 , . . . , tn ) is a term. 4.1

Sequences

We use a special kind of terms, here called sequence terms, for implementing sequences. Definition 43 A sequence term, t¯ is defined as follows: – is a sequence term that represents the empty sequence. – seq(t, s¯) is a sequence term if t is a term and s¯ is a sequence term. Definition 44 A sequence term in normal form is defined as:

– is in normal form – seq(t1 , t2 ) is in normal form if t1 is not of the form seq(t3 , t4 ) and t2 is in normal form. Example 41 Given the function symbol f , the variable X and the constants a and b: seq(f (seq(a, )), seq(b, seq(X, ))) is a sequence term in normal form. Sequence terms in normal form are our internal notation for sequences. For example, seq(a, seq(b, )) represent sequence a,b. In our implementation, a term f (t1 , t2 , . . . , tn ), where f has flexible arity, is internally represented as f (seq(t1 , seq(t2 , . . . , seq(tn , ), . . .)), that is, arguments of functions of flexible arity are always represented as elements of a sequence term in normal form.

5

Types

In this section we present the type language starting with a brief description of Regular Types and then their extension to type sequences of terms. 5.1

Regular Types

The next definitions and examples introduce briefly the notion of Regular Types along the lines presented in [8]. Definition 51 Assuming an infinite set of type symbols, a type term is defined as follows: 1. A constant symbol ( we use a, b, c, etc.) is a type term. 2. A type symbol ( we use α, β, etc.) is a type term. 3. If f is an n-ary function symbol and each τi is a type term, f (τ1 , ..., τn ) is a type term. Definition 52 A type rule is an expression of the form α → Υ where α is a type symbol and Υ is a finite set of type terms. Example 51 Let α and β be type symbols, α → {a, b} and β → {nil, tree(β, α, β)} are type rules. Definition 53 A type symbol α is defined by a set of type rules T if there exists a type rule α → Υ ∈ T . We make some assumptions: 1. The constant symbols are partitioned in non-empty subsets, called base types. Some examples are, string, int, and float. 2. The existence of µ, the universal type, and φ representing the empty type. 3. Each type symbol occurring in a set of type rules T is either µ, φ, a base type symbol, or a type symbol defined in T , and each type symbol defined in T has exactly one defining rule in T . Regular types are the class of types that can be defined by finite sets of type rules. In XCentric a type rule α → {τ1 , . . . , τn } is represented by the declaration, :-type α - - -> τ1 ; . . . ; τn .

5.2

Sequence types

We now use the functor seq defined in section 4.1 to extend the previous type language to type sequences of terms: a sequence type is or seq(τ, τ¯) where τ is a type and τ¯ is a sequence type. Example 52 Given the type symbol αseq , functor f , constant c and type rule αseq → {, seq(f (seq(c, )), αseq )}, αseq describes sequences of terms of the form f (c) such as f (c), f (c), f (c), . . . (as we shall see later, αseq corresponds exactly to the XML type f (c)∗). We add sequence types to our type language. Thus, from now on, when we refer to a type we mean a regular type or a sequence type. 5.3

Types as programs

It is well known that regular types can be associated with unary logic programs (see [25, 24, 9]). For every type symbol α, there is a predicate definition α, such that α(c) is true if and only if c is a term with type α (note that we are using the type symbol as the predicate symbol). The universal type µ is defined by the predicate µ(x) ←. For every other type symbol, α, where α → Υ and τ ∈ Υ , the program has a clause, α(t) ← β(x1 ) ∧ . . . ∧ β(xn ) where t is τ with each type symbol βi replaced by a new variable xi . Example 53 Let αlist be a type symbol with type rule, αlist → {nil, .(a, αlist )}. αlist can be associated with the program: αlist (nil) ← αlist (.(a, X)) ← αlist (X) Let α1 , . . . , αn be type symbols defined by the rules in T . ΦT denotes the program that defines µ (universal type) and α1 , . . . , αn . Example 54 Let T = {α → {, a}, β → {seq(α, γ)}, γ → {, seq(d, γ)}} then, ΦT is: α() ← α(a) ← β(seq(X1 , X2 )) ← α(X1 ), γ(X2 ) γ() ← γ(seq(d, X2 )) ← γ(X2 ) The type associated with a type symbol α in a set of type rules T is the set of terms occurring as arguments to the unary predicate α in the minimal model MΦT .

Definition 54 Let T be a set of type rules. The type associated with the pure type term τ with respect to T is given by the recursive definition:  if τ is constant symbol c  {c} if τ is type symbol α [τ ]T = {t|α(t) ∈ MΦT }  {f (t1 , . . . , tn )|ti ∈ [τi ]T , 1 ≤ i ≤ n} if τ is pure type term f (τ1 , . . . , τn ) Types that can be described by [τ ]T , where T is a set of type rules are called regular types. Informally, [τ ]T is the set of terms that can be derived from τ by repeated application of rules in T . Example 55 Let T be the set of type rules {α → {, seq(a, α)}}, then [α]T = {, seq(a, ), seq(a, seq(a, )), . . .} 5.4

Regular Expression Types

We now define regular expression types, which describe sequences of values, a* (sequence of zero or more a’s), a+ (sequence of one or more a’s), a? (zero or one a), a|b (a or b) and a,b (a followed by b). We translate regular expression types to our internal sequence notation: a∗ ⇒ α∗ → {, seq(a, α∗ )} a+ ⇒ α+ → {a, seq(a, α+ )} a? ⇒ α? → {, seq(a, )} a|b ⇒ α| → {a, b} a, b ⇒ αseq → {seq(a, seq(b, ))} Note that DTDs (Document Type Definition) [20] can be trivially translated to regular expression types. Example 56 The DTD,
a b c d e l

(#PCDATA)> (c+)> (#PCDATA)> (#PCDATA)> (#PCDATA)> (a*, b, e?, d)>

corresponds to the regular expression type: α → l(a(string)∗, b(c(string)+), e(string)?, d(string)) We add types to our programming language defining the notions of type declaration and typed unification. XCentric also has some XML Schema [18] support: – Basic types: string, integer, float and boolean. – Occurrences of sequences: The programmer can declare the minimum and maximum number of occurrences of a sequence. The example presented on the introduction, for the type of a sequence of two or more authors, can be written as:

type type_a ---> author(string){2,unbounded}.

Type type a represents every sequence of two or more authors. – Orderless sequences: The programmer can declare a sequence of elements which can appear in any given order. Consider the following type: type mix ---> rec({name(string) & address(string) & email(string)?}).

Type mix represents a record with three elements where their order doesn’t matter.

6

Types in the unification process

In this section we explain the role of types in the unification process. Definition 61 A type declaration for a term t with respect to a set of type rules T is a pair t :: α where α is a type symbol defined in T . Note that in any term t, it is only necessary to declare types for variables since this is enough to reconstruct the type for t. When a term t is typed by µ (the universal type) we will omit the type declaration. We now define the notion of typed unification. Definition 62 If t1 and t2 are terms and α1 and α2 are types, then t1 :: α1 = ∗ = t2 :: α2 denotes unification of typed terms with flexible arity symbols. Equality is the only relation between trees. Equality between trees is defined in the standard way: two trees are equal if and only if their root functor are the same and their corresponding subtrees, if any, are equal. An equation t1 :: α1 = ∗ = t2 :: α2 is solvable if and only if there is an assignment of sequences or ground terms, respectively, to variables therein such that the equation evaluates to true, i.e. such that after that assignment the terms become equal and [α1 ]T ∩ [α2 ]T 6= ∅. Example 61 Consider the equation a(X, b, Y ) :: αa = ∗ = a(a, b, b, b) :: µ, where αa is defined by the type rules: αa −→ {a(αx , b, αy )} αx −→ {µ} αy −→ {b, (b, αy )} then this unification gives only two results: 1. X = a and Y = b, b 2. X = a, b and Y = b Implementation: An equation of the form t1 :: α1 = ∗ = t2 :: α2 is translated to the following query: ? - t1 = ∗ = t2 , α1 (t1 ), α2 (t2 ).

and the respective predicate definitions for α1 and α2 (as described in section 53). Correctness of t1 :: α1 (t1 ) = ∗ = t2 :: α2 (t2 ) comes for free from the correctness of the untyped version of = ∗ = (presented in [5]) and noticing that if α1 (t1 ), α2 (t2 ) succeeds then [α1 ]T ∩ [α2 ]T 6= ∅. Example 62 Consider the equation: a(X, b, Y :: αy ) = ∗ = a(a, b, b, b), where αy −→ {b, (b, αy )}. The query has the following answers: X = a, Y =< b, b >; X =< a, b >, Y = b; The equation is internally translated to following query: seq(a(seq(X, seq(b, seq(Y, )))), ) = ∗ = seq(a(seq(a, seq(b, seq(b, seq(b, ))))), ), αy (Y ) and the type declaration to the predicate definition: αy (seq(b, )) ← αy (seq(b, s)) ← αy (s) The results of running the previous program will be: X = seq(a, ) and Y = seq(b, seq(b, )), αy (seq(b, seq(b, ))). Succeeds X = seq(a, seq(b, )) and Y = seq(b, ), αy (seq(b, )). Succeeds X = seq(a, seq(b, seq(b, ))) and Y = , αy (). Fails

7

Searching term in depth

XCentric has several built-in predicates useful for XML processing. Here we sketch the implementation of the deep predicates presented before. The deep predicate is only used to query a ground XML document, thus it uses pattern matching, not unification to match subterms. The pattern matching operator =˜= is also a built-in operator in XCentric, where in t =˜= s, t is a term and s is a ground term. The deep predicate is itself implemented in XCentric (note the use of the notation < t1 ,. . . , tn > for a sequence t1 ,. . . , tn . Definition 71 Given the sequences s1 and s¯, where s¯ has no variables: deep(S1 , s¯)

: − deep1 (< S1 , X >, s¯).

deep1 (S1 , s¯) : − S1 =˜= s¯. deep1 (S1 , < f (S¯2 ), s¯ >) : − deep1 (S1 , S¯2 )∨ deep1 (S1 , s¯). deep1 (S1 , < S2 , s¯ >) : − deep1 (S1 , s¯). The other built-ins that search a given occurrence of a sequence of terms, deepc or count the number of occurrences of a sequence of terms, deepp, are simply a generalizations of the previous definition using counters.

8

Conclusion and ongoing work

XCentric is an extension of Prolog with a richer form of unification and regular types, designed specifically for XML processing in logic programming. It is important to notice that this is obtained by using a very small core of well studied key features such as unification of terms with flexible arity [5, 12] and regular types for logic programming [25, 9]. It was already successful used in practice for website verification [6]. As for the language, we have just finished implementing the type module and we are presently testing some XML Schema validation and improving some implementation issues, such as direct compilation of key features to low-level code (WAM or C). Besides this, future plans include the study of type inference for XCentric. Previous work on type inference for Prolog in the context of XML processing [4] gives us a starting point for this subject. Acknowledgements We thank Temur Kutsia and Nelma Moreira for all the help provided. The work presented in this paper has been partially supported by funds granted to LIACC through the Programa de Financiamento Plurianual, Funda¸c˜ ao para a Ciência e Tecnologia and Programa POSI.

References 1. Véronique Benzaken, Giuseppe Castagna, and Alain Frisch. CDuce: an XMLcentric general-purpose language. In Proceedings of the eighth ACM SIGPLAN International Conference on Functional Programming, Uppsala, Sweden, 2003. 2. H. Boley. Relationships between logic programming and XML. In Proc. 14th Workshop Logische Programmierung, 2000. 3. F. Bry and S. Schaffert. The XML Query Language Xcerpt: Design Principles, Examples, and Semantics. In 2nd Annual International Workshop Web and Databases, volume 2593 of LNCS, 2002. 4. Jorge Coelho and M´ ario Florido. Type-based XML Processing in Logic Programming. In Practical Aspects of Declarative Languages, volume 2562 of LNCS, New Orleans, USA, 2003. 5. Jorge Coelho and M´ ario Florido. CLP(Flex): Constraint Logic Programming Applied to XML Processing. In Ontologies, Databases and Applications of SEmantics (ODBASE), volume 3291 of LNCS. Springer Verlag, 2004. 6. Jorge Coelho and M´ ario Florido. VeriFLog: Constraint Logic Programming Applied to Verification of Website Content. In Int. Workshop XML Research and Applications (XRA’06), volume 3842 of LNCS, Harbin China, 2006. (to appear). 7. A. Colmerauer. Logic Programming, chapter Prolog and Infinite Trees. Academic Press, 1982. 8. P. Dart and J. Zobel. A regular type language for logic programs. In Frank Pfenning, editor, Types in Logic Programming. The MIT Press, 1992. 9. M. Florido and L. Damas. Types as theories. In Proc. of post-conference workshop on Proofs and Types, Joint International Conference and Symposium on Logic Programming, 1992.

10. Haruo Hosoya and Benjamin Pierce. XDuce: A typed XML processing language. In Third International Workshop on the Web and Databases (WebDB2000), volume 1997 of LNCS, 2000. 11. Oleg Kiselyov and Shriram Krishnamurthi. SXSLT: Manipulation Language for XML. In Practical Aspects of Declarative Languages, volume 2562 of LNCS, New Orleans, USA, 2003. 12. T. Kutsia. Unification with sequence variables and flexible arity symbols and its extension with pattern-terms. In Joint AICS’2002 - Calculemus’2002 conference, LNAI, 2002. 13. Temur Kutsia. Context sequence matching for xml. In Proceedings of the 1th International Workshop on Automated Specification and Verification of Web Sites, Valencia, Spain, 2005. 14. J. W. Lloyd. Foundations of Logic Programming. Springer-Verlag, second edition, 1987. 15. Erik Meijer and Mark Shields. XMλ: A functional language for constructing and manipulating XML documents. (Draft), 1999. 16. Pillow: Programming in (Constraint) Logic Languages on the Web. http://clip.dia.fi.upm.es/Software/pillow/pillow.html, 2001. 17. SWI Prolog. http://www.swi-prolog.org/. 18. XML Schema. http://www.w3.org/XML/Schema/, 2000. 19. Artur Wilk and Wlodzimierz Drabent. On types for xml query language xcerpt. In Principles and Practice of Semantic Web Reasoning, volume 2901 of LNCS, 2003. 20. Extensible Markup Language (XML). http://www.w3.org/XML/, 2003. 21. XQuery Use Cases. http://www.w3.org/TR/xquery-use-cases/, 2005. 22. XSL Transformations (XSLT). http://www.w3.org/TR/xslt/, 1999. 23. Xtatic. http://www.cis.upenn.edu/˜bcpierce/xtatic/, 2004. 24. E. Yardeni and E. Shapiro. A type system for logic programs. In The Journal of Logic Programming, 1990. 25. Justin Zobel. Derivation of polymorphic types for prolog programs. In Proc. of the 1987 International Conference on Logic Programming. MIT Press, 1987.

XCentric: A logic-programming language for XML ... - CiteSeerX

XCentric: A logic-programming language for XML ... - CiteSeerX

Suggest Documents

biXid: A Bidirectional Transformation Language for XML - CiteSeerX

A Generic XML Language for Characterising Objects to ... - CiteSeerX

A Visual Language for XML - Semantic Scholar

DSD: A Schema Language for XML - brics

XML-QL: A Query Language for XML Table of Contents - CiteSeerX

XML Transformation Language Based on Monadic ... - CiteSeerX

Compatibility of XML Language Versions - CiteSeerX

XML: eXtensible Markup Language

WSIPL: An XML Scripting Language for Integrating Web ... - CiteSeerX

Medical Markup Language (MML) for XML-based Hospital ... - CiteSeerX

WSIPL: An XML Scripting Language for Integrating Web ... - CiteSeerX

An XML description language for web-based network ... - CiteSeerX

Experimenting with the Circus Language for XML ... - CiteSeerX

DMML: An XML Language for Interacting with Multi-modal ... - CiteSeerX

XDuce: A Statically Typed XML Processing Language - CiteSeerX

Logics For XML - CiteSeerX

XML for Visualization - CiteSeerX

Xing: A Visual XML Query Language - CiteSeer

SKAML An XML Markup Language for Abstract

WSIPL: An XML Scripting Language for Integrating

Language Models for XML Element Retrieval

CXQuery: A Novel XML Query Language

Towards a Semantics for XML Markup - CiteSeerX

A XML tagged Spanish Learner Oral Corpus for Learner Language ...