XML Information Retrieval Using SQL2XQuery - CiteSeerX

3 downloads 3668 Views 75KB Size Report
Email: [email protected]. ABSTRACT ... define data format (as in HTML), tags defining the ... XML format so as to represent a relational data view.
XML Information Retrieval Using SQL2XQuery Francisco Javier Cartujano Escobar, Enrique David Espinosa Departamento de Computación Tecnológico de Monterrey-Campus cd. De México México D.F., México e-mail: [email protected] [email protected] Rafael Lozano Departamento de Computación Tecnológico de Monterrey-Campus cd. De México México D.F.,México Email: [email protected] ABSTRACT XML (eXtensible Markup Language) has recently emerged as the universal format to publish and interchange data in the World Wide Web. As a result, data sources, including relational databases, face a new requirement: users or applications wanting to deal directly with XML data instead of being forced to deal with formats and access protocols of a particular data source. This paper presents a scheme in which stored data in relational DBMS are exposed to the Web by an XML view. These XML views are queried by the user via SQL but processed by a XQuery motor (XQuery is a language used to inquire XML documents). This means that a tool that allows transformation from a SQL expression to an equivalent expression in XQuery is provided. The result is returned in XML format. This article mostly focuses on the analysis of making such transformation and presents a general algorithm for it. Keywords Information retrieval, Internet-based technologies, knowledge reuse, interchange and interoperability.

1. INTRODUCTION XML has recently emerged as a very promising technology for data interchange in Internet. The basic ideas of XML are very simple: instead of defining tags that define data format (as in HTML), tags defining the meaning of data are established. Besides, existing relationships among data are established through nested levels or references among them. This, in turn, assembles a tree or a network of nodes that make up the XML document. The idea behind XML becomes clearer if the markup tags are thought of as database fields[1]. In other words, an XML document allows information storage, as a database does but with the benefit of structuring information as it is needed. This concept of XML document, gives a universal format to almost every type of data and it is being rapidly adopted as a substitute of owner representations in several applications [2] , now a days. In recent years, different languages to inquire XML have been proposed, for instance: XPath [3], XSLT [4], XQL [5], XML-QL [6], Lorel [7], Quilt [8], XM-GL [9] and XQuery [10]. Some of them are more querying capacity than others but all of them have aimed to giving an access mechanism to XML repositories. Nowadays, the W3C XML Query Working Group has established a series of requirements that a language most accomplish in order to consult XML documents; these

requirements are specified in the “XML Query 1.0 Requirements” document [11]. The XQuery 1.0 language [10] is the actual W3C’s working draft that tries to accomplish all these requirements. On the other hand, it is well known that the standard relational database query language in the world is SQL. This research presents, as a proposal, a Web architecture in which data repositories are established in XML format so as to represent a relational data view. Web users retrieve information from those XML views via SQL. Such views are processed by a query motor in XQuery. This means that a mechanism which transforms SQL expressions to XQuery expressions is established, so that this last mentioned expression is executed over the data repository in XML. This article mostly focuses on the transformation process, and includes five additional sections. Section number two describes the background of this project. Section number three describes the transformation process of SQL expressions to XQuery expressions. Section number four establishes our conclusions and finally, section number five establishes the possible extensions to this research project.

2. BACKGROUND This research project has been previously documented and justified in [12]. What follows is an introduction to the problem statement.

easily transported, manipulated and integrated inside the Web environment.

2.1 Problem Description Let’s consider the situation of an enterprise that wishes to expose storage data in a relational DBMS on the Web. For this enterprise exist two possible alternatives: 1) To allow the clients accede to the relational tables through HTML forms which establish the script execution on the server side, or by a mediator software (such as JDBC) allowing the settlement of a bridge between the client application and the DBMS. In both cases a SQL query is sent to the relational DBMS for its execution. The result of such consultation is sent again to the client’s browser in either HTML format, to be shown directly, or through a series of tuples (a record set) that are manipulated in a proprietary way by the client application. This scheme is limited because the user is forced to deal with proprietaries formats and protocols of access from a particular data source. This originates that there is not direct accessibility, portability, operability and scalability of relational data. 2) To create materialized XML views of the relational data and to allow that the users inquire them through a query language for XML. The result of the consultation is returned in XML format. This alternative solves in a direct way all the limitations of the first option, since the information in XML format is easier to accede, to transport and to scale on the Web. Its main limitation is that the users that are used to deal with relational data via SQL, must now inquire them by means of some query language for XML documents, such as XQuery. The problem for the user is to learn and to get experience with XQuery. Our proposal to solve this problem is to combine the best of both alternatives: to publish in the Web relational data by means of XML views, to inquire these XML views through SQL and to return the result in XML format. Because SQL queries cannot be executed directly over XML data sources, we introduce a bridge to allow such queries to be allowed within XML.

Given the problem as described in section 2.1, we conclude that we need a mechanism in which a given SQL query is transformed into an equivalent expression in XQuery, being this last expression the executed one by the XQuery motor over the XML data repository. The program that transform SQL to XQuery has been called SQL2XQuery. This program is innovative and constitutes the main product of our research. The scheme of the proposed architecture is shown on Figure 1. 2.3 Justification This scheme is reasonable due to the following reasons: • Because the relational model and SQL are the most known and dominated data model and query language respectively nowadays, it is very convenient and practical that the users get an idea of being working and querying relational data even though the materialized data view is an XML document. In this way, the client’s knowledge about a language is reused and so, does not have to learn something new, which in some cases would take several weeks or months to dominate and acquire a experience similar to which could be had in SQL. • On the Web, an XML data view is much more accessible than stored data in a relational DBMS. In the same way, it is easier to integrate the query result in an XML document than in a relational structure (as in a record set) [13][14]. • The XML views can integrate data from different operational sources for federated database environments, achieving in many cases the easiness and quickness of making queries in such views rather than querying the original data in the operational sources [15]. 2.4 Representation of relational data in XML format

2.2 Proposed Solution The proposed scheme of this research work proposes an architecture where Web users can retrieve information over XML data views which materialize stored data in relational DBMS. This XML views are consulted by the user through SQL but processed by a XQuery motor. The result is returned in XML format, so that this result can be

SQL Exp.

Sql2XQuery Transformation from SQL to XQuery

The XML views from the proposed scheme are structured documents that have a direct mapping with tables of the relational model. The specifications on the document “XML representation of a relational database” [16] will be taken as a base. An example of a relational database schema and its equivalent schema in XML is shown on Figure 2. XQuery Exp. Access

Application Result (XML Document)

XML data Repository

XQuery Motor data Web

Figure 1. Access architecture to XML repositories through a SQL-XQuery

Table_S (suppliers)

Table_P (partes)

id_s, name, status, city

id_p, name, color, weight

Table_SP (shipments) id_s, id_p, quantity

a)

Relational schema of a suppliers, parts and shipments database.

b) XML schema equivalent to the relational schema in a)

… ….. …. …. ………….. … … … … ………….. .. .. .. …………...

Figure 2. Equivalent schemas of a relational model and a XML model. 2.5 XQuery

3.1 Basic Scheme that transforms from SQL to XQuery

When writing this article, the specification of XQuery [10] is a working draft that is being revised by the W3C. There exists a very high probability that XQuery will be released as a standard of this organization. There are several researchers [17] [18] and companies that develop software and believe so. For instance, Microsoft is going to implement a XQuery in products such as SQL Server [19], or in the case of IBM, who considers that the specification of XQuery can produce a language that could be used much more than SQL [20].

As an initial part of this analysis, we can establish that the SQL’s SELECT-FROM-WHERE clauses are transformed reciprocally into RETURN-FOR-WHERE clauses of an FLWR expression in XQuery. Due to this correspondence, every SQL query will be transformed to an expression FLWR with the following basic structure.

3 TRANSFORMING SQL EXPRESSIONS TO XQUERY This section analyzes how to transform queries from SQL to XQuery. The development strategy consists on (1) explaining the basic transformation scheme from SQL to XQuery, (2) Giving examples of equivalent SQL and XQuery queries showing on each example particularities of the establishment of such equivalence and (3) specifying a general algorithm to make such transformation. All those examples are specified in relation to the data models shown on Figure 2. Besides, being consistent with the representation of relational data in XML format, the XQuery query result will have the following structure: < atribute_ name>... ................................................................

SQL

XQuery

SELECT attributes FROM table WHERE condition

{FOR $t IN document(“db.xml”)/table/tuple WHERE condition RETURN {$t/attributes} }

The established pattern in the XQuery consultation specifies a FOR clause to retrieve each of the tuples that make up the consulting table (in SQL established by the FROM clause). In each iteration of the FOR, one tuple is recovered and attached to the variable $t. The WHERE clause verifies that the attributes of such tuple (attached by $t) satisfies the established condition. In this case the RETURN clause specifies an expression that builds an element that will nest the selected attributes of the variable $t (the established ones by the SELECT clause from SQL). Every tuple generated by the RETURN clause, constitutes the result of such query and will be nested inside the element. We now proceed to describe the analysis by presenting a series of examples.

3.2 Examples Example 1: Obtain the id_s and the status of suppliers who live in Paris and status is greater than 20. Result will be structured in descendent order by status. ---SQL Expression--SELECT id_s,status FROM s WHERE (city=’Paris’) AND (status>20) ORDER BY status DESC ---XQuery Expression-- {FOR $s IN document (“db.xml”)/table_s/tuple_s WHERE ($s/city=’Paris’)AND($s/status>20) RETURN {$s/id_s} {$s/status} SORT BY(status DESCENDING)}

The XQuery expression returns as a result a group of elements where each of these elements nested the element and the element with their respective values. At the end, the clause SORTBYDESCENDING from XQuery is specified so as to order the result by status. Example 2: Get the name of suppliers that supply part ‘p2’. ---SQL Expression--SELECT DISTINCT name FROM sp,s WHERE (id_p=’p2’) AND (s.id_s=sp.id_s) ---XQuery Expression— {distinct(FOR $s IN document(“db.xml”)/table_s/tuple_s FOR $sp IN document(“db.xml”)/table_sp/tuple_sp WHERE ($s/id_s=$sp/id_s) AND ($sp/id_p=’p2’) RETURN {$s/name} )}

This inquiry implements a “join” between two tables. As it is known, the “join” operator allows the relationship of tables in the relational data model. In XQuery, a FOR clause is established for each table that participate in “join”, establishing in this way a cartesian product of the involved tables. For this example, the nested FORs combine all tuple_s with all tuple_sp. From this cartesian product only combinations are selected that have the same value in the connections fields ($s/id_s=$sp/id_s) and that fulfill the ($sp/id_p=’p2’). Besides, it makes use of distinct function to eliminate duplicated tuples in the result; this function is applied to the result returned by the FLWR expression. Example 3: Obtain the supplier names which supply at least a red color part.

---SQL Expression--SELECT name FROM s WHERE id_s IN (SELECT DISTINCT id_s FROM sp WHERE id_p IN (SELECT id_p FROM p WHERE color=’red’)) ---XQuery Expression-- {LET $table:=(distinct (LET $table:= (FOR $p IN document (“db.xml”)//tuple_p WHERE $p/color=’red’ RETURN {$p/id_p}) FOR $sp IN document (“db.xm.”)//tuple_sp WHERE $sp/id_s = $table RETURN {$sp/id_s} )) FOR $s IN document(“db.xml”)//tupla_s WHERE $s/id_s = $table RETURN {$s/name}}

The requested information is obtained from the nested queries. In the case of the XQuery query, an expression to retrieve the id_p, the red color ones is established (the most nested query). The result of this query is assigned to the $table variable by means of the inner LET. The assignment of this LET is used by the middle inquiry to retrieval the id_s that satisfy the condition in which some of the supplied parts is equal to some of the values in $table. All the retrieved id_s of this query are assigned again to $table by means of the outer LET (the previous content of $table is overwritten). In a similar way, the most outer query is executed. It is important to mention that in this consultation, the LETs are established before the FOR clause due to the fact that the nested queries are not correlated (see the following example of correlated consults). Example 4: Obtain the names of parts that are supplied by supplier ‘s2’. Use the EXIST quantifier. ---SQL Expression--SELECT name FROM p WHERE EXISTS (SELECT * FROM sp WHERE sp.id_p=p.id_p AND id_s=’s2’); ---XQuery Expression-- {FOR $p IN document (“db.xml”)//tuple_p LET $table:= (FOR $sp IN document (“bd.xml”)//tuple_sp WHERE $sp//id_p=$p//id_p AND $sp//id_s=’s2’ RETURN {$sp//*} ) WHERE not (empty ($stable)) RETURN {$p//name} }

The SQL query has two particularities: correlational nested queries and the use of the EXIST quantifier. It is said that we have correlated queries when an internal level query makes reference to a table attribute that is defined in an outer query. In an inner SQL query, we have this situation in the section corresponding to the WHERE. In XQuery, the correlated queries are defined when specifying that the inner query will be evaluated for each tuple obtained from the FOR clause and that each one of these results will be assigned to a variable by means of a

LET. In this way the LET remains established inside the FOR (in contrast to non-correlated queries where the LET remains established outside the FOR). In the case that concerns us, once it has been assigned to the $table variable, the results of the inner query for a particular tuple of the FOR $p clause verify that such tuple fulfill the established condition by the WHERE clause. In this case, the EMPTY function of the XQuery is negated so as to make the equivalence of the EXISTS in SQL. The EMPTY function returns true when its argument does not have elements. Example 5: Obtain suppliers tuple who do not supply ‘p2’. Use the ALL operator of SQL. ---SQL Expression--SELECT * FROM s WHERE ‘p2’ALL (SELECT id_p FROM sp WHERE id_s=s.id_s) ---XQuery Expression-- {FOR $s IN document (db.xml”)//tuple_s LET $table:= (FOR $sp IN document (“db.xml”)//tuple_sp WHERE $sp/id_s = $s/id_s RETURN {sp/id_p} ) WHERE EVERY $tuple IN $table SATISFIES ‘p2’!=$tuple RETURN {$s/*} }

The ALL operator can be implemented by means of universal EVERY-SATISFIES quantifier in XQuery. In the example, each $tuple in $table that carries out the SATISFIES condition, allows the anidated elements in variable $s, to be returned as a part of the query result. Example 6: Obtain the total supplied quantity of ‘p2’ and the total suppliers that supply it. ---SQL Expression--SELECT sum (quantity), count (distinct id_s) FROM sp WHERE id_p=’p2’ ---XQuery Expression-- {LET $table:=(FOR $sp IN document (“db.xml”)//tuple_sp WHERE $sp/id_p=’p2’ RETURN {$sp/*}) RETURN {sum($table/quantity)} {count(distinct($table/id_s))} }

XQuery defines the additional functions sum, count, average, max and min which have the same functionality as their homologues in SQL. In this type of queries, it is important to mention that in our transformation scheme, first, we build a table ($table) which is the result of applying a FLWR expression that corresponds to a FROM and WHERE part in SQL. Once the table is built, the additional functions can be applied to the desired columns. Exampe 7: Obtain the part numbers that are supplied by more than one supplier.

---SQL Expression--SELECT id_p FROM sp GROUP BY id_p HAVING COUNT(id_p)>1 ---XQuery Expression-- {LET $table:= (FOR $sp IN document (“db.xml”)//tuple_sp RETURN {$sp/*} } FOR $value IN distinct (FOR $tuple IN $table RETURN {$tuple/id_p}) LET $gp:= (FOR $tuple IN $table WHERE $value/id_p=$tuple/id_p RETURN $tuple) WHERE NOT (empty ($gp)) AND count ($gp/id_p)>1 RETURN {$value/id_p} }

This is a grouping example. We do a similar thing with queries that use aggregated functions, first, we make $table so as to obtain those tuples that satisfy the WHERE condition of the related tables. From $table, we obtain different attribute values, simple or compound, of the one it is grouped with. In this case, each of these different values is assigned to the variable $value, that is used to make the corresponding group. In the example, the corresponding group of a given value is assigned to a $gp variable. Then, it is verified that the group is not empty and that in case of using the SQL HAVING clause, it adds the corresponding condition. Finally, for each group, the selected information is returned.

3.3 General transformation algorithm. In this section, we will show a general algorithm to perform the mapping from SQL to XQuery. TRANSFORM (query). 1. Initialization RETURN = FROM = ””; WHERE = ”WHERE” GROUPBY = XQuery = LET[] = ””; i=0 2. Analyze the SQL SELECT. RETURN= ”RETURN ” + the SELECT attributes + “” 3. Analyze the SQL FROM. FROM = ”FOR in document()”+ for each table of FROM 4. Analyze the SQL WHERE Iterate over each SQL condition if condition with nested query{ LET[++ i ]= ”LET $table_”+i+”:=” + TRANSFORM(nested query) WHERE =WHERE + related condition to $table_i} else WHERE = WHERE + simple condition 5. if aggregated function || grouping is used FROM=”LET $table:= ” + FROM + WHERE else FROM=FROM+ WHERE 6. Analyze SQL GROUP BY- HAVING if Group By is used GROUP BY= “FOR $value IN distinct IN..” + ..$table..+ “LET $qp:= “FOR $tuple IN $table..” + “WHERE !empty($gp) ” + HAVING condition

7. if correlationated queries XQUERY = FROM + LET’s of correlationated queries + GROUP BY + RETURN else XQUERY = LET’s of non correlationated queries + FROM + GROUPBY + RETURN 8. if ORDER BY is used XQUERY = XQUERY+ “SORT BY” + Ordering attributes 9. Return XQUERY

3.4 Implementation The SQL to XQuery translator, called SQL2XQuery, has been developed using the following tools: • Jflex 1.3.2 and Cup 0.10 were used to implement the grammar of the SQL’s SELECT statute. At the moment of parsing the SQL expression, the transformation is happening too. • There is an implemented graphical interface in Java where the user can make the transformation of the desired query. • The user can execute the SQL expression as well as the XQuery generated expression, so as to make sure of the equivalence of both expressions. • For the SQL query execution, a JDBC is used in order to access to Oracle Lite 8i. For the XQuery query execution QuiP 1.6 is used.

architecture has, it is that XML documents must be structured in such a way that they have a direct correspondence with relational tables [16]. A research path would consist on analyzing the smallest possible modifications that could be done in the SQL’s syntax so as to inquiry XML documents with a less rigid structure. • Optimizing the generated query. Due to the fact that the XML query motors are not very fast, it is very important to generate optimized expressions that make the execution time better. • Natural languages interfaces to querying XML repositories. The next step is making queries not via SQL but via a natural language. In other words, transforming the natural language to XQuery. • Visual languages to querying XML repositories. Similar to the previous idea, transforming a visual specification to a XQuery.

References [1] [2]

[3] [4] [5]

4. CONCLUSIONS

[6]

The contributions of this research work are the followings: • There is a practical utility of the expounded architecture. The justification of this architecture was reuse all the user’s experience and knowledge about the relational data model and SQL as a query language. This allows that the user can work with relational data in the Web but with the benefit that when being in XML format, they can be easily transported, operated and integrated by the applications. • The establishment of an algorithm to transform SQL to XQuery expressions. The already made analysis shows that a SQL query can be transformed into an equivalent in XQuery. • The research work is innovative. Up to the authors’ knowledge, there are no architectures with the mentioned characteristics implemented on the Web. • So we have implemented a pragmatic bridge between SQL and XQuery.

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

4. FUTURE PROJECTS The results of this research can settle the bases to make the following future projects: • Querying via SQL XML documents with a less rigid structure. One of the restrictions that the proposed

[17] [18]

[19] [20]

STANDEFER Robert. Enterprise XML: Clearly Explained. Academic Press, San Diego, California, USA 2001. IVES Z. G. and Y. Lu. XML Query Languages in Practice: An Evaluation. First International Confererence, Web-Age Information Management 2000, June 2000. CLARK James, DeRose S. XML Path Language (XPath) Version 1.0. W3C Recommendation, November 1999. (http://www.w3.org/TR/xpath) CLARK James. XSL Transformations (XSLT) Version 1.0. W3C Recommendation, November 1999. (http://www.w3.org/TR/xslt) ROBIE J. XQL (XML Query Language). XQL’99 Proposal, August 1999. (http://www.ibiblio.org/xql/xql-proposal.html) DEUTSCH, M. Fernandez, D. Florescu, A. Levy, D. Suciu. XML-QL: A Query Languaje for XML. Submission to the World Wide Web Consortium, August 1998. ABITEBOUL S., D. Quass, J. M. Hugh, J. Widom, and J. Wiener. The Lorel Query Language for Semistructured Data. International Journal on Digital Libraries, April 1997 (http://www-db.stanford.edu/lore/pubs/index.html) CHAMBERLIN D., J. Robie and D. Florescu. Quilt: an XML Query Language for Heterogeneous Data Sources. In Lecture Notes in Computer Science, Springer-Verlag, December 2000. (http://www.almaden.ibm.com/cs/people/chamberlin/quilt_lncs.pdf) CERI S., S. Comai, E. Damiani, P. Fraternali, S. Paraboschi and L. Tanca. XMLGL: a Graphical Language for Querying and Restructuring XML Documents. In 8th International World Wide Web Conference, WWW8, Toronto, Canada, May 1999. (http://www8.org/fullpaper.html,) CHAMBERLIN D., J. Clark, D. Florescu, J. Robie, J. Siméon, M. Stefanescu. XQuery 1.0: An XML Query Language. W3C Working Draft, June 2001. (http://www.w3.org/TR/xquery) CHAMBERLIN D., P. Fankhauser, M. Marchiori and J.Robie. XML Query Requirements: W3C Working Draft. February 2001. (http://www.w3.org/TR/ xmlquery-req) CARTUJANO J., Espinosa E. and Lozano R. SQL2XQuery. Memories of International Congress on Computer Sciences and Information, CICCI 2002, Durango, México, March 2002. CONRAD A. A Survey of Microsoft SQL Server 2000 XML Features. Microsoft Corporation, MSDN Library, July 2001. (http://msdn.microsoft.com/library/enus/dnexxml/html/xml07162001.asp) SCHMELZER Ronald. The Pros and Cons of XML. ZapThink Industry Analysts, September 2001. (http://www.zapthink.com/reports/proscons.html) ABITEBOUL S., R. Goldman, J. McHugh, V. Vassalos, Y. Zhuge. Views for Stanfor University, 1997. Semistructured Data”. (http://dbpubs.stanford.edu:8090/aux/ index-en.html ) World Wide Web Consortium. XML Representation of a Relational Database. (http://www.w3.org/XML/RDB.html) BOURRET R. XML and Databases. June 2001. (http://www.rpbourret.com/xml/XMLAndDatabases.htm) MANOLESCU I., D. Florescu & D. Kossmann. Answering XML Queries Over Heterogeneous Data Sources. accepted for publication in the VLDB conference, January 2001. (http://www-caravel.inria.fr/~ioana/ AGORA/MFK01.pdf) BURT J. Microsoft Debuts Demo 2 of XML Query Tool. August 2001. http://techupdate. zdnet.com/ techupdate/stories/main/0,14179,2804344,00.html BABCOCK C. IBM Experiments With XML. March 2001. (http://www.zdnet.com/zdnn/stories/news/0,4586,2697554,00.html)