Flexible Queries to Semi-structured Datasources: the WG-log Approach S. Comai1, E. Damiani2, L. Tanca1
e-mail:
[email protected] [email protected],
[email protected], (1) Politecnico di Milano, Dipartimento di Elettronica e Informazione (2) Universita di Milano, Polo di Crema
Abstract
A line of research is presented aimed at specifying both logical and navigational aspects of semi-structured data sources such as Web sites through the unifying notion of schema. Gracefully supporting schemata that are huge or subject to change, the WG-Log language allows for a uniform representation of queries and views, the latter expressing customized access structures to site information. A survey of related work and some directions for future research involving fuzzy query techniques are also outlined.
1 Introduction and Motivations Modern network-oriented information systems often have to deal with data that are semi-structured, i.e. lack the strict, regular, and complete structure required by traditional database management systems (see [Abi97] and [Suc97] for a survey on semi-structured data and related research). Information is semi-structured also when the structure of data varies w.r.t. time, rather than w.r.t. space: even if data is fairly well structured, such structure may evolve rapidly. A particularly interesting example of semi-structured information are data stored on the World Wide Web. Web site data may sometimes be fully unstructured, consisting in loose collections of images, sound and text. Other times WWW data are extracted from traditional relational or object-oriented databases, whose completely speci ed semantics is mirrored by the site structure. Often, however, WWW information lies somewhere in between these extremes. Our approach deals with semi-structured data through a schema-based representation, introducing WG-Log, a graph-oriented language supporting the representation of logical as well as navigation and presentation aspects of hypermedia. This data description and manipulation language for the Web gracefully supports schemata that are huge or subject to change. Moreover, it retains the representation of semantics created during the design process, allowing Web users to exploit this semantics by means of a database-like querying facility. WG-Log has its formal basis in the graph-oriented language G-Log [Par95]: in G-Log, directed labeled graphs are used to specify and represent database schemata, instances and queries. While expressing Web site semantics, WG-log schemata take also into account the semi-structured nature of WWW information, providing graceful tolerance of data dynamics, i.e. an easy mechanism for schema updates resulting from instance evolution over time. This allows for ecient checking of instance correctness w.r.t. a given schema and ecient checking (i.e. at the schema level) of query applicability to a certain instance. The advantages of this approach are many, the rst one being the availability of a uniform mechanism for query and view formulation. Uniform graph-based representations of schemata and instances are used for querying, while Web site data remain in their original, semi-structured form. Thus, WG-log does not require Web site conversion to a fully structured format to allow for database-style querying. Schema availability supports adaptability 1
by allowing users to build queries that are partially speci ed w.r.t. the schema; the expressive power of WG-log, which is fully equipped with recursion, is another powerful resource to this eect. Adaptability is also supported by WG-log's ability of building custom views over semi-structured information. When a view over a site is created, clients can formulate queries to include additional information and a good deal of restructuring to the result. In fact, this mechanism can be exploited to reuse existing sites' content as well as schemata. In this paper, after a review of related work (Section 2) we introduce WG-Log (Section 3), presenting its operational semantics based on bisimulation [Mil90]. In Section 3.1, some theoretical results are presented that allow WG-Log schemata to gracefully evolve together with the instance. This material is based on our recent work on the subject, namely on [Dam97], [Com98], [Cor98] and [Dam98-2]. In Section 4 WG-log query support is described, together with the design of a Web Query System based on WG-log. Finally, Section 5 contains some hints at future research on this subject.
2 Related Work In this section we provide an overview of related work (see also [Flo98]), based on how the various approaches deal with representation of semantics . Free text indexing - No representation of semantics Early approaches to Web indexing tried to collect and index title-like information about every reachable page of data on the WWW and then build Boolean keyword searches into the resulting document. Many current Web search engines like Altavista [Alt98] are partially based on this approach, where search results are at lists of HTML pages. Some keyword-based indexes, like WebCrawler [Pin97], complement keyword indexing by taking into account the HTML document structure in order to make educated guesses about semantics. Representation of semantics via taxonomies Several search engines exploit a taxonomy representing WWW content. Yahoo [Yah98] relies on a hierarchical classi cation of subjects, not unlike the one used by the Library of Congress. Yahoo's success has spawned similar tools, based on the idea of providing large, monolithic servers holding indexes of site contents (Point [Lyc98-2], Magellan [McK98], and others). In order to exploit taxonomies together with free text searching, meta-search services like SavvySearch [Dre97] use free-text indexes like Altavista as subroutines, querying all available services in parallel and then aggregating the results. Structural representation of sites A considerable amount of research has been made on how to complement keyword-based searching with database-style support for querying the Web. Three main WWW query languages have been proposed so far: Web3QL [Kon95], WebSQL [Men96] and WebLog [Lak96]. The rst two languages are modelled after standard SQL used for RDBMS, while the third retains the avour of the Datalog language. However, these three languages give only a fraction of the power of the original query languages they are based on, since they explicitly refrain from semantics representation issues. Web3QL and WebSQL oer a standard relational representation of Web pages, such as Document(url, title, text, type, length), which can be easily constructed from HTML tagging. The user can present SQL-like queries to Web sites based on that relational representation. Content-related queries (for instance: Document.text = "Italy") are mapped into free-text searches using a conventional search engine. In addition to similar query capabilities, Web3QL oers a graph pattern search facility in the navigational structure of Web sites. Finally, WebLog proposes an O-O instance representation technique which leads to a deductive query language with recursion. The Dublin Core [DC98]is a fteen element metadata set developed to improve resource discovery on the Web. Dublin Core elements describe document-like objects, specifying metainformation such as Title, Subject, Description, Format and others. Site semantics is conveyed by natural language descriptions without any formal representation. A rich Web object model is also described in [Man98], but again it lacks representation of data semantics.
Instance-based semantics representation A well-known technique for instance-based representation of semantics is semantic tagging, i.e. the use of HTML or XML tags to denote semantic information. Early proposals were based on the E-R model: semantic tags were used to refer the data stored in a Web page to an entity and to denote relationships as semantic links which could be used for querying purposes. Several variations of this idea have been proposed, leading to the assertion-based model of RDF [Ber98]. RDF has been developed as a part of the W3C metadata activity in order to provide a generic metadata architecture for Web sites that can be expressed in XML. RDF assertions are triples made up of a resource, a propertyType and a value. Though propertyTypes can be thought of as attributes in traditional attribute-value pairs, RDF's choice of using the URI namespace for resources makes the assertion-based model very dierent from the standard entity-relationship model. In fact, in RDF any syntactically valid assertion can be made about any resource, whereas in E-R models each entity has its own set of attributes. Though current RDF proposal attaches metadata to single instances, the W3C RDF Schema Working Group is currently developing a Schema De nition Language [RDF98] to de ne metadata system based on RDF. Schema-based semantics representation With the partial exception of RDF, all the approaches described above lack an explicit notion of schema. This may be due to the fact that, while the advantages of schema-aware query formulation are widely recognized in the database context, it has been considered unfeasible on the WWW because no schema information is associated to Web sites. However, an increasing number of sites are being designed using design methodologies such as HDM [Gar95], RMM [Isa95] and the like. When such a methodology is used to design a Web site, some notion of site schema is present during the site design process. Indeed, many commercial authoring environments for Web sites hint at the idea of a navigational schema to be chosen by the user as the basis of automatic site generation. Other approaches address the problem of Web querying in the more general framework of dealing with semi-structured data. For instance, the Tsimmis system [GaM95] proposes an OEM object model to represent semi-structured information together with a powerful query language, Lorel. For each Web site, the user de nes OEM classes to be used in its Tsimmis representation. Then a textual lter is applied, initializing objects from Web pages' data. Tsimmis' additional DataGuide facility allows to identify regularities in the extracted instance representation to produce a site schema. A representation of semantics based on E-R schema is used in the STRUDEL approach [Fer97]. STRUDEL provides a single graph data model not unlike OEM in which all data sources are uniformly modeled, and a query language for data integration. In the Araneus project [Atz97], Web site crawling is employed to induce schemata of Web pages. These ne grained page schemata are later combined into a site-wide schema, and a special-purpose language, Ulixes is used to build relational views over it. Resulting relational views can be queried using standard SQL language.
3 An Informal Introduction to WG-Log In this section we introduce the data model and language of WG- Log, a graph-oriented language supporting representation of both data model and structural entities. Rather than presenting the native semantics of WG- Log, we recall its operational semantics based on bisimulation [Mil90], which allows some interesting properties of WG-Log to come to the light. Formal presentations of this material can be found in previous papers on G-Log [Par95] and on WG-Log [Dam97, Cor98]. In WG-Log, directed labeled graphs are used to specify and represent Web site schemata, instances, views (allowing customized access paths and structures) and queries. The nodes of the graphs stand for objects while edges indicate logical or navigational relationships between objects. A WG-Log graph contains the following mode types: Entities, depicted as rectangles are used to represent abstract objects possibly linked to each other in dierent ways; Slots or concrete nodes, depicted as ellipses indicate objects with an atomic value such as strings, soundtracks or movie frames; Entry
FRODOS L_restarea
L_restaurant
E_RESTAREA
L_numphone
E_RESTAURANT
String L_name String
L_type
L_name String
L_numhome
L_numfax
String String
L_price
L_entree
String
String
L_name String L_rating
E_ENTREE L_labelnumber L_opinion
String
String L_rating
String
E_OPINION
Figure 1: A sample WG-Log schema rest_list
L_entree
L_opinion L_rating
E_RESTAURANT
E_ENTREE
String
OPINION
great rest_list
E_RESTAURANT
L_rating
L_entree
E_ENTREE
String great
Figure 2: A sample WG-Log query points, depicted as triangles, represent the unique page that gives access to a portion of a site, for instance the site home page; Collections, represented by a rectangle containing a set of horizontal lines, indicate collections or aggregates of nodes. Three types of arcs are allowed in WG-Log graphs: structural edges representing navigational links between pages; all navigational edges have the same (dummy) label; logical labeled edges, denoting relationships between entities; double labeled edges, denoting a coupled navigational step and logical relationship between entities; double arcs are not, at least in principle, a shorthand for the presence of both a navigational step and a relationship between two entities, since they carry the additional meaning of coupling, i.e. associating the navigational link to a speci c relationship. At the schema level, each kind of edges may have a single or double arrow, the latter indicating that the associated link is multi-valued. WG-Log graphs are colored: one color among red solid, green, red dashed, black is assigned to all the elements of a graph. Instances and schemata are completely black graphs, while queries contain red solid (RS), red dashed (RD) and green (G) nodes and edges. A WG-Log instance (representing the actual semi-structured information), elsewhere called concrete WG-Log graph [Cor98], is a black WG-Log graph whose slot nodes are all \instantiated", i.e. carry a value. A WG-Log schema, elsewhere called abstract WG-Log graph [Cor98], is a black WGLog graph whose slots have no values. As can be expected, a WG-Log concrete graph is an instance of a WG-Log schema when they show the same structure, i.e. they are bisimilar: two WG-Log graphs are bisimilar i they contain the same paths. Note that bisimulation is an equivalence relation and though it is implied by graph isomorphism (i.e. two isomorphic graphs are also bisimilar), the viceversa does not hold. A sample WG-log schema is reported in Fig. 1 representing the Frodos restaurant chain. A WG-Log rule is a red-and-green WG-Log graph. A WG-Log program (or query) is a sequence of sets of WG-Log rules. The query of Fig. 2 aims at selecting all the restaurants
which have an Entree rated `Great'. Two rules are needed to capture both the situations: a query node `Rest list' is linked to both kinds of E Restaurant entities, thus meaning that a unique result list is requested. Instead, if we were to select only one of the possible rating formats, we should use only the rule pertaining to it. Negative requirements can also be introduced in a query, by means of dashed edges and nodes. WG-Log also allows goals, which are simply WG-Log schemata used to reduce the size of the result of a query. Rule applicability [Cor98] is also a key concept in the de nition of WG-Log semantics: in practice, a rule is applicable to an instance (respectively to a schema) when its red solid part is bisimilar to a subgraph of the instance (resp. schema), and thus it can be applied to deduce new information. The form of such new information is dictated by the green part of the rule, which must be bisimilar to the part of the resulting instance containing the subgraph where the rule was applied. A rule is basically a logical implication, and dictates that, wherever a subgraph bisimilar to the red part is found, a subgraph bisimilar to the green part must be added (if not already present). The semantics of a WG-Log rule R is a binary relation on WG-Log instances containing all the pairs hG; G i such that R is applicable in G, G satis es R, i.e. contains all the \green extensions" of the subgraphs bisimilar to R, and moreover G is minimal w.r.t. these properties. The semantics of a WG-Log program is obtained by extending the rule semantics in the intuitive way. In the same way the semantics of WG-Log can be applied to any black WG-Log graph: indeed, the pairs hG; G i above might also be pairs of schemata. This de nes the abstract semantics of WGLog, which is currently under study and which gives rise to a number of very interesting properties, immediately applicable to the use we make of WG-log for modeling semi-structured information; these are presented in the next subsection. 0
0
0
0
3.1 The Safe Transfer Theorem
All the results presented in this subsection are proved in [Cor98]. As suggested by intuition, we use WG-Log schemata in order to represent sets of instances having the same structure. The rst property we introduce is the Abstraction Theorem: given a set of WG-Log instances, a minimal WG- Log schema representing these instances can always be found; such schema is unique up to bisimulation. Indeed, the process of schema derivation can be proved to be O(nlogn) where n is the number of nodes in the instance. The signi cance of the above property becomes apparent when one thinks of automatically deducing WG-log schemata from WG-Log instances, particularly when these represent Web sites. Other interesting properties relate to the function , which assigns to each WG-Log schema the set of its WG-Log instances: is monotonic, i.e. for any pair hS; S i of schemata, if S is contained in S we have that
(S ) (S ); is injective, i.e. non-bisimilar schemata have distinct sets of instances. The most important property that we derive from the study of abstract interpretation is however the following safe transfer theorem: let S; S be two WG-Log schemata, and R be a rule without slots. Then, the following diagram commutes: 0
0
0
0
R G G ! # # R I ! I 0
0
The above property guarantees the correctness of abstract computations: the application of a rule to an abstract graph safely represents the application of the same rule to any of its instances. The practical impact of this result is quite interesting. First of all, consider a user who issues a query towards a remote site: if the rule is not applicable to the site schema, we can immediately
Client Area User
Schema Search Area (Robot)
Query (graphical) Interface
Query
Local Query Manager
Schema Robot
Answer Schemata (+ remote site info)
Query
Answer
Instance presentation issues
Keywords
Schema Repository Thesaurus
partial answer
Remote Query Manager
Remote Site Area Instance & Schema
Instance presentation issues
Figure 3: The WG-Log architecture conclude that the same rule is not applicable to any instance of that schema. Therefore, we may try complex queries locally, on site schemata, and then issue them to the target site only once checked that they are applicable. Moreover, suppose we use WG-log rules to specify site instance evolution during the site life. Then, the application of the same rule to the site schema returns automatically the schema corresponding to the new site instance.1
4 Query Execution in WG-Log Our approach to Web querying involves six basic steps: schema identi cation, schema retrieval, query formulation, instance retrieval, results restructuring, presentation of results. An outline of our system architecture is shown in Figure 3. The schema identi cation and retrieval steps involve a dialogue between the client module and a distributed object-oriented server called Schema Robot. Interacting with Schema Robots, users identify Web servers holding the information they need on the basis of their schema, i.e. on the semantics of their content. Besides helping the user in schema identi cation, clients provide facilities for query formulation, through a graph-like representation of the hypertext structure of Web sites. An eective support for query execution and instance retrieval is essential in the development of Web-based applications. In our approach, queries specify both the information to be extracted from a site instance, and the integration and results restructuring operations to beexecuted on it. Queries are delivered to object-oriented servers called Remote Query Managers, in execution at target Web sites. Remote Query Managers use an internal lightweight instance representation to execute queries and to return results in the format selected by the user. The partial result computed by the Remote Query Manager is processed at the client site by the Local Query Manager module, which has access to the additional information requested by the user, and produces the complete query result. The Presentation Manager provides facilities for the presentation of results. This mechanism allows clients to require customized presentation styles on the basis, for instance, of constraints on the resources available at the user location.
1 Of course, in this context we are interested in those site updates that would aect the schema, since schemainvariant updates do not need to be traced at the schema level.
5 Conclusion and Future Work Though several features of semi-structured information are addressed by our approach, many problems remain to be solved. As an example, consider the case of blind queries, i.e. WG-log queries written without consulting the site schema, or using an invalid version of it. In this case, it is important to avoid stonewalling, i.e. the remote site giving no answer (or an unsatisfactory one) even to basic blind queries. Sometimes, stonewalling is a result of data coercion, that is some information to be modeled as an attribute in the blind query while it is a separate entity in the instance, or viceversa. This problem can be eectively dealt with by "blinding" the bisimulation checking w.r.t. lexical dierences between terminal nodes of paths. In order to deal with blind queries in general, we plan to adopt a looser notion of bisimulation in query execution, allowing for a fuzzy query language based on WG-log. In this approach, graph bisimilarity can be informally de ned as follows: Given two graphs A and B , (A; B ) is the ratio between the number of paths in common between A and B and the total number of distinct paths in both graphs. We have that takes values in [0; 1], (A; A) = 1 and (A; B ) = (B; A). Bisimilarity between a query graph and (sub)graphs of a WG-log instance could thus be used to compute an answer satisfying the formula associated to the query "up to a point".This may alleviate the stonewalling problem. We plan to explore this subject in a future paper.
References [Abi97] Serge Abiteboul, Querying Semi-Structured Data, ICDT'97, 6th Intl. Conf. on Database Theory, Vol. 1186, Springer,(1997) [Alt98] Altavista, Inc. \AltaVista Search Index" available from http://www.altavista.digital.com [Atz97] Atzeni P., Mecca G.., Merialdo P.\Semistructured and Structured Data in the Web: Going Back and Forth" available from http://www.research.att.com/ suciu/workshop-papers.html [Ber98] Berners-Lee T., "Introduction to RDF MetaData" available from http://www.w3c.org [Com98] Comai S., Damiani E., Posenato R., Tanca L. "A Schema-based Approach to Modeling and Querying WWW Data" Proc. of FQAS '98, Roskilde, May 1998, LNAI 1495. [Cor98] Cortesi A., Dovier A., Quintarelli E., Tanca L., "Operational and Abstract Semantics of a Query Language for Semi-Structured Information" Proc. of the Intl. Workshop on Deductive Logic Programming (DDLP '98) [Dam97] Damiani E., Tanca L. "Semantic Approaches to Structuring and Querying Web Sites". Proc. of 7th IFIP Work. Conf. on Database Semantics (DS-97), Chapman & Hall 1997. [DC98] "Dublin Core Metadata Element Set", available from http://www.purl.oclc.org/metadata/dublin core
[Dam98-2] Damiani E., Oliboni B., Tanca L., Veronese D. " Using WG-Log Schemata to Represent semistructured Data", Proc. of 8th IFIP Work. Conf. on Database Semantics (DS-98), Kluwer (to appear) [Dre97] Dreilinger, D., \SavySearch Home Page" available from http://www.lycos.com [Fer97] Fernandez M., Florescu D., Kang J., Levy A., Suciu D. "STRUDEL: A Web Site Management System " Proc. of the 7th IFIP Conf. on Database Semantics, 1997. [Flo98] Florescu, D., Levy A., Mendelzon A., "Database Techniques for the WWW: A Survey", SIGMOD Record, Sept. 1998.
[GaM95] Garcia-Molina, H., Hammer, J. \Integrating and Accessing Heterogeneous Information Sources in Tsimmis" Proc. of ADBIS 97, St. Petersburg (Russia) (1997) [Gar95] Garzotto F., Mainetti, L. Paolini P. \Hypermedia Design Langages Evaluation Issues", Comm. of the ACM 38(8) (1995) [Gys94] Gyssens M., Paredaens J., Van der Bussche J., Van Gucht D. \A Graph-oriented Object Database Model". IEEE Trans. on Knowledge and Data Eng., 6(4),(1994) [Hu96] Hu J., Nicholson D., Mungall C., Hillyard A., Archibald A. \WebinTool: A Generic Web to Database Interface Building Tool". Proc. DEXA Workshop 285:290 (1996) [Isa95] Isakowitz T., Stohr Edward A. D., Balasubramanian, P. \RMM: a Language for Structured Hypermedia Design". Comm. of the ACM 38(8) (1995) [Kon95] Konopnicki D., Shmueli O. \W3QL: A Query System for the World WideWeb", Proc. of the 21th Intl. Conf. on Very Large Databases, Zurich (1995) [Lak96] Lakshmanan L., Sadri F.,Subramanian I.: \A Declarative Language for Querying and Restructuring the Web". RIDE-NDS IEEE Computer Soc. Press (1996) [Lyc98-2] Lycos, Inc. \Point" available from http://www.pointcom.com [Man98] Manola F. "Towards a Richer Web Object Model" SIGMOD Record, Vol. 27 n.1 March 1998 [McK98] The McKinley Group, Inc., \Magellan Internet Guide" available from http://www.cs.colostate.edu/dreiling/smartform.html/
[Men96] Mendelzon, A., Mihaila, G. Milo, T.. \Querying the World Wide Web", Proc. of the Conf. on Parallel and Distributed Information Systems, Toronto (Canada) (1996) [Mil90] Milner R.,"Operational and Algebraic Semantics of Concurrent Processes". In J. van Leeuwen, editor, Handbook of Theoretical Computer Science Chapter 19. Elsevier Science Publishers B.V.(1990) [Par95] Paredaens, J. , Peelman, P., Tanca L. \G-log A Declarative Graph-based Language", IEEE Trans. on Knowledge and Data Eng., (7), 436:453 (1995) [Pin97] Pinkerton, B. \Finding What people Want: Experiences with WebCrawler", available from ftp://www.biotech.washington.edu/pub/WebCrawler.ps.gz
[RDF98] \W3C RDF Schemas (Working Draft)" available from http://www.w3.org/TR/WD-rdf-schema/ [Suc97] Suciu D.,\Management of Semi-Structured Information" SIGMOD Record, Vol.26 n.4 Dec. 1997 [Yah98] Yahoo, Inc., \Yahoo!" available from http://www.yahoo.com