contains data about patients and hospital treatments, where a cure is nothing but ... trID and that all the prescribed cures have to appear among the treatments of ...
XML data integration with identification Antonella Poggi 1,2 and Serge Abiteboul 1 1
2
INRIA Futurs, France Universita’ di Roma “La Sapienza”, Italy
Abstract. Data integration is the problem of combining data residing at different sources, and providing the user with a virtual view, called global schema, independent from the sources. Our goal is to address some of the issues related to XML data integration. In particular, we highlight two major issues that emerge in this context: (i) the global schema may be characterized by a set of constraints, expressed by a DTD and XML integrity constraints, (ii) the concept of node identity requires to introduce semantic criteria to identify nodes coming from different sources. We propose a formal framework for XML data integration systems based on an expressive global schema, a set of XML data sources and a set of mappings specified by means of a simple tree language. Then, we define an identification function that aims at globally identifying nodes coming from different sources. Finally, we address query answering and propose algorithms to answer queries under different assumptions for the mappings.
1 Introduction Data integration is the problem of combining data residing at different sources, and providing the user with a virtual view, called global schema, which is independent from the sources and can be queried by users. Whereas many data integration systems [5, 7] and theoretical works [6, 3, 9] have been proposed for relational data, not much investigation has been focused yet on XML data integration. Our goal is to address some of its related issues. In particular, we highlight two major issues that emerge in the XML context: (i) the global schema may be characterized by a set of constraints, expressed by a DTD and XML integrity constraints, (ii) the concept of node identity requires to introduce semantic criteria to identify nodes coming from different sources. Our main contributions are as follows. – First, we propose a formal framework for XML data integration systems based on (i) a global schema specified by means of a (simplified) DTD and a set of XML integrity constraints as defined in [2], (ii) a source schema specified by means of DTDs, and (iii) a set of Local-As-View (LAV) mappings, which characterize the content of each source as a view over the global schema, specified by means of the prefix-selection-queries [1]. If a source provides a subset of the data accessible from the global schema throw the corresponding view, then, we say that the mapping is sound. If the source provides exactly the view, then the mapping is exact. – Second, we define an identification function, that plays the role to identify nodes coming from different sources. – Finally, we address the query answering problem in the XML data integration setting. In particular, we propose an approach that is reminiscent of query answering with incomplete information. We provide three algorithms to answer queries under the assumptions of sound, exact and mixed mappings, and study their complexity. This paper is an abstract. Definition and proofs may be found in [8].
2 XML data integration by example Suppose that an hospital offers access to information about patients and their treatments. Information is stored in XML documents managed in different offices of the hospital, whereas users (e.g. statisticians), because of privacy and security reasons, have access to a global DTD S G that has the following form: SG :
Following the usual approach for XML, we consider documents as trees, whose nodes are labeled with elements names. The above DTD says that the document contains data about patients and hospital treatments, where a cure is nothing but a treatment id. Moreover, a set of keys and foreign key constraints are specified over the global schema. In particular, we know that two patients cannot have the same social security number SSN, that two treatments cannot have the same number trID and that all the prescribed cures have to appear among the treatments of the hospital. Such constraints correspond respectively to two key constraints and one foreign key constraint. Finally, sources consist in the following two documents, D1 and D2 , with the following DTDs. Mappings tell us that they contain resp. patients with a name and a social security number, and patients that were prescribed cures.
D1 :
Parker 55577 Rossi 20903
D2 :
55577 32 11
S1 :
S2 :
Suppose now that the user asks for the following queries: 1. Find the name, the social security number and the prescribed cures for all patients having an insurance number and at least one prescribed cure. 2. Find all dangerous treatments (we assume these have numbers smaller than 25). Typically, in data integration systems, the goal is to find the certain answers, e.g. the answers returned by all data trees satisfying the global schema and conforming to the data at the sources. By adapting data integration terminology [6], we call them legal data trees. A crucial point here is that legal data trees can be constructed from a merge of the source trees. We therefore need to identify nodes that should be merged, using the constraints of the global schema. Note that data retrieved may not satisfy these constraints. In particular, there are two kinds of constraints violation. Data may be incomplete, e.g. it may violate constraints by not providing all data required according to the schema. Or, data retrieved may be inconsistent, e.g. it may violate key constraints by providing two elements that are ”semantically” the same but cannot be merged without violating key constraints. In this paper, we will focus on the incompleteness issue, and will detect inconsistencies without aiming at providing answers if they occur. Coming back to the example, it is easy to see that the sources are consistent. Thus, the global schema constraints allows to answer Query 1 by returning the patient with
name ”Parker”, social security number ”55577” and cures numbers ”32” and ”11”. Note that Query 2 can also be answered with certainty, thanks to the foreign key constraint, by returning the cures that have been prescribed to patients from the second data source. We conclude the section by highlighting the impact of sound/exact mappings assumption. Suppose that no constraints were expressed over the global schema. Under exact mappings, there is only one way to merge data sources and satisfy the schema constraints. Indeed, since every patient has a name and a SSN number, we deduce that all patients in D2 should belong to D 1 . Therefore Query 1 is answered by returning the patient ”Parker”, with SSN ”55577” and cures numbers ”32” and ”11”. Note that under sound mappings, the same query cannot be answered with certainty, without keys.
3 Data model and query language In this section we introduce our data model and query language, inspired from [1]. We refer to [8] for most of the definitions. Data trees, paths, prefixes, tree type The (unordered) data tree T = t, d, λ, ν of Figure 1(a) represents an XML document about patients and treatments of an hospital, where that datanode values, assigned by the function ν, are circled. Note that there are two different paths for trID. Moreover, two prefixes of T are shown in Fig. 2(b) and 2(d). In Fig. 1(b), we show a the tree type Σ, r, µ that corresponds to the global DTD SG from Section 2, where r =hospital and µ can be specified as follows: hospital → patient + treatment+ patient → SSID name cure ∗ bill? treatment → trID procedure ?
Note that patient and treatment are both elements of the same collection hospital. Moreover, we have that T satisfies the treetype S G , noted SG |= T . hospital patient
+
treatment treatmen
patient
name cure cure SSN SSN 55577 Parker 32 11 20903
name trID Rossi 32
trID 32
hospital
+ treatment ?
patient ? * bill SSN name cure
(a) Data tree
trID
procedure
(b) Tree type Fig. 1. Data Model
Unary keys and foreign keys Given a tree type S = Σ, r, µ, we recall and adapt to our framework the definition of (absolute) unary keys and foreign keys from [2]. In particular, foreign keys are assertions of the form: a.k a ⊆ b.kb , where kb is a key for b, kaω ∈ µ(a) for some ω, kb1 ∈ µ(b) and µ(ka ), µ(kb ) contain both S. This indicates that, for every node n labeled a, the value of a datanode m labeled k a , child of n, references the value of a datanode m labeled kb , child of a node n labeled b. In this paper, we consider only fully specified foreign keys, such that path(b) = {p}, where p = (l0 , l1 , .., ls ), l0 = r, ls = b and for each label l i , liωi ∈ µ(li−1 ) where ωi ∈ {?, 1} and S ∈ / µ(li ), i = 1, .., s. Note that this restriction is very reasonable in pratice. Example 1. Given the treetype of Fig. 1(b), the constraints from Section 2 are: patient.SSN → patient treatment.trID → treatment
patient.cure ⊆ treatment.trID
Prefix-selection queries We illustrate the ps-query language by queries Q 1 and Q2 resp. in Fig. 2(a) and Fig. 2(c). They select (Q 1 ) the name and SSN of patients in the input tree of Fig. 1(a) having SSN smaller than 10000, (Q 2 ) the SSN and cures of patients that were prescribed at least one cure. We say that Q 1 , Q2 are posed in terms of the treetype of Figure 1(b). The answers to the queries are given in Figure 2(b) and 2(d). Note that these are the data trees representing documents D 1 and D2 from Section 2. hospital
hospital
patient
patient
SSN
name