Full Outer Join Optimization Techniques in ... - DBGroup - Unimore

1 downloads 112 Views 164KB Size Report
ited optimization is performed on full outer ... fine and to optimize the query qG associated to a global ..... search engines based on data integration systems.
Full Outer Join Optimization Techniques in Integration Information Systems Domenico Beneventano

Sonia Bergamaschi

Carlos Rodrigue Nana Mbinkeu

Dipartimento di Ingegneria dell’Informazione Universit`a di Modena e Reggio Emilia Via Vignolese 905, 41100 Modena, Italy [email protected]

Abstract In Integration Information Systems data fusion, i.e. the merging of multiple tuples about the same real world object coming from different sources into a consistent and homogeneous set is often ignored [5]. The full outer join operator is a good candidate in today’s integrating information systems, to perform data fusion [1, 6]. The problem is that full outer join queries are very expensive, especially in a distributed environment as the one of mediator/integration systems. Database optimizers take full advantage of associativity and commutativity properties of join to implement efficient and powerful optimizations on join queries; however, only limited optimization is performed on full outer join [6]. This paper reports the description of work-inprogress about query optimization techniques for full outer join queries developed in the context of the MOMIS/SEWASIE Integration System.

1

Introduction

In todays integrating information systems data fusion, i.e., the merging of multiple tuples about the same real-world object into a single tuple, is often ignored [5]. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Proceedings of the 32nd VLDB Conference, Seoul, Korea, 2006

The MOMIS/SEWASIE Integration System (IS ) [1, 2] performs data fusion from multiple sources into a consistent and homogeneous set. It is characterized by a classical wrapper/mediator architecture based on a Global Virtual View (GVV ) and a set of local data sources. The data sources contain the real data, while the GVV provides a reconciled, integrated, and virtual view of the underlying sources. Formally, IS = hGV V, N , Mi is constituted by: • A Global Virtual View (GV V ) • A set N of local data sources. • A set M of GAV (Global-As-View) mapping assertions between GV V and N , where each assertion associates to a class G in GV V a query qG over the schemas of a set of local sources in N . More precisely, for each global class G ∈ GV V we define a set of local classes, denoted by L(G), belonging to the local sources in N and a query qG over L(G) based on the full outer join operator. Both the GV V and the local source schemata are expressed in the ODLI 3 [3] language. However, for the scope of this paper, we consider both the GV V and the local source schemata as relational schemata. In several papers (see [1, 2]) we described the MOMIS/SEWASIE approach to build the GV V . In this paper we face a new problem, that is how to define and to optimize the query qG associated to a global class G to perform data fusion. Figure 1 shows the Mapping Table of the global class G with L(G) = {L1 , L2 , L3 } (we suppose that these local classes are in three different local sources); columns of M T represent the local classes L(G) belonging to G and rows represent the global attributes of G (the schema of G is then S(G) = {A, B, C, D, E, F, ID}). An element M T [A][L] represents the set of local attributes of L which are mapped onto the global attribute G.A. In this example, for the sake of simplicity, M T [A][L] = L.A.

A B C D E F ID

L1 A B − − E − ID

L2 − B C D − − ID

L3 − − C − − F ID

Figure 1: Mapping Table of the global class G The query qG is defined in such a way that: 1. it contains a unique tuple resulting from the merge of all the different tuples representing the same real-world object; 2. it solves data conflicts of common local attribute values and merges common attributes into one. As in [5] we assume object identity, that means, it is possible to distinguish between different real-world objects by a globally unique and consistent identifier. In most domains, such an identifier is already present or can easily be created, e.g., by duplicate detection methods. [5] analyzed standard and advanced relational operators that somehow perform data integration and discussed their ability to achieve complete and concise results. The join operators are not wellsuited to fuse tables, because the result contains only objects present in both source tables. This disadvantage is removed by the use of the outer join operations, which also retains all tuples of either one or both relations (left, right and full outer join). If the join attribute is a globally consistent ID, each real-world object is identified by that ID and appears only once in the result of the full outer join (this property is called extensional conciseness in [5]). In MOMIS/SEWASIE we assume that the join attribute is a globally consistent ID in the Mapping Table, i.e., all the local classes of G have an attribute ID, Li .ID, which is mapped in the same global attribute G.ID (see figure 1) and we use the full outer join to define the query qG , as shown in the following. If the local classes Li and Lj are extensionally overlapping, a join condition pij : Li .ID = Lj .ID holds. If all the local classes are extensionally overlapping (the default case), the resulting query graph is fully connected. We consider the following full outer join query is associated to the global class G: F JG

=

p12

p13 ∨p23

p23

p13 ∨p12

p13

p23 ∨p12

(L1 =o n= L2 ) =o n= L3

(1)

≡ (L2 =o n= L3 ) =o n= L1 ≡ (L1 =o n= L3 ) =o n= L2 These equivalences are an immediate consequence of the definitions of pij .

If two local classes, say L1 and L3 , are disjointed, p13 = f alse and in (1) only the first two expressions are valid to compute F JG . Note that in this case the resulting query graph is acyclic; the only condition we require is that the query graph is connected. The query qG is defined starting from F JG by solving data conflicts of common local attribute values and merging common attributes into one. We use conflict resolution strategies, based on Resolution Functions, as introduced in [7, 5]: for global attributes, mapped in more than one local attributes, the designer defines, in the Mapping Table, Resolution Functions to solve data conflicts of local attribute values. For example, if L1 .B and L2 .B are numerical attribute, we can define G.B = avg(L1 .B, L2 .B). If the designer knows that there are no data conflicts for a global attribute mapped onto more than one source (that is, the instances of the same real object in different local classes have the same value for this common attribute), he can define this attribute as an Homogeneous Attribute; of course, for homogeneous attributes resolution functions are not necessary. A global attribute mapped into only one local class is a particular case of an homogeneous attribute. We consider that all Resolution Functions are applied by means of a unary relation Φ, which is introduced and defined in informal way in the following definition (for a definition of similar concepts see [5, 4]): Definition 1 (Query qG ) qG = Φ(F JG )

(2)

where Φ is a unary relation operator which returns a relation with schema S(G) such that • the value of G.ID is a not null value of Li .ID • the value of an homogeneous attributes G.A is a not null value of Li .A (if a such value does not exists, the value is null) • the value of a non homogeneous attributes G.A is computed by applying the related Resolution Function.

2

Full Outer Join Optimization Techniques

We consider a query Q over the global class G of the GV V : Q = ΠA [σP (G)]

(3)

where A are the projection attributes and P is a conjunction of positive atomic constraints (G.A Op value). In [1] we completely described the rewriting method; here we focus on query optimization. We consider only queries over a single global class as a generic query posed over more than one

global classes of the GVV is expanded into subqueries referring to a single global class. The query Q must be rewritten over the local classes of G by taking into account the mapping defined by qG , i.e., we must consider the following expression: ΠA [σP (Φ(F JG ))]

(4)

i.e. p12

p13 ∨p23

n= L3 )))] ΠA [σP (Φ(((L1 =o n= L2 ) =o

(5)

This expression is rewritten as: p

p

∨p

13 23 12 ¯ 3 ))] ¯ 1 =o ¯ =o n= L ΠA [σPr (Φ((L n= L2) | {z }

as full outer join [6]. In [6], the authors presented the theory and algorithms needed to generate alternative evaluation orders for the optimization of queries containing outer joins. Their results are very interesting and constitute a good basis for our proposal but are not sufficient in our framework. The main reason is that our query graphs are in general (in the default case) cyclic. Moreover [6] focused on the problem of changing the order of evaluation of queries containing both joins and outer joins whereas we want to optimize queries with only full outer joins (see 6). In the following we describe, in an informal way (by examples) the following significant simplifications Q of the full outer join expression F JG :

(6)

1. Substitution of a full outer join with a left/right outer join or a join.

where Pr are the residual constraints, i.e. the atomic constraints of P that are not mapped into all local queries L¯i , and L¯i = ΠAi (σPi (L1 )) is defined as follows:

2. Reduction of the local projection attributes in L¯i = ΠAi (σPi (L1 )).

Q F JG

• Ai is the union of the attributes in A and in Pr mapped into Li , plus the attribute ID. • Pi is computed by performing an atomic constraint mapping: an atomic constraint (G.Ai Opvalue) is rewritten on the basis of resolution functions defined for G.Ai . For example, if G.B = avg(L1 .B, L2 .B), the constraint (G.B = value) cannot be pushed at the local level, because avg has to be calculated at a global level. In this case, the constraint is mapped as true in both the local classes. On the other hand, if G.B is an homogeneous attribute the constraint can be pushed at the local classes L1 and L2 , as shown in the following example:

3. Elimination of local queries L¯i and thus elimination of local classes from the full outer join query execution. All these simplifications constitute a relevant query optimization result as are independent from the query execution cost model and effective in a distributed environment. To describe our method, let us consider Q = ΠA [σP (G)] where A = A1 . . . An and P = (A1 Op1 value1 ) ∧ . . . ∧ (Am Opm valuem ) is denoted by P = A1 . . . Am . Simplification 1 : Substitution of a full outer join with a left/right outer join or a join. Example:

Q1 = ΠA1 [σP1 (G)] with A = {A, B, C, D, E, F } and P = (A = value1 ) ∧ (B = vvalue2 ) ∧ (C = Q1 value3 ). In F JG we have: ¯ 1 = ΠID,A,B,E (σ(A=value )∧(B=value ) (L1 )) L 1 2 ¯ 2 = ΠID,B,C,D (σ(B=value )∧(C=value ) (L2 )) L 2 3 ¯ 3 = ΠID,C,F (σ(C=value ) (L1 )) L 3

There is no atomic constraints of P mapped into all local queries L¯i , then all are residual constraints and then Pr = P1 . Q How to optimize the full outer join expression F JG ? Conventional database optimizers take full advantage of associativity and commutativity properties of join to implement efficient and powerful optimizations on select/project/join queries; however, only limited optimization is performed on other binary operators such

Q1 = ΠA1 [σP1 (G)] with A1 = ABCDEF and P1 = ABC. If the attributes included in P1 are homogeneous Q1 attribute, the possible semplifications of F JG are: ∨p

p

p

p

p12 ∨p13

12 23 ¯ 1 =o ¯ 2 ) 13=o ¯3 1. (L n= L n L

23 ¯ 2 =o ¯3) 2. (L n= L

p

o n p

¯1 L

∨p

13 23 ¯ 1 =o ¯ 3 ) 12=o ¯2 3. (L n= L n L

Simplification 1 is justified by the following observation (see figure 2.a): the set L1 and L2 support the predicate P (since S(L1 ) ∪ S(L2 ) includes P ) whereas the set L3 does not support P (i.e. we can avoid the execution of the right outer join on L3 . The justification of simplifications 2 and 3 is similar (Figure 2.b-2.c).

Figure 2: Query Graph and Simplifications (the ID attribute is not shown) The maximum simplification is obtained in the case that P involves attributes that compare in a single class, as shown in the following example: Q2 = ΠA2 [σP2 (G)] with A2 = A1 and P2 = DEF . A possible semplificap12 p ∨p23 Q2 ¯1 o ¯ 2 ) 13o ¯3. tion of F JG is: (L n L n L In the previous examples, we only impose that the attributes involved in the predicate P be homogeneous. If the projection attributes are homogeneous, further simplifications may be obtained.

3

Conclusion and Future Work

In this paper we described, in an informal way, our work-in-progress on full outer join optimization techniques in Integration Information Systems. The proposed optimization techniques are valid in a context of homogeneous attribute, i.e., if there are no conflicts. The challenge of our future work is to extend such optimization techniques also in a context of conflicting data.

References Simplification 2 : Reduction of the local projection attributes, i.e., reduction of Ai in L¯i = ΠAi (σPi (L1 )). Example: we consider Q2 with a different set of projection attributes: Q3 = ΠA3 [σP3 (G)] with A3 = BCE and P3 = DEF . As for Q2 we have Q3 that a possible semplification of F JG is: p

12 ¯1 o ¯2) (L n L

p13 ∨p23

o n

[1] D. Beneventano and S. Bergamaschi. Semantic search engines based on data integration systems. In Semantic Web Services: Theory, Tools and Applications. Idea Group., 2006. To appear. Available at http://dbgroup.unimo.it/SSE/SSE.pdf. [2] D. Beneventano, S. Bergamaschi, F. Guerra, and M. Vincini. Synthesizing an integrated ontology. IEEE Internet Computing, 7(5):42–51, 2003.

¯3 L

[3] S. Bergamaschi, S. Castano, M. Vincini, and Moreover, due to the hypothesis that projection atD. Beneventano. Semantic integration of heterotributes are homogeneous, an attribute of A3 can be geneous information sources. Data Knowl. Eng., selected from any local class and thus we can perform 36(3):215–249, 2001. the following reduction (other combinations are possi[4] J. Bleiholder. A relational operator approach to ble): data fusion. In VLDB 2005 PhD Workshop, 2005.   ¯ 1 : A1={ID, B, E} In L A1={ID, B, E}       [5] J. Bleiholder and F. Naumann. Declarative data   reduction ¯ 2 : A2={ID, B, C, D} fusion - syntax, semantics, and implementation. In A2={ID, D} In L =⇒     Advances in Databases and Information Systems     ¯ 3 : A3={ID, C, F } ADBIS, pages 58–73, 2005. A3={ID, C, F } In L By combining substitution and reduction we can elimiQ nate local classes from F JG , as shown in the following. Simplification 3 : Elimination of local classes from the full outer join query execution. Example: Q4 = ΠA4 [σP4 (G)] with A4 = BF and P4 = CD. In this case L1 can be p23 Q4 ¯ 2 =o ¯ 3 ). eliminated from F JG , obtaining (L n L

[6] C. A. Galindo-Legaria and A. Rosenthal. Outerjoin simplification and reordering for query optimization. ACM Trans. Database Syst., 22(1):43– 73, 1997. [7] F. Naumann and M. H¨aussler. Declarative data merging with conflict resolution. In C. Fisher and B. N. Davidson, editors, IQ, pages 212–224. MIT, 2002.

Suggest Documents