A Relational Operator Approach to Data Fusion Jens Bleiholder Humboldt-Universit¨at zu Berlin Unter den Linden 6, 10099 Berlin, Germany
[email protected]
Abstract Integrated information systems provide users with a single unified view to heterogeneous data sources. As the resolution of schema level conflicts and the detection of fuzzy duplicates has been looked at more comprehensively, the problem of resolving data level conflicts still remains. We propose a relational data fusion operator, which fuses tuples representing the same real world entity by resolving conflicts in the attributes. Syntax and semantics of the operator are given as well as an extension of Sql. Furthermore, optimization issues involving transformations of logical query plans involving fusion and enabling cost based optimization for fusion are addressed. An implementation of the operator as part of a research prototype is under way.
1
Data Fusion
Integrated (relational) information systems provide users with a single unified view to heterogeneous data sources. Querying the underlying data sources, combining the results, and presenting it to the user is done by the integration system. The example in Figure 1 consists of three tables, the first two each representing data of some data source. In the example we assume a real-world domain where students are unambigously and globally consistent identified by their first name. The two source tables overlap intensionally (attributes with same meaning in both tables), as well as extensionally (tuples with same meaning - same students - in both tables). There is information about Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.
Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
EE Students Name Age Peter ⊥ Alice 22 Bob ⊥ Charly 25 Paul 26 Paul ⊥
Student no yes yes yes yes yes
Car Ford ⊥ VW Pontiac Chevy Chevy
CS Students Name Age Alice ⊥ Bob 27 Charly 24 Alice 21 Mary 24 Mary 24
Student yes ⊥ yes no yes yes
Phone 555 1234 555 4321 ⊥ 555 9876 ⊥ ⊥
Data fusion result of EE - and CS Students Name Age Student Car Phone Peter ⊥ no Ford ⊥ Alice 22 yes ⊥ 555 9876 Bob 27 yes VW 555 4321 Charly 25 yes Pontiac ⊥ Paul 26 yes Chevy ⊥ Mary 24 yes ⊥ ⊥
Figure 1: Two data sources with conflicting data and the result of data fusion; ⊥ denotes null values Alice, Bob, and Charly in both source tables, but the tables also complement each other: Column Car is only contained in table EE Students, column Phone is solely contained in table CS Students. When integrating data from both tables, inter-source conflicts (e.g., Bob and Charly) on the data level can occur, as well as intra-source conflicts (e.g., Paul ). We distinguish between two kinds of conflicts: a) ‘uncertainty’ about the value, caused by missing information, aka. null values in the table, and b) ‘contradictions’. An example for the former would be the age of Bob in table EE Students, one for the latter the age of Charly, which is 25 in EE Students and 24 in CS Students. The challenge in fusing data from the two tables above into the third table below (in Figure 1) is how to resolve these conflicts efficiently and in a meaningful way. This problem is mentioned for the first time by Dayal [4]. Since then, approaches and techniques have
emerged, many of them trying to “avoid” the conflicts or only eliminating the uncertainty of missing values. Anyhow, there is no easy way to specify a fused result table such as the one given in the example. Therefore we propose a new data fusion operator, which not only resolves uncertainties, but also fuses tables by resolving occurring data conflicts. An extension of Sql, the Fuse By statement, comes along with the operator. Data fusion is part of the integration process and follows the steps of schema matching, which resolves conflicts at the schema level, and duplicate detection, which solves the problem of identifying the different representations (tuples in the relational model) of one and the same real world entity. Dirty data is present in many scenarios and data fusion operations are used in virtual as well as in materialized integration systems. Therefore an efficient execution of data fusion will be beneficial in a wide variety of applications. Outline of the thesis work. The PhD work presented in this paper deals with the entire data fusion step in the context of the information integration process and will cover all major facets. Data fusion is realized as a relational operator. After evaluating previous attempts to relational data fusion along with a classification, Section 3 proposes syntax and semantics of the data fusion operator and shows how to express a data fusion operation in Sql using the Fuse By extension. Section 4 deals with the transformation of logical query plans involving such a data fusion operator. Before summarizing, Section 5 presents the challenges in implementing data fusion and conflict resolution and also in generating physical query plans. Current status. As syntax and semantics have already been defined, we are currently working on an implementation of the operator and the conflict resolution as part of our research integration system HumMer [1]. First ideas on how to transform logical query plans have emerged and will lead to the next step of actually implementing logical and physical query plan optimization.
operator and system to date fully achieves this. Union-style operators generally increase extensional completeness (more tuples) whereas join-style operators generally increase intensional completeness (more attributes), in the case of outer joins ( ./, ./ , and ./ ) also extensional completeness. Conciseness is high if the join attribute is a key, high conciseness in unions is reached by removing exact duplicates (union ∪, outer union ]) and subsumed tuples (minimum union ⊕) [7, 15]. The match join operator used in the AURORA system [17], as well as the merge () and prioritized merge (C) operators [8] are combinations of union and join operators. Some relational operators are able to remove uncertainties to a certain degree, but no one resolves all present contradictions. Grouping and aggregation is a way of achieving conciseness, but cannot readily be used for data fusion. A globally consistent ID needs to be provided and, more importantly though, the standard aggregate functions are not sufficient to resolve the type of conflicts that arise in data fusion. Projects that tackle but do not fully solve data fusion with grouping and aggregaton include AXL [16] and FraQL [13]. Besides the pure relational techniques, the three integrating information systems TSIMMIS [11], Hermes [14] and Fusionplex [10] achieve data fusion to a certain degree and explicitely mention the presence of conflicts and the need for resolving them. Both TSIMMIS and Hermes integrate semi-structured data from multiple sources using pre-defined rules in the mediator; Fusionplex operates on relational data. As the description of Hermes is the most detailed concerning conflict resolution strategies, all three systems only perform conflict avoidance: The value to be returned is generally not created by taking into account all present conflicting values but by choosing one value according to a preference measure on the sources (TSIMMIS, Hermes) or some quality metadata of the sources (Hermes, Fusionplex), such as recentness, trust or accuracy.
2
To accomplish data fusion in relational systems we propose a new operator in the relational algebra, the data fusion operator φ. This n-ary operator fuses n input relations R = r1 , . . . , rn as described below. It takes as additional parameters a list of m attributes F = f1 , . . . , fm that determine same real world entities, a list of k conflict resolutions CR = cr1 , . . . , crk and a list of l attributes S = s1 , . . . , sl , the intra-group sort key. We will write the operator as φF,CR,S (R). The data fusion operator represents a simple way of expressing queries that fuse multiple tuples describing the same object into one tuple while resolving uncertainties and contradictions and can be used with other relational algebra operators as π, o n, or σ. In order to implement such an operator in an Sql engine, first syntax and semantics need to be defined.
Classification of Data Fusion
A classification of previous attempts in data fusion according to completeness and conciseness is given in [2]. Nevertheless we want to point out some important insights to emphasize the need of a specific data fusion operation. To a certain degree, fusing relational data can be done by standard relational operators. In comparing the different operators the two notions of completeness and conciseness are helpful. In integrating data from different sources we generally want to increase completeness (more sources, more attributes, more tuples) but also conciseness (no duplicate tuples, no attributes meaning the same). The goal of data fusion is to produce a complete and concise result. As we will see no
3
Describing Data Fusion
The following description of the Fuse By extension and its semantics to support this data fusion operator is also part of [2], but should not be omitted here, as it is integral part of the thesis work. 3.1
Syntax
The data fusion operator can easily be embedded into Sql by means of the Fuse By statement. This statement is based on the standard Sql syntax and resembles in syntax and semantics the Group By statement. Figure 2 shows its syntax diagram. , SELECT
colref RESOLVE (colref) RESOLVE (colref, function) * ,
FUSE
FROM
tableref
where-clause , FUSE BY
(
colref
, )
ON ORDER
colref
having-clause order-by-clause
Figure 2: Syntax diagram of the Fuse By statement Consider the example sources from the introduction and the simple data fusion operation φN ame,max(Age) (EE Students, CS Students). This fuses the data on EE- and CS-Students, leaving just one tuple per student. Students are identified by their name (F = N ame) and conflicts in the age of the students are resolved by taking the higher age (CR = max(Age)). There is no order specified (S = ε). The corresponding Fuse By statement is: SELECT Name, RESOLVE(Age, max) FUSE FROM EE_Student, CS_Students FUSE BY (Name) The statement specifies the attributes in the result (Name, Age) in the SELECT clause, together with the conflict resolution for the Age attribute (max). The tables being fused are given in the FUSE FROM clause, the attribute identifying real world entities is part of the FUSE BY clause. As there is no order specified, the ON ORDER clause is missing. 3.2
Semantics
The following description explains how such a data fusion operation is evaluated. Data fusion consists of two phases: First, all tuples from the sources are combined to form a single table (Step 1), thus increasing completeness. Second, conciseness is increased by grouping tuples representing the same real world object and resolving conflicts (Steps 2 through 4). We consider φf1 ,...,fm ,cr1 ,...,crk ,s1 ,...,sl (r1 , . . . , rn ).
Step 1: Increasing completeness. Tuples going into the fusion process are determined by evaluating the relations r1 , . . . , rn . All relations are combined by outer union and are given in the FUSE FROM clause. Please note that this is not the usual behaviour of a FROM clause (cross product). If the relations are combined using cross product or join (φF,CR,S (r1 o nc1 . . . o ncn rn )) FROM is used instead of FUSE FROM together with the join condition(s) in the WHERE clause. Step 2: Identifying tuples to be fused. The attributes given in F = f1 , . . . , fm are specified in the FUSE BY clause and determine which tuples are considered the same real world entity. Tuples are grouped based on these attributes. We hereby assume a globally unique and consistent identifier that we can use to do the grouping. This identifier may be produced by detecting duplicates and assigning equal keys to representatives of the same real world objects or using multiple columns as key. For this reason duplicate detection needs to be done in advance to the fusion process. Additional WHERE conditions further limit the tuples going into the fusion process, but are seperate selections before φ in the algebra representation. The tuples in the groups are ordered according to the sort key given in S = s1 , . . . , sl . Step 3: Increasing conciseness. Next, exact duplicates and subsumed tuples are removed in every group. The removal of subsumed tuples is neither a standard operation of the relational algebra, nor does there exist a specific Sql statement. Rao et al. nevertheless show how subsumed tuples can be removed from a single table using Sql and its data warehouse extensions [12] and introduce the best match operator β, which removes subsumed tuples. All the remaining tuples of each group are then fused to a single remaining tuple, simultaneously resolving inconsistencies and data conflicts. This is done by applying conflict resolution functions to the attributes as given in CR = cr1 , . . . , crk and in the RESOLVE parts of the SELECT clause. Different conflict resolution functions and strategies are required by different domains, thus encapsulating the expert knowledge in fusing data in a domain. Nevertheless, there are many conflict resolution functions that are applicable in a wide variety of domains. More details on conflict resolution follow in Sec. 4. Step 4: Shaping the result. Finally, only the attributes given in F and CR can form the final result (projection may apply). An additional selection and ordering on this result (e.g., τ (σC (φF,CR,S (R)))) is specified in the HAVING and ORDER BY clauses in the spirit of Group By. To show the feasibility and usefulness of the data fusion operator, it is necessary to include it in a DBMS. As a first step we will enhance our own research prototype to support data fusion. This work is currently under way (see 5). In the long run, we plan on demon-
strating how data fusion can be achieved with a commercially available DBMS.
4
Logical Query Plans
The execution of data fusion operations also involves the optimization of logical as well as physical query plans. As any other relational operator, φ may be part of a logical query plan. A first step in query optimization is the query rewrite phase, in which algebraic transformations are applied to the logical plan to produce equivalent, but (hopefully) more efficient plans. Transforming Queries with Data Fusion. In improving logical query plans algebraic transformations involving data fusion are needed. One such very simple transformation would be σC (φF,CR,S (R)) = φF,CR,S (σC (R)) as long as the selection condition C involves attributes that are only part of F . Based on the optimization of queries involving Group By [3, 9] we plan on determining such transformations for data fusion. This involves transformations for pushing fusion operators down a tree, performing early and partial fusion, moving selections and projections around fusion operators and interchanging the order of joins and fusion. We plan to compile a comprehensive list of these transformations. One interesting question is in what special cases data fusion can be replaced by Group By and thus reuse already existing transformations and other functionality present in standard query optimizers. Nevertheless, in most cases the existing transformations involving Group By need to be enhanced, which depends on the intrinsic nature of data fusion allowing for ordered groups and the removal of subsumed tuples. Similar to Group By optimization, the used conflict resolution functions play an important role in applying transformations. Therefore, an important part of this thesis work is to look at the implementation of conflict resolution, the different properties of conflict resolution functions and how they influence query optimization. Conflict resolution functions. Unlike standard aggregation, conflict resolution functions can also use the information given by the entire query context. This query context consists not only of the conflicting values themselves, but also consists of the corresponding tuples, all remaining column values or other metadata (e.g., column or table name). This enables many different and powerful ways to resolve conflicts. Building a catalog of conflict resolutions is one step towards useful data fusion and is currently under way. Besides the standard aggregation functions interesting conflict resolution functions include: Vote, which takes the most frequently occuring value, Coalesce, which takes the first non-null value, Most Recent, which takes the most recent value, and Most specific ancestor, which, using a taxonomy, returns the first common ancestor of the values.
Although most RDBMS do not explicitely support user defined aggregation there are techniques like the one described in [16] that enable their use. Still, how to use information other than the column values in aggregating values in that case remains an open issue. Properties of conflict resolution. As other functions too, conflict resolution functions possess different properties that may influence how the data fusion operator is executed and how query plans involving fusion are transformed. Functions may be commutative or associative. Like aggregation functions, they may be order dependant, which means that the order of the values influences the aggregated value. The Coalesce function would be an example for an order dependant function whereas Max or Count are not. Another important property is duplicate sensitivity which means that the presence of exact duplicates influences the aggregated value. The Count function is duplicate sensitive whereas Max is not. Properties of aggregation functions have been studied in the fuzzy systems community [6], but mainly for numerical data. As data fusion involves non-numerical data as well, properties for these functions are also needed.
5
Physical Query Plans
In transforming logical into physical query plans, specific implementations of the data fusion operator and the conflict resolution functions need to be chosen. Different alternative plans also need to be evaluated based on a cost model. 5.1
Implementing Data Fusion
To support data fusion queries we implemented the data fusion operator in the context of our integrating information system HumMer [1]. A first step involved writing a parser that understands Sql including the Fuse By statement. In the query execution process the parser tree is then converted into a logical query plan, involving selections, projections, etc., and our fusion operator. After optimization the logical query plan is converted into a physical one that contains executable operators. We implemented the data fusion and additional operators based on the XXL library [5], which can be used to implement a database management system providing basic relational operators and a basic cost based optimizer. XXL uses the cursor concept to implement relational algebra operators. Our first implementation of data fusion is based on the grouping functionality already present within XXL. But as with the implementation of grouping and aggregation, there are different techniques and strategies to implement a fusion cursor that all need to be considered and compared when creating physical query plans. In implementing fusion one big difference between grouping and fusion influences the algorithm: the generally smaller number of tuples to aggregate in the fusion compared to standard grouping.
As the data fusion operation itself is inspired by grouping, the conflict resolution part is inspired by aggregation. Implementing conflict resolution functions therefore means implementing (user defined) aggregation functions. First experiments with several movie data sources and some basic conflict resolution functions showed good scalability of the data fusion operator. In enhancing an existing DBMS to support the data fusion operation, techniques such as table functions and user defined aggregates/functions become important and will be useful. 5.2
Optimizing Data Fusion
Cost based optimzation is the standard relational technique to generate physical query plans. Enhancing the standard cost model with estimates and rules for data fusion operators will enable cost based optimizers to also use data fusion operations. One interesting issue will be the interesting orders produced or needed by data fusion or the influence on possible join orders as with group by optimization [3]. The characteristic of fusion to mainly deal with many small groups will not only influence the actual implementation but also choosing among different plans as disk I/O may be not as important because these small groups entirely fit into main memory. Another important question is how different index structures influence optimization.
6
Conclusion and Outlook
Relational data fusion is a problem that has not received much attention over the last years. Therefore, many interesting problems remain. We propose a relational data fusion operator and show how it can be embedded in a query language like Sql. Using the proposed Fuse By statement, a user may easily specify how to fuse data from different data sources and resolve the existing data conflicts. A prototypical implementation using XXL is part of our research integration system HumMer. We also plan to show how data fusion can be achieved using a standard RDBMS. In addition to a more efficient implementation of the underlying operations, finding transformations for reformulating query plans involving fusion is a next step, as well as finding heuristics to support cost based query plan generation. The goal is to efficiently support the relational data fusion operation in relational systems. From there the road may lead to applying the knowledge gained to XML data fusion or changing query plan generation to support quality based optimization. Acknowledgment. This research was supported by the German Research Society (DFG grant no. NA 432). Thanks to my thesis advisor Felix Naumann for helpful discussions.
References [1] A. Bilke, J. Bleiholder, C. B¨ ohm, K. Draba, F. Naumann, and M. Weis. Automatic data fusion with hummer. In Proc. of VLDB, Trondheim, Norway, 2005. [2] J. Bleiholder and F. Naumann. Declarative data fusion - syntax, semantics and implementation. In Proc. of ADBIS, Tallinn, Estonia, 2005. to appear. [3] S. Chaudhuri and K. Shim. Including group-by in query optimization. In Proc. of VLDB, pages 354– 366, Santiago de Chile, Chile, 1994. [4] U. Dayal. Processing queries over generalization hierarchies in a multidatabase system. In Proc. of VLDB, pages 342–353, Florence, Italy, 1983. [5] J. V. den Bercken, B. Blohsfeld, J.-P. Dittrich, J. Kr¨ amer, T. Sch¨ afer, M. Schneider, and B. Seeger. XXL - a library approach to supporting efficient implementations of advanced database queries. In Proc. of VLDB, pages 39–48, 2001. [6] M. Detyniecki. Numerical aggregation operators: State of the art, July 2001. [7] C. A. Galindo-Legaria. Outerjoins as disjunctions. In Proc. of SIGMOD, pages 348–358, Minneapolis, Minnesota, 1994. [8] S. Greco, L. Pontieri, and E. Zumpano. Integrating and managing conflicting data. In Revised Papers from the 4th International Andrei Ershov Memorial Conference on Perspectives of System Informatics, pages 349–362, 2001. [9] A. Gupta, V. Harinarayan, and D. Quass. Aggregatequery processing in data warehousing environments. In Proc. of VLDB, pages 358–369, 1995. [10] A. Motro, J. Berlin, and P. Anokhin. Multiplex, fusionplex, and autoplex - three generations of information integration. SIGMOD Record, 33(4):51–57, 2004. [11] Y. Papakonstantinou, S. Abiteboul, and H. GarciaMolina. Object fusion in mediator systems. In Proc. of SIGMOD, pages 413–424, 1996. [12] J. Rao, H. Pirahesh, and C. Zuzarte. Canonical abstraction for outerjoin optimization. In Proc. of SIGMOD, pages 671–682, Paris, France, 2004. [13] K. Sattler, S. Conrad, and G. Saake. Adding conflict resolution features to a query language for database federations. In Proc. of EFIS, pages 41–52, Dublin, Ireland, 2000. [14] V. S. Subrahmanian, S. Adali, A. Brink, R. Emery, J. Lu, A. Rajput, T. Rogers, R. Ross, and C. Ward. Hermes: A heterogeneous reasoning and mediator system. Technical report, University of Maryland, 1995. [15] J. D. Ullman, H. Garcia-Molina, and J. Widom. Database Systems: The Complete Book. Prentice Hall PTR, 2001. [16] H. Wang and C. Zaniolo. Using SQL to build new aggregates and extenders for object-relational systems. In Proc. of VLDB, pages 166–175, Cairo, Egypt, 2000. ¨ [17] L. L. Yan and M. T. Ozsu. Conflict tolerant queries in aurora. In Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems, page 279, 1999.