1 The first three properties represent a method signature, while the last two properties .... of exact and relaxed micro matches in the target class; (3) partial exact, ...
QoM: Qualitative and Quantitative Schema Match Measure Naiyana Tansalarak and Kajal Claypool Department of Computer Science, University of Massachusetts - Lowell {ntansala,kajal}@cs.uml.edu http://www.cs.uml.edu/dsl/index.html
Abstract. Integration of multiple heterogeneous data sources continues to be a critical problem for many application domains and a challenge for researchers world-wide. Schema matching, a fundamental aspect of integration, has been a well-studied problem. However researchers have, for the most part, concentrated on the development of different schema matching algorithms, and their performance with respect to the number of matches produced. To the best of our knowledge, current research in schema matching does not address the issue of quality of matching. We believe that quality of match is an important measure that can not only provide a basis for comparing multiple matches, but can also be used as a metric to compare as well as optimize existing match algorithms. In this paper, we define the Quality of Match (QoM) metric, and provide qualitative and quantitative analysis techniques to evaluate the QoM of two given schemata. In particular, we introduce a taxonomy of schema matches as a qualitative analysis technique, and a weight-based match model that in concert with the taxonomy provides a quantitative measure of the QoM. We show, via examples, how QoM can be used to distinguish the “goodness” of one match in comparison with other matches.
Keywords: Schema Matching, Schema Integration, Quality of Matching
1 Introduction Integration of heterogeneous data sources continues to be a critical problem for many application domains and a challenge for researchers world-wide. Today there is a broad spectrum of information that is available in interconnected digital environments such as the Web, each with its own concepts, semantics, data formats, and access methods. Currently, the burden falls on the user to resolve conflicts, integrate the data, and interpret the results, a process that can take on the order of hours and days to accomplish, often leaving data under-exploited and under-utilized. An integral problem underlying this entire process is that of data integration, and in particular that of matching schema entities in an automated/semi-automated manner. Schema matching is the task of finding semantic correspondences between elements of two schemas [DR02]. Various systems and algorithms have been proposed over the years to automate this process of schema matching. While most approaches address
the problem for specific domains [BHP94,BCVB01,BM01], there have been a few approaches that tackle the problem independent of the domain [HMN + 99,MBR01,DR02]. The proposed approaches exploit various types of schema information such as element names, data types, structural properties, ontologies, domain knowledge as well as characteristics of data instances. Typically, two schemas are provided as input and matches between the schemas, denoting correspondences between the entities of the two schemas, are produced as output by the match algorithm. Two entities are said to match if their similarity value is above a certain threshold. Calculation of the similarity value is largely dependent on the type of match algorithm used. For example, Madhavan et al. [MBR01] define the similarity value for structural matching as the fraction of leaves in the two subtrees that have at least one strong link to some leaf in the other subtree. A link in their case is said to be a strong if the similarity value exceeds a pre-set threshold. Thresholds, on the other hand, are typically set in an ad-hoc manner. Similarity value and threshold together provide a measure of the quality of match that is produced by a system. Unfortunately, in current systems, calculation of the similarity value and the setting of the threshold value are tightly coupled to the individual algorithms, with no metric that can compare matches across the different algorithms. There has been no concerted effort to provide a metric that compares (1) the quality of match across different match algorithms (horizontal comparison); or (2) multiple matches that may be discovered for a given source entity (vertical comparison) via the same match algorithm. In this paper we define a Quality of Match (QoM) metric, and provide qualitative and quantitative analysis techniques to evaluate the QoM of two given schemata, independent of the actual match algorithm used. In particular, we propose a first of its kind match taxonomy, a qualitative analysis technique, that categorizes the structural overlap and hence the information capacity of the given schemata. Our match taxonomy uses UML as its unifying data model, thereby broadening its applicability to relational, XML and OO schemas which can all be expressed in the UML model. However, we find that while the match taxonomy provides a categorization of the matches at a high level, it cannot distinguish between matches within a given category. To enable distinction of matches within a category, we propose a quantitative measurement of QoM via a weight-based match model. The match model, based on the structural and informational aspects of a schema, quantitatively evaluates the quality of match, assigning it an absolute numeric value. Roadmap: The rest of the paper is organized as follows. Section 2 presents a formalization of the UML model. Section 3 presents the match taxonomy, while Section 4 describes the weight-based match model. Section 5 presents related work and we conclude in Section 6.
2 Background: The UML Model Today, much of the information exists in heterogeneous sources such as relational tables, objects, or XML documents. To integrate information from these heterogeneous sources and to reason over them, we must necessarily unify them in one common data model [RR87,MIR93]. Given the universal acceptance of the UML model and its suit-
ability for modeling all aspects of the relational, object-oriented, and XML data models [CSF00], we have chosen the UML model as our common data model. In this section, we present a brief overview of the UML model, and provide some basic definitions that are used in the later sections of this paper. UML, the Unified Modeling Language, as defined by Rambaugh, Jacobson, and Booch [Boo94], is a general purpose visual modeling language that can be used to specify, visualize, construct, and document the artifacts of a software system. While UML can model both the static structure and the dynamic behavior of systems, we are primarily interested in capturing schema structures in the static view [Boo94]. Classes, the cornerstone of the UML static view, consist of a class name, attributes, and methods. The attribute, identified by its label, has associated with it a set of properties that define its domain type, its scope, and possibly a set of initial values. A method, on the other hand, is identified by its signature comprising of its scope, return value, and a set of input parameters (possibly empty). In addition, a method has a set of pre-and post-conditions which define its behavior. Note that in the case of XML and relational models, no methods are defined for the classes. Formally, we define an attribute and a method as follows. Definition 1. An attribute a is defined as a 5-tuple a = < L, A, T , N , I > where L represents the label, A the set of applicable modifiers, T the domain type, either primitive or user-defined, N the cardinality, and I the list of possible initial value(s) of the attribute. Definition 2. A method m is defined as a 5-tuple m =< A, O, I, pre, post > where A is the set of applicable modifiers, O the return data type, I a finite set of input parameter data types, pre the precondition, and post the postcondition. 1 In Definitions 1 and 2, the applicable modifiers are the set of modifiers permissible in UML, namely, the access modifiers (private, public, and protected), the class-wide modifiers (static), and the constant modifier (final). Note that not all data models, XML and relational models for example, have a direct mapping to the modifiers. Default values are used when translations are done from these data models. The UML data model also defines a set of possible relationships (association, aggregation, and generalization and specialization) between classes. In our work, we translate all relationships into attributes of the given class based on the conversion rules [San95]. For example, an association with one-to-many cardinality between class A and class B is given by an attribute of the type A on class B. Similarly, a specialized class is represented by the set of all local and inherited attributes and methods. A class is thus defined as follows. Definition 3. A class c is defined as a 2-tuple c =< E, F > where E a finite set of all attributes (local and inherited), and F a finite set of all methods (local and inherited). Lastly, while the UML model does not specifically define the concept of a schema, we find it useful to define a schema simply as a collection of classes. Definition 4. A schema, S, is defined as a finite set of classes, < C >. 1
The first three properties represent a method signature, while the last two properties denote a method specification.
Notation. We use the following notations throughout the rest of the paper. We use the notation Ss and St to represent the source and the target schemas respectively. In addition, C(Ss ) denotes a set of valid classes of Ss , M(Cs ) a set of valid attributes and methods of a class Cs ∈ Ss , |Ss | the number of classes (cardinality) of Ss , and |Cs | the number of attributes and methods (cardinality) of Cs . Similar notation is used for the target schema.
3 Qualitative Analysis: Taxonomy of Schema Matching We define the quality of match (QoM) metric as the measure of “goodness” of a given match. In this section, we focus on defining a qualitative measure of this goodness via a well-defined match taxonomy. Schema matching is typically based on the inherent hierarchy present in the schema structure, resulting in the comparison of attributes (and methods for OO schemas) at the lowest level, the comparison of containers (relations, classes, and elements), and the comparison of the schemas themselves. Each level of the comparison is tightly coupled, and hence heavily dependent on its lower level. Based on this hierarchy we now define a match taxonomy that categorizes the matches at the attribute (or method), class and the schema levels. We classify theses matches as micro, sub-macro, and macro matches respectively.
Fig. 1. The Recipe Schema
Fig. 2. The Dish Schema
3.1 Micro Match In existing match algorithms [MBR01,DR02,BM01,BHP94,HMN + 99], a match between two attributes is typically determined by the similarity (via linguistic matching) of their labels. In addition to label similarity, some algorithms [MBR01,DR02] also consider the domain type of the attributes to determine a match. A match between
two methods, on the other hand, is typically done via a matching of the method signatures and/or the matching of method specifications as given by its pre- and postconditions [ZW95,ZW97,JC95]. Most method matching algorithms do not take the label of the method into account [ZW95,ZW97,JC95]. Based on this existing work and the UML model presented in Section 2, we now define a match between attributes (or methods), termed a micro match, as a match of all properties of the attributes (or methods). The quality of match (QoM) for a micro match is categorized as either exact or relaxed. A micro match is said to be exact if all properties of the two attributes (or methods) as per Definition 1 (Definition 2) are either (a) identical or equivalent. Assuming labels of the attributes are compared using linguistic match algorithms, identical labels imply that the two labels are either exactly the same or are synonymous. For all other properties, it implies “exactly the same” semantics. As an example, consider schema Recipe and schema Dish given in Figures 1 and 2 respectively. Here, the attribute name of class Recipe is an exact match to the attribute name of class Dish as all properties including the label name are identical. On the other hand, the attribute name of class Recipe is not considered an exact match of the attribute qty of class Ingredient as their labels are neither identical nor synonymous (even though the other properties are identical). An equivalent match used for method specification implies the logical equivalence of two method specifications. For example, the precondition count = count + 1 is equivalent to the precondition num = num + 1. Formally, an exact match for two attributes or two methods can be defined as follows. Definition 5. A match between two attributes as ∈ Cs and at ∈ Ct is said to be exact, as =E at , if < Ls , As , Ts , Cs , Is >=< Lt , At , Tt , Ct , It >, where = denotes either an identical or a synonymous match between Ls and Lt , and identical matches for all other properties. Definition 6. A match between two methods ms ∈ Cs and mt ∈ Ct is said to be exact, ms =E mt , if (1) < As , Os , Is >=< At , Ot , It >, where = denotes an identical match; and (2) < pres , posts >⇔ < pret , postt >, where ⇔ denotes an equivalent match. A micro match is said to be relaxed if (a) the labels of the attributes are related but not identical (approximate). For example, they may be either hyponyms, or may have the same stem; or the properties of an attribute (or a method) are a generalization or a specialization of the other. For example, the access modifier public is considered to be a generalization of the modifier protected; or (c) for a method match, the pre- and post-conditions of the source method imply the pre-and post-conditions of the target method, or vice versa. For example, the precondition count > 10 implies the precondition count > 5. Consider again the schemas given in Figure 1 and 2. The attribute step of class Instruction has a relaxed match with the attribute direction of class Step as their labels are in the same word hierarchy (hyponym), but are not identical or synonymous. Formally, we define a relaxed match between two attributes (or methods) as follows.
Definition 7. A match between two attributes as ∈ Cs and at ∈ Ct is relaxed, as =R at , if < Ls , As , Ts , Cs , Is > ≈ < Lt , At , Tt , Ct , It > where ≈ denotes an approximate match between the labels Ls and Ls , and either a generalized match (>) or a specialized match ( ≈sig < At , Ot , It >, where ≈sig denotes either a generalized match (>) or a specialized match ( ≈spec < pret , postt >, where ≈spec = {⇒ or ⇐} denotes that one set of pre-and postconditions can imply the other set of pre-and post-conditions. 3.2 Sub-Macro Match A class, as per Definition 3, is defined as a set of attributes and methods. Thus, a match between two classes, referred to as the sub-macro match, can be compared on the basis of (1) the number of matched attributes (or methods); and (2) the quality of the micro matches. Based on the number of matched attributes (methods) between the source and target classes, the quality of match (QoM) at the sub-macro level is given as either a total or a partial match. In a total match, all attributes (or methods) of the source class match some or all attributes (or methods) of the target class one-to-one, while in a partial match some (but not all) attributes (or methods) of the source class match those in the target class. For example, the class Recipe in Figure 1 has total coverage in the class Dish in Figure 2, while the class Ingredient in Figure 1 has only partial coverage with respect to the class Item in Figure 2. Definition 9. A total match between two classes Cs ∈ Ss and Ct ∈ St , denoted as Cs =T Ct , is defined as a total and injective function over attributes and methods f : M(Cs ) → M(Ct ). Definition 10. A partial match between two classes Cs ∈ Ss and Ct ∈ St , denoted as Cs =P Ct , is defined as an injective function over attributes and methods f : M(Cs ) → M(Ct ).
Fig. 3. The QoM for Sub-Macro Matches.
Combining the two criteria, number of matches and the quality of micro match, we define four classifications for the QoM at the sub-macro level: (1) total exact, wherein all attributes and methods of the source class match exactly (exact micro match) the attributes and methods of the target class; (2) total relaxed, wherein all attributes and methods of the source class have either a relaxed micro match, or some combination of exact and relaxed micro matches in the target class; (3) partial exact, wherein some of the source attributes and methods have an exact micro match in the target class; and (4) partial relaxed, wherein some of the source attributes and methods have either a relaxed micro match, or some combination of exact and relaxed micro matches in the target class. Figure 3 diagrammatically depicts the possible QoMs for a sub-macro match. Here E and R denote an exact and a relaxed micro match respectively, while TE, TR, PE, PR denote total exact, total relaxed, partial exact, and partial relaxed matches respectively. Definition 11. A total exact match between two classes Cs ∈ Ss and Ct ∈ St , denoted as Cs =TE Ct , is a total and injective function over attributes and methods f : M(Cs ) → M(Ct ) where ∀ms ∈ M(Cs ) ∧ ∃mt ∈ M(Ct ) | (ms =E mt ). Definition 12. A total relaxed match between two classes Cs ∈ Ss and Ct ∈ St , denoted as Cs =TR Ct , is a total and injective function over attributes and methods f : M(Cs ) → M(Ct ) where ∀ms ∈ M(Cs )∧∃mt ∈ M(Ct ) | (ms =R mt ∨ms =E mt ) with at least one relaxed micro match (ms =R mt ). Definition 13. A partial exact match between two classes Cs ∈ Ss and Ct ∈ St , denoted as Cs =PE Ct , is an injective function over attributes and methods f : M(Cs ) → M(Ct ) where ∃ms ∈ M(Cs ) ∧ ∃mt ∈ M(Ct ) | (ms =E mt ). Definition 14. A partial relaxed match between two classes Cs ∈ Ss and Ct ∈ St , denoted as Cs =PR Ct , is an injective function over attributes and methods f : M(Cs ) → M(Ct ) where ∃ms ∈ M(Cs ) ∧ ∃mt ∈ M(Ct ) | (ms =R mt ∨ ms =E mt ) with at least one relaxed micro match (ms =R mt ). As an example consider once again the schemas in Figure 1 and 2. Here the class Recipe (Figure 1) has a total exact match with the class Dish (Figure 2). On the other hand, the class Ingredient in Figure 1 has only a partial exact match with class Item in Figure 2 as there is no match for the attribute id of class Ingredient. Similarly, the class Instruction in Figure 1 has a partial relaxed match with class Step in Figure 2 as the class Instruction provides only partial coverage for its attributes, and the micro matches are relaxed. 3.3 Macro Match As per Definition 4 (Section 2), a schema is defined as a collection of classes. Thus, a match between two schemas, referred to as the macro match, is dependent on (1) the number of matched classes in the schema; and (2) the quality of the sub-macro matches. Similar to the classification at the sub-macro level, we categorize the quality of match (QoM) at the macro level as either a total or a partial match based on the number of
sub-macro matches between the source and the target schemata. A match is total if all classes in the source schema match some or all classes in the target schema. Note that this is not a one-to-one correspondence as it is possible that two or more source classes map to one target class. A match is partial if some (but not all) of the source classes match some of the target classes. Definition 15. A total match between two schemas Ss and St , denoted as Ss =T St , is defined as a total function over classes f : C(Ss ) → C(St ). Definition 16. A partial match between two schemas Ss and St , denoted as Ss =P St , is defined as a function over classes f : C(Ss ) → C(St ).
Fig. 4. The QoM for Macro Matches.
Combining the two criteria given above, we now classify the QoM at the macro level as (1) total exact, wherein all classes of the source have a total exact sub-macro match in the target schema; (2) total relaxed, wherein all source classes have a total match in the target schema, with at least one total relaxed sub-macro match; (3) partial exact, wherein either all source classes have an exact sub-macro match in the target schema with at least one partial sub-macro match, or some (but not all) of the source classes have either partial or total exact sub-macro matches in the target schema; and (4) partial relaxed wherein either all source classes have a sub-macro match in the target schema with at least one partial sub-macro match and one relaxed sub-macro match, or some (but not all) source classes have a sub-macro match in the target schema with at least one total relaxed sub-macro match or one partial relaxed sub-macro match. Figure 4 diagrammatically depicts the possible QoMs for a macro match. Definition 17. A total exact match between two schemas Ss and St , denoted as Ss =T E St , is defined as a total function over classes f : C(Ss ) → C(St ) where ∀ Cs ∈ M(Ss ) ∧ ∃ Ct ∈ M(St ) | (Cs =T E Ct ).
Definition 18. A total relaxed match between two schemas Ss and St , denoted as Ss =TR St , is defined as a total function over classes f : C(Ss ) → C(St ) where ∀Cs ∈ M(Ss ) ∧ ∃Ct ∈ M(St ) | (Cs =T R Ct ∨ Cs =T E Ct ) with at least one total relaxed sub-macro match (Cs =T R Ct ). Definition 19. A partial exact match between two schemas Ss and St , denoted as Ss =PE St , is defined as either (1) a total function over classes f : C(Ss ) → C(St ) where ∀Cs ∈ M(Ss ) ∧ ∃Ct ∈ M(St ) | (Cs =P E Ct ∨ Cs =T E Ct ) with at least one partial exact class match (Cs =P E Ct ), or (2) a function over classes f : C(Ss ) → C(St ) where ∃Cs ∈ M(Ss ) ∧ ∃Ct ∈ M(St ) | (Cs =P E Ct ∨ Cs =T E Ct ). Definition 20. A partial relaxed match between two schemas Ss and St , denoted as Ss =PR St , is defined as either (1) a total function over classes f : C(Ss ) → C(St ) where ∀Cs ∈ M(Ss ) ∧ ∃Ct ∈ M(St ) | ((Cs =P R Ct ) ∨ (Cs =T E Ct ) ∨ (Cs =P E Ct ) ∨ (Cs =T R Ct )) with at least one partial relaxed sub-macro match (Cs =P R Ct ), or (2) a function over classes f : C(Ss ) → C(St ) where ∃Cs ∈ M(Ss ) ∧ ∃Ct ∈ M(St ) | ((Cs =P R Ct ) ∨ (Cs =T E Ct ) ∨ (Cs =P E Ct ) ∨ (Cs =T R Ct )) with at least one partial relaxed or total relaxed sub-macro match. Consider the schemas given in Figure 1 and 2. The schema Recipe has a partial relaxed match with the Dish schema. However, there is a total relaxed match between the Dish schema and the Recipe schema if the Dish schema is considered to be the source schema.
4 Quantitative Analysis: Weight-Based Match Model In Section 3, we have presented a qualitative technique for evaluating the quality of match (QoM) between two schemata. Based on the qualitative analysis it can be observed that the QoM for an exact match is typically better than the QoM for a relaxed match. Similarly, the QoM for a total match is generally better than the QoM for a partial match. Moreover, we can observe that in general the quality of match is guaranteed to be better if the match is total exact. However, we find that qualitative analysis alone can not accurately determine the distinction between a total relaxed, a partial exact, or a partial relaxed match. To address this, in this section we now provide a weightbased match model that quantitatively determines and ranks the QoM. We define this quantitative model at each level of the match taxonomy. 4.1 Micro Match Model Micro matches (refer Section 3.1) are classified as either exact or relaxed based on the matches between the properties of the two attributes (or methods). Recall that each property of a source attribute is compared to the corresponding property of the target attribute, and is determined to be either identical (=), equivalent (⇔), or relaxed (≈), where ≈ = {>,