Towards Evaluating Ontology Based Data Matching Strategies Matching Strategies, Evaluation Methodology and Results
Yan Tang
Ellen Leenarts
VUB STARLab, Department of Computer Science Free University of Brussels, Brussels, Belgium
[email protected]
Offices Minerva & Mercurius, British Telecom, Herikerbergweg 2 Amsterdam Zuid-Oost, The Netherlands
[email protected]
Robert Meersman VUB STARLab, Department of Computer Science Free University of Brussels, Brussels, Belgium
[email protected]
Ioana-Georgiana Ciuciu
Kevin Pudney The Social Care Institute for Excellence (SCIE) Goldings House 2 Hay's Lane London SE1 2HB
[email protected]
VUB STARLab, Department of Computer Science Free University of Brussels, Brussels, Belgium
[email protected] Abstract—In the EC FP6 Prolix project, a generic ontology-based data matching framework (ODMF) is developed for enhancing competency management and e-learning processes. In this paper, we will focus on ODMF matching strategies and an evaluation methodology for the evaluation. We will discuss the evaluation principles and evaluation criteria in general. This evaluation methodology is applied to enterprises and being validated. The evaluation results and valuable concluding points concerning an evaluation methodology will be explained. Keywords-ontology; ontology-based data matching; evaluation methodology; human resouce mangement; e-learning
I.
INTRODUCTION
In the ECFP6 Prolix project, we have developed a generic ontology-based data matching framework (ODMF). With regard to the evaluation schema or benchmark, there does not exist a generic evaluation concerning ontology-based data matching or ontology-based data schema matching. Together with the ODMF, the evaluation methodology becomes the paper motivation and our main contribution to the domain. The ODMF contains the matching algorithms originally for: 1) matching strings, such as the ones for SecondString [5], in particular, UnsmoothedJS [9][10][22], JaroWinklerTFIDF [9][10][22] and TFIDF (term frequency–inverse document frequency, [17]); 2) matching lexical information, such as using WordNet [6]; and 3) matching concepts in an ontology graph. There are several ontology-based data matching strategies in the ODMF. Each strategy contains at least one graph algorithm.
The use case of the ODMF is created with the Prolix test beds – the British Telecom (BT, http://www.bt.com) and the Social Care Institute for Excellence (SCIE, http://www.scie.org.uk). In addition, the evaluation methodology that will be discussed in this paper is implemented and tested by BT. This paper will focus on the ODMF matching strategies and an evaluation methodology for evaluating the ODMF different matching strategies. The results will be demonstrated. The paper is organized as follows: chapter II is the related work of the paper. Chapter III is the paper background. All the ODMF matching strategies that are evaluated with this methodology are illustrated in chapter IV. Chapter V covers the evaluation methodology. We illustrate the evaluation results in chapter VI. Chapter VII is the paper conclusion and future work. II.
RELATED WORK
A general evaluation methodology for ontology-based data matching does not exist. Evaluation methods are trivial and application specific. Related work on the types of evaluation methods is described as below: Program evaluation is the systematic collection of information about the activities, characteristics and outcomes of programs to make judgments about the program, improve program effectiveness, and inform decisions about future programming [15]. Another definition of program evaluation is: it is “the systematic assessment of the operation and/or outcomes of a program or policy, compared to a set of explicit or implicit standards as a means of contributing to the
improvement of program or policy”. Examples of such evaluation methods are [4], [12]. Utilization-focused evaluation [19] is a comprehensive approach to doing evaluations that are useful, practical, ethical and accurate. Examples of such methods are the evaluation methods for non-experimental data [3], which show how to use non-experimental methods to evaluate social programs. Purpose oriented evaluation methodologies [2] contain three kinds of evaluation methodologies – formative evaluation, pretraining evaluation and summative evaluation. Formative evaluation focuses on the process. Pretraining evaluation focuses on judgment of the value before the implementation. Summative evaluation focuses on the outcome. Other evaluation types, which are considered not directly linked to our work, are product evaluation, personnel evaluation, self evaluation, advocacy evaluation, policy evaluation, organizational evaluation, cluster evaluation. We refer to [8][14][16] for the examples of product evaluation methodology, self evaluation methodology and cluster evaluation methodology. III.
A lexon is defined as a quintuple ‹γ, t1, r1, r2, t2› representing a fact type. t1, t2 are two terms. γ is a context identifier that points to a context where t1, t2 are originally defined and disambiguated. r1, r2,are two roles that t1, t2 can possibly play with. For example, ‹school, teacher, teaches, is taught by, student› means a fact that “in the context of school, a teacher teaches a student and a student is taught by a teacher”. A commitment contains a constraint of a (set of) lexon(s). For instance, we can apply the mandatory constraint on the above lexon - “each teacher teaches at least one student”. The commitment language needs to be specified in a language, such as OWL1 and SDRule language [20]. ODMF MATCHING STRATEGIES
The ODMF is a collection of matching algorithms and matching strategies. In this paper, we focus on evaluating four ODMF matching strategies – Lexon Matching Strategy (LeMaSt), Ontology-based Graph Matching (OntoGraM) version 1, OntoGraM version 2 and Controlled Fully Automated Ontology Assisted Matching (C-FOAM). The compared objects are two competency objects, which can be annotated with a competency ontology [21]. A. LeMaSt LeMaSt is a composition of a string matching algorithm and a lexical matching algorithm for calculating the similarity of two lexon sets. 1
,
(1)
The final similarity score S is calculated based on the above formula. Si is the score at the matching level i. Ci is the contribution weight for Si. There are three matching levels in LeMaSt – 1) two lexons have the same terms and roles. 2) two lexons have the same terms and different roles; 3) two lexons have the same roles and different terms. B. OntoGraM (Version 1 and Version 2) The OntoGraM strategy calculates the similarity of two competency objects based on their semantic relations, such as classification, properties, and WordNet relations.
BACKGROUND
The ontology in this paper is in the paradigm of Developing Ontology Grounded Methodology and Applications (DOGMA, [13][18]). In DOGMA, an ontology contains a set of lexons and commitments.
IV.
where
http://www.w3.org/TR/owl-ref/
Figure 1. Ontology model for matching function of OntoGraM
We use the approaches to model-based reasoning [11], rulebased reasoning [7] and competency ontology [21] in OntoGraM. Figure 1 shows the model that describes the model that describes the kernel concepts for the matching components in OntoGraM. These concepts are further defined in the competency ontology. For instance, we can compare a person and a task by calculating the similarities between the set of all the competencies required for the task with the set of competencies of the person. The set of competencies of the person is the total of all the competencies for his functions and tasks, and the end competencies of his qualifications. Lexical resources, such as WordNet and user defined dictionary, are used to assist this comparison process. There are two versions of OntoGraM strategy. In the version 1, we use basic semantic relations, such as shown in Figure 1 and the holonym-meronym WordNet relationship. In the version 2, extra semantic relations, such as “is possibly related to” and “is very similar to”, are interpreted into similarity scores. C. C-FOAM The C-FOAM strategy contains two important modules (the interpreter and comparator). The interpreter module makes use of the lexical dictionary, WordNet, the domain ontology and string matching algorithms to interpret end users’ input. Given a term that denotes either (a) a concept in the domain ontology, or (b) an instance in the ontology, the interpreter will return the
correct concept(s) defined in the ontology or lexical dictionary, and an annotation set of this concept. When users provide inputs recursively, the process is executed automatically recursively
^#^$@%@!^%&^ *$^@%$@%$@# ^%@
!#@@!$%@$^% #$^$ ODMF A
Comparator
B
Interpreter
1 ontology
C
2 3
Interpreter
Figure 3.
Annotation of “heart” in ODMatcher2
V. Ontology based comparator
Pre-processing by the Interpreter
Matching at string level
Score + Penalty
Figure 2.
Matching at lexical level
Graph matching
Similarity score
Score + Penalty
C-FOAM model
There are two penalty values in the interpreter module (Figure 2). The first one is the threshold for the internal output using string matching. The filtered terms will be the input for the lexical searching components. The second penalty value is to filter the output of the lexical searching components. For instance, when a user enters a string “hearty” or “warmhearted”, C-FOAM finds a defined concept “heart” in the domain ontology and its annotation (Figure 3) using JaroWinklerTFIDF (the string matching algorithm) and WordNet (the lexical dictionary). In our test setting, OntoGraM is used in the comparator. The comparator can as well use any combination of the different graph algorithms to produce a composite score. For example, if the user wants to calculate the score between two competences using three graph algorithms (OntoGraM version 1, OntoGraM version 2, and LeMaSt) he should specify a positive percentage for each of these algorithms; say 30%, 35%, and 35%. Each algorithm with a positive percentage set will be called by the comparator. That is the comparator will start a separate thread for each of these algorithms. The comparator will monitor these threads within a predefined maximum period or till all results are available. If a thread doesn’t provide a result within the maximum period that thread will be stopped. For the total score only the returned scores will be taken into account. If OntoGraM version 2 does, for example, not return a result within the maximum period then the total score would be calculated as: (0.3 x ScoreGraph1 + 0.35 x ScoreLeMaSt) / (0.3 + 0.35).
THE EVALUATION METHODOLOGY
The ODMF evaluation methodology adapts the principles in the methodologies for program evaluation [15] and purpose oriented evaluation [1][2]. The principle in the program evaluation methodologies is: the evaluation methodology needs to help a system to improve their services and/or functions, and, also help to ensure that the system is delivering the right services and/or functions. In our problem settings, the ODMF evaluation methodology needs to help the ODMF to improve the matching results, and, to ensure that the ODMF is delivering the correct matching score. The principles in the purpose oriented evaluation methodologies are: 1) the evaluation methodology must be able to determine what information exists in the process of a system is important so that the engineers can analyze the processing information; 2) the evaluation methodology need to test and collect continuous feedback in order to revise the process is important; 3) the evaluation methodology must have a precondition analysis and post-condition analysis of the evaluated systems; 4) the end users must be able to judge the outcome of a system based on the evaluation methodology. In our problem settings, the ODMF evaluation methodology needs to determine the information during the process of the ODMF, with which we analyze which matching strategy performs the best within certain contexts. The ODMF evaluation methodology needs to continuously analyze the comparison between user’s expected similarity scores and the similarity scores calculated by the ODMF. The ODMF evaluation methodology needs to have a precondition analysis and post-condition analysis of the ODMF and its use case. The ODMF evaluation methodology needs to provide a mechanism for the end users of the ODMF, with which they can justify the matching results. The above evaluation principles give an overview of evaluating the ODMF at a general level (macro judgment). They are the fundamental issues to judge the quality of the ODMF.
2
ODMatcher is developed to support evaluating ODMF matching strategies
In addition, we draw evaluation criteria for the ODMF matching strategies as shown in TABLE I. These evaluation criteria are used to evaluate the ODMF at a detailed level (micro judgment). The reasons why we choose these criteria to evaluate these strategies are explained in the column of “motivation/evaluation principle”. TABLE I. Evaluation criteria Difficulty of managing the required knowledge resource Difficulty of using the strategy
Results of the matching strategy What affects the matching score
Advantage and disadvantage
Performance analysis
precondition and post-condition analysis need to be performed as well. This output of this activity is a report of a detailed use case. •
Step 3: design test and evaluation data. This activity is to design the test data in order to scope the test problem. The data is used by the ODMF instead of the end users. These data are used on step 6. The output of this activity is a report of listed test and evaluation data.
•
Step 4: design a test suit. This activity is to design a test suit, the data of which is provided by the end users and used by knowledge engineers on step 5. This test suit is designed based on the results from step 1 and step 2. The output of this activity is a report of a test suit.
•
Step 5: compare the results from the ODMF with the end users’ expected results. The output of this activity is a report of a comparison (ODMF similarity scores vs. end user expected similarity scores).
•
Step 6: analyze and conclude. This activity is to analyze the comparison report from step 5 and draw valuable conclusions. The output of this activity is a report of comparison analysis and conclusion.
EVALUATION CRITERIA FOR THE ODMF MATCHING STRATEGIES
Explanation To check whether it is difficult to manage the knowledge base of a strategy or not. To check whether it is difficult to adjust the parameters of a strategy.
To check whether the similarity scores match users’ expectations or not. To find with which factors, this strategy is delivering the right services and good functional results To explain the situations that this strategy is applicable and inapplicable
To check whether it is expensive to run a strategy
Motivation/Evaluation principle(s) To evaluate that this specific strategy are useful and easily used.
In order to continuously analyze (with different parameters) the comparison between users’ expected similarity scores and the similarity scores calculated by this strategy. To evaluate whether this strategy is delivering the right services and good functional results To help a system that uses this strategy to improve their services and/or functions
To evaluate whether this strategy is delivering the right services and good functional results; To help a system that uses this strategy to improve their services and/or functions To evaluate whether this strategy is delivering the right services and good functional results; In Prolix, ODMF-CA is required to provide a score within 1 second
Based on the above discussions, the ODMF evaluation methodology contains the following steps according to the evaluation principles: •
Step 1 (preparation step): design the general use case – scope the problem. Clear requirements for a viable use case for the ODMF need to be initialized. The design terms are gathered and analyzed from the test beds’ materials. The output of this activity is a report of a general use case.
•
Step 2 (preparation step): design a detailed use case – specify the problem. During this activity, types of information used by the ODMF are specified. The goal of this activity is to further specify the design terms from step 1. These terms include the process information of the ODMF. Examples of specified design terms are actors 3 and triggers 4 . Note that
In the following subsections, we will discuss the output of each activity in this evaluation methodology. A. Step 1 and Step 2 – Design the Use Case for the Evaluation The use case is developed as a story (TABLE II). Note that company information is hidden according to the privacy issues. TABLE II.
A DETAILED USE CASE STORY
Title
ODMF for learning material ID OMMR_V1.0 recommendation Scope e-learning, training, material recommendation, ontology based gap analysis Purpose This story describes the use case of using ontology based gap analysis framework for the recommendation of the learning materials for company. Settings S1 An employee has competencies, which can be evaluated by reviewers and stored in a Development & Performance Review. S2 Learning materials contains the methods of improving skills of BT employees. The formats of these learning materials vary from documents to multimedia resources. Characters C1 Every employee at BT has one function. His function (level) gets raised when he gets a good evaluation. C2 The reviewer evaluates the employee. The evaluation result, which is stored in a DPR, is the input of the recommendation process. C3 The trainer is responsible for training BT employees with appropriate materials. Episodes Episode I – use case that involves graph matching algorithm EI-1 An employee gets the DPR from a reviewer. The employee’s
3
An actor is a person or other entity external to ODMF being specified who interacts with ODMF and performs test cases or use cases to accomplish test tasks
4
A trigger is an identifier of the event that initiates the use case or test case.
EI-2 EI-2.1 EI-2.2 EI-3 EI-3.1 EI-3.2
EI-3.3 EI-3.4
EI-3.4.1 EI-3.4.2 EI-3.4.3 EI-3.4.4 EI-3.4.5
EI-3.4.6 EI-4 EI-4.1 EI-4.2
EI-4.3
actual performance rating of the values is recorded in the DPR. The employee provides input. (annotated) (mandatory) The employee provides the actual competency scores from the DPR ♣ (optional) The employee provides the expected competency scores as the input 2. The ODMF performs the calculation. If input 2 is not provided, then capacities with value of NI are collected as the capacity set that needs to be improved. If input 2 is provided, then capacities with a value that is lower than expected are collected as the capacities that need to be improved. The framework collects all the capacities that do not need to be improved. The framework compares the gap between the capacities in the set of existing capacities and the capacities in the set of capacities that need to be improved. The framework generates two networks (two graphs) in the ontology. The framework combines the graphs of the capacities in the first set into one graph. Suppose this graph is graph 1. ♣ The framework combines the graphs of the capacities in the second set into one graph. Suppose this graph is graph 2. ♣ The framework compares graph 1 and graph 2. ♥♣ The framework finds the difference between graph 1 and graph 2, an internal competency gap set is generated. Suppose this gap set is gap 1. ♣ The framework finds the learning materials that are annotated with the concepts in gap 1. ♣ The matching framework generates the output. Output 1: a set of recommended learning materials. Output 2: reasons of recommendation. E.g. each learning material is illustrated with relevant concepts concerning competency in the ontology. And each capacity is also illustrated with relevant concepts concerning competency in the ontology. ♣ Output 3: others, e.g. the steps of graph matching (log) ♥
The preconditions of the above story are listed as follows: • • •
There exists a competency ontology in the matching framework. There exist several matching strategies in the matching framework There is at least one good example from the test bed
The post-condition of this story is that the ODMF needs to illustrate: •
Definition of a term and the relations of concepts in the ontology base
•
Explanation of the matching result with different strategies
B. Step 3 – Design Test and Evaluation Data There are two test beds involved. In the first test bed – the British Telecom (BT, http://www.bt.com), the test data set contains 26 learning materials, which are categorized into 10 soft skills. There are in total 10 company values. The ontology contains 1365 lexons, which cover 382 different concepts and 208 different role pairs. For the second test bed – Social Care Institute for Excellence (SCIE, http://www.scie.org.uk), there are 1054 lexons in total, used to annotate 161 competences in the NVQ
HSC standards 5 and 72 organizational specific competences. TABLE III shows an example of test data in lexons. TABLE III. R5 Head term Agent Agent Agent Action Object HSC21 Head term Individual Agent Agent Agent Action Object Action
TEST DATA USED FOR SCIE
Communication Skills Role Co‐role Tail term interacts with interacts with Person performs is performed by Action acts on is acted on Object subsumes is a Communicate subsumes is a Communicate Communicate with, and complete records for individuals Role Co‐role Tail term is a subsumes Person interacts with interacts with Person performs is performed by Action acts on is acted on Object subsumes is a Communicate subsumes is a Record subsumes is a Complete record
C. Step 4 – Design a User Test Suite The user suite is demonstrated in TABLE IV. The level of relevance can be 1, 2, 3, 4, or 5. 1 means G1 (e.g. a company value) and G2 (e.g. a learning material) are completely irrelevant. 2 means “not very relevant (or I don’t know)”. Level 3 means “relevant”. Level 4 means “very relevant” and level 5 means “100% relevant”. TABLE IV.
TEST SUITE USED FOR EVALUATING THE ODMF
G1
G2
Level of relevance
Heart
ITIL1
3
Helpful
ITIL8
2
Straightforward
PD0236
4
Inspiring
SKpd_04_a05
5
Trustworthy
BTAMG001
1
Customer connected
HMM23
2
Team work
FS-POSTA01
4
Coaching for performance
SKCUST0154
5
…
…
…
TABLE IV needs to be filled in by the test beds, who did not contribute to modeling the domain ontology. We will use this test suite to measure whether the ODMF gives a satisfactory similarity score between G1 and G2. VI.
EVALUATION RESULTS
We have implemented a tool called ODMatcher to support evaluating different matching strategies (Figure 4). 5
http://www.direct.gov.uk/en/EducationAndLearning/Qualifica tionsExplained/DG_10039029
•
The performance. For instance, the maximum cost of running LeMaSt is 3765 milliseconds and the average cost is 562.0971429 milliseconds. Figure 4 shows the complexity of using LeMaSt. The complexity is linear.
Figure 5. Complexity of LeMaSt
•
Figure 4. Different similarity scores using different ODMF strategies (the justification view in the ODMatcher, screenshot)
Given a competence object describe as a string in English – “hearty”. ODMatcher finds annotated learning materials – “coaching for performance”, “courtesy towards customers” and “knowledge for BT organization”. In the justification view (Figure 4), users can find the scores generated by each strategy. In this example, “hearty” is firstly translated into “heart” using string matching algorithms, WordNet and user dictionary. Then, all the matching strategies are executed. At the end, ODMatcher provides the most relevant learning materials. We use the following criteria to evaluate the ODMF matching strategies. •
•
•
The factors that affect the similarity score. For instance, there are two factors that affect the final score provided by LeMaSt. The first factor is the contribution weights. The second factor is the annotation sets. The advantages and disadvantages. For instance, the advantage of LeMaSt is that the similarity of two objects that belong to a different object type can be calculated. One disadvantage is that the object descriptions have to be annotated with lexons. The other disadvantage is that the types of ontological commitments are very limited. It only deals with WordNet semantic relations and subtyping. It does not deal with the constraints like exclusive-or, inclusive-or, uniqueness, mandatory. It also does not deal with the operators like negation, implication, conjunction, disjunction and sequence. The difficulty level of using a strategy and managing the knowledge base for this strategy.
Satisfactory rate. This rate is calculated based on the similarity scores generated by ODMF strategies and users expected relevance levels in the test suite.
In order to calculate the satisfactory rate, we need to correctly interpret the relevance levels provided by the end users. For instance for LeMaSt, the average score is 0.19, the maximum score is 0.3225 and the minimum score is 0. Therefore, the scale of the similarity scores is [0, 0.3225], which needs to be split into 5 score ranges. We equally split it as shown below. •
Relevance Level 5 – similarity score >0.258
•
Relevance level 4 – similarity score >0.1935 and 0.129 and 0.0645 and