Document not found! Please try again

A Self-Configuring Semantic Decision Table

3 downloads 2062 Views 408KB Size Report
DIY-SE project. It is authors' pleasure to thank our ex- ... Resemblance, in Proceedings of OES/SEO 2001 Rome Workshop, In, d'Atri, A. & Missikoff, M.,(eds.) ...
A Self-Configuring Semantic Decision Table For Parameterizing an Ontology-based Data Matching Strategy

Yan Tang VUB STARLab, Department of Computer Science Free University of Brussels, Brussels, Belgium [email protected]

Jan Demey VUB STARLab, Department of Computer Science Free University of Brussels, Brussels, Belgium [email protected]

Robert Meersman VUB STARLab, Department of Computer Science Free University of Brussels, Brussels, Belgium [email protected] Abstract— a semantic decision table (SDT), which is a decision table annotated with an ontology (or ontologies), is a means to ensure the completeness and correctness of a decision table. It can be used to store the parameters of a matching strategy. In principle, SDTs can be used to configure any kinds of strategies, functions or algorithm. In this paper, we use an ontology-based data matching strategy, which has been developed and used for competency matching in the fields of human resource management and eLearning/training to demonstrate how SDTs can be used. In particular, we focus on how to make SDT selfconfigured based on the feedbacks from an end user while evaluating this strategy. We design an algorithm called Semantic Decision Table Self-Configuration Algorithm (SDT-SCA) to find the best parameters that can be stored as action stubs in an SDT. We discuss the design, implementation and industrial experiments concerning SDT-SCA. Keywords-ontology; ontology-based data matching; human resouce mangement; eLearning and training; semantic deicison table; LeMaSt

I. INTRODUCTION We have developed a generic ontology-based data matching framework (ODMF [2]), which has been used for matching competencies across application domains. ODMF contains the matching algorithms originally for: 1) matching strings, such as the ones for SecondString [1]; 2) matching lexical information, such as using WordNet [8]; and 3) matching concepts in an ontology graph, such as using the Dijkstra’s shortest path algorithm [4]. Each ontology-based data matching strategy from ODMF contains at least one graph algorithm. For example, the Graph-Aided Similarity Calculation algorithm (GRASIM, [18] [20]) is a composition of a lexical matching algorithm using WordNet and the Dijkstra’s algorithm. A semantic decision table (SDT, [16]) is a decision table properly annotated with an ontology (or ontologies). It uses modern ontological technologies to ensure the correctness of a set of a decision table. The methodology in (chapter 6, pp. 139~153, [16]) shows how a group of decision makers can

build an SDT. With this methodology, SDTs are also considered as a means to support group decision making. We use SDTs for storing user specific decision rules and parameters of a matching strategy in ODMF, such as in [18]. We have designed an evaluation framework to evaluate the ODMF matching strategies [19]. During the evaluation processes, we have encountered a problem of finding the best parameters. We are requested to find a solution to automatically update an SDT based on the feedbacks of an end user. In other words, SDT needs to be self configured. In this paper, we focus on an algorithm called Semantic Decision Table Self-Configuration Algorithm (SDT-SCA) to automatically update the parameters stored in an SDT. We take a matching strategy in ODMF, which is called Lexon Matching Strategy (LeMaSt), as an example to demonstrate SDT-SCA. The paper is organized as follows: section II is the related work. Section III is the paper background. LeMaSt using SDT is illustrated in section IV. Section V covers the design of SDT-SCA. We illustrate our industrial experiments in section VI. Section VII is the conclusion and future work. II. RELATED WORK The goal of SDT-SCA is to adjust the content in a (semantic) decision table in order to improve the output of a strategy (or function, algorithm etc.) that uses this (semantic) decision table. It can be considered as a learning algorithm that finds the best parameters. Basically, we can use any machine learning approaches. The field of machine learning is usually categorized into three cases: supervised, unsupervised and reinforcement learning [14]. The supervised learning (e.g., learning decision trees [7]) is to learn a function from examples (also called instances and cases) of its inputs and outputs. The unsupervised learning (e.g., Bayesian learning or Bayes nets [13]) is to deal with learning patterns in the input when no specific output values are provided. The reinforcement learning (e.g., adaptive control theory [21]) is to learn to maximize cumulative

rewards. SDT-SCA is in the category of unsupervised learning. The underlying principle of computational learning theory is any hypothesis that is consistent with training sets must be approximately correct [14]. In this paper, we follow this principle while designing SDT-SCA. Including SDT-SCA, we can a well use other unsupervised machine learning methods for finding the best parameters. For instance, statistical learning methods like naïve Bayesian learning [6] [5], the Expectation-Maximization (EM algorithm [3] [11]) and the Nearest-neighbor model [9]. Naïve Bayesian learning is to help agents to act under uncertainty. It is based on Bayesian models, one of the oldest decision support methods. If the environment of an agent is well analyzed and the whole truth is discovered, then the agent is guaranteed to work. But in practice, an agent “almost never has access to the whole truth about his environment” (p. 462, [14]). That is the reason why Bayesian models get popular in various fields, especially in the realm of machine learning. The EM algorithm addresses the problem of learning probability models with hidden variables and missing data. It is sometimes also called “EM schema”. It can be combined with other learning algorithm. For example, Lauritzen [10] illustrates an EM algorithm for learning Bayes nets with hidden variables. The Nearest-neighbor model is often used in pattern recognition, which is considered as a branch of machine learning. We can use it to analyze data, calculate data distance and cluster data sets. This section covers our related work. We will illustrate the background in the next section. III. BACKGROUND The ontology in this paper is modeled in the paradigm of Developing Ontology Grounded Methodology and Applications (DOGMA, [12] [15]). In DOGMA, an ontology Ω contains a set of lexons and commitments. A lexon is defined as a quintuple 〈 , , , , 〉 representing a binary fact type where , are the two terms representing two concepts; is a context identifier that points to the context where the concepts represented by and are originally defined and disambiguated; and are two roles that , can possibly play with. For example, 〈 〉 is a lexon, ℎ , ℎ , ℎ , which expresses that “in the context identified by , a teacher teaches a student and a student is taught by a teacher”. A commitment contains a constraint of a (set of) lexon(s). For instance, we can apply the mandatory constraint on the above lexon - “EACH teacher teaches AT LEAST ONE student”. The commitment language needs to be specified in a language with well defined syntax, such as Decision Commitment Language (DECOL, see chapter 3 in [16]). For instance, we write the above mandatory constraint as ]: ( ). =[ ℎ , ℎ , ℎ ,

The ontology modeled in DOGMA can be further published and stored in W3C recommended ontology languages, such as RDF(s)1 and OWL2. A semantic decision table [16] is a decision table properly annotated with an ontology (or ontologies). It contains a decision table, a set of lexons and commitments. We often use it to audit decision rules in a decision table. An example of a semantic decision table (SDT) is shown in TABLE I. It decides whether to hire, fire or train an employee depending on the title of this person. A decision table contains three basic elements: conditions, actions and decision rules. A condition is a pair that contains a condition stub and a condition entry. A condition stub is a label that indicates a condition variable. A condition entry is a value. For instance in TABLE I. “Teacher” is a condition stub, 〉 is a “Yes” is a condition entry. The pair 〈 ℎ , condition. An action is a pair that contains an action stub and an action entry. An action stub is a label that indicates an action variable. An action entry is a Boolean value that shows whether this action is executed or not. For example in TABLE I. “Hire” is an action stub. The presence or absence of “*” is an action entry. The pair 〈 ,∗〉 is an action. A decision rule is defined as a mapping from a complete composition of conditions to a complete composition of actions. In a decision table, it is viewed as a decision column, e.g., column 1~4 in TABLE I. TABLE I. Condition Teacher Student Action Hire Fire Train

AN EXAMPLE OF SDT (SEE P.89 IN [16])

1 Yes Yes

2 No Yes

3 Yes No

4 No No

* * * SDT Lexons is a supertype of Person is a supertype of Person SDT Commitments (P1=[Teacher, is a, supertype of, Person], P2=[Student, is a, supertype of, Person]) : EXCLUSIVE_OR (P1 (supertype of), P2 (supertype of)).

Teacher Student Commitment 1

A decision table needs to be complete and correct. As a subtype of decision tables, an SDT also needs to ensure the completeness and correctness. A. Completeness of a semantic decision table A (semantic) decision table is complete iff it contains all the possible condition combinations. For instance, TABLE I. is complete. If we remove any condition columns, it will not be considered complete any more.

1

http://www.w3.org/TR/rdf-schema/ 2 http://www.w3.org/TR/owl-ref/

B. Correctness of a semantic decision table A decision table is correct iff it meets all the requirements at a meta-level. As the rules at the meta-level of an SDT are modeled in the format of commitments, we consider an SDT correct iff all the constraints in the commitments are satisfied. In TABLE I. we have a commitment that contains an exclusive-or constraint, which disallows the following two instances to appear at the same time: Tom is a teacher and he is also a student. Column 1 does not satisfy this constraint. Therefore, TABLE I. is not considered correct. When we audit this SDT, this column will be highlighted and we need to update this column, e.g., remove the action entry “*” for the action “Hire”. Figure 1 the matching problem: matching the data sets from HRM with the ones from eLearning and training.

We have discussed the background concerning the DOGMA ontology paradigm and semantic decision tables in this section. In the next section, we will illustrate a lexon matching strategy using a semantic decision table. IV.

LEXON MATCHING STRATEGY USING SEMANTIC

We use a competency ontology [17] as the domain ontology to annotate the data from each data set. For example, we annotate an evaluation item called “Heart” in an evaluation report and a learning material called “Decision Making (HAVARD)” as shown in TABLE II. .

DECISION TABLE

The ODMF [2] is a collection of matching algorithms and matching strategies that are used for matching data sets across domains. The Lexon Matching Strategy (LeMaSt) is one of the matching strategies. In this section, we will first discuss an ontology-based matching problem (subsection IV.A). Then, we will illustrate the design of LeMaSt using an SDT (subsection IV.B). A. Ontology based Data Matching Problem We give an example of a matching problem as follows. In a company, every employee is evaluated by the human resource (HR) department twice a year. Based on the evaluation report, the training department sends the employees to different courses or provides them with different learning materials. There are two domains in this problem – one is human resource management (HRM) and the other is eLearning and training. Each domain has its own data sets. The data set in the HRM domain is the evaluation reports. And the one in the domain of eLearning and training is the learning materials and courses. Each evaluation report has a group of evaluating items. The commonality between these two data sets is that both contain competencies. Note that competencies are also contained in other domains, such as sales and marketing (see Figure 1). They may be the competencies in the forms of business models and business process models. We use the same principle as illustrated in this paper to match business process models with the data sets from the domains of HRM and eLearning and training, if necessary.

TABLE II. A LEXON TABLE THAT CONTAINS ANNOTATIONS FOR “HEART” AND “DECISION MAKING (HAVARD)” Head term Evaluation item Person Person Person Person Person Learning material Person Issue Person Person Person Person Person

Role

Co-role Heart interact with interact with respect is respected by Empathies is empathized by Manage is managed by discourage is discouraged by Decision Making (HAVARD) identify is identified by is about is of generate is generated by evaluate is evaluated by communicate with communicate with implement is implemented by identify is identified by

Tail term Person Context Person Emotion Person Issue Decision Alternative Alternative Person Decision Problem

If the evaluation report of an employee shows that he needs to improve “Heart”, then the matching engine loads a learning material (or a course). Suppose it loads “Decision Making (HAVARD)”. Then, it compares the annotations from “Heart” and “Decision Making (HAVARD)”. If the matching score is larger than a threshold, then “Decision Making (HAVARD)” will be proposed to this employee for learning and training. The matching engine will do this matching process repeatedly until all the relevant learning materials and courses are found for improving “Heart”. We design LeMaSt to calculate the matching score, which will be discussed in the next subsection. B. LeMaSt and Semantic Decision Tables LeMaSt is an ontology-based data matching strategy that uses a simple graph matching algorithm to produce matching scores. It transforms a graph matching problem into a lexon matching problem. Before using LeMaSt, we need to make sure the following two assumptions are satisfied.

Assumption 1: if represents concept , represents and ≠ , then ≠ . For example, if = apple, = ; represents and represents , then we say ≠ because ≠ . Assumption 2: each role pair 〈 , 〉 is uniquely defined in the ontology Ω . For example, all the “is-a” subsumption 〉 or 〈 relations, such as 〈 , , 〉, should be unified into, e.g., the role pair 〈 , 〉. If these two assumptions are not satisfied, we should run a string algorithm (e.g., [1]) or a lexical matching algorithm (e.g., the ones that use WordNet [8]) to clean the lexons, map the lexon terms correctly to the right concepts, and reorganize the lexon roles into the right relations. For example, 〈 , 〉 should , , , 〉 using be 〈 , , , , simple string management processes. Or, we can use the SynSet relation from WordNet to transform the lexon 〈 , 〉 , ℎ, ℎ , 〉. Figure into 〈 , , , , 2 explains this mapping.

( ( (

= ≠ =

)∩( )∩( )∩ ≠ = ) ∪ ( )∩( )∩ = ≠ )∩( ) → ≅ . ≠

4) One term one role match, which we denote as and use the following formula to describe it – ( ( ( (

= ≠ ≠ = = =

( (

)∩( )∩( )∩ = ≠ ) ∪ ( )∩( )∩ = ≠ )∩( ) ∪ ( )∩ = = )∩( )∩( ) ∪ ≠ ≠ )∩( )∩( )∩ ≠ ≠ ) → ≃

5) One term match, which we denote as ~ . We check this matching situation using the formula – ( ( (

= ≠ ≠

)∩( )∩( )∩ ≠ ≠ ) ∪ ( )∩( )∩ = ≠ )∩( ) → ~ . ≠

LeMaSt uses the following equation to calculate the matching score.

=

Figure 2

Given two annotation sets and (e.g., corresponds to “Heart” and corresponds to “Decision Making (HAVARD)” in TABLE II. ) and two lexons and 〉 , = 〈 , 〉 ( =〈 , , , , , , , and ∈ , ∈ ), we find five matching situations in LeMaSt and illustrate them as follows.

( ( (

= = =

. It can be

)∩( )∩( )∩ = = ) ∪ ( )∩( )∩ = = )∩( ) → = . =

2) Two terms one role match, which we denote as ≈ and can be derived using the formula – ( ( ( ( ( (

= ≠ = = = =

)∩( )∩( )∩ = = ) ∪ ( ) ( ) = ∩ ≠ ∩ )∩( ) ∪ ( )∩ = = )∩( )∩( ) ∪ = ≠ )∩( )∩( )∩ ≠ = ) → ≈ .

3) Two terms match, which we denote as ≅ derived using the formula –



min {|

× | 1 ,|

2 |}

,0 ≤

≤ 1



The value is the number of matched lexons for the matching situation . | | and | | are the total lexon numbers in and ;min {| |, | |} is the minimum value in the set {| |, | |} ; is a weight that is decided by the semantic decision table shown in TABLE III.

Clean a lexon using simple string management processes

1) Perfect match, which we denote as = derived using the formula -



and is

TABLE III.

AN SDT THAT DECIDES

Condition 1 2 3 4 Profile Initial Balanced Max Perfect Action PERFECT_MATCH 1 1 1 1 TWO_TERMS_ONE_ROLE_MATCH 0.9 0.8 1 0 ONE_TERM_ONE_ROLE_MATCH 0.7 0.6 1 0 TWO_TERMS_MATCH 0.5 0.4 1 0 ONE_TERM_MATCH 0.1 0.2 1 0 SDT Lexons 〈 , 〉 ℎ , , , 〈 , 〉 ℎ ,ℎ , , 〈 , ,ℎ , , 〉 … SDT Commitments Commitment 1: (P1 = [Weight, has, is of, Value], P2 = [Weight, has, is of, Label]): IMPLIES (P2 (Label) = ‘PERFECT_MATCH’, P1 (Value) = 1). Commitment 2: P1 = [Weight, has, is of, Value]: Value =0.

The first commitment in TABLE III. expresses that the weight of “PERFECT_MATCH” must be 1 in any cases. The second commitment contains a rule meaning that the weight value must be less than or equal to 1, and, more than or equal to 0. Let us use LeMaSt to compare the HRM evaluation item “Heart” with the learning material “Decision Making

(HAVARD)” (the annotations are illustrated in TABLE II. ). If our profile is “Initial”, “Balanced”, “Max” or “Perfect”, then LeMaSt generates the matching scores - 0.14, 0.22, 1 or 0 respectively. The matching score for the profile “Max” is 1 because every lexon from the smaller lexon set (which is “Heart”) has at least the situation of “ONE_TERM_MATCH” ( ≃ ) with the other lexon set. The score for “Perfect” is 0 because there is no perfect match between these two lexon sets. We want LeMaSt to produce the “best” matching scores using the “best” parameters in TABLE III. , which are the action entries in the SDT. The problem is indeed to find the most suitable action entries within their allowed value ranges. We design a semantic decision table self-configuration algorithm (SDT-SCA) to solve this problem, which we will discuss in the next section. V.

SEMANTIC DECISION T ABLE SELF-CONFIGURATION ALGORITHM: SDT-SCA In our methodology, we involve an expert who knows both domains of HRM and eLearning and training. The processes of the semantic decision table self-configuration algorithm are illustrated as in Figure 3.

We define relevance level set as {∀ | ∈ , ∈ ℤ, ≥ 1}. It is a finite set of relevance levels. The number of its members (also called the size of the set) is denoted as | |. For example, we can have five levels illustrated as below. 

Level 5: 100% relevant



Level 4: very relevant



Level 3: relevant



Level 2: not very relevant



Level 1: completely irrelevant

B. Step 2 – run a compelte test and calucate score ranges On this step, we first run a complete test of comparing all the HRM evaluation items with the learning materials. Suppose the maximum matching score is , which means that the scale is [0, ]. Then, we equally split this scale into |

We denote the score range for the relevance level It is calculated using the following formulas. = 0,

 

|



|

=

=

|

|

|

|

|

|

SDT-SCA flowchart

A. Step 1 – Preparation Phase On this step, we ask the expert to fill in a test suit as illustrated in TABLE IV. TABLE IV. HRM Item Trustworthy Trustworthy Trustworthy Trustworthy … Heart Heart Heart Heart …

ID SOFT1 SOFT2 SOFT3 SOFT4 … SOFT1 SOFT2 SOFT3 SOFT4 …

A TEST SUITE FILLED BY AN EXPERT

Learning material Title Problem solving and decision making Flexibility in changing circumstances Communication Networking … Problem solving and decision making Flexibility in changing circumstances Communication Networking …

Relevance 3 2 3 2 … 2 3 4 2 …

as

.

; × (|

× (|

| − 1), | − 2),

|

For example, if = 0.3225 and | have 5 score ranges shown as below.

Figure 3

| score ranges.



= (0.258, 0.3225]



= (0.1935,0.258]



= (0.129,0.1935]



= (0.0645,0.129]



= [0,0.0645]

; |

× (|

| − 1)

| = 5. Then we

C. Step 3 – calculate bias On this step, we use the score ranges produced on the previous step to check whether the matching scores are good or not. If a similarity score falls in the range, then we say this similarity score is “completely satisfied”. If it does not fall in the range, then we need to calculate the bias. Let us use to indicate the low boundary for the range (e.g., is 0.1935) and to denote the high boundary (e.g., is 0.258). (

Allow us to use bias. If < otherwise, ( 0.0035. Let us use

, , (

,

,

) to designate the low boundary

then ( )= −

, )= ( e.g.,

− , , 0.19) =

) to denote the high boundary bias.

If < otherwise, 0.068

, (

,

( then )= −

( use (

, ) = min( ( , ( , ) to indicate , 0.19) = 0.0035.

,

)= ( e.g.

), ( , the bias.

− , , 0.19) =

) ) where we For instance,

D. Step 4 – cluster scores with satisfaction rates On this step, we need to check whether or not the expert is satisfied with a matching score before modifying the SDT. We say is ∈



completely satisfied if



satisfied if



note really satisfied if ∉ and ( , )≤ 2× | | |





and

(

)≤

, ,

(

|

|

,

)≰

|

completely unsatisfied if does not meet the requirements of the above situations.

For example, if = 0.2, then it is completely satisfied; if = 0.193, then it is satisfied; if = 0.12, then it is note really satisfied; if = 0.05, then it is completely unsatisfied. E. Step 5 – increase/decrease LeMaSt parameters in SDT On this step, we increase/decrease the parameters by assigning any possible floats that are accurate to a certain decimal places. For example, the action stubs for the changeable actions may be one value in the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} if we use one decimal place and allow them to vary from 0.1 to 0.9. In this example, the calibration is 0.1. F. Step 6 – collect tests and select the best parameters We repeat step 2 ~ step 5 until we cannot increase or decrease the parameters anymore. Suppose we have changeable actions in an SDT, and we allow the action stubs to vary from to (with the calibration of ). A complete test of SDT-SCA contains a loop that executes exactly (( − )/ + 1) times. For instance, if we allow all the changeable actions from TABLE III. to vary from 0.1 to 0.9 (with the calibration of 0.1), then we will get 94 different combinations (to remind, PERFECT_MATCH needs to be always 1 and cannot be changed, therefore the combination number is not 95). We collect the satisfaction rates for all the possible parameter combinations, and select the one that has the best satisfaction rate. G. Future learning With SDT-SCA, our ODMF matching engine learns as much as possible from what perceives. The initial evaluation suit and the first result that contains the best parameters (action stubs) reflect the prior knowledge of the learning environment. Here we want to emphasize that our environment is not completely known as a prior because it evolves all the time. For example, the knowledge base of ODMF may change when the following situations happen:



The annotation of an existing learning material or evaluation item is modified.



A user introduces a new learning material or HRM evaluation item.



The test suit is updated.

When the annotation of an existing learning material or evaluation item is modified, we can skip the preparation phase (step 1 in Figure 3). Instead of running a complete test, we only compare this modified annotation set with the sets from another domain. For instance, if the annotation of a learning material is updated, then we compare this learning material with all the HRM evaluation items. After we calculate the bias (step 3 in Figure 3) and cluster the scores with satisfaction rates (step 4 in Figure 3), if the satisfaction rate is better, then we remain the SDT unchanged. If it is worse, then the system suggests not updating the annotation. If we insist on updating it, then SDT-SCA will provide a new set of best parameters for the SDT. When a user introduces a new learning material, we require a domain expert to annotate this new learning material so that it can be included in the search space of ODMF. If the relevance score in the test suit is provided, then we can run SDT-SCA to find the best parameters in the SDT. Otherwise, we use a simple voting mechanism (by involving the whole community of end users) to suggest relevance scores, and then run SDTSCA. When a new HRM evaluation item is introduced, our solution is similar to the one for a new learning material. We first need to annotate this new HRM evaluation item. Then we use either the updated test suit or a voting mechanism to suggest the relevance scores. At the end, we run SDT-SCA to find the best parameters. When the test suit is updated, we need to know the reasons why it gets updated. It is very important to verify this update action because the training set needs to be consistent and approximately correct. Otherwise, the “best” parameters found by SDT-SCA are actually not the best ones. As a consequence, LeMaSt using this SDT will not yield a good matching score. H. Practical issues Note that in real case scenarios, we need to run SDT-SCA offline as it consumes considerable time and resource. Given the maximum number of the lexons in from the HRM domain, which we write as ′ , the maximum number of the lexons in from the domain of eLearning and training, which we write as ′ , the total number of the graphs from the HRM domain, which we write as ′ , and the total number of the graphs from the domain of eLearning and training, which we write as ′ . We need to run up-to ∏ ′ times of lexon comparison for a complete test. As discussed, the loop in SDT-SCA is executed (( − )/ + 1) times (to remind, is the number of the changeable actions in an SDT; every action stub varies from to ). In every execution of the loop, a complete set of lexon comparisons is executed exactly twice: once for getting the maximum score in order to calculate the score ranges and once for clustering scores with satisfaction rates. Hence SDT-

SCA calls, in total, up-to 2 × (( times of lexon comparisons.



)/ + 1)

×∏



for LeMaSt, where , items:

We use ′ to denote the largest number from the set { ′ , ′ , ′ , ′ , ( − )/ }. The time complexity is ′ . Therefore, we need to run SDT-SCA offline as we do not want it to consume too much at the front end. End users may spontaneously create new annotations or introduce new learning materials in the ODMF framework. The SDT-SCA component will not be activated until needed.

,

,

are the values of the following



−“ONE_TERM_MATCH”



−“TWO_TERMS_MATCH”,



−“ONE_TERM_ONE_ROLE_MATCH”



− “TWO_TERMS_ONE_ROLE_MATCH”

In the example shown in Figure 4, each synchronized task block contains 9 SDT-SCA loops. There are in total 729 synchronized task blocks. By doing so, we can run SDT-SCA more efficiently.

In this section, we have illustrated how SDT-SCA is designed. We will explain our test case and experiments in the next section. VI. EXPERIMENTS The use case is created with a test bed from the EC FP6 Prolix project3 – the British Telecom (BT, http://www.bt.com). The test data set contains 26 learning materials and 10 HRM evaluation items. The ontology contains 1365 lexons, which cover 382 different concepts and 208 different role pairs. We have implemented SDT-SCA in Java SE 1.6 using Eclipse SDK 3.4.0. We use MySQL 5.0 as the database server for storing the lexons (in the ontologies and the annotations of learning materials and HRM evaluation items). As discussed in the previous section, the complexity of SDT-SCA is rather high. Our execution environment is described as follows: OS – MS XP professional v2002, SP3, CPU – Intel Core 2 Duo, P9400, 2.4GHZ, 3.45 of RAM. The maximum cost of comparing a learning material with an HRM evaluation item is 23657 milliseconds. The average cost is 13213 milliseconds. Therefore, we need to find a solution to implement SDTSCA in a more efficient way. We implement the loop in SDTSCA (see Figure 3) using an “execution-in-parallel” solution, which is visualized in Figure 4.

We have run a complete test for SDT-SCA. A part of the result is illustrated in TABLE V. TABLE V.

ID 1 2 3 4 5 6 7 8 9 10 … 11 12 13 14 15 16 17 18 19 20 …

1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 … 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 …

2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 … 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 …

3 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 … 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 …

PART OF THE TEST RESULT

4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 … 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 …

5 43 43 43 43 44 43 43 43 43 44 … 65 66 66 66 67 67 67 67 66 66 …

6 65 65 65 65 65 65 65 65 65 80 … 109 110 111 112 119 120 119 119 112 111 …

7 72 72 72 72 69 71 72 71 72 73 … 53 52 51 51 43 43 43 43 51 51 …

8 54 54 54 54 56 54 54 54 54 36 … 6 6 6 6 5 5 5 5 6 6 …

9 0.462 0.462 0.462 0.462 0.464 0.463 0.461 0.463 0.462 0.533 0.747 0.752 0.756 0.759 0.796 0.7959 0.795 0.795 0.757 0.756 …

In what follows, we explain the columns in TABLE V.

Figure 4 UML activity diagram for visualizing “execution-in-parallel” of SDT-SCA

Each state in label – _ 3

represents one loop in SDT-SCA. Its ( , , , ) – indicates the parameters

Figure 4

http://www.prolixproject.org/



1: parameter for one term match ( )



2: parameter for two terms match ( )



3: parameter for one term one role match ( )



4: parameter for two terms one role match ( )



5: number of the scores that are completely satisfied. We use to denote the value.



6: number of the scores that are satisfied. We use to denote the value.



7: number of the score that are not really satisfied. We use to denote the value.



8: number of the scores that are completely unsatisfied. We use to denote the value.



TABLE VI.

9: satisfaction rate. It is calculated using the (

formula:

)

The test that has the highest satisfaction rate is the one with ID 15 in TABLE V. If we take the parameter pair (0.1, 0.1, 0.1, 0.1), which is in the test with ID 1, then we will get a satisfaction rate of 0.462. The detailed satisfaction rates (see the definitions in section V.D) are illustrated in Figure 5. The detailed satisfaction rates for the test that has the highest satisfaction rate (ID 15) are visualized in Figure 6.

x1=0.1, x2=0.1, x3=0.1, x4=0.1

Condition Profile Action PERFECT_MATCH TWO_TERMS_ONE_R OLE_MATCH ONE_TERM_ONE_ROL E_MATCH TWO_TERMS_MATCH ONE_TERM_MATCH …





23%

completly satisfied satisfied

28%

31%

not really satisfied

not really satisfied

51%

1 0.9

1 0.8

1 1

1 0

1 0.1

0.7

0.6

1

0

0.9

0.4 1 0.2 1 SDT Lexons … SDT Commitments

0 0

0.1 0.1 …

CONCLUSION AND FUTURE WORK



SDT-SCA. We store the parameters of a matching strategy in an SDT as actions. SDT-SCA is an algorithm to find the best action stubs, which guarantee the best matching result of this matching strategy.



Test of SDT-SCA for matching competencies in the fields of Human Resource Management and eLearning/training. In order to justify SDT-SCA, we have executed an industrial experiment with the help from the British Telecom.

2%

satisfied

5 SCA

LeMaSt using SDT. LeMaSt is an ontology-based data matching strategy that uses a simple graph matching algorithm to produce matching scores. It transforms the graph matching problem into a lexon (binary fact type) matching problem. We use an SDT to store the weights for different matching status while comparing two lexons.

x1=0.1, x2=0.1, x3=0.9, x4=0.1

completly satisfied

4 Perfect



Figure 5 detailed satisfaction rate for the scores that are calculated with the parameter pair (0.1, 0.1, 0.1, 0.1)

29%

3 Max

In this paper, we have illustrated how to make a semantic decision table self-configured in order to get the best output of an ontology-based matching strategy. Our main contributions are listed as follows.

completely unsatisfied

18%

2 Balanced

In this section, we have explained how SDT-SCA is implemented and tested. In the next section, we will conclude. VII.

18%

1 Initial

0.5 0.1



RESULTANT SDT

completely unsatisfied

Figure 6 detailed satisfaction rate for the scores that are calculated with the parameter pair (0.1, 0.1, 0.9, 0.1)

By comparing Figure 5 and Figure 6, we can see that the matching result using the “best” parameter pair is dramatically increased. Once we find the best parameters, we can store them in an extra decision column of the SDT (see TABLE VI. , column 5).

Note that LeMaSt calculates the matching scores only based on fact types4 (concepts and relations) from the ontology (or ontologies). The axioms and constraints on the fact types are not taken into an account. LeMaSt is more suitable for light-weight ontologies where not a lot of constraints are defined. When we have heavy-weight ontologies, we need to use other matching strategies in ODMF, such as GRASIM [18] [20]. The scores generated by GRASIM are based on both fact types and axioms/constraints. Since there are many matching strategies in ODMF, an interesting future work could be to design an evaluation-driven framework in order to automatically select the “best” matching strategies.

4

we call a fact type as lexon in this paper

ACKNOWLEDGMENT The data is taken from the EC FP6 Prolix project. This work has also been supported by the EU ITEA-2 Project 2008005 “Do-it-Yourself Smart Experiences” founded by IWT DIY-SE project. It is authors’ pleasure to thank our excolleagues - Dr. Gang Zhao, Peter De Baer - for the discussion concerning the initial idea of ODMF. REFERENCES [1]

Cohen, W. W., and Ravikumar, P. (2003): Secondstring: An open source java toolkit of approximate string-matching techniques. Project web page, http://secondstring.sourceforge.net. [2] De Baer, P., Tang, Y., and De Leenheer, P. (2009): An Ontology-based Data Matching Framework: Case study for Comptency-based HRM. In Proc. of the 4th International ISWC Workshop on Ontology Matching (OM 2009), CEUR [3] Dempster, A. P., Laird, N., and Rubin, D. (1977): Maximum likelihood from incomplete data via the EM algorithm, journal of the Royal Statistical Society, 39 (Series B), 219~234 [4] Dijkstra, E.W. (1959): A note on two problems in connexion with graphs. Numerische Mathematik 1, 269–271 [5] Domingos, P. and Pazzani, M. (1997): On the optimality of the simple Baysian classifier under zero-one loss, Machine Learning, 29, 103~130 [6] Elkan, C. (1997): Boosting and naïve Bayesian learning. Tech. rep., depeartment of computer science and engineering, univ. of California. [7] Feigenbaum, E. A. (1961): The Simulation of Verbal Learning Behavior, in Proc. of the Western Joint Computer Conference, 19, 121-131 [8] Fellbaum, C. (1999): WordNet: an electronic lexical database, Massachusetts Institute of Technology, ISBN 0-262-06197. [9] Fix, E., and Hodges, J. L. (1951): Discriminatory analysis – nonparametric discrimination: consistency properties. Tech. rep. 21-49004, USAF School of Aviation Medicine, Randolph Field, Taxas [10] Lauritzen, S. (1995): the EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19, 191505 [11] McLachlan, G. J., and Krishnan, T. (1997): The EM Algorithm and Extensions, Wiley, New York [12] Meersman, R. (2001): Ontologies and Databases: More than a Fleeting Resemblance, in Proceedings of OES/SEO 2001 Rome Workshop, In, d’Atri, A. & Missikoff, M.,(eds.), Luiss Publications

[13] O'Hagan, A. (1994): Bayesian Inference, Volume 2B, in Kendall's Advanced Theory of Statistics), ISBN 0-340-52922-9 [14] Russell, S., and Norvig, P. (2002): Artificial Intelligence: A Modern Approach, chapter 18-learning from observations, Prentice Hall, 2 edition, ISBN-10: 0137903952,ISBN-13: 978-0137903955, December 30, 2002 [15] Spyns, P., Tang, Y. & Meersman, R. (2008), An ontology engineering methodology for DOGMA, Journal of Applied Ontology, Volume 3, Issue 1-2, p.13-39 [16] Tang, Y. (2010), Semantic Decision Tables - A New, Promising and Practical Way of Organizing Your Business Semantics with Existing Decision Making Tools, ISBN 978-3-8383-3791-3, LAP LAMBERT Academic Publishing AG & Co. KG, Saarbrücken, Germany [17] Tang, Y., De Baer, P., Zhao, G., Meersman, R., and Pudney, K. (2009): Towards a Pattern-Driven Topical Ontology Modeling Methodology in Elderly Care Homes, international OntoContent’09 workshop, On the Move to Meaningful Internet Systems: OTM 2009 Workshops, Springer, Heidelberg, LNCS 5872, ISBN 978-3-642-05289-7, pp. 514—523, Vilamoura, Portugal, Nov. 1 ~ Nov. 6, 2009 [18] Tang, Y., Zhao, G., De Baer, P., and Meersman, R. (1010): Towards Freely and Correctly Adjusted Dijkstra's Algorithm with Semantic Decision Tables for Ontology Based Data Matching, in Proc. fo the 2nd International Conference on Computer and Automation Engineering "ICCAE 2010", V. Mahadevan, J. Zhou (edt.), IEEE (Category number: CFP1096F-ART, CFP1096F-PRT), EI (Compendex), Thomason ISI Proceeding (ISTP), ISBN: 978-1-4244-5586-7, 978-1-4244-5585-0), Suntec city, Singapore, February 26 - 28, 2010 [19] Tang, Y., Meersman, R., Ciuciu, I., Leenarts, E., and Pudney, K. (2010): Towards Evaluating Ontology Based Data Matching Strategies, in proc. fo fourth IEEE Research Challenges in Information Science RCIS'10, Peri Loucopoulos and Jean Louis Cavarero (eds.), pp: 137 - 146,ISSN: 2151-1349,E-ISBN: 978-1-4244-4840-1,ISBN: 978-1-4244-4839-5 DOI: 10.1109/RCIS.2010.5507373, Nice, France, May 19 - 21, 2010 [20] Tang, Y. (2010): Towards Evaluating GRASIM for Ontology-based Data Matching, in proc. of the 9th international conference on ontologies, databases, and applications for semantics (ODBASE'2010), Springer Verlaag, LNCS 6427, p. 1009 ff, Hersonissou, Crete, Greece, Oct 26~28, 2010 [21] Widrow, B. and Hoff, M. E. (1960): Adaptive switching circuits, in 1960 IRE WESCON Convention Record, pp. 96-104, New York