schema matching, the core of this approach is the use of thesaurus to carry out the match ... thesaurus can lead to more accurate results than other approaches.
3rd International Conference on Software, Knowledge, Information Management and Applications
Schema Matching Using Thesaurus
Thabit Sulaiman Sabbah 1,Rashid Jayousi2,Yousef Abuzir3
1
Al-Quds University, Computer Science Department, Jerusalem, Palestine Al-Quds University, Computer Science Department, Jerusalem, Palestine. 3 AlQuds Open University, Ramallah, Palestine.
2
Schema matching is a basic problem in data integration process between different or similar data resources within a certain domain or among variant domains. Many individual and hybrid approaches based on several techniques were introduced to solve this problem automatically. In this paper we present a new approach of schema matching, the core of this approach is the use of thesaurus to carry out the match process. An application was implemented to be able to test our approach. The initial results of this approach indicate that the use of thesaurus can lead to more accurate results than other approaches.
Index Terms: Elements Mapping, Linguistic analysis, Schema Matching, Searching , Thesaurus.
I. INTRODUCTION
knowledge; and a set of related terms for each term in the list (http://en.wikipedia.org). It is used for indexing, classifying, searching, and text mining (Losee M. Robert, 2007). Terms in thesaurus are listed alphabetically, and some are hierarchically, this hierarchically indicates the relation between terms, the broader term "BT" represent the super class of the term while the narrower term "NT" represents the subclass(es) of the term. Some thesauri have the USE (use) and UF (used for) relations to indicate the alternation of terms. In this research we investigate the effectiveness of using thesaurus techniques to solve the problem of textual description similarity within automatic schema matching depending on annotation section of schemas. The organization of the remaining part of this paper contains: a background section about schema matching surveys and approaches in section II, then Section III explains our methodology of using thesaurus in schema matching. Section IV describes our implemented application. Initial results of our methodology implementation are shown in section V. Finally we present our conclusion and future work in Section VI.
D ata exchange between different systems is one of the most challenging problems in computer science, since these systems were almost developed separately, and for different applications. As applications grow and more development are made the exchange of data becomes a necessary. XML language was introduced as a solution of this problem, but XML data integration is increasingly important problem, and many methods have been developed (Dong Ce, Bailey James), approaches to solve problems in XML data integration can be classified into many categories according to the base used, some of these approaches use manual mapping, query language, DTD mapping, semi automatic mapping bases on XSLT, and schema matching. An XML schema is the definition (both in terms of its organization and its data types) of a specific XML structure (Connolly Thomas, Begg Carolyn). The problem is that the tags (fields' names) and data types are usually not unified, so we need to make some kind of mapping to make it possible to integrate data in different XML structures. Many techniques were used to solve this problem. One of these approaches based on data type, structure, and elements descriptions. Element descriptions, which are also called Annotations, are one of XML schema data model (http://www.w3.org), textual description similarity in this approach was solved through one of Information Retrieval techniques based on terms vector frequency (vector space). Many of IR techniques were used also to solve the problem of schema matching; one of such techniques that are still not investigated effectively in this area is the use of thesaurus technique. A Thesaurus is a list of every important term (single-word or multi-word) in a given domain of
II. BACKGROUND Schema matching is the process of identifying a semantic correspondence or equivalent between two or more schemas. During the past 20 years, schema matching has been an active research area (Black, 2007), a main reason for that was and still the need of effective and complete automatic matching, this need is becoming more and more urgent because the rapid expansion of application areas in which the schema matching forms the first step toward data integration. Several surveys were conducted to review and classify schema matching approaches and
ISBN: 9781851432516
197
3rd International Conference on Software, Knowledge, Information Management and Applications
algorithms. An early study was Batini and Lenzerini (1986), the aims of this study were to unify a framework for the schema integration problem, and to conduct a comparative analysis of existing methodologies for past work in the field of database schema integration. The study analysed and compared twelve approaches according to common criteria such as Use, Completeness, and Detailed Specification. Batini and Lenzerini (1986) classified the analyzed integration approaches into two simple classes: View integration; in which a global conceptual discretion of a proposed database is produced, and Database Integration where a
global schema of a collection of databases is produced. Another survey of automatic schema matching approaches was Rahm and Bernstein (2001). In addition to surveying some of past automatic schema matching approaches, the authors presented a taxonomy that explained the common features of these approaches. Schema matching approaches were classified into two main categories; Individual Matchers and Combining Matchers. Both were divided into subdivisions through many levels as shown in FIG.1 Schema Matching Approaches Classification, Rahm and Bernstein
FIG.1 Schema Matching Approaches Classification, Rahm and Bernstein (2001)
Another classification of schema matching approaches was introduced through Shvaiko and Euzenat (2005), this classification is based on three layers: basic properties of matching
techniques as the inside layer, Granularity/interpretation of input as the top layer, and the kind of input as the bottom layer, as shown in FIG.2.
FIG. 2 Schema Matching Approaches Classification, Shvaiko and Euzenat (2005)
Matching techniques were categorize into Strings, Language, Linguistics, Constraints, Alignment reuse, Formal ontology, Graphs,
Taxonomy, Structure repositories, and Models, in the inside layer, whereas the top layer adopted the levels of granularity that were proposed by
ISBN: 9781851432516
198
3rd International Conference on Software, Knowledge, Information Management and Applications
Rahm and Bernstein (2001), The bottom layer where the kinds of input were grouped there was three classes: Terminological, Structural, and Semantic. Most of schema matching systems where emerged from the context of a specific application, but few of them try to address the schema matching problem in a generic way that is suitable for different applications and schema languages(Do et al., 2002). The following briefly describe some of both types in time line order. SEMINT(2000): A theoretical background and implementation details of SEMantic INTegrator; a tool based on neural networks to identify attribute correspondences in heterogeneous databases. SEMINT uses three levels of metadata to determine attribute correspondence, these are attribute names (the dictionary level), schema information (the field specification level), and data contents and statistics (the data content level).Neural networks were used to learn how this metadata characterizes the semantics of the attributes, the knowledge of how to determine matching data elements is discovered from the metadata directly (Li and Clifton, 2000). Cupid (2001): A generic schema matching algorithm to discover mappings between schema elements based on their names, data types, constraints, and schema structure(element-based and structure-based). Cupid integrates the use of linguistic and structural matching, biases toward similarity of atomic elements, and exploits internal structure, keys, referential constraints and views (Madhavan et al., 2001). Similarity Flooding (2002): A generic matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm produces a mapping between corresponding nodes of the two input graphs (schemas, catalogs, or other data structures). The ‘accuracy’ of the algorithm is evaluated by counting the number of user adjustments that is needed after algorithm run (Melnik et al., 2002). LSD (2003): A multi-strategy learning approach to find schema mapping automatically. LSD applied multiple learner modules, where each learner exploits a different type of information either in the schemas of the sources or in their data. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. Finally, predictions of these modules are combined using a meta-learner. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data (Doan et al. 2003). SCROL (2004): Semantic Conflict Resolution Ontology (SCROL) is formal structure of a
common ontology that provides a systematic method for automatically detecting and resolving various semantic conflicts in heterogeneous databases. A strength point of SCROL is the ability to recognize matches between differently named elements that are semantically similar but require conversions in order to implement the match. SCROL is formally defined to provide a dynamic mechanism of comparing and manipulating contextual knowledge of each information source, which is useful in semantic interoperability among heterogeneous databases (Ram and Park, 2004). Corpus (2005): A schema matching method that leverages previous matching experiences (knowledge) to carry out new schema matching. Knowledge is extracted from a Mapping Knowledge Base (corpus) of known schemas and mappings (Madhavan et al., 2005). Element clustering (2006): A technique based on clustering as an intermediate step into existing schema matching algorithms to improve the efficiency of large scale schema matching. Clustering is used to partitions schemas and reduces the overall matching load, and creates a possibility to trade between the efficiency and effectiveness (Smiljanic et al. 2006). III. SCHEMA MATCHING USING THESAURUS The process of matching is divided into three phases: Data extraction, Data analysis and Element matching, and finally the result viewing phase. Each of these three phases contains its own processes and data structures that aim to prepare data for the next phase usage. XML schema format was used in the application of out methodology since it is the most commonly used schema format, and it follows the rules of XML language, so it can be verified easily. The use of this three phases model gives the advantage of the possibility of changing some parts of the model without affecting the whole model design. Phase One: Data Extraction ? Check Input Validity In this phase the matcher analysis the input files (S1, S2) and checks the schema files to be wellformed. The Input of this phase is two XML Schema Definition files (.XSD files). ? Parsing Input Files Both files are parsed to extract the names if elements and its' annotation. Parsing process output is a two dimensional list that contains elements names and elements descriptions (annotations). ? Filtering Output To get rid of words that have no Significance meaning, the descriptions' of elements are filtered
ISBN: 9781851432516
199
3rd International Conference on Software, Knowledge, Information Management and Applications
against a stop words. Stop words are loaded into a hash table from a specified file.
The dataflow of this phase is shown in FIG.3.
Output
Parser
Valid ?
DOM
Yes
to
(Extract Fields’ Names, annotations)
Filter against Stop Word List
No Schema Loader/Editor FIG.2 Dataflow in first phase of schema matching.
the element in the header of the column where their cross forms the maximum value. Then all values in both row and column are set to zero. This process will be repeated until all similarity values in the matrix are zeros or less than the threshold value. Phase Three: Result Viewing This phase contains no complications; the user can browse the final mapping result. Result can be viewed as a table of matched elements and the value of similarity between each pair of elements. In our approach we propose a new schema matching technique. The mapping process of element from first schema (S1) to another schema (S2) element will be based on one of Information Retrieval (IR) techniques; which is Thesaurus. We use this approach in the linguistic analysis of elements' textual description. The standard XML schema format will be considered as input schemas, it is required that each of schemas S1 and S2 to contain a clear and full definitions (descriptions) of all fields (tags) in schema's annotation section, the use of thesaurus will leads to more accurate results than other individual approaches, because the result of using thesaurus is the extraction of concept(s) that the description carry, disregard other attributes that is considerable by old matching approaches (e.g. name, constraints, pronunciation, etc.). This approach is classified as an individual, schema based, element level, linguistic based, 1:1 cardinality, and uses a thesaurus to carry out matching process, thus it can be fit in Rahm and Bernstien (2001) classification as shown with bold path in FIG.4. Also,our approach can fit in Shvaiko and Euzenat (2005) classification as shown with bold path in FIG. 5.
Phase Two: Data Analysis and Elements Matching The core of matching process is carried out at this phase. ? Applying Thesaurus The first step in this phase is to apply the thesaurus on each element's description; this process requires the searching about each term from the text into the thesaurus. ? Searching Thesaurus A little enhancement of searching technique makes the possibility of performing the searching process efficiently. This enhancement depends on making an extra two tables in thesaurus database. The process of searching terms into the database field performed by traversing of text tokens just one time, in this round each token of the text used to query one of extra tables, the result of query will determines if two extra queries are needed or not. The experiments on this enhancement of searching algorithm shows significant results in time required to complete this process. This process will repeated for each element in both schemas. ? Similarity Matrix Construction Next step of this phase is to determine the similarity matrix between all elements in both schemas. Depending on the terms resulted from the first step process. The similarity matrix between any two elements is calculated, the result of this step is the normalized similarity matrix. ? Final Mapping between Elements The final step of this phase is to determine the matches with best similarity between deferent elements of both schemas. Starting with the first maximum similarity value from the similarity matrix, we consider a matching (mapping) between the element in the header of the row and
ISBN: 9781851432516
200
3rd International Conference on Software, Knowledge, Information Management and Applications
FIG4. Our approach with respect to Rahm and Bernstin (2001) classification
Figure 3: Our approach with respect to Shvaiko and Euzenat (2005) classification
As generalization this approach architecture allows adaptation to be applied in order to deal with other types of schema in case of existence of clear and full definitions (descriptions) of all fields of both schemas S1 and S2.
the user Interface and it is implemented using Java language. Thesaurus database was designed to satisfy the needs of our enhanced search methodology. Also the implemented GUI allows setting up the configuration of matching process easily. V. EXPERIMENTS AND INITIAL RESULTS
IV. IMPLEMENTATION To test our methodology a full application was implemented; the application consists of two main parts: First the Database and it is implemented using Oracle 9i database, Second
Our methodology was tested in the area of finding equivalent courses depending of course description, the data of our experiment ware collected from the Internet in two deferent
ISBN: 9781851432516
201
3rd International Conference on Software, Knowledge, Information Management and Applications
knowledge domains: Agriculture, and Computer Science. In our first experiment we test two sets of courses in agriculture domain; each set consists of 22 courses proposed by two deferent universities. Courses descriptions of both sets were processed to find the equivalent courses among them, our initial results shows that more than 45% of compared courses matched to each Elements from Schema 1 No. in Name Schema 15 AGR231 2 AGR115 18 AGR241 10 AGR138 19 AGR242 14 AGR191 4 AGR117 1 AGR102 23 AGR253 17 AGR239
other, with similarity ratios ranges from 50% up to 95%. The following shows the final matching of our first experiment with similarity ration greater than 50%: Elements from Schema 2 Normalized Similarity No. in Ration Name Schema 2 AGRI101 1.0 5 AGRI118 0.992 1 AGRI100 0.957 12 AGRI145 0.932 17 AGRI205 0.781 14 AGRI151 0.766 3 AGRI114 0.735 8 AGRI123 0.717 13 AGRI146 0.539 9 AGRI126 0.526
In computer science domain we have tested compared courses were matched with similarity two sets of courses from two deferent ratio greater than 50%, the following table shows universities; each contains 12 courses. 33% of the the initial results of this experiment. Elements from Schema 1 Elements from Schema 2 Normalized Similarity No. in No. in Ration Name Name Schema Schema 7 CPSC336 1 CIS102 1.0 9 CPSC431 9 CIS225 0.884 6 CPSC329 5 CIS136 0.814 10 CPSC433 11 CIS236 0.526 [2] Connolly Thomas, Begg Carolyn: Data base VI. CONCLUSION AND FUTURE WORK systems A Practical Approach to design, The above results are promissing, and it shows Implementation, and Management, Thomas that we can depend on thesaurus to find the Connolly, Carolyn Begg, 4th ed, p1091. similarity between two texts and it can be used [3] http://www.w3.org/TR/xmlschema11efficiently in schema matching systems to find 1/#concepts-data-model.Accessed the mapping between two deferent schemas with Sept.2009. deferent tags names by exploring the similarity of [4] http://en.wikipedia.org/wiki/Thesaurus. their tags’ descriptions. Currently we are Accessed Sept.2009. developing a tool that use thesaurus as a schema [5] Decisions in Thesaurus Construction and matching tool, such tool is being developed by Use, Robert M. Losee, Information depending on more factors that can be extracted Processing & Management, 43(4), 2007, from the schema such as elements’ attributes, 958-968. data types, etc. Also using thesaurus a matcher [6] Batini and Lenzerini (1986): Batini, C., can be integrated with other schema matching Lenzerini, M. and Navathe, S.B. (1986) "A systems that depend on deferent criteria. comparative Analysis of Methdologies for Database Schema Integration," ACM REFERENCES Computing Surveys, 18(4), pp. 323-364 [1] Dong Ce, Bailey James : Ce Dong, James [7] Berlin and Motro (2002): Berlin, J. and Bailey A Framework for integrating XML Motro, A. (2002) "Database Schema transformations.
ISBN: 9781851432516
202
3rd International Conference on Software, Knowledge, Information Management and Applications
Matching Using Machine Learning with Feature Selection," Advanced Information Systems Engineering: 14th International Conference, CAiSE 2002 Toronto, Canada [8] Blake (2007): Roger Blake. UMBCMWP 1031. September 2007, Unpublished research, (http://www.management.umb.edu/faculty/ workingpaper/blake_roger/Blake_Schema _Matching_Research.pdf). Accessed Sept.2009 [9] Chua et al (2003): Chua, C., Chiang, R., and Lim, E. (2003). "Instance-based Attribute Identification in Database Integration," The VLDB Journal, (12), pp. 228-243. [10] Claypool et al. (2005): Claypool, K. and Hegde, V. (2005) "QMatch - A Hybrid Match Algorithm for XML Schemas," Proceedings of the twenty-first International Conference on Data Engineering. [11] Dhamankar (2004): Dhamankar, R., Lee, Y., Doan, A., Halevy, A., and Domingos, P. (2004). "iMAP: Discovering Complex Semantic Matches between Database Schemas," Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 383-394. [12] Do et al. (2002): Do, H., Melnik, S., and Rahm, E. (2002) "Comparison of Schema Matching [13] Evaluations," Proceedings of the 2nd Int. Workshop on Web Databases. [14] Doan and Halevy (2005): Doan, A. and Halevy, A. (2005) "Semantic-Integration Research in the Database Community," AI Magazine, Spring 2005, pp. 83-94. [15] Doan et al. (2003): Doan, A., Domingos, P., and Halevy, A. (2003) "Learning to Match the Schemas of Data Sources: A Multistrategy Approach," Machine Learning, (50), pp. 279-301. [16] Ehrig and Staab (2004): M. Ehrig and S. Staab. QOM: Quick ontology mapping. In Proceedings of the International Semantic Web Conference (ISWC), pages 683–697, 2004. [17] Li and Clifton (2000): Li, W. and Clifton, C. (2000), “Semint: A Tool for Identifying Attribute Correspondence in Hetergeneous Databases Using Neural Networks,” Data and Knowledge Engineering, 33(1), pp. 49-84. [18] Madhavan et al. (2001): Madhavan, J., Bernstein, P., and Rahm, E. (2001) “Generic Schema Matching with Cupid,” Proceedings of the 27th International
Conference on Very Large Data Bases, pp. 49 – 58. [19] Madhavan et al. (2005): Madhavan, J., Bernstein, P., Doan, A., and Halevy, A. (2005) “Corpus-based Schema Matching,” Proceedings of the twenty-first International Conference on Data Engineering. [20] Melnik et al. (2002): Melnik, S., GarciaMolina, H., and Rahm, E. (2002) “Similarity Flooding: A Versatile Graph Matching Algorithm,” Proceedings of the eighteenth International Conference onData Engineering, p 117. [21] Rahm and Bernstein (2001): Rahm, E. and Bernstein, P. (2001) “A survey of approaches to automatic schema matching,” The VLDB Journal, 10, pp. 335-350. [22] Ram and Park (2004): Ram, S. and Park, J. (2004) “Semantic Con�ict Resolution Ontology (SCROL): An Ontology for Detecting and Resolving Data and Schema-Level Semantic Con�icts,” IEEE Transactions on Knowledge and Data Engineering, 16(2), pp. 189-202. [23] Shvaiko and Euzenat (2005): Shvaiko, P. and Euzenat, J. (2005) “A Survey of Schema-based Matching Approaches,” Journal on Data Semantic, 4, pp. 146-171. [24] Smiljanic et al. (2006): Smiljanic, M., Keulen, M., and Jonker, W. (2006) “Using Element Clustering to Increase the E�ciency of XML Schema Matching.” [25] Sung and McLeod (2006): Sung, S. and McLeod, D. (2006) “Ontology -Driven Semantic Matches between Database Schemas,” Proceedings of the twentysecond International Conference on Data Engineering. [26] Chevallet, Jean-Pierre ( ): Jean-Pierre Chevallet () "Building Thesaurus from Manual Sources and Automatic Scanned Texts", [27] American National Standards Institute, 2005, ANSI/NISO Z39.19-2005 [28] Aitchison et al. 1997: Thesaurus construction and use: a practical manual, 3ed edition, Aslib, 1997.
ISBN: 9781851432516
203