though they are not similar (e.g., instructor - student, christmas - gift). ..... The Cupid approach [62] combines linguistic and structural matchers in a hybrid way.
SEMANTIC INTEGRATION THROUGH APPLICATION ANALYSIS
By OGUZHAN TOPSAKAL
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 1
c 2007 Oguzhan Topsakal
2
To all who are patient, supportive, just and loving to others regardless of time, location and status
3
ACKNOWLEDGMENTS I thought quite a lot about the time when I would finish my dissertation and write this acknowledgment section. Finally, the time has come. Here is the place where I can remember all the good memories and thank everyone who helped along the way. However, I feel words are not enough to show my gratitude to those who were there with me all along the road to my Ph.D.. First of all, I give thanks to God for giving me the patience, strength and commitment to come all this way. I would like to give my sincere thanks to my dissertation advisor, Dr. Joachim Hammer, who has been so kind and supportive to me. He was the perfect person for me to work with. I also would like to thank to my other committee members: Dr. Tuba Yavuz-Kahveci, Dr. Christopher M. Jermaine, Dr. Herman Lam, and Dr. Raymond Issa for serving on my committee. Thanks to Umut Sargut, Zeynep Sargut, Can Ozturk, Fatih Buyukserin and Fatih Gordu for making Gainesville a better place to live. I am grateful to my parents, H. Nedret Topsakal and Sabahatdin Topsakal; to my brother, Metehan Topsakal; and to my sister-in-law, Sibel Topsakal. They were always there for me when I needed them, and they have always supported me in whatever I do. My wife, Elif, and I joined our lives during the most hectic times of my Ph.D. studies, and she supported me in every aspect. She is my treasure.
4
TABLE OF CONTENTS page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
CHAPTER 1
2
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.1 1.2 1.3 1.4
. . . .
12 14 16 17
RELATED CONCEPTS AND RESEARCH . . . . . . . . . . . . . . . . . . . .
18
2.1 2.2 2.3 2.4 2.5
19 20 20 22 24 25 25 25 26 26 26 26 27 27 27 28 28 30 31 32 32 34 34 34 35
Problem Definition . . . . . . . Overview of the Approach . . . Contributions . . . . . . . . . . Organization of the Dissertation
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Legacy Systems . . . . . . . . . . . . . . . . . Data, Information, Semantics . . . . . . . . . Semantic Extraction . . . . . . . . . . . . . . Reverse Engineering . . . . . . . . . . . . . . Program Understanding Techniques . . . . . . 2.5.1 Textual Analysis . . . . . . . . . . . . . 2.5.2 Syntactic Analysis . . . . . . . . . . . . 2.5.3 Program Slicing . . . . . . . . . . . . . 2.5.4 Program Representation Techniques . . 2.5.5 Call Graph Analysis . . . . . . . . . . . 2.5.6 Data Flow Analysis . . . . . . . . . . . 2.5.7 Variable Dependency Graph . . . . . . 2.5.8 System Dependence Graph . . . . . . . 2.5.9 Dynamic Analysis . . . . . . . . . . . . 2.6 Visitor Design Patterns . . . . . . . . . . . . . 2.7 Ontology . . . . . . . . . . . . . . . . . . . . . 2.8 Web Ontology Language (OWL) . . . . . . . . 2.9 WordNet . . . . . . . . . . . . . . . . . . . . . 2.10 Similarity . . . . . . . . . . . . . . . . . . . . 2.11 Semantic Similarity Measures of Words . . . . 2.11.1 Resnik Similarity Measure . . . . . . . 2.11.2 Jiang-Conrath Similarity Measure . . . 2.11.3 Lin Similarity Measure . . . . . . . . . 2.11.4 Intrinsic IC Measure in WordNet . . . 2.11.5 Leacock-Chodorow Similarity Measure
5
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
36 36 36 37 37 37 39 41 43 43 45 46 48 48
APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.1
. . . . . . . . . . . . . . . . . .
50 51 53 53 55 55 58 58 60 62 67 68 73 75 76 81 81 82
PROTOTYPE IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . .
84
4.1
84 84 86 88
2.12 2.13 2.14 2.15 2.16
2.17 2.18 3
3.2
4
4.2
2.11.6 Hirst-St.Onge Similarity Measure . . . . . . 2.11.7 Wu and Palmer Similarity Measure . . . . . 2.11.8 Lesk Similarity Measure . . . . . . . . . . . 2.11.9 Extended Gloss Overlaps Similarity Measure Evaluation of WordNet-Based Similarity Measures . Similarity Measures for Text Data . . . . . . . . . . Similarity Measures for Ontologies . . . . . . . . . . Evaluation Methods for Similarity Measures . . . . Schema Matching . . . . . . . . . . . . . . . . . . . 2.16.1 Schema Matching Surveys . . . . . . . . . . 2.16.2 Evaluations of Schema Matching Approaches 2.16.3 Examples of Schema Matching Approaches . Ontology Mapping . . . . . . . . . . . . . . . . . . Schema Matching vs. Ontology Mapping . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Illustrative Examples . . . . . . . . . . . . . . . . 3.1.2 Conceptual Architecture of Semantic Analyzer . . 3.1.2.1 Abstract syntax tree generator (ASTG) 3.1.2.2 Report template parser (RTP) . . . . . . 3.1.2.3 Information extractor (IEx) . . . . . . . 3.1.2.4 Report ontology writer (ROW) . . . . . 3.1.3 Extensibility and Flexibility of Semantic Analyzer 3.1.4 Application of Program Understanding Techniques 3.1.5 Heuristics Used for Information Extraction . . . . Schema Matching . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Motivating Example . . . . . . . . . . . . . . . . . 3.2.2 Schema Matching Approach . . . . . . . . . . . . 3.2.3 Creating an Instance of a Report Ontology . . . . 3.2.4 Computing Similarity Scores . . . . . . . . . . . . 3.2.5 Forming a Similarity Matrix . . . . . . . . . . . . 3.2.6 From Matching Ontologies to Schemas . . . . . . 3.2.7 Merging Results . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . in SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Semantic Analyzer (SA) Prototype . . . . . . . . . . . . . . . 4.1.1 Using JavaCC to generate parsers . . . . . . . . . . . . 4.1.2 Execution steps of the information extractor . . . . . . Schema Matching by Analyzing ReporTs (SMART) Prototype
6
. . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . .
. . . .
5
6
TEST HARNESS FOR THE ASSESSMENT OF LEGACY INFORMATION INTEGRATION APPROACHES (THALIA) . . . . . . . . . . . . . . . . . . . .
89
5.1 5.2 5.3 5.4 5.5
. . . . .
89 91 92 94 95
EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.1
6.2 6.3
7
THALIA Website and Downloadable Test Package . DataExtractor (HTMLtoXML) Opensource Package Classification of Heterogeneities . . . . . . . . . . . Web Interface to Upload and Compare Scores . . . Usage of THALIA . . . . . . . . . . . . . . . . . .
Test Data . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Test Data Set from THALIA testbed . . . 6.1.2 Test Data Set from University of Florida . Determining Weights . . . . . . . . . . . . . . . . Experimental Evaluation . . . . . . . . . . . . . . 6.3.1 Running Experiments with THALIA Data 6.3.2 Running Experiments with UF Data . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . . . .
96 96 98 102 105 107 110
CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1 7.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7
LIST OF TABLES Table
page
2-1 List of relations used to connect senses in WordNet. . . . . . . . . . . . . . . . .
31
2-2 Absolute values of the coefficients of correlation between human ratings of similarity and the five computational measures. . . . . . . . . . . . . . . . . . . . . . . . . 37 3-1 Semantic Analyzer can transfer information from one method to another through variables and can use this information to discover semantics of a schema element. 62 3-2 Output string gives clues about the semantics of the variable following it. . . . .
63
3-3 Output string and the variable may not be in the same statement. . . . . . . . .
64
3-4 Output strings before the slicing variable should be concatenated. . . . . . . . .
64
3-5 Tracing back the output text and associating it with the corresponding column of a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
3-6 Associating the output text with the corresponding column in the where-clause.
65
3-7 Column header describes the data in that column. . . . . . . . . . . . . . . . . .
65
3-8 Column on the left describes the data items listed to its immediate right. . . . .
65
3-9 Column on the left and the header immediately above describe the same set of data items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3-10 Set of data items can be described by two different headers. . . . . . . . . . . .
66
3-11 Header can be processed before being associated with the data on a column. . .
66
4-1 Subpackage in the sa package and their functionality. . . . . . . . . . . . . . . .
85
6-1 The 10 university catalogs selected for evaluation and size of their schemas. . . .
98
6-2 Portion of a table description from the College of Engineering, the Bridges Project and the Business School schemas. . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6-3 Names of tables in the College of Engineering, the Bridges Office, and the Business School schemas and number of schema elements that each table has. . . . . . . . 101 6-4 Weights found by analytical method for different similarity functions with THALIA test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6-5 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8
LIST OF FIGURES Figure
page
1-1 Scalable Extraction of Enterprise Knowledge (SEEK) Architecture. . . . . . . .
14
3-1 Scalable Extraction of Enterprise Knowledge (SEEK) Architecture. . . . . . . .
50
3-2 Schema used by an application. . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3-3 Schema used by a report.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3-4 Conceptual view of the Data Reverse Engineering (DRE) module of the Scalable Extraction of Enterprise Knowledge (SEEK) prototype. . . . . . . . . . . . . . .
54
3-5 Conceptual view of Semantic Analyzer (SA) component. . . . . . . . . . . . . .
54
3-6 Report design template example. . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3-7 Report generated when the above template was run. . . . . . . . . . . . . . . .
56
3-8 Java Servlet generated HTML report showing course listings of CALTECH. . . .
56
3-9 Annotated HTML page generated by analyzing a Java Servlet. . . . . . . . . . .
57
3-10 Inter-procedural call graph of a program source code. . . . . . . . . . . . . . . .
61
3-11 Schemas of two data sources that collaborates for a new online degree program.
69
3-12 Reports from two sample universities listing courses. . . . . . . . . . . . . . . .
70
3-13 Reports from two sample universities listing instructor offices. . . . . . . . . . .
71
3-14 Similarity scores of schema elements of two data sources. . . . . . . . . . . . . .
73
3-15 Five steps of Schema Matching by Analyzing ReporTs (SMART) algorithm. . .
74
3-16 Unified Modeling Language (UML) diagram of the Schema Matching by Analyzing ReporTs (SMART) report ontology. . . . . . . . . . . . . . . . . . . . . . . . . . 76 3-17 Example for a similarity matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
3-18 Similarity scores after matching report pairs about course listings. . . . . . . . .
82
3-19 Similarity scores after matching report pairs about instructor offices. . . . . . .
82
4-1 Java Code size distribution of (Semantic Analyzer) SA and (Schema Matching by Analyzing ReporTs) SMART packages. . . . . . . . . . . . . . . . . . . . . .
84
4-2 Using JavaCC to generate parsers. . . . . . . . . . . . . . . . . . . . . . . . . .
86
5-1 Snapshot of Test Harness for the Assessment of Legacy information Integration Approaches (THALIA) website. . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
9
5-2 Snapshot of the computer science course catalog of Boston University. . . . . . .
91
5-3 Extensible Markup Language (XML) representation of Boston Universitys course catalog and corresponding schema file. . . . . . . . . . . . . . . . . . . . . . . .
92
5-4 Scores uploaded to Test Harness for the Assessment of Legacy information Integration Approaches (THALIA) benchmark for Integration Wizard (IWiz) Project at the University of Florida. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6-1 Report design practice where all the descriptive texts are headers of the data. .
97
6-2 Report design practice where all the descriptive texts are on the left hand side of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6-3 Architecture of the databases in the College of Engineering. . . . . . . . . . . .
99
6-4 Results of the SMART with Jiang-Conrath (JCN), Lin and Levenstein metrics. . 107 6-5 Results of COmbination of MAtching algorithms (COMA++) with All Context and Filtered Context combined matchers and comparison of SMART and COMA++ results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6-6 Receiver Operating Characteristics (ROC) curves of SMART and COMA++ for THALIA test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6-7 Results of the SMART with different report pair similarity thresholds for UF test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6-8 F-Measure results of SMART and COMA++ for UF test data when report pair similarity is set to 0.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6-9 Receiver Operating Characteristics (ROC) curves of the SMART for UF test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6-10 Comparison of the ROC curves of the SMART and COMA++ for UF test data. 115
10
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor Of Philosophy SEMANTIC INTEGRATION THROUGH APPLICATION ANALYSIS By Oguzhan Topsakal May 2007 Chair: Joachim Hammer Major: Computer Engineering Organizations increasingly need to participate in rapid collaborations with other organizations to be successful and they need to integrate their data sources to share and exchange data in such collaborations. One of the problems that needs to be solved when integrating different data sources is finding semantic correspondences between elements of schemas of disparate data sources (a.k.a. schema matching). Schemas, even those from the same domain, show many semantic heterogeneities. Resolving these heterogeneities is mostly done manually; which is tedious, time consuming, and expensive. Current approaches to automating the process mainly use the schemas and the data as input to discover semantic heterogeneities. However, the schemas and the data are not sufficient sources of semantics. In contrast, we analyze a valuable source of semantics, namely application source code and report design templates, to improve schema matching for information integration. Specifically, we analyze application source code that generate reports to present the data of the organization in a user friendly way. We trace the descriptive information on a report back to the corresponding schema element(s) through reverse engineering of the application source code or report design templates and store the descriptive text, data, and the corresponding schema elements in a report ontology instance. We utilize the information we have discovered for schema matching. Our experiments using a fully functional prototype system show that our approach produces more accurate results than current techniques.
11
CHAPTER 1 INTRODUCTION 1.1
Problem Definition
The success of many organizations largely depends on their ability to participate in rapid, flexible, limited-time collaborations. The need to collaborate is not just limited to business but also applies to government and non-profit organizations such as military, emergency management, health-care, rescue, etc. The success of a business organization depends on its ability to rapidly customize its products, adapt to continuously changing demands, and reduce costs as much as possible. Government organizations, such as the Department of Homeland Security, need to collaborate and exchange intelligence to maintain the security of its borders or to protect critical infrastructure, such as energy supply and telecommunications. Non-profit organizations, such as the American Red Cross, need to collaborate on matters related to public health in catastrophic events, such as hurricanes. The collaboration of organizations produces a synergy to achieve a common goal that would not be possible otherwise. Organizations participating in a rapid, flexible collaboration environment need to share and exchange data. In order to share and exchange data, organizations need to integrate their information systems and resolve heterogeneities among their data sources. The heterogeneities exist at different levels. There exist physical heterogeneities at the system level because of differences between various internal data storage, retrieval, and representation methods. For example, some organizations might use professional database management systems while others might use simple flat files to store and represent their data. In addition, there exist structural (syntax)-level heterogeneities because of the differences at the schema level. Finally, there exist semantic level heterogeneities because of the differences in the use of the data which correspond to the same real-world objects [47]. We face a broad range of semantic heterogeneities in information systems because of
12
different viewpoints of designers of these information systems. Semantic heterogeneity is simply a consequence of the independent creation of the information systems [44]. To resolve semantic heterogeneities, organizations must first identify the semantics of their data elements in their data sources. Discovering the semantics of data automatically has been an important area of research in the database community [22, 36]. However, the process of resolving semantic heterogeneity of data sources is still mostly done manually. Resolving heterogeneities manually is a tedious, error-prone, time-consuming, non-scalable and expensive task. The time and investment needed to integrate data sources become a significant barrier to information integration of collaborating organizations. In this research, we are developing an integrated novel approach that automates the process of semantic discovery in data sources to overcome this barrier and to help rapid, flexible collaboration among organizations. As mentioned above, we are aware that there exist physical heterogeneities among information sources but to keep the dissertation focused, we assume data storage, retrieval and representation methods are the same among the information systems to be integrated. According to our experiences gained as a software developer for information technologies department of several banks and software companies, application source code generating reports encapsulate valuable information about the semantics of the data to be integrated. Reports present data from the data source in a way that is easily comprehensible by the user and can be rich source of semantics. We analyze application source code to discover semantics to facilitate integration of information systems. We outline the approach in Section 1.2 below and provide more detailed explanation in Sections 3.1 and 3.2. The research described in this dissertation is a part of the NSF-funded1 SEEK (Scalable Extraction of Enterprise Knowledge) project which also serves as a testbed.
1
The SEEK project is supported by the National Science Foundation under grant numbers CMS-0075407 and CMS-0122193.
13
1.2
Overview of the Approach
The results described in this dissertation are based on the work we have done on the SEEK project. The SEEK project is directed at overcoming the problems of integrating legacy data and knowledge across the participants of a collaboration network [45]. The goal of the SEEK project is to develop methods and theory to enable rapid integration of legacy sources for the purpose of data sharing. We apply these methodologies in the SEEK toolkit which allows users to develop SEEK wrappers. A wrapper translates queries from an application to the data source schema at run-time. SEEK wrappers act as an intermediary between the legacy source and decision support tools which require access to the organization’s knowledge.
Figure 1-1: Scalable Extraction of Enterprise Knowledge (SEEK) Architecture.
14
In general, SEEK [45, 46] works in three steps: Data Reverse Engineering (DRE), Schema Matching (SM), and Wrapper Generation (WG). In the first step, Data Reverse Engineering (DRE) component of SEEK generates a detailed description of the legacy source. DRE has two sub-components, Schema Extractor (SE) and Semantic Analyzer (SA). SE extracts the conceptual schema of the data source. SA analyzes electronically available information sources such as application code and discovers the semantics of schema elements of the data source. In other words, SA discovers mappings between data items stored in an information system and the real-world objects they represent by using the pieces of evidence that it extracts from the application code. SA enhances the schema of the data source by the discovered semantics and we refer to the semantically enhanced schema knowledgebase of the organization. In the second step, the Schema Matching (SM) component maps the knowledgebase of an organization with the knowledgebase of another organization. In the third step, the extracted legacy schema and the mapping rules provide the input to the Wrapper Generator (WG), which produces the source wrapper. These three steps of SEEK are build-time processes. At run-time, the source wrapper translates queries from the application domain model to the legacy source schema. A high-level schematic view outlining the SEEK components and their interactions is shown in Figure 1-1. In this research, our focus is on the Semantic Analysis (SA) and Schema Matching (SM) methodology. We first describe how SA extracts semantically rich outputs from the application source code and then relates them with the schema knowledge extracted by the Schema Extractor (SE). We show that we can gather significant semantic information from the application source code by the methodology we have developed. We then focus on our Schema Matching (SM) methodology. We describe how we utilize the semantic information that we have discovered by SA to find mappings between two data sources. The extracted semantic information and the mappings can then be used by the subsequent
15
wrapper generation step to facilitate the development of legacy source translators and other tools during information integration which is not the focus of this dissertation. 1.3
Contributions
In this research, we introduce novel approaches for semantic analysis of application source code and for matching of related but disparate schemas. In this section, we list the contributions of this work. We describe these contributions in details in Chapter 7 while concluding the dissertation. External information sources such as corpora of schemas and past matches have been used for schema matching but application source code have not been used as an external information source yet [25, 28, 78]. In this research, we focus on this well-known but not yet addressed challenge of analyzing application source code for the purpose of semantic extraction for schema matching. The accuracy of the current schema matching approaches is not sufficient for fully automating the process of schema matching [26]. The approach we present in this dissertation provides better accuracy for the purpose of automatic schema matching. The schema matching approaches so far have been mostly using lexical similarity functions or look-up tables to determine the similarities of two schema element properties (for example, the names and types of schema elements). There have been suggestions to utilize semantic similarity measures between words [7] but have not been realized. We utilize the state of the art semantic similarity measures between words to determine similarities and show its effect on the results. Another important contribution is the introduction of a generic similarity function for matching classes of ontologies. We have also described how we determine the weights of our similarity function. Our similarity function along with the methodology to determine the weights of the function can be applied on many domains to determine similarities between different entities.
16
Integration based on user reports ease the communication between business and information technology (IT) specialists. Business and IT specialists often have difficulty on understanding each other. Business and IT specialists can discuss on data presented on reports rather than discussing on incomprehensible database schemas. Analyzing reports for data integration and sharing helps business and IT specialists communicate better. One other contributions is the functional extensibility of our semantic analysis methodology. Our information extraction framework lets researchers add new functionality as they develop new heuristics and algorithms on the source code being analyzed. Our current information techniques provide improved performance because it requires less passes over the source code and provide improved accuracy as it eliminates unused code fragments (i.e., methods, procedures). While conducting the research, we saw that there is a need of available test data of sufficient richness and volume to allow meaningful and fair evaluations between different information integration approaches. To address this need, we developed THALIA2 (Test Harness for the Assessment of Legacy information Integration Approaches) benchmark which provides researchers with a collection of over 40 downloadable data sources representing University course catalogs, a set of twelve benchmark queries, as well as a scoring function for ranking the performance of an integration system [47, 48]. 1.4
Organization of the Dissertation
The rest of the dissertation is organized as follows. We introduce important concepts of the work and summarize research in Chapter 2. Chapter 3 describes our semantic analysis approach and schema matching approach. Chapter 4 describes the implementation details of our prototype. Before we describe the experimental evaluation of our approach in Chapter 6, we describe the THALIA test bed in Chapter 5. Chapter 7 concludes the dissertation and summarizes the contributions of this work.
2
THALIA website: http://www.cise.ufl.edu/project/thalia.html 17
CHAPTER 2 RELATED CONCEPTS AND RESEARCH In the context of this work, we have explored a broad range of research areas. These research areas include but are not limited to data semantics, semantic discovery, semantic extraction, legacy system understanding, reverse engineering of application code, information extraction from application code, semantic similarity measures, schema matching, ontology extraction and ontology mapping, etc. While developing our approach, we leverage these research areas. In this chapter, we introduce important concepts and related research that are essential for understanding the contributions of this work. Whenever necessary, we provide our interpretations of definitions and commonly accepted standards and conventions in this field of study. We also present the state-of-the-art in the related research areas. We first introduce what a legacy system is. Then we state the difference between frequently used terms data, information and semantics in Section 2.2. We point out some of the research in semantic extraction in Section 2.3. Since we extract semantics through reverse engineering of application source code. We provide the definitions of reverse engineering of source code, database reverse engineering in Section 2.4 and also provide the techniques for program understanding in Section 2.5. We represent the extracted information from application source code of different legacy systems in ontologies and utilize these ontologies to find out semantic similarities between them. For this reason, semantic similarity measures are also important for us. We have explored the research on semantic similarity measures and presented these works in Section 2.11 after giving the definition of similarity in Section 2.10. We aim to leverage the research on assessing similarity scores between texts and ontologies. We present these techniques in Section 2.13 and 2.14. We then present the ontology concept, and the ontology language Web Ontology Language (OWL). Finally, we present ontology mapping and schema mapping
18
and conclude the chapter by presenting some outstanding techniques of schema matching in the literature. 2.1
Legacy Systems
Our approaches for semantic analysis of application source code and schema matching has been developed as a part of the SEEK project. SEEK project aims to help understanding of legacy systems. We analyze application source code of a legacy system to understand the semantics of it and apply gained knowledge to solve schema matching problem of data integration. In this section, we first give a broad definition of a legacy system and highlight its importance and then provide its definition in the context of this work. Legacy systems are generally known as inflexible, nonextensible, undocumented, old and large software systems which are essential for the organization’s business [12, 14, 75]. They significantly resist modifications and changes. Legacy system are very valuable because they are the repository of corporate knowledge collected over a long time and they also encapsulate the logic of the organization’s business processes [49]. A legacy system is generally developed and maintained by many different people with many different programming styles. Mostly, the original programmers have left, and the existing team is not an expert of all the aspects of the system [49]. Even though once there was a documentation about the design and specification of the legacy system, the original software specification and design have been changed but the documentation was not updated through out the years of development and maintenance. Thus, understanding is lost, and the only reliable documentation of the system is the application source code running on the legacy system [75]. In the context of this work, we define legacy systems as any information system with poor or nonexistent documentation about the underlying data or the application code that is using the data. Despite the fact that legacy systems are often interpreted as old
19
systems, for us, an information system is not required to be old in order to be considered as legacy. 2.2
Data, Information, Semantics
In this section, we give definitions of data, information and semantics before we explore some research on semantic extraction in the following section. According to a simplistic definition data is the raw, unprocessed input to an information system that produces the information as an output. A commonly accepted definition states that data is a representation of facts, concepts or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means [2, 18]. Data mostly consists of disconnected numbers, words, symbols, etc. and results from measurable events, or objects. Data has a value when it is processed, changed into a usable form and placed in a context [2]. When data has a context and has been interpreted, it becomes information. Then it can be used purposefully as information [1]. Semantics is the meaning and the use of data. Semantics can be viewed as a mapping between an object stored in an information system and the real-world object it represents [87]. 2.3
Semantic Extraction
In this section, we first state the importance of semantic extraction and application source code as a rich source for semantic extraction and then point out several related research efforts in this research area. Sheth et al. [87] stated that data semantics does not seem to have a purely mathematical or formal model and cannot be discovered completely, and fully automatically. Therefore, the process of semantic discovery requires human involvement. Besides being human-dependent, semantic extraction is a time-consuming and hence expensive task [36]. Although it can not be fully automatized, the gain of discovering even the limited amount of useful semantics can tremendously reduce the cost for understanding a system.
20
Semantics can be found from knowledge representation schemas, communication protocols, and applications that use the data [87]. Through out the discussions and research on semantic extraction, application source code has been proposed as a rich source of information [30, 36, 87]. Besides, researchers have agreed that the extraction of semantics from application source code is essential for identification and resolution of semantic heterogeneity. We use the discovered semantics from application source code to find correspondence between schemas of disparate data sources automatically. In this context, discovering semantics means gathering information about the data, so that a computer can identify mappings (paths) between corresponding schema elements in different data sources. Jim Ning et al. worked on extracting semantics from application source code but with a slightly different aim. They developed an approach to identify and recover reusable code components [67]. They investigated conditional statements as we do to find out business rules. They stated that conditional statements are potential business rules. They also gave importance to input and output statements for highlighting semantics inside the code, and stated that meaningful business functions normally process input values and produce results. Jim Ning et al. called investigating input variables as forward slicing and called investigating output statements as backward slicing. The drawback of their approach was being very language-specific (Cobol) [67]. N Ashish et al. worked on extracting semantics from internet information sources to enable semi-automatic wrapper generation [5]. They used several heuristics to identify important tokens and structures of HTML pages in order to create the specification for a parser. Similar to our approach, they benefited from parser generation tools, namely YACC [53] and LEX [59], for semantic extraction. There are several related work in information extraction from text that deal with tables and ontology extraction from tables. The most relevant work about information extraction from HTML pages by the help of heuristics was done by Wang and Lochovsky
21
[94]. They aimed to form the schema of the data extracted from an HTML page by using labels of a table on an HTML page. The heuristic that they use to relate labels to the data and to separate data found in a table cell into several attributes is very similar to our heuristics. For example, they assume that if several attributes are encoded into one text string, then there should be some special symbol(s) in the string as the separator to visually support users to distinguish the attributes. They also use heuristics to relate labels to the data from an HTML page that are similar to our heuristics. Buttler et al. [17] and Embley et al. [32] also developed heuristic based approaches for information extraction from HTML pages. However, their aim was to identify boundaries of data on an HTML page. Embley et al. [33] also worked on table recognition from documents and suggested a table ontology which is very similar to our report ontology. In a related work, Tijerino et al. [90] introduced an information extracting system called TANGO which recognizes tables based on a set of heuristics, forms mini-ontologies and then merges these ontologies to form a larger application ontology. 2.4
Reverse Engineering
Without the understanding of the system, in other words without the accurate documentation of the system, it is not possible to maintain, extend, and integrate the system with other systems [76, 89, 95]. The methodology to reconstruct this missing documentation is reverse engineering. In this section, we first give the definition of reverse engineering in general and then give definitions of program reverse engineering and database reverse engineering. We also state the importance of these tasks. Reverse engineering is the process of analyzing a technology to learn how it was designed or how it works. Chikofsky and Cross [19] defined reverse engineering as the process of analyzing a subject system to identify the systems components and their interrelationships and as the process of creating representations of the system in another form or at a higher level of abstraction. Reverse engineering is an action to understand the subject system and does not include the modification of it. The reverse of the reverse
22
engineering is forward engineering. Forward engineering is the traditional process of moving from high-level abstractions and logical, implementation-independent designs to the physical implementation of a system [19]. While reverse engineering starts from the subject system and aims to identify the high-level abstraction of the system, forward engineering starts from the specification and aims to implement the subject system. Program (software) reverse engineering is recovering the specifications of the software from source code [49]. The recovered specifications can be represented in forms such as data flow diagrams, flow charts, specifications, hierarchy charts, call graphs, etc. [75]. The purpose of program reverse engineering is to enhance our understanding of the software of the system to reengineer, restructure, maintain, extend or integrate the system [49, 75]. Database Reverse Engineering (DBRE) is defined as identifying the possible specification of a database implementation [22]. It mainly deals with schema extraction, analysis and transformation [49]. Chikofsky and Cross [19] defined DBRE as a process that aims to determine the structure, function and meaning of the data of an organization. Hainaut [41] defined DBRE as the process of recovering the schema(s) of the database of an application from data dictionary and program source code that uses the data. The objective of DBRE is to recover the technical and conceptual descriptions of the database. It is a prerequisite for several activities such as maintenance, reengineering, extension, migration, integration. DBRE can produce an almost complete abstract specification of an operational database while program reverse engineering can only produce partial abstractions that can help better understand a program [22, 42]. Many data structures and constraints are embedded inside the source code of data-oriented applications. If a construct or a constraint has not been declared explicitly in the database schema, it is implemented in the source code of the application that updates or queries the database. The data in the database is a result of the execution of the applications of the organization [49]. Even though the data satisfies the constraints of the database, it is verified with the validation mechanisms inside the source code before it
23
is being updated into the database to ensure that it does not violate the constrains. We can discover some constraints, such as referential constraints, by analyzing the application source code, even if the application program only queries the data but does not modify it. For instance, if there exists a referential constraint (foreign key relation) between the entity named E1 and entity named E2, this constraint is used to join the data of these two entities with a query. We can discover this referential constraint by analyzing the query [50]. Since program source code is a very useful source of information in which we can discover a lot of implicit constructs and constraints, we use it as an information source for DBRE. It is well known that the analysis of program source code is a complex and tedious task. However, we do not need to recover the complete specification of the program for DBRE. We are looking for information to enhance the schema and to find the undeclared constraints of the database. In this process, we benefit from several program understanding techniques to extract information effectively. We provide the definitions of the program understanding and its techniques in the following section. 2.5
Program Understanding Techniques
In this section, we introduce the concept of program understanding and its techniques. We have implemented these techniques to analyze application source code to extract semantic information effectively. Program understanding (a.k.a program comprehension) is the process of acquiring knowledge about an existing, generally undocumented, computer program. The knowledge acquired about the business processes through the analysis of the source code is accurate and up-to-date because the source code is used to generate the application that the organization uses. Basic actions that can be taken to understand a program is to read the documentation about it, to ask for assistance from the user of it, to read the source code of it or to run the program to see what it outputs to specific inputs [50]. Besides these actions, there
24
are several techniques that we can apply to understand a program. These techniques help the analyst to extract high-level information from low-level code to come to a better understanding of the program. These techniques are mostly performed manually. However, we apply these techniques in our semantic analyzer module to automatically extract information from data-oriented applications. We show how we apply these techniques in our semantic analyzer in Section 3.1.5. We describe the main program understanding techniques in the following subsections. 2.5.1
Textual Analysis
One simple way to analyze a program is to search for a specific string in the program source code. This searched string can be a pattern or a clich`e. The program understanding technique that searches for a pattern or a clich`e is named as pattern matching or clich`e recognition. A pattern can include wildcards, character ranges and can be based on other defined patterns. A clich`e is a commonly used programming pattern. Examples of clich`es are algorithmic computations, such as list enumeration and binary search, and common data structures, such as priority queue and hash table [49, 97]. 2.5.2
Syntactic Analysis
Syntactic analysis is performed by a parser that decomposes a program into expressions and statements. The result of the parser is stored in a structure called abstract syntax tree (AST). An AST is a type of representation of source code that facilitates the usage of tree traversal algorithms and it is the basic of most sophisticated program analysis tools [49]. 2.5.3
Program Slicing
Program slicing is a technique to extract the statements from a program relevant to a particular computation, specific behavior or interest such as a business rule [75]. The slice of a program with respect to program point p and variable V consists of all statements and predicates of the program that might affect the value of V at point p [96]. Program slicing is used to reduce the scope of program analysis [49, 83]. The slice that affect the
25
value of V at point p is computed by gathering statements and control predicates by way of a backward traversal of the program, starting at the point p. This kind of slice is also known as backward slicing. When we retrieve statements that can potentially be affected by the variable V starting from a point p, we call it forward slicing. Forward and backward slicing are both a type of static slicing because they use only statically available information (source code) for computing. 2.5.4
Program Representation Techniques
Program source code, even reduced through program slicing, often is too difficult to understand because the program can be huge, poorly structured, and based on poor naming conventions. It is useful to represent the program in different abstract views such as the call graph, data flow graph, etc [49]. Most of the program reverse engineering tools provide these kind of visualization facilities. In the following sections, we present several program representation techniques. 2.5.5
Call Graph Analysis
Call graph analysis is the analysis of the execution order of the program units or statements. If it determines the order of the statements within a program then it is called intra-procedural analysis. If it determines the calling relationship among the program units, it is called inter-procedural analysis [49, 83]. 2.5.6
Data Flow Analysis
Data flow analysis is the analysis of the flow of the values from variables to variables between the instructions of a program. The variables defined and the variables referenced by each instruction, such as declaration, assignment and conditional, are analyzed to compute the data flow [49, 83]. 2.5.7
Variable Dependency Graph
Variable dependency graph is a type of data flow graph where a node represents a variable and an arc represents a relation (assignment, comparison, etc.) between two variables. If there is a path from variable v1 to variable v2 in the graph, then there is a
26
sequence of statements such that the value of v1 is in relation with the value of v2. If the relation is an assignment statement then the arc in the diagram is directed. If the relation is a comparison statement then the arc is not directed [49, 83]. 2.5.8
System Dependence Graph
System dependence graph is a type of data flow graph that also handles procedures and procedure calls. A system dependence graph represents the passing of values between procedures. When procedure P calls procedure Q, values of parameters are transferred from P to Q and when Q returns, the return value is transferred back to P [49]. 2.5.9
Dynamic Analysis
The program understanding techniques described so far are performed on the source code of the program and are static analysis. Dynamic analysis is the process of gaining increased understanding of a program by systematically executing it [83]. 2.6
Visitor Design Patterns
We applied the above program understanding techniques in our semantic analyzer program. We implemented our semantic analyzer by using visitor patterns. In this section, we explain what a visitor pattern is and the rationale for using it. A Visitor Design Pattern is a behavioral design pattern [38], which is used to encapsulate the functionality that we desire to perform on the elements of a data structure. It gives the flexibility to change the operation being performed on a structure without the need to change the classes of the elements on which the operation is performed. Our goal is to build semantic information extraction techniques that can be applied to any source code and can be extended with new algorithms. The visitor design pattern technique is the key object oriented technique to reach this goal. New operations over the object structure can be defined simply by adding a new visitor. Visitor classes localize related behavior in the same visitor and unrelated sets of behavior are partitioned in their own visitor subclasses. If the classes defining the object structure, in our case the grammar production rules of the programming language, rarely change, but
27
new operations over the structure are often defined, a visitor design pattern is the perfect choice [13, 71]. 2.7
Ontology
An ontology represents a common vocabulary describing the concepts and relationships for researchers who need to share information in a domain [40, 69]. It includes machine interpretable definitions of basic concepts in the domain and relations among them. Ontologies enable the definition and sharing of domain-specific vocabularies. They are developed to share common understanding of the structure of information among people or software agents, to enable reuse of domain knowledge, and to analyze domain knowledge [69]. According to a commonly quoted definition, an ontology is a formal, explicit specification of a shared conceptualization [40]. For a better understanding, Michael Uschold et al. define the terms in this definition as follows [92]: A conceptualization is an abstract model of how people think about things in the world. An explicit specification means the concepts and relations in the abstract model are given explicit names and definitions. Formal means that the meaning specification is encoded in a language whose formal properties are well understood. Shared means that the main purpose of an ontology is generally to be used and reused across different applications. 2.8
Web Ontology Language (OWL)
The Web Ontology Language (OWL) is a semantic markup language for publishing and sharing ontologies on the World Wide Web [64]. OWL is derived from the DAML+OIL Web Ontology Language. DAML+OIL was developed as a joint effort of researchers who initially developed DAML (DARPA Agent Markup Language) and OIL (Ontology Inference Layer or Ontology Interchange Language) separately. OWL is designed for processing and reasoning about information by computers instead of just presenting it on the Web. OWL supports more machine interpretability than XML (Extensible Markup Language), RDF (the Resource Description Framework),
28
and RDF-S (RDF Schema) by providing additional vocabulary along with a formal semantics. Formal semantics allows us to reason about the knowledge. We may reason about class membership, equivalence of classes, and consistency of the ontology for unintended relationships between classes and classify the instances in classes. RDF and RDF-S can be used to represent ontological knowledge. However, it is not possible to use all reasoning mechanisms by using RDF and RDF-S because of some missing features such as disjointness of classes, boolean combinations of classes, cardinality restrictions, etc. [4]. When all these features are added to RDF and RDF-S to form an ontology language, the language becomes very expressive. However it becomes inefficient to reason. For this reason, OWL comes in three different flavors: OWL-Lite, OWL-DL, and OWL Full. The entire language is called OWL Full, and uses all the OWL languages primitives. It also allows to combine these primitives in arbitrary ways with RDF and RDF-S. Besides its expressiveness, OWL Full’s computations can be undecidable. OWL DL (OWL Description Logic) is a sublanguage of OWL Full. It includes all OWL language constructs but restricts in which these constructors from OWL and RDF can be used. This makes the computations in OWL-DL complete (all conclusions are guaranteed to be computable) and decidable (all computations will finish in finite time). Therefore, OWL-DL supports efficient reasoning. OWL Lite limits OWL-DL to a subset of constructors (for example OWL Lite excludes enumerated classes, disjointness statements and arbitrary cardinality) making it less expressive. However, it may be a good choice for hierarchies needing simple constraints [4, 64]. OWL provides an infrastructure that allows a machine to make the same sorts of simple inferences that human beings do. A set of OWL statements by itself (and the OWL spec) can allow you to conclude another OWL statement whereas a set of XML statements, by itself (and the XML spec) does not allow you to conclude any other XML statements. Given the statements (motherOf subProperty parentOf) and (Nedret
29
motherOf Oguzhan) when stated in OWL, allows you to conclude (Nedret parentOf Oguzhan) based on the logical definition of subProperty as given in the OWL spec. Another advantage of using OWL ontologies is the availability of tools such as Racer, Fact and Pellet that can reason about them. A reasoner can also help us to understand if we could accurately extract data and description elements from the report. For instance, we can define a rule such as ‘No data or description elements can overlap’ and check the OWL ontology by a reasoner to make sure if this rule is satisfied or not. 2.9
WordNet
WordNet is an online database which aims to model the lexical knowledge of a native speaker of English.1 It is designed to be used by computer programs. WordNet links nouns, verbs, adjectives, and adverbs to sets of synonyms [66]. A set of synonyms represent the same concept and is known as a synset in WordNet terminology. For example, the concept of a ‘child’ may be represented by the set of words: ‘kid’, ‘youngster’, ‘tiddler’, ‘tike’. A synset also has a short definition or description of the real world concept known as a ‘gloss’ and has semantic pointers that describe relationships between the current synset and other synsets. The semantic pointers can be a number of different types including hyponym / hypernym (is-a / has a) meronym / holonym (part-of / has-part), etc. A list of semantic pointers is given in Table 2-1.2 WordNet can also be seen as a large graph or semantic network. Each node of the graph represents a synset and each edge of the graph represents a relation between synsets. Many of the approaches for measuring similarity of words uses the graphical structure of WordNet [15, 72, 79, 80]. Since the development of WordNet for English by the researchers of Princeton University, many WordNets for other languages have been developed such as Dannish (Dannet), Persian (PersiaNet), Italian (ItalWordnet), etc. There has been also research to
1
WordNet 2.1 defines 155,327 words of English
2
Table is adapted from [72]
30
Table 2-1. List Hypernym Hyponym Troponym Meronym Holonym Antonym Attribute Entailment Cause Also see Similar to Participle of Pertainym of
of relations used to connect senses is a generalization of is a kind of is a way to is part / substance / member of contains part opposite of attribute of entails cause to related verb similar to is participle of pertains to
in WordNet. furniture is a hypernym of chair chair is a hyponym of furniture amble is a troponym of walk wheel is a (part) meronym of a bicycle bicycle is a holonym of a wheel ascend is an antonym of descend heavy is an attribute of weight ploughing entails digging to offend causes to resent to lodge is related to reside dead is similar to assassinated stored (adj) is the participle of to store radial pertains to radius
align WordNets of different languages. For instance, EuroWordNet [93] is a multilingual lexical knowledgebase that links WordNets of different languages (e.g., Dutch, Italian, Spanish, German, French, Czech and Estonian). In EuroWordNet, the WordNets are linked to an Inter-Lingual-Index which interconnects the languages so that we can go from the synsets in one language to corresponding synsets in other languages. While WordNet is a database which aims to model a person’s knowledge about a language, another research effort Cyc [57] (derived from En-cyc-lopedia) aims to model a person’s everyday common sense. Cyc formalizes common sense knowledge (e.g., ‘You cannot remember events that have not happened yet’, ‘You have to be awake to eat’, etc.) in the form of a massive database of axioms. 2.10
Similarity
Similarity is an important subject in many fields such as philosophy, psychology, and artificial intelligence. Measures of similarity or relatedness are used in various applications such as word sense disambiguation, text summarization and annotation, information extraction and retrieval, automatic correction of word errors in text, and text classification [15, 21]. Understanding how humans assess similarity is important to solve many of the problems of cognitive science such as problem solving, categorization, memory retrieval, inductive reasoning, etc. [39].
31
Similarity of two concepts refers to how much features they have in common and how much they have in difference. Lin [60] provides an information theoretic definition of similarity by clarifying the intuitions and assumptions about it. According to Lin, the similarity between A and B is related to their commonality and their difference. Lin assumes that the commonality between A and B can be measured according to the information they contain in common (I(common(A, B))). In information theory, the information contained in a statement is measured by the negative logarithm of the probability of the statement (I(common(A, B)) = −logP (A ∩ B)). Lin also assumes that if we know the description of A and B, we can measure the difference by subtracting the commonality of A and B from the description of A and B. Hence, Lin states that the similarity between A and B, sim(A, B) is a function of their commonalities and descriptions. That is, sim(A, B) = f (I(common(A, B)), I(description(A, B))). We also come across with ‘semantic relatedness’ term while dealing with similarity. Semantic relatedness is a more general concept than similarity and refers to the degree to which two concepts are related [72]. Similarity is one aspect of semantic relatedness. Two concepts are similar if they are related in terms of their likeliness (e.g child - kit). However, two concepts can be related in terms of functionality or frequent association even though they are not similar (e.g., instructor - student, christmas - gift). 2.11
Semantic Similarity Measures of Words
In this section, we provide a review of semantic similarity measures of words in the literature. This review is not meant to be a complete list of the similarity measures but provides most of the outstanding ones in the literature. Most of the measures below use the hierarchical structure of WordNet. 2.11.1
Resnik Similarity Measure
Resnik [79] provided a similarity measure based on the is-a hierarchy of the WordNet and the statistical information gathered from a large corpora of text. Resnik used the statistical information from the large corpora of text to measure the information content.
32
According to the information theory, the information content of a concept c can be quantified as − log P (c), where P (c) is the probability of encountering concept c. This formula tells us that as probability increases, informativeness decreases; so the more abstract a concept, the lower its information content. In order to calculate the probability of a concept, Resnik first computed the frequency of occurrence of every concept in a large corpus of text. Every occurrence of a concept in the corpus adds to the frequency of the concept and to the frequency of every concept subsuming the concept encountered. Based on this computation, the formula for the information content is:
P (c) = f req(c)/f req(r) ic(c) = − log P (c) ic(c) = − log(f req(c)/f req(r)) where r is the root node of the taxonomy and c is the concept. According to Resnik, the more information two concepts have in common, the more similar they are. The information shared by two concepts is indicated by the information content of the concepts that subsume them in the taxonomy. The formula of the Resnik similarity measure is:
simRES(c1, c2) = max[− log P (c)] where c is a concept that subsumes both c1 and c2. One of the drawbacks of the Resnik measure is that it completely depends upon the information content of the concept that subsumes the two concepts whose similarity we measure. It does not take the two concepts into account. For this reason similarity measures of different pairs of concepts that have the same subsumer have the same similarity values.
33
2.11.2
Jiang-Conrath Similarity Measure
Jiang and Conrath [52] address the limitations of the Resnik measure. It both uses the information content of the two concepts, along with the information content of their lowest common subsumer to compute the similarity of two concepts. The measure is a distance measure that specifies the extent of unrelatedness of two concepts. The formula of the Jiang and Conrath measure is:
distanceJCN(c1, c2) = ic(c1) + ic(c2) − (2 ∗ ic(LCS(c1, c2))) where ic determines the information content of a concept, and LCS determines the lowest common subsuming concept of two given concepts. However, this measure works only with WordNet nouns. 2.11.3
Lin Similarity Measure
Lin [60] introduced a similarity measure between concepts based on his theory of similarity between arbitrary objects. To measure the similarity, Lin uses the information content of the two concepts that is being measured and the information concept of the lowest common subsumer of them. The formula of the Lin measure is: simLIN (c1, c2) =
2 ∗ log P (c0) log P (c1) + log P (c2)
where c0 is the lowest common concept that subsumes both c1 and c2. 2.11.4
Intrinsic IC Measure in WordNet
Seco et al. [85] advocates that WordNet can also be used as a statistical resource with no need for external corpora to compute the information content of a concept.
34
They assume that the taxonomic structure of WordNet is organized in a meaningful and principled way, where concepts with many hyponyms3 convey less information than concepts that are leaves. They provide the formula for information content as follows:
icW N (c) =
log hypo(c)+1 log(hypo(c) + 1) maxwn = 1− 1 log(maxwn) log maxwn
In this formula, the function hypo returns the number of hyponyms of a given concept and maxwn is the maximum number of concepts that exist in the taxonomy. 2.11.5
Leacock-Chodorow Similarity Measure
Rada et al. [77] was the first to measure the semantic relatedness based on the length of the path of two concepts in a taxonomy. Rada et al. measured semantic relatedness of medical terms, using a medical taxonomy called MeSH. According to this measurement, given a tree-like structure of a taxonomy, the number of links between two concepts are counted and they are considered more related if the length of the path between them is shorter. Leacock-Chodorow [56] applied this approach to measure semantic relatedness of two concepts using WordNet. The measure counts the shortest path between two concepts in the taxonomy and scales it by the depth of the taxonomy:
relatedLCH(c1, c2) =
− log(shortestpath(c1, c2)) 2∗D
In the formula, c1 and c2 represent the two concepts, D is the maximum dept of the taxonomy.4 One weakness of the measure is, it assumes the size or weight of every link as equal. However, lower down in the hierarchy a single link away concept pairs are more related
3
hyponym: a word that is more specific than a given word.
4
For WordNet 1.7.1, the value of D is 19.
35
than such pairs higher up in the hierarchy. Another limitation of the measure is that they limit their attention to is-a links and only noun hierarchies are considered. 2.11.6
Hirst-St.Onge Similarity Measure
Hirst and St.Onge’s [51] measure of semantic relatedness is based on the idea that two concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that does not change direction too often [15, 72]. The Hirst-St.Onge measure considers all the relations defined in WordNet. All links in WordNet are classified as Upward (e.g., part-of), Downward (e.g., subclass) or Horizontal (e.g., opposite-meaning). They also describe three types of relations between words extra-strong, strong and medium-strong. The strength of the relationship is given by:
relHS(c1, c2) = C − pathlength − k ∗ d; where d is the number of changes of direction in the path, and C and k are constants; if no such path exists, the strength of the relationship is zero and the concepts are considered unrelated. 2.11.7
Wu and Palmer Similarity Measure
The Wu and Palmer [98] measures the similarity in terms of the depth of the two concepts in the WordNet taxonomy, and the depth of the lowest common subsumer (LCS):
simW UP (c1, c2) = 2.11.8
2 ∗ depth(LCS) depth(c1) + depth(c2)
Lesk Similarity Measure
Lesk [58] defines relatedness as a function of dictionary definition overlaps of concepts. He describes an algorithm that disambiguates words based on the extent of overlaps of their dictionary definitions with those of words in the context. The sense of the target word with the maximum overlaps is selected as the assigned sense of the word.
36
Table 2-2. Absolute values of the coefficients of correlation between human ratings of similarity and the five computational measures. Measure Miller & Charles Rubenstein & Goodenough Hirst and St-Onge .744 .786 Jiang and Conrath .850 .781 Leacock and Chodorow .816 .838 Lin .829 .819 Resnik .774 .779 2.11.9
Extended Gloss Overlaps Similarity Measure
Banerjee and Pedersen [9, 72] provided a measure by adopting the Lesk’s measure to WordNet. Their measure is called ‘the extended gloss overlaps measure’ and takes not only the two concepts that are being measured into account but also the concepts related with the two concepts through WordNet relations. An extended gloss of a concept c1 is prepared by adding the glosses of concepts that is related with c1 through a WordNet relation r. The calculation of measurement of two concepts c1 and c2 is based on the overlaps of extended glosses of two concepts. 2.12
Evaluation of WordNet-Based Similarity Measures
Budanitsky and Hirst [16] evaluated six different metrics using WordNet and listed the coefficients of correlation between the metrics and human ratings according to the experiments conducted by Miller & Charles [65] and Rubenstein & Goodenough [82]. We present the results of Budanitsky & Hirst’s experiments in Table 2-2. According to this evaluation, the Jiang and Conrath metric [52] as well as the Lin metric [60] are listed as one of the best measures. As a result, we use the Jiang and Conrath as well as the Lin semantic similarity measure to assign similarity scores between text strings. 2.13
Similarity Measures for Text Data
Several approaches have been used to assess a similarity score between texts. One of the simplest methods is to assess a similarity score based on the number of lexical units that occur in both text segments. Several processes such as stemming, stop-word removal, longest subsequence matching, weighting factors can be applied to this method
37
for improvement. However, these lexical matching methods are not enough to identify the semantic similarity of texts. One of the attempts to identify semantic similarity between texts is latent semantic analysis method (LSA)5 [55] which aims to measure similarity between texts by including additional related words. LSA is successful at some extend but has not been used on a large scale, due to the complexity and computational cost of its algorithm. Corley and Mihalcea [21] introduced a metric for text-to-text semantic similarity by combining word-to-word similarity metrics. To assess a similarity score for a text pair, they first create separate sets for nouns, verbs, adjectives, adverbs, and cardinals for each text. Then they determine pairs of similar words across the sets in the two text segments. For nouns and verbs, they use semantic similarity metric based on WordNet, and for other word classes they use lexical matching techniques. Finally, they sum up the similarity scores of similar word pairs. This bag-of-words approach improves significantly over the traditional lexical matching metrics. However, as they acknowledge, a metric of text semantic similarity should take into account the relations between words in a text. In another approach to measure semantic similarity between documents, Aslam and Frost [6] assumes that a text is composed of a set of independent term features and employ the Lin’s [60] metric for measuring similarity of objects that can be described by a set of independent features. The similarity of two documents in a pile of documents can be calculated by the following formula: 2∗
P
min(P a : t, P b : t) log P (t) P SimIT (a, b) = P (P a : t) log P (t) + (P b : t) log P (t) t
t
t
where probability P (t) is the fraction of corpus documents containing term t, P b : t is P the fractional occurrence of term t in document b ( (P b : t) = 1) and two documents a t
5
URL of LSA: http://lsa.colorado.edu/
38
and b share min(P a : t, P b : t) amount of term t in common, while they contain P a : t and P b : t amount of term t individually. Another approach by Oleshchuk and Pedersen [70] uses ontologies as a filter before assessing similarity scores to texts. They interpret a text based on an ontology and find out how much of the terms (concepts) of an ontology exists in a text. They assign a similarity score for text t1 and text t2 after comparing the ontology o1 extracted from t1 based on the ontology O and the ontology o2 extracted from t2 based on the same ontology O. The base ontology acts as a context filter to texts and depending on the base ontology used, texts may or may not be similar. 2.14
Similarity Measures for Ontologies
Rodriguez and Egenhofer [81] suggested assessing semantic similarity among entity classes from different ontologies based on a matching process that uses information about common and different characteristic features of ontologies based on their specifications. The similarity score of two entities from different ontologies is the weighted sum of similarity scores of components of compared entities. Similarity scores are independently measured for three components of an entity. These components are ‘set of synonyms’, ‘set of semantic relations’, and ‘set of distinguishing features’ of the entity. They further suggest to classify the distinguishing features into ‘functions’, ‘parts’, and ‘attributes’ where ‘functions’ represents what is done to or with an instance of that entity, ‘parts’ are structural elements of an entity such as leg or head of a human body, and ‘attributes’ are additional characteristics of an entity such as age or hair color of a person. Rodriguez and Egenhofer point out that if compared entities are related to the same entities, they may be semantically similar. Thus, they interpret comparing semantic
39
relations as comparing semantic neighborhoods of entities.6 The formula of overall similarity between entity a of ontology q and entity b of ontology q is as follows:
S(ap , bq ) = ww ∗ Sw (ap , bq ) + wu ∗ Su (ap , bq ) + wn ∗ Sn (ap , bq ) where Sw , Su , and Sn are the similarity between synonym sets, features, and semantic neighborhood and ww , wu , and wn are the respective weights which adds up to 1.0. While calculating a similarity score for each components of an entity, they also take non common characteristics into account. The similarity of a component is measured by the following formula:
S(a, b) =
|A ∩ B| |A ∩ B| + α(a, b) ∗ |A/B| + (1 − α(a, b)) ∗ |B/A|
where α is a function that defines the relative importance of the non-common characteristics. They calculate α in terms of the depth of the entities in their ontologies. Maedche and Staab [63] suggests to measure similarity of ontologies in two levels: lexical and conceptual. In the lexical level, they use edit-distance measure to find similarity between two sets of terms (concepts or relations) that forms the ontologies. While measuring similarity in the conceptual level, they take all its super- and sub-concepts of two concepts from two different ontologies into account. According to Ehrig et al. [31] comparing ontologies should go far beyond comparing the representation of the entities of the ontologies and should take their relation to the real world entities into account. For this, Ehrig et al. suggested a general framework for measuring similarities of ontologies which consists of four layers: data-, ontology-, context-, and domain knowledge layer. In the data layer, they compare data values by
6
The semantic neighborhood of an entity class is the set of entity classes whose distance to the entity class is less than or equal to an non negative integer
40
using generic similarity functions such as edit distance for strings. In the ontology layer, they consider semantic relations by using the graph structure of the ontology. In the context layer, they compare the usage patterns of entities in ontology-based applications. According to Ehrig et al. if two entities are used in the same (related) context then they are similar. They also propose to integrate domain knowledge layer into any three layers as needed. Finally, they reach to a overall similarity function which incorporates all layers of similarity. Euzenat and Valtchev [34, 35] proposed a similarity measure for OWL-Lite ontologies. Before measuring similarity, they first transform OWL-Lite ontology to a OL-graph structure. Then, they define similarity between nodes of the OL-graphs depending on the category and the features (e.g relations) of the nodes. They combine the similarities of features by a weighted sum approach. A similar work by Bach and Dieng-Kuntz [8] proposes a measure for comparing OWL-DL ontologies. Different from Euzenat and Valtchev’s work, Bach and Dieng-Kuntz adjusts the manually assigned feature weights of an OWL-DL entity dynamically in case they do not exist in the definition of the entity. 2.15
Evaluation Methods for Similarity Measures
There are three kinds of approaches for evaluating similarity measures [15]. These are evaluation by theoretical examination (e.g., Lin [60]), evaluation by comparing human judgments, and evaluation by calculating the performance within a particular application. Evaluation by comparing human judgments technique has been used by many researchers such as Resnik [79], and Jiang and Conrath [52]. Most of the researchers refer to the same experiment on the human judgment to evaluate their performance due to the expense and difficulty of arranging such an experiment. This experiment was conducted by Rubenstein and Goodenough [82] and a later replication of it was done by Miller and Charles [65]. Rubenstein and Goodenough had human subjects assign degrees of synonymy, on a scale from 0 to 4, to 65 pairs of carefully chosen words. Miller and Charles
41
repeated the experiment on a subset of 30 word pairs of the 65 pairs used by Rubenstein and Goodenough. Rubenstein and Goodenough used 15 subjects for scoring the word pairs and the average of these scores was reported. Miller and Charles used 38 subjects in their experiments. Rodriguez and Egenhofer also used human judgments to evaluate the quality of their similarity measure for comparing different ontologies [81]. They used Spatial Data Transfer Standard (SDTS) ontology, WordNet ontology, WS ontology (created from the combination of WordNet and SDTS) and subsets of these ontologies. They conducted two experiments. In the first experiment, they compare different combinations of ontologies to have a diverse grade of similarity between ontologies. These combinations include identical ontologies (WordNet to WordNet), ontology and sub-ontology (WordNet to WordNet’s subset), overlapping ontologies (WordNet to WS), and different ontologies (WordNet to SDTS). In the second experiment, they asked human subjects to rank similarity of an entity to other selected entities based on the definitions in WS ontology. Then, they compared average of human rankings with the rankings based on their similarity measure using different combinations of ontologies. Evaluation by calculating the performance within a particular application is another approach for the evaluation of similarity measurement metrics. Budanitsky and Hirst [15] used this approach to evaluate the performance of their metric within an NLP application, malapropisms.7 Patwardhan [72] also used this approach to evaluate his metric within the word sense disambiguation8 application.
7
Malapropisms: The unintentional misuse of a word by confusion with one that sounds similar. 8
Word Sense Disambiguation: It is the problem of selecting the most appropriate meaning or sense of a word, based on the context in which it occurs.
42
2.16
Schema Matching
Schema matching is producing a mapping between elements of two schemas that correspond to each other [78]. When we match two schemas S and T, we decide if any element or elements of S refer to the same real-world concept of any element or elements of T [28]. The match operation over two schemas produces a mapping. A mapping is a set of mapping elements. Each mapping element indicates certain element(s) in S are mapped to certain element(s) in T. A mapping element can have a mapping expression which specifies how schema elements are related. A mapping element can be defined as a 5-tuple: (id, e, e’, n, R), where id is the unique identifier, e and e’ are schema elements of matching schemas, n is the confidence measure (usually in the [0,1] range) between the schema elements e and e’, R is a relation (e.g., equivalence, mismatch, overlapping) [88]. Schema matching has many application areas, such as data integration, data warehousing, semantic query processing, agent communication, web services integration, catalog matching, and P2P databases [78, 88]. The match operation is mostly done manually. Manually generating the mapping is a tedious, time-consuming, error-prone, and expensive process. There is a need to automate the match operation. This would be possible if we can discover the semantics of schemas, make the implicit semantics explicit and represent them in a machine processable way. 2.16.1
Schema Matching Surveys
Schema matching is a very well-researched topic in the database community. Erhard Rahm and Philip Bernstein provides an excellent survey on schema matching approaches by reviewing previous works in the context of schema translation and integration, knowledge representation, machine learning and information retrieval [78]. In their survey, they clarify the terms such as match operation, mapping, mapping element, and mapping expression in the context of schema matching. They also introduce application areas of schema matching such as schema integration, data warehouses, message translation, and query processing.
43
The most significant contribution of their survey is the classification of schema matching approaches which helps understanding of schema matching problem. They consider a wide range of classification criteria such as instance-level vs schema-level, element vs structure, linguistic-based vs constraint-based, matching cardinality, using auxiliary data (e.g., dictionaries, previous mappings, etc.), and combining different matchers (e.g., hybrid, composite). However, it is very rare that one approach falls under only one leaf of the classification tree presented in that survey. A schema matching approach needs to exploit all the possible inputs to achieve the best possible result, and needs to combine matchers either in a hybrid way or in a composite way. For this reason, most of the approaches uses more than one technique and falls under more than one leaf of the classification tree. For example, our approach uses auxiliary data (i.e., application source code) and uses linguistic similarity techniques (e.g., name and description), constraint based techniques (e.g., type of the related schema element) on the data as well. A recent survey by Anhai Doan and Alon Halevy [28] classifies matching techniques under two main group: rule-based and learning-based solutions. Our approach falls under the rule-based group which is relatively inexpensive and does not require training. Anhai Doan and Alon Halevy also describe challenges of schema matching. They point out that since data sources become legacy (poorly documented) schema elements are typically matched based on schema and data. However, the clues gathered by processing the schema and data are often unreliable, incomplete and not sufficient to determine the relationships among schema elements. Our approach aims to overcome this fundamental challenge by analyzing reports for more reliable, complete and sufficient clues. Anhai Doan and Alon Halevy also state that schema matching becomes more challenging because matching approaches must consider all the possible matching combinations between schemas to make sure there is no better mapping. Considering all the possible combinations increases the cost of the matching process. Our approach
44
helps us overcoming this challenge by focusing on a subset of schema elements that are used on a report pair. Another challenge that Anhai Doan and Alon Halevy state is the subjectivity of the matching. This means the mapping depends on the application and may change in different applications even though the underlying schemas are the same. By analyzing report generating application source code, we believe we produce more objective results. Anhai Doan and Alon Halevy’s survey also adds two more application areas of schema matching on the application areas mentioned in Erhard and Rahm’s survey. These application areas are peer data management and model management. A more recent survey by Pavel Shvaiko and J`erˆome Euzenat [88] points out new application areas of schema matching such as agent communication, web service integration and catalog matching. In their survey, Pavel Shvaiko and J`erˆome Euzenat consider only schema-based approaches not the instance-based approaches and provide a new classification tree by building on the previous work of Erhard Rahm and Philip Bernstein. They interpret the classification of Erhard Rahm and Philip Bernstein and provide two new classification trees based on granularity and kinds of input with added nodes to the original classification tree of Erhard Rahm and Philip Bernstein. Finally, Hong-Hai Do summarizes recent advances in the field in his dissertation [25]. 2.16.2
Evaluations of Schema Matching Approaches
The approaches to solve the problem of schema matching evaluate their systems by using a variety of methodology, metrics and data which are not usually publicly available. This makes it hard to compare these approaches. However, there have been works to benchmark the effectiveness of a set of schema matching approaches [26, 99]. Hong Hai Do et al. [26] specifies four comparison criteria. These criteria are kind of input (e.g., schema information, data instances, dictionaries, and mapping rules), match results (e.g., matching between schema elements, nodes or paths), quality measures (metrics such as recall, precision and f-measure) and effort (e.g., pre- and post-match
45
efforts for training of learners, dictionary preparation and correction). Mikalai Yatskevich in his work [99] compares the approaches based on the criteria stated in [26] and adds time measures as the fifth criteria. Hong Hai Do et al. only use the information available in the publications describing the approaches and their evaluation. In contrast, Mikalai Yatskevich provides real-time evaluations of matching prototypes, rather than reviewing the results presented in the papers. Mikalai Yatskevich compares only three approaches (COMA [24], Cupid [62] and Similarity Flooding (SF) [86]) and concludes that COMA performs the best on the large schemas and Cupid is the best for small schemas. Hong Hai Do et al. provides a broader comparison by reviewing six approaches (Automatch [10], COMA [24], Cupid [62], LSD [27], Similarity Flooding (SF) [86], SemInt). 2.16.3
Examples of Schema Matching Approaches
In the rest of this section, we review some of the significant approaches for schema matching and describe their similarities and difference from our approach. We review LSD, Corpus-based, COMA and Cupid approaches below. The LSD (Learning Source Descriptions) approach [27] uses machine-learning techniques to match data sources to a global schema. The idea of LSD is that after a training period of determining mappings between data sources and global schema manually, the system should learn from previous mappings and successfully propose mappings for new data sources. The LSD system is a composite matcher. It means it combines the results of several independently executed matchers. The LSD consist of several learners (matchers). Each learner can exploit from different types of characteristics of the input data such as name similarities, format, and frequencies. Then the predictions of different learners are combined. The LSD system is extensible since it has independently working learners (matchers). When new learners are developed they can be added to the system to enhance the accuracy. The extensibility of the LSD system is similar to the extensibility of our system because we can also add new visitor patterns to our system to
46
extract more information to enhance the accuracy. The LSD approach is similar to our approach in the way that they also come to a final decision by combining several results coming from different learners. We also combine several results that come from matching of ontologies of report pairs, to give a final decision. LSD approach is a learner based solution and requires training which makes it relatively expensive because of the initial manual effort. However our approach needs no initial effort other than collecting relevant report generating source code. One of the distinguished approaches that uses external evidence is the Corpus-based Schema Matching approach [43, 61]. Our approach is similar to Corpus-based Schema Matching in the sense that we also utilize external data rather than solely depending on matching schemas and their data. The Corpus-based schema matching approach constructs a knowledge base by gathering relevant knowledge from a large corpus of database schemas and previous validated mappings. This approach identifies interesting concepts and patterns in a corpus of schemas and uses this information to match two unseen schemas. However, learning from the corpus and extracting patterns is a challenging task. This approach also requires initial effort to create a corpus of interest and then requires tuning effort to eliminate useless schemas and to add useful schemas. The COMA (COmbination of MAtching algorithms) approach [24] is a composite schema matching approach. It develops a platform to combine multiple matchers in a flexible way. It provides an extensible library of matching algorithms and a framework to combine obtained results. The COMA approach have been superior to other systems in the evaluations [26, 99]. The COMA++ [7] approach improves the COMA approach by supporting schemas and ontologies written in different languages (i.e., SQL, W3C XSD and OWL) and by bringing new match strategies such as fragment-based matching and reuse-oriented matching. Fragment-based approach follows the divide-and-conquer idea and decomposes a large schema into smaller subsets aiming to achieve better match quality and execution time with the reduced problem size and then merges the results of
47
matching fragments into a global match result. Our approach also considers matching small subsets of a schema that are covered by reports and then merging these match results into a global match result as described in Chapter 3. The Cupid approach [62] combines linguistic and structural matchers in a hybrid way. It is both element and structural based. It also uses dictionaries as auxiliary data. It aims to provide a generic solution across data models and uses XML and relational examples. The structural matcher of Cupid transforms the input into a tree structure and assesses a similarity value for a node based on the node’s linguistic similarity value and its leaves similarity values. 2.17
Ontology Mapping
Ontology mapping is determining which concepts and properties of two ontologies represent similar notions [68]. There are several other terms relevant to ontology mapping and are sometimes used interchangeably with the term mapping. These are alignment, merging, articulation, fusion, and integration [54]. The result of ontology mapping is used in similar application domains as schema matching, such as data transformation, query answering, and web services integration [68]. 2.18
Schema Matching vs. Ontology Mapping
Schema matching and ontology mapping are similar problems [29]. However, ontology mapping generally aims to match richer structures. Generally, ontologies have more constraints on their concepts and have more relations among these concepts. Another difference is that a schema often does not provide explicit semantics for their data while an ontology is a system that itself contains semantics either intuitively or formally [88]. Database community deals with the schema matching problem and the AI community deals with the ontology mapping problem. We can perhaps fill the gap between these similar but yet distinctly studied subject.
48
CHAPTER 3 APPROACH In Chapter 1, we stated the need for rapid, flexible, limited time collaborations among organizations. We also underlined that organizations need to integrate their information sources to exchange data in order to collaborate effectively. However, integrating information sources is currently a labor-intensive activity because of non-existing or out-dated machine processable documentation of the data source. We defined legacy systems as information systems with poor or nonexistent documentation in Section 2.1. Integrating legacy systems is tedious, time-consuming and expensive because the process is mostly manual. To automate the process we need to develop methodologies to automatically discover semantics from electronically available information sources of the underlying legacy systems. In this chapter, we state our approach for extracting semantics from legacy systems and for using these semantics for the schema matching process of information source integration. We developed our approach in the context of SEEK (Scalable Extraction of Enterprise Knowledge) project. As we show in Figure 3-1, the Semantic Analyzer (SA) takes the output of Schema Extractor (SE), schema of the data source, and the application source code or report templates as input. After the semantic analysis process, SA stores its output, extracted semantic information, in a repository which we call the knowledgebase of the organization. Then, Schema Matcher (SM) uses this knowledgebase as an input and produces mapping rules as an output. Finally, these mapping rules will be an input to Wrapper Generator (WG) which produces source wrappers. In Section 3.1, we first state our approach for semantic extraction using SA. Then, in Section 3.2, we show how we utilize the semantics discovered by SA in the subsequent schema matching phase. The schema matching phase is followed by the wrapper generation phase which is not described in this dissertation.
49
Figure 3-1. Scalable Extraction of Enterprise Knowledge (SEEK) Architecture. 3.1
Semantic Analysis
Our approach to semantic analysis is based on the observation that application source code can be a rich source for semantic information about the data source it is accessing. Specifically, semantic knowledge extracted from application source code frequently contains information about the domain-specific meanings of the data or the underlying schema elements. According to these observations, for example, application code usually has embedded queries, and the data retrieved or manipulated by queries is stored in variables and displayed to the end user in output statements. Many of these output
50
statements contain additional semantic information usually in the form of descriptive text or markup [36, 84, 87]. These output statements become semantically valuable when they are used to communicate with the end-user in a formatted way. One way of communicating with the end-user is producing reports. Reports and other user-oriented output, which are typically generated by report generators or application source code, do not use the names of schema elements directly but rather provide more descriptive names for the data to make the output more comprehensible to the users. We claim that these descriptive names together with their formatting instructions can be extracted from the application code generating the report and can be related to the underlying schema elements in the data source. We can trace the variables used in output statements throughout the application code and relate the output with the query that retrieves data from the data source and indirectly with the schema elements. These descriptive text and formatting instructions are valuable information that help discover the semantics of the schema elements. In the next subsection, we explain this idea using an illustrative example. 3.1.1
Illustrative Examples
In this section, we illustrate our idea of semantic extraction on two simple example. On the left hand side of Figure 3-2, we see a relation and its attributes from a relational database schema. By looking at the names of the relation and its attributes, it is hard to understand what kind of information this relation and its attributes store. For example, this relation can be used for storing information about ‘courses’ or ‘instructors’. The attribute Name can hold information about ‘course names’ or ‘instructor names’. Without any further knowledge of the schema, we would probably not be able to understand the full meaning of these schema items in the relation ‘CourseInst’. However, we can gather information about the semantics of these schema items by analyzing the application source code that use these schema items.
51
Figure 3-2. Schema used by an application. Let us assume we have access to the application source code that outputs the search screen shown on the right hand side of Figure 3-2. Upon investigation of the code, semantic analyzer (SA) encounters output statements of the form ‘Instructor Name’ and ‘Course Code’. SA also encounters input statements that expect input from the user next to the output texts. Using program understanding techniques, SA finds out that inputs are used with certain schema elements in a ‘where clause’ to form a query to return the desired tuples from the database. SA first relates the output statements containing descriptive text (e.g., ‘Instructor Name’) with the input statements located next to the output statements on the search screen shown in Figure 3-2. SA then traces input statements back to the ‘where clause’ and find their corresponding schema elements in the database. Hence, SA relates the descriptive text with the schema elements. For example, if SA relates the output statement ‘Instructor Name’ to ‘Name’ schema element of relation ‘CourseInst’, then we can conclude that ‘Name’ schema element of the relation ‘CourseInst’ stores information about the ‘Instructor Names’. Let us look at another example. Figure 3-3 shows a report R1 using the schema elements from the schema S1. Let us assume that we have access to the application source code that generates the report shown in Figure 3-3. The schema element names in S1 are non-descriptive. However, our semantic analyzer can gather valuable semantic information by analyzing the source code. SA first traces the descriptive column header texts back to the schema elements that fill in the data of that column. Then, SA relates descriptive
52
Figure 3-3. Schema used by a report. column header texts with the schema elements (red arrows). After that, we can conclude about the semantics of the schema element. For example, we can conclude that the Name schema element of the relation CourseInst stores information about ’Instructors‘. 3.1.2
Conceptual Architecture of Semantic Analyzer
SA is embedded in the Data Reverse Engineering (DRE) module of the SEEK prototype together with the Schema Extractor (SE) component. As Figure 3-4 illustrates, the SE component in the DRE connects to the data source with a call-level interface (e.g., JDBC) and extracts the schema of the data source. The SA component enhances this schema with the pieces of evidence found about the semantics of the schema elements from the application source code or from the report design templates. 3.1.2.1
Abstract syntax tree generator (ASTG)
We show the components of Semantic Analyzer (SA) in Figure 3-5. The Abstract Syntax Tree Generator (ASTG) accepts application source code to be analyzed, parses
53
Figure 3-4. Conceptual view of the Data Reverse Engineering (DRE) module of the Scalable Extraction of Enterprise Knowledge (SEEK) prototype. it and produces the abstract syntax tree of the source code. An Abstract Syntax Tree (AST) is an alternative representation of the source code for more efficient processing. Currently, the ASTG is configured to parse application source code written in Java. The ASTG can also parse SQL statements embedded in the Java source code and HTML code extracted from the Java Servlet source code. However, we aim to parse and extract semantic information from source code written in any programming language. To reach this aim, we use state-of-the-art parser generation tools, JavaCC, to build the ASTG. We explain how we build the ASTG so that it becomes extensible to other programming languages in Section 3.1.3.
Figure 3-5. Conceptual view of Semantic Analyzer (SA) component.
54
3.1.2.2
Report template parser (RTP)
We also extract semantic information from another electronically available information source, namely from report design templates. A report design template includes information about the design of a report and is typically represented in XML. When a report generation tool, such as Eclipse BIRT or JasperReport, runs a report design template, it retrieves data from the data source and presents it to the end user according to the specification in the report design template. When parsed, valuable semantic information about the schema elements can be gathered from report design templates. The Report Template Parser (RTP) component of SA is used to parse report design templates. Our current semantic analyzer is configured to parse report templates designed with Eclipse BIRT.1 We show an example of a report design template in Figure 3-6 and a resulting report when this template was run in Figure 3-7.
Figure 3-6. Report design template example. 3.1.2.3
Information extractor (IEx)
The outputs of ASTG and RTP are the inputs for the Information Extractor (IEx) component of SA. The IEx, shown in Figure 3-5, is the component where we apply several heuristics to relate descriptive text in application source code with the schema elements in
1
http://www.eclipse.org/birt/ 55
Figure 3-7. Report generated when the above template was run. database by using program understanding techniques. Specifically, The IEx first identifies the output statements. Then, it identifies texts in the output statements and variables related with these output texts. The IEx relates the output text with the variables by the help of several heuristics described in Section 3.1.5. The IEx traces the variables related with the output text to the schema elements from which it retrieves data.
Figure 3-8. Java Servlet generated HTML report showing course listings of CALTECH. The IEx can extract information from Java application source code that communicates with user through console. The IEx can also extract information from Java Servlet 56
application source code. A Servlet is a Java application that runs on the Web Server and responds to client requests by generating HTML pages dynamically. A Servlet generates an HTML page by the output statements embedded inside the Java code. After IEx analyzes the Java Servlet, it identifies the output statements that output HTML code. It also identifies the schema elements from which the data on the HTML page is retrieved. As an intermediate step, the IEx produces the HTML page that the Servlet would produce with the schema element names instead of the data. An example of the output HTML page generated by the IEx after analyzing a Java Servlet is shown in Figure 3-9. The Java Servlet output that was analyzed by the IEx is shown in Figure 3-8. This example is taken from THALIA integration benchmark and shows course offerings in Computer Science department of California Institute of Technology (CALTECH). The reader can notice that the data on the report in Figure 3-8 is replaced with the schema element names from which the data is retrieved in Figure 3-9. Next, the IEx analyzes this annotated HTML page show in Figure 3-9 and extracts semantic information from this page.
Figure 3-9. Annotated HTML page generated by analyzing a Java Servlet. The IEx has been implemented using visitor design pattern classes. We explain the benefits of using visitor design patterns in Section 3.1.3. The IEx applies several program understanding techniques such as program slicing, data flow analysis and call graph
57
analysis [49] in visitor design pattern classes. We describe these techniques in Section 3.1.4. The IEx also extracts semantic information from report design templates. The IEx uses the heuristic numbers seven to eleven described in Section 3.1.5 while analyzing the report design templates. Extracting information from report design templates is relatively easier than extracting information from application source code because The report design templates are represented in XML and are more structured. 3.1.2.4
Report ontology writer (ROW)
Report Ontology Writer (ROW) component of SA writes the semantic information gathered in report ontology instances represented in OWL language. We explain the design details of the report ontology in Section 3.2.3. These report ontology instances forms the knowledgebase of the data source being analyzed. 3.1.3
Extensibility and Flexibility of Semantic Analyzer
Our current semantic analyzer is configured to extract information from application source code written in Java. We choose the Java programming language because it is one of the dominating programming languages in the enterprise information systems. However, we aim our semantic analyzer to be able to process source code written in any programming language to extract semantic information about the data of the legacy system. For this reason, we need to develop our semantic analyzer in a way that is extensible to other programming languages easily. To reach this aim, we leverage state-of-the-art techniques and recent research on code reverse engineering, abstract syntax tree generation and object oriented programming to develop a novel approach for semantic extraction from source code. We describe our extensible semantic analysis approach in details in this section. To analyze application source code, we need a parser for the grammar of the programming language of the source code. This parser is used to generate Abstract Syntax Tree (AST) of the source code. An AST is a type of representation of source code that
58
facilitates the usage of tree traversal algorithms. For programmers, writing a parser for the grammar of a programming language has always been a complex, time-consuming, and error-prone task. Writing a parser becomes more complex when the number of production rules of the grammar increases. It is not easy to write a robust parser for Java which has many production rules [91].2 We focus on extracting semantic information from legacy system’s source code not writing a parser. For this reason, we choose a state-of-the-art parser generation tool to produce our Java parser. We use JavaCC3 to automatically generate a parser by using the specification files from the JavaCC repository.4 JavaCC can be used to generate parsers for any grammar. We also utilize JavaCC to generate a parser for SQL statements that are embedded inside the Java source code and for HTML code that are embedded inside the Java Servlet code. By using JavaCC, we can extend SA to make it capable of parsing other programming languages with little effort. The Information Extractor (IEx) component of SA is composed of several visitor design patterns. Visitor Design Patterns give the flexibility to change the operation being performed on a structure without the need to change the classes of the elements on which the operation is performed [38]. Our goal is to build semantic information extraction techniques that can be applied to any source code and can be extended with new algorithms. By using visitor design patterns [71], we do not embed the functionality of the information extraction inside the classes of Abstract Syntax Generator (ASTG). This separation lets us focus on the information extraction algorithms. We can maintain the operations being performed whenever necessary. Moreover, new operations over the data structure can be defined simply by adding a new visitor [13].
2
There are over 80 production rules in the Java language according to the Java Grammar that we obtained from the JavaCC Repository 3
JavaCC: https://javacc.dev.java.net/
4
JavaCC repository: http://www.cobase.cs.ucla.edu/pub/javacc/
59
3.1.4
Application of Program Understanding Techniques in SA
We have introduced program understanding techniques in Section 2.5. In this section, we present how we apply these techniques in SA. SA has two components as shown in Figure 3-5. The input of Information Extractor (IEx) component is an abstract syntax tree (AST). The AST is the output of our Abstract Syntax Tree Generator (ASTG) which is actually a parser. As mentioned in Section 2.5, processing the source code by a parser to produce an AST is one of the program understanding techniques known as Syntactic Analysis [49]. We perform the rest of the program understanding techniques on the AST by using the visitor design pattern classes of the IEx. One of the program understanding techniques we apply is Pattern Matching [49]. We wrote a visitor class that looks for certain patterns inside the source code. These patterns such as input/output statements are stored in a class structure and new patterns can be simply added into this class structure as needed. The visitor class that searches these patterns identifies the variables in the input/output statements as slicing variables. For instance, the variable V in Table 3-5 is identified as a slicing variable since it is used in an output statement. Program Slicing [75] is another program understanding technique mentioned in Section 2.5. We analyze all the statements affecting a variable that is used in an output statement. This technique is also known as backward slicing. SA also applies the Call Graph Analysis technique [83]. SA produces inter-procedural call graph of the source code and analyzes only methods that exist in this graph. SA starting from a specific method (e.g., main method of a Java stand-alone class or doGet method of a Java Servlet) traverses all possible methods that can be executed in run-time. By this, SA eliminates analyzing unused methods. These methods can reflect old functionality of the system and analyzing them can lead to incorrect, misleading information. An example for an inter-procedural call graph of a program source code is shown in Figure 3-10. SA does not analyze method1 of Class1, method1 of Class2, and method3 of Class3 since they are never called from inside other methods.
60
Figure 3-10. Inter-procedural call graph of a program source code. The Data Flow Analysis technique [83] is another program understanding technique that we implemented in the IEx by visitor design patterns. As mentioned in Section 2.5, Data Flow Analysis is the analysis of the flow of the values of variables to variables. SA analyzes the data flow in the variable dependency graphs (i.e., flow of data between variables). SA analyzes assignment statements and makes necessary changes in the values stored in the symbol table of the class being analyzed. SA also analyzes the data flow in the system dependency graphs (i.e., flow of data between methods). SA analyzes method calls and initializes the values of method variables by actual parameters in the method call and transfers back the value of return variable at
61
Table 3-1. Semantic Analyzer can transfer information from one method to another through variables and can use this information to discover semantics of a schema element. public ResultSet returnList() { ResultSet rs = null; try { String query = ”SELECT Code, Time, Day, Pl, Inst FROM Course”; rs = sqlStatement.executeQuery(query); }catch(Exception ex){ researchErr = ex.getMessage();} return rs;}
ResultSet rsList = returnList(); String dataOut = ””; while (rsList.next()){ dataOut = rsList.getString(4); ... System.out.println(”Class is held in room number:” + dataOut);
the end of the method. SA can transfer information from one method to another through variables and can use this information to discover semantics of a schema element. The code fragment in Table 3-1 is given as an example for this capability of SA. Inside the method, the value of variable query is transferred to variable rs. At the end of the method, value of variable rs is transferred to variable rsList. The value of the fourth field of the query from the resultset is then stored into a variable and then printed out. When we relate the text in the output statement with the fourth field of the query, we can conclude that Pl field of table Course corresponds to ’Class is held in room number’. 3.1.5
Heuristics Used for Information Extraction
A heuristic is any method found through observation which produces correct or sufficiently exact results when applied in commonly occurring conditions. We have developed several guidelines (heuristics) through observations to extract semantics from the application source code and report design templates. These heuristics relate semantically rich descriptive texts to schema elements. They are based on mainly layout and format (e.g., font size, face, color, and type) of data and description texts that are
62
used to communicate with users through console with input/output statements or through a report. We introduce these heuristics below. The first six heuristics shown in this section are developed to extract information from source code of applications that communicate with users through console with input/output statements. Please note that the code fragments in the first six heuristics contain Java-specific input, output, and database-related statements that use syntax based on the Java API. We parameterized these statements in our SA prototype. Therefore it is theoretically straightforward to add new input, output, and database-related statement names or to switch to another language if necessary. We developed the rest of the heuristics to extract semantics from reports. We use these heuristics to extract semantic information either from reports generated by Java Servlets or from report design templates. Heuristic 1. Application code generally has input-output statements that display the results of queries executed on the underlying database. Typically, output statements display one or more variables and/or contain one or more format strings. Table 3-2 represents a format string ‘\n Course code:\t’ followed by a variable V. Table 3-2. Output string gives clues about the semantics of the variable following it. System.out.println(‘\n Course code:\t’ + V);
Heuristic 2. The format string in an input-output statement describes the displayed slicing variable that comes after this format string. The format string ‘\n Course code:\t’ describes the variable V in Table 3-2. Heuristic 3. The format string that contains semantic information and the variable may not be in the same statement and may be separated by an arbitrary number of statements as shown in Table 3-3. Heuristic 4. There may be an arbitrary number of format strings in different statements that inherit semantics and they may be separated by an arbitrary number 63
Table 3-3. Output string and the variable may not be in the same statement. System.out.println(’\n Course code:‘); ... ... System.out.print(V);
of statements, before we encounter an output of slicing variable. Concatenation of the format strings before the slicing variable gives more clues about the variable semantic. An example is shown in Table 3-4. Table 3-4. Output strings before the slicing variable should be concatenated. System.out.print(’\n Course‘); System.out.println(’\t code:‘); System.out.print(V);
Heuristic 5. An output text in an output statement and a following variable in the same or following output statements are semantically related. The output text can be considered as the variable’s possible semantics. We can trace back the variable through backward slicing and identify the schema element in the data source that assigns a value to it. We can conclude that this schema element and variable are related. We can then relate the output text with the schema element. The Java code sample with an embedded SQL query in Table 3-5 illustrates our point. Table 3-5. Tracing back the output text and associating it with the corresponding column of a table. Q = ’SELECT C FROM T‘; R = S.executeQuery(Q); V = R.getString(1); System.out.println(’Course code: ‘ + V);
64
In Table 3-5, the variable V is associated with the text ’Course code‘. It is also associated with the first column of the query result in R, which is called C. Hence the column C can be associated with the text ’Course code‘. Heuristic 6. If the variable V is used with column C of table T in a compare statement in the where-clause of the query Q, and if one can associate a text string from an input/output statement denoting the meaning of variable V, then we can associate this meaning of V with column C of table T. The Java code sample with an embedded SQL query in Table 3-6 illustrates our point. Table 3-6. Associating the output text with the corresponding column in the where-clause. Q = ’SELECT * FROM T WHERE C = ’‘+V+”‘; R = S.executeQuery(Q); System.out.println(’Course code: ‘ + V);
In Table 3-6, the variable input is associated with the text ‘Course code:’. It is also associated with the column C of table T. Hence the schema element C can be associated with the text ‘Course code’. Table 3-7. Column header describes the data in that column. College Course Title Instructor CAS CS101 Intro Comp. Dr. Long GRS CS640 Artificial Int. Dr. Betke
Heuristic 7. A header of a column H (i.e., description text) on a table on a report describes the value of a data D (i.e., data element) in that column. We can associate the header H with the data D presented on the same column. For example, the header “Instructor” in the fourth column describes the value “Dr. Long” in Table 3-7. Table 3-8. Column on the left describes the data items listed to its immediate right. Course CSE103 Introduction to Databases Credits 3 Description Core concepts in databases
65
Heuristic 8. A descriptive text on a row of a table on a report T describes the value of a data D on the right hand side on the same row of the table. We can associate the text T with the data D presented on the same row. For example, the text “Description” on the third row describes the value “Core concepts in databases” in Table 3-8. Table 3-9. Column on the left and the header immediately above describe the same set of data items. Core Courses Course CSE103 Introduction to Databases Credits 3 Description Core concepts in databases Elective Courses Course CSE131 Problem Solving Credits 3 Description Use of Comp. for problem solving Heuristic 9. Heuristic one and heuristic two can be combined. Both header of a data on the same column and the text on the left hand side on the same row describe the data. For example, both the text “Course” on the left hand side and the header “Elective Courses” of data “CSE131 Problem Solving” describe the data in Table 3-9. Table 3-10. Set of data items can be described by two different headers. Course Instructor Code Room Name Room CIS4301 E221 Dr. Hammer E452 COP6726 E112 Dr. Jermaine E456 Heuristic 10. If more than one header describe a data on a report, all the headers corresponding to the data describe the data. For example, both the header “Instructor” and the header “Room” describe the value “E452” in Table 3-10. Table 3-11. Header can be processed Course Title (Credits) CS105 Comp. Concepts ( 3.0 ) CS110 Java Intro Prog. ( 4.0 )
before being associated with the data on a column. Instructor Dr. Krentel Dr. Bolker
66
Heuristic 11. The data value presented on a column can be retrieved from more than one data item in the schema. In that case, the format of the header of the column gives clues about how we need to parse the header and associate it with the data items. For example, the data of the second column in Table 3-11 is retrieved from two data items in the data source. The format of the header “Title (Credits)” tells us that we need to consider the parenthesis while parsing the header and associating the data items in the column with the header. In this section, we have introduced Semantic Analyzer (SA). SA extracts information about the semantics of schema elements from the application source code. This information is an essential input for the Schema Matching (SM) component. In the following section, we introduce our schema matching approach and how we use SA to discover semantics for SM. 3.2
Schema Matching
Schema matching aims at discovering semantic correspondences between schema elements of disparate but related data sources. To match schemas, we need to identify the semantics of schema elements. When done manually, this is a tedious, time-consuming, and error-prone task. Much research has been carried out to automate this task to aid schema matching, see for example, [25, 28, 78]. However, despite the ongoing efforts, current schema matching approaches, which use the schemas themselves as the main input for their algorithms, still rely heavily on manual input [26]. This dependence on human involvement is due to the well-known fact that schemas represent semantics poorly. Hence, we believe that improving current schema matching approaches requires improving the way we discover semantics. Discovering semantics means gathering information about the data, so that after processing the data, a computer can decide on how to use the data in a way a person would do. In the context of schema matching, we are interested in finding information that leads us to find a path from schema elements in one data source to the corresponding
67
schema elements in the other. Therefore, we define discovering semantics for schema matching as discovering paths between corresponding schema elements in different data sources. We reduce the level of difficulty of the schema matching problem by abstracting it to matching of automatically generated documents such as reports that are semantically richer than the schemas to which they correspond. Reports and other user-oriented output, which are typically generated by report generators, do not use the names of schema elements directly but rather provide more descriptive names to make the output more comprehensible to the users. These descriptions together with their formatting instructions plus relationships to the underlying schema elements in the data source can be extracted from the application code generating the report. These semantically rich descriptions, which can be linked to the schema elements in the source, can be used to discover relationships between data sources and hence between the underlying schemas. Moreover, reports use more domain terminology than schemas. Therefore, using domain dictionaries is particularly helpful as opposed to their use in schema matching algorithms. One can argue that reports of an information system may not cover the entire schema and hence by this approach we may not find matches for all schema elements. It is important to note that we do not have to match all the schema elements of two data sources in order to have two organizations collaborate. We believe the reports typically present the most important data of the information system, which is also likely to be the set of elements that are important for the ensuing data integration scenario. Thus starting the schema matching process from reports can help focus on the important data eliminating any effort on matching unnecessary schema elements. 3.2.1
Motivating Example
We present a motivating example to show how analyzing report generating application source code and report design templates can help us understand the semantics of schema elements better. We choose our motivating example reports from the university domain
68
because the university domain is well known and easy to understand. To create our motivating example, we use the THALIA5 testbed and benchmark which provides a collection of over 40 downloadable data sources representing university course catalogs from computer science departments worldwide [47].
Figure 3-11. Schemas of two data sources that collaborates for a new online degree program. We motivate the need for schema matching across the two data sources of computer science departments with a scenario. Let us assume that two computer science departments of universities A and B start to collaborate for a new online degree program. Unless one is contend to query each report separately, one can imagine the existence of a course schedule mediator capable of providing integrated access to the different course sites. The mediator enables us to query data of both universities and presents the results in a uniform way. Such a mediator necessitates the need to find relationships across the source schemas S1 and S2 of universities A and B shown in Figure 3-11. This is a challenging task when limited to information provided by the data source alone. By just using the schema names on the Figure, one can match schema elements of two different schemas in various ways. For instance, one can match the schema element Name in relation Offerings of schema S2 with schema element Name in relation Schedule of schema S1 or with schema
5
THALIA Website: http://www.cise.ufl.edu/project/thalia.html
69
element Name in relation CourseInt of schema S1. Both mappings seem reasonable when we only consider the available schema information. However, when we consider the reports generated by application source code using these schemas of data sources, we can decide on the mappings of schemas more accurately. Information system applications that generate reports retrieve data from the data source, format the data and present it to users. To make the data more apprehensible by the user, these applications generally do not use the names of schema elements but invent more descriptive names (i.e., title) to the data by using domain specific terms when applicable.
Figure 3-12. Reports from two sample universities listing courses.
70
For our motivating example, university A has reports R1 and R3 and university B has R2 and R4 presenting data from their data sources. Reports R1 and R2 present course listings and reports R3 and R4 present instructor office information from corresponding universities. We show these simplified sample reports (R1, R2, R3, and R4) and the schemas (S1 and S2) in Figures 3-12 and 3-13. The reader can easily match the column headers (blue dotted arrows in Figures 3-12 and 3-13). On the other hand, it is hard to match the schema elements of data sources correctly by only considering their names. However, it becomes again straightforward to identify semantically related schema elements if we know the links between column headers and schema elements (red arrows in Figures 3-12 and 3-13).
Figure 3-13. Reports from two sample universities listing instructor offices.
71
Our idea is to find mappings between descriptive texts on reports (blue dotted arrows) by using semantic similarity functions and to find the links between these texts and schema elements (red arrows) by analyzing the application source code and report design templates. For this purpose, we first analyze the application source code or the report design template generating each report. For each report, we store our findings such as descriptive texts (e.g., column headers), schema elements and relations between the descriptive texts and the schema elements into an instance of report ontology. We give the details of the report ontology in Section 3.2.3. We pair report ontology instances one from the first data source and one from the second data source. We then compute the similarities between all possible report ontology instance pairs. For our example, the four possible report pairs when we select one report from DS1 and the other from DS2 are [R1-R2], [R1-R4], [R2-R3] and [R3-R4]. We calculate the similarity scores between descriptive texts on reports for each report pairs by using semantic similarity functions using WordNet which we describe in Section 3.2.4. We then transfer similarity scores between descriptive texts of reports to scores between schema elements of schemas by using the previously discovered relations between descriptive texts and schema elements. Last, we merge the similarity scores of schema elements computed for each report pair and form a final matrix holding similarity scores between elements of schemas that are higher than a threshold. We address details of each step of our approach in Section 3.2.2. When we apply our schema matching approach on the example schemas and reports described above, we obtain a precision value of 0.86 and a recall value of 1.00. We show the similarity scores between schema elements of data sources DS1 and DS2 which are greater than the threshold (0.5) in Figure 3-14. These results are better than the results found matching the above schemas with the COMA++ (COmbination of MAtching algorithms) framework.6 COMA++ [7] is a well known and well respected schema
6
We use the default COMA++ All Context combined matcher
72
matching framework providing a downloadable prototype. This example motivates us that our approach promises better accuracy for schema matching than existing approaches. We provide a detailed evaluation of the approach in Chapter 6. In the next section, we describe the steps of our schema matching approach.
Figure 3-14. Similarity scores of schema elements of two data sources. 3.2.2
Schema Matching Approach
The main idea behind our approach is that user-oriented outputs such as reports, encapsulate valuable information about semantics of data which can be used to facilitate schema matching. Applying well-known program understanding techniques as described in Section 3.1.4, we can extract semantically rich textual descriptions and relate these with data presented on reports using heuristics described in Section 3.1.5. We can trace the data back to corresponding schema elements in the data source and match the corresponding schema elements in the two data sources. Below, we outline the steps of our Schema Matching approach, which we call Schema Matching by Analyzing ReporTs (SMART). In the next sections, we provide detailed description of these steps which are shown in Figure 3-15. • • • • •
Creating an Instance of a Report Ontology Computing Similarity Scores Forming a Similarity Matrix From Matching Ontologies to Schemas Merging Results
73
Figure 3-15. Five steps of Schema Matching by Analyzing ReporTs (SMART) algorithm.
74
3.2.3
Creating an Instance of a Report Ontology
In the first step, we analyze application source code that generates a report. We described the details of semantic analysis process in Section 3.1. The extracted semantic information from source code or from a report design template is stored in an instance of the report ontology. We have developed an ontology for reports after analyzing some of the most widely used open source report generation tools such as Eclipse BIRT,7 JasperReport8 and DataVision.9 We designed the report ontology using the Protege Ontology Editor10 and represented this report ontology in OWL (Web Ontology Language). The UML diagram of the report ontology depicted in Figure 3-16 shows the concepts, their properties and their relations with other concepts. We store information about the descriptive texts on a report (e.g., column headers) and information about the source of data (i.e., schema elements) presented on a report in an instance of the report ontology. The descriptive text and schema element properties are stored in description element and data element concepts of the report ontology respectively. The data element concept has properties such as attribute, table (table of the attribute in relational database) and type (type of the data stored in the attribute). We identify the relation between a description element concept and a data element concept by the help of a set of heuristics which are based on the location and format information described in Section 3.1.5 and store this information in hasDescription relation property of the description element concept.
7
Eclipse BIRT: http://www.eclipse.org/birt/
8
JasperReport: http://jasperreports.sourceforge.net/
9
Datavision: http://datavision.sourceforge.net/
10
Protege tool: http://protege.stanford.edu/
75
Figure 3-16. Unified Modeling Language (UML) diagram of the Schema Matching by Analyzing ReporTs (SMART) report ontology. The design of the report ontology does not change from one report to another but the information stored in an instance of the report ontology changes based on the report being analyzed. We placed the data element concept in the center of the report ontology as shown in Figure 3-16. This design is appropriate for the calculation of similarity scores between data element concepts according to the formula described in Section 3.2.4. 3.2.4
Computing Similarity Scores
We compute similarity scores between all possible data element concept pairs consisting of a data element concept from an instance of the report ontology of the first data source and another data element concept from an instance of report ontology of the second data source. This means if there are m reports having n data elements concepts on average for DS1 data source and k reports having l data elements concepts on average
76
for DS2 data source, we compute similarity scores for (m ∗ n ∗ k ∗ l) pairs of data elements concepts. However, computing similarity scores for all possible report ontology instance pairs may be unnecessary. For example, unrelated report pairs, such as a report describing payments of employees with another describing the grades of students at a university, may not have semantically related schema elements and therefore we may not find any semantical correspondence by computing similarity scores of concepts of unrelated report ontology instance pairs. To save computation time, we filter out report pairs that have semantically unrelated reports. To determine which report pairs are semantically related or not, we first extract texts (i.e., titles, footers and data headers) on two report pairs and calculate similarity scores of these texts. If the similarity score between these texts of a report pair is below a predetermined threshold, we assume that the report pair presents semantically unrelated data and we do not compute similarity scores of data element pairs of report pairs having low similarity scores for the texts on them. The similarity of two objects depends on the similarities of the components that form the objects. An ontology concept is formed by the properties and the relations it has. Each relation of an ontology concept connects the concept to its neighbor concept. Therefore, the similarity of two concepts depends on the similarities of the properties of the concepts and the similarities of the neighbor concepts. For example, the similarity of two data element concepts from different instances of the report ontology depends on the similarity of their properties attribute, table, and type and the similarities of its neighbor concepts DescriptionElement, Header, Footer, etc. Our similarity function between concepts of instances of an ontology is similar to the function proposed by Rodriguez and Egenhofer [81]. Rodriguez and Egenhofer also consider sets of features (properties) and semantic relations (neighbors) among concepts while assessing similarity scores among entity classes from different ontologies. While their similarity function aims to find similarity scores between concepts from different
77
ontologies, our similarity is for finding similarity scores between the instances of an ontology. We formulate the similarity of two concepts in different instances of an ontology as follows: simc (c1 , c2 ) = wp ∗ simp (c1 , c2 ) + wn ∗ simn (c1 , c2 )
(3–1)
where c1 is a concept in an instance of the ontology, c2 is the same type of concept in another instance of the ontology, wp is the weight of total similarity of properties of that concept and wn is the weight of total similarity of the neighbor concepts that can be reached from that concept by a relation. simp (c1 , c2 ) and simn (c1 , c2 ) are the formulas to calculate similarities of the properties and the neighbors. We can formulate simp (c1 , c2 ) as follows:
simp (c1 , c2 ) =
X
wpi ∗ SimF unc(c1 pi , c2 pi )
(3–2)
i=1..k
where k is the number of properties of that concept, wpi is the weight of the ith property, c1 pi is the ith property of the concept in the first report ontology instance, c2 pi is the same type of property of the other concept in the second report ontology instance. SimFunc is the function that we use to assess a similarity score between the values of the properties of two concepts. For description elements, the SimFunc is a semantic similarity function between texts which is similar to the text-to-text similarity function of Corley and Mihalcea [21]. To calculate the similarity score between two text strings T1 and T2, we first eliminate stop words (e.g., a, and, but, to, by). We then find the word having the maximum similarity score in text T2 for each word in text T1. The similarity score between two words, one from text T1 and the other from T2, is obtained from a the Word-Net based semantic similarity function such as the Jiang and Conrath metric [52]. We sum up the maximum scores and divide the sum by the word count of the text T1. The result is the measure of similarity between text T1 and the text T2 for the direction
78
from T1 to T2. We repeat the process for the reverse direction (i.e., from T2 to T1) and then compute the average of the two scores for a bidirectional similarity score. We use different similarity functions for different properties. If the property that we are comparing has text data such as property description, we use one of the word semantic similarity functions that we have introduced in Section 2.11. By using a semantic similarity measure instead of lexical similarity measure such as edit distance, we can detect the similarities of words that are lexically far but semantically close such as lecturer and instructor and we can also eliminate the words that are lexically close but semantically far such as ‘tower’ and ‘power’. Besides description property of description element concept, we also use semantic similarity measures to compute similarity scores between footernote property of the footer concept, headernote property of the header concept and title property of the report concept. If the property that we are comparing is the attribute or table property of data element concept, we assess a similarity score based on the Levenstein edit similarity measure. Besides attribute property of data element concept, we also use edit similarity measures to compute similarity scores between query property of the report concept. In the following formula, which calculates the similarity between the neighbors of two concepts, l is the number of relations of the concepts we are comparing, wni is the weight of the ith relation, c1 ni (c2 ni ) is the neighbor concept of the first (second) concept that we reach by following the kth relation.
simn (c1 , c2 ) =
X
wni ∗ simc (c1 ni , c2 ni )
(3–3)
i=1..l
Note that our similarity function is generic and can be used to calculate similarity scores between concepts of instances of any ontologies. Even though the formulas in Equations 3–1, 3–2 and 3–3 are recursive in nature, when we apply the formulas to compute similarity scores between data elements of report ontologies, we do not encounter recursive behavior. That is because there is no path back to data element concept through
79
relations from neighbors of the data element concept. In other words, the neighbor concepts of data element concept does not have the data element concept as a neighbor. We apply the above formulas to calculate similarity scores between data element concepts of two different report ontologies. The data element concept has properties attribute, table, and type and neighbor concepts description element, report, header, and footer concepts. The similarity score between two data element concepts can be formulated as follows: simDataElement (DataElement1 , DataElement2 ) = w1 ∗ SimF unc(Attribute1 , Attribute2 ) +w2 ∗ SimF unc(T able1 , T able2 ) +w3 ∗ SimF unc(T ype1, T ype2 ) +w4 ∗ simDescriptionElement(DescriptionElement1 , DescriptionElement2 )
(3–4)
+w5 ∗ simReport (Report1 , Report2 ) +w6 ∗ simHeader (Header1 , Header2 ) +w7 ∗ simF ooter (F ooter1 , F ooter2 ) We explain how we determine the weights w1 to w7 in Section 6.2. The similarity score between two description element, report, header and footer concepts can be computed by the following formulas: simDescriptionElement(DescriptionElement1 , DescriptionElement2 ) =
(3–5)
SimF unc(Description1 , Description2 ) simReport (Report1 , Report2 ) = SimF unc(Query1, Query2) + SimF unc(T itle1 , T itle2 ) (3–6) simHeader (Header1 , Header2 ) = SimF unc(HeaderNote1 , HeaderNote2 ) simF ooter (F ooter1 , F ooter2 ) = SimF unc(F ooterNote1 , F ooterNote2 )
80
(3–7) (3–8)
3.2.5
Forming a Similarity Matrix
To form a similarity matrix, we connect to the underlying data sources using a call-level interface (e.g., JDBC) and extract the schemas of two data sources to be integrated. A similarity matrix is a table storing similarity scores for two schemas such that elements of the first schema form the column headers and elements of the second schema form the row headers. The similarity scores are in the range [0,1]. The similarity matrix given as an example in Figure 3-17 has schema elements from motivating example in Section 3.2.1 and the similarity scores between schema elements are fictitious.
Figure 3-17. Example for a similarity matrix. 3.2.6
From Matching Ontologies to Schemas
In the first step, we traced a data element to its corresponding schema element(s). We use this information to convert inter-ontology matching scores into scores between schema elements. Using the converted scores, we then fill in a similarity matrix for each report pair. Note that, we find similarity scores only for a subset of schemas used in the reports. We believe the reports typically present the most important data of the information system, which is likely to be the set of elements that is important for the ensuing data integration scenario. Even though reports of an information system may not cover the entire schema, our approach can help focus on the important data thus eliminating efforts
81
to match unnecessary schema elements. Note that each similarity matrix can be sparse having only a small subset of its cells filled in as shown in Figures 3-18 and 3-19.
Figure 3-18. Similarity scores after matching report pairs about course listings.
Figure 3-19. Similarity scores after matching report pairs about instructor offices. 3.2.7
Merging Results
After generating a similarity matrix for each report pair, we need to merge them into a final similarity matrix. If we have more than one score for a schema element pair in the similarity matrix, we need to merge the scores. In Section 3.2.4, we described how we compute similarity scores for report pairs to avoid unnecessary computations between unrelated report pairs. We use these overall similarity scores between report pairs while merging similarity scores. We multiply the similarity score of a schema element pair with the overall similarity score of the report pair and sum the resulting scores up.
82
Then we divide the final score with the number of reports. For instance, if the similarity score between schema elements A and B is 0.9 in the first report having an overall similarity score of 0.7 and is 0.5 in the second report having an overall similarity score of 0.6, then we conclude that the similarity score between schema elements A and B is (0.9 ∗ 0.7 + 0.5 ∗ 0.6)/(2) = 0.465. Finally, we eliminate the combined scores which fall below a (user-defined) threshold.
83
CHAPTER 4 PROTOTYPE IMPLEMENTATION We implemented both the semantic analyzer (SA) component of the SEEK and the SMART schema matcher using Java programming language. As shown in Figure 4-1 we have written 1,350 KB of Java code (approximately 27,000 lines of code) for our prototype implementation. In addition, we have utilized 1,150 KB of Java code (approximately 23,000 lines of code) which was automatically generated by JavaCC. In the following sections, we first explain the SA prototype and then SMART prototype.
Figure 4-1. Java Code size distribution of (Semantic Analyzer) SA and (Schema Matching by Analyzing ReporTs) SMART packages. 4.1
Semantic Analyzer (SA) Prototype
We have implemented SA semantic analyzer prototype using Java language. The SEEK prototype source code is placed in the seek package. The functionality of the SEEK prototype is divided into several packages. The source code of the seamntic analyzer (SA) component resides in the sa package. Java classes in the sa package are further divided into subpackages according to their functionality. The subpackages of the sa package are listed in Table 4-1. 4.1.1
Using JavaCC to generate parsers
The classes inside the packages syntaxtree, visitor, and parsers are automatically created by JavaCC tool. JavaCC is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar according to that
84
Table 4-1. Subpackage in the sa package and their functionality. package name classes in the package visitor default visitor classes. parsers classes to parse application source code written in grammars Java, HTML and SQL. seekstructures supplementary classes to analyze application source code. seekvisitors visitor classes to analyze source code written in grammars Java, HTML specification. As shown in Figure 4-2, JavaCC processes grammar specification file and output the Java files that has the code of the parser. The parser can process the languages that are according to the grammar in the specification file. The parsers generated in this way forms the ASTG component of the SA. Grammar specification files for some grammars such as Java, C ++, C, SQL, XML, HTML, Visual Basic, and XQuery can be found at the JavaCC grammar repository Web site.1 These specification files have been tested and corrected by many JavaCC implementers. This implies that parsers produced by using these specifications must be reasonably effective in the correct production of ASTs. The classes generated by the JavaCC forms the abstract syntax tree generator ASTG of the SA which was described in Section 3.1.2.1. For the SA component of the SEEK prototype, we created parsers for three different grammars. These are Java, SQL and HTML grammars. We placed these parsers, related syntax tree classes and generic visitor classes into parser, syntaxtree, visitor package respectively. Each Java class inside the syntaxtree package has an accept method to be used by visitors. Visitor classes have a visit methods that each corresponds to a Java class inside syntaxtree package. The syntaxtree, visitor, and parsers packages have 142, 15 and 14 classes respectively. The classes inside these packages remains the same as long as the Java, SQL and HTML grammars do not change.
1
JavaCC repository: http://www.cobase.cs.ucla.edu/pub/javacc/
85
Figure 4-2. Using JavaCC to generate parsers. The classes inside the packages seekstructures and seekvisitors are written to fulfill the goals of the SA. The seekstructures and seekvisitors packages have 25 and ten classes respectively and are subject to change as we add new functionality to SA module. The classes inside these packages forms the Information Extractor (IEx) of the SA which was described in Section 3.1.2.3. IEx is consist of several visitor design patterns. Execution steps of the IEx and functionality of some selected visitor design patterns are described in the next section. 4.1.2
Execution steps of the information extractor
Semantic analysis process has two main steps. In the first step, SA makes preparations necessary for analyzing the source code and forms the control flow graph. SA driver accepts the name of the stand-alone Java class file (with the main method) as an argument. Starting from Java class file, SA finds out all the user-defined Java classes to be analyzed in the class path and forms the control flow graph. Next, SA parses the all the Java classes in the control flow graph and produces AST of these Java classes. Then, the visitor class ObjectSymbolTable gathers variable declaration information for each class to be analyzed and store this information in the SymbolTable classes. The SymbolTable classes are passed to each visitor class and are filled with new information as the SA process continues. In the second step, SA identifies the variables used in input, output and SQL statements. SA uses the ObjectSlicingVars visitor class to identify slicing variables. The list of all input, output, and database-related statements, that are language (Java,
86
JDBC) specific, are stored in InputOutputStatements and SqlStatements classes. To analyze additional statements, or to switch to another language, all we need to do is to add/update new statement names into these classes. When a visitor class encounters a method statement while traversing through AST, it checks this list to find out if this method is an input, output, or a database-related statement. SA finds and parses SQL strings embedded inside the source code. SA uses the ObjectSQLStatement visitor class to find and parse SQL statements. While the visitor traverses the AST, it constructs the value of variables that are of String type. When a variable type of String or a string text is passed as a parameter to an SQL execute method (e.g., executeQuery(queryStr)), this visitor class parses the string, and constructs the AST of this SQL string. Then it uses the visitor class named ObjectSQLParse to extract information from that SQL statement. The visitor class ObjectSQLStatement uses the visitor class ObjectSQLParse to extract information about the SQL string and stores this information into a class named ResultsetSQL. The information gathered from SQL statements, input/output methods are used to construct relations between database schema element and the text denoting the possible meaning of the schema element. Besides analyzing application source code written in Java, SA can also analyze report design templates represented in XML. Report Template Parser (RTP) component of the SA uses Simple API for XML2 (SAX) to parse report templates. The outcome of the IEx is written into report ontology instances represented in OWL. Report Ontology Writer (ROW) uses OWL API3 to write the semantic information into OWL ontologies.
2
Simple XML API: http://www.saxproject.org/
3
OWL API: http://owl.man.ac.uk/api.shtml
87
4.2
Schema Matching by Analyzing ReporTs (SMART) Prototype
We have implemented SMART schema matcher prototype using Java language. There are 46 classes in five different packages. The total size of the Java classes are 500K (approximately 10,000 lines). We also wrote a Perl program to find similarity scores between word pairs by using the WordNet similarity library4 [73]. To assess similarity scores between texts, we first eliminate stop words (e.g., a, and, but, to, by) and convert plural words to singular words. We convert plural words to singular words5 because WordNet Similarity functions returns similarity scores between singular words. The SMART prototype also uses Simple API for XML (SAX) library to parse XML files and OWL API to read OWL report ontology instances into into internal Java structures. COMA++ framework enables external matchers to be included into its framework through an interface. We have integrated our SMART matcher into the COMA++ framework as an external matcher.
4
WordNet Semantic Similarity Library: http://search.cpan.org/dist/WordNet-Similarity/
5
We are using the PlingStemmer library written by Fabian M. Suchanek: http://www.mpi-inf.mpg.de/~suchanek/ 88
CHAPTER 5 TEST HARNESS FOR THE ASSESSMENT OF LEGACY INFORMATION INTEGRATION APPROACHES (THALIA) Information integration refers to the unification of related, heterogeneous data from disparate sources, for example, to enable collaboration across domains and enterprises. Information integration has been an active area of research since the early 80s and produced a plethora of techniques and approaches to integrate heterogeneous information. Determining the quality and applicability of an information integration technique has been a challenging task because of the lack of available test data of sufficient richness and volume to allow meaningful and fair evaluations. Researchers generally use their own test data and evaluation techniques, which are tailored to the strengths of the approach and often hide any existing weaknesses. 5.1
THALIA Website and Downloadable Test Package
While working for this research, we saw the need for a test bed and benchmark providing test data of sufficient richness and volume to allow meaningful and fair evaluations for information integration approaches. To answer this need, we developed THALIA1 (Test Harness for the Assessment of Legacy information Integration Approaches) benchmark. We show a snapshot of THALIA website in Figure 5-1. THALIA provides researchers with a collection of over 40 downloadable data sources representing University course catalogs, a set of twelve benchmark queries, as well as a scoring function for ranking the performance of an integration system [47, 48]. THALIA website also hosts cached web pages of University course catalogs. The downloadable packages have data extracted from these websites. Figure 5-2 shows an example cached course catalog of the Boston University hosted in THALIA website. In THALIA website, we also provide the ability to navigate between extracted data and
1
URL of the THALIA website: http://www.cise.ufl.edu/project/thalia.html
89
Figure 5-1. Snapshot of Test Harness for the Assessment of Legacy information Integration Approaches (THALIA) website. corresponding schema files that are in the downloadable packages. Figure 5-3 shows XML representation of Boston Universitys course catalog and corresponding schema file. Downloadable University course catalogs are represented using well-formed and valid XML according to the extracted schema for each course catalog. Extraction and translation from the original representation was done using a source-specific wrapper which preserves structural and semantic heterogeneities that exist among the different course catalogs.
90
Figure 5-2. Snapshot of the computer science course catalog of Boston University. 5.2
DataExtractor (HTMLtoXML) Opensource Package
To extract the source data provided in THALIA benchmark, we enhanced and used the Telegraph Screen Scraper (TESS)2 source wrapper developed at UC Berkeley. The enhanced version of TESS, DataExtractor (HTMLtoXML), can be obtained from SourceForge website3 along with the 46 examples used to extract data provided in THALIA. DataExtractor (HTMLtoXML) tool provides added functionality over TESS wrapper including capability of extracting data from nested structures. It extracts data from a HTML page according to a configuration file and puts the data into an XML file according to a specified structure.
2
TESS: http://telegraph.cs.berkeley.edu/tess/
3
URL of DataExtractor (HTMLtoXML) is http://sourceforge.net/projects/dataextractor
91
Figure 5-3. Extensible Markup Language (XML) representation of Boston Universitys course catalog and corresponding schema file. 5.3
Classification of Heterogeneities
Our benchmark focuses on syntactic and semantic heterogeneities since we believe they pose the greatest technical challenges to the research community. We have chosen course information as our domain of discourse because it is well known and easy to understand. Furthermore, there is an abundance of data sources publicly available that allowed us to develop a testbed exhibiting all of the syntactic and semantic heterogeneities that we have identified in our classification. We list our classification of heterogeneities below. We have also listed these classifications in [48] and in the downloadable package on the THALIA website along with examples from THALIA benchmark and corresponding queries.
92
1.
Synonyms: Attributes with different names that convey the same meaning. For example, ‘instructor’ vs. ‘lecturer’.
2.
Simple Mapping: Related attributes in different schemas differ by a mathematical transformation of their values. For example, time values using a 24 hour vs. 12 hour clock.
3.
Union Types: Attributes in different schemas use different data types to represent the same information. For example, course description as a single string vs. complex data type composed of strings and links (URLs) to external data.
4.
Complex Mappings: Related attributes differ by a complex transformation of their values. The transformation may not always be computable from first principles. For example, the attribute ‘Units’ represents the number of lectures per week vs. textual description of the expected work load in field ‘credits’.
5.
Language Expression: Names or values of identical attributes are expressed in different languages. For example, The English term ‘database’ is called ‘Datenbank’ in the German language.
6.
Nulls: The attribute (value) does not exist. For example, Some courses do not have a textbook field or the value for the textbook field is empty.
7.
Virtual Columns: Information that is explicitly provided in one schema is only implicitly available in the other and must be inferred from one or more values. For example, Course prerequisites is provided as an attribute in one schema but exists only in comment form as part of a different attribute in another schema.
8.
Semantic incompatibility: A real-world concept that is modeled by an attribute does not exist in the other schema. For example, The concept of student classification (‘freshman’, ‘sophomore’, etc.) at American Universities does not exist in German Universities.
9.
Same attribute in different structure: The same or related attribute may be located in different positions in different schemas. For example, The attribute Room is an attribute of Course in one schema while it is an attribute of Section which in turn is an attribute of Course in another schema.
10.
Handling sets: A set of values is represented using a single, set-valued attribute in one schema vs. a collection of single-valued attributes organized in a hierarchy in another schema. For example, A course with multiple instructors can have a single attribute instructors or multiple section-instructor attribute pairs.
11.
Attribute name does not define semantics: The name of the attribute does not adequately describe the meaning of the value that is stored there. 93
12.
Attribute composition: The same information can be represented either by a single attribute (e.g., as a composite value) or by a set of attributes, possibly organized in a hierarchical manner. 5.4
Web Interface to Upload and Compare Scores
THALIA website offers a web interface for researcher to upload their result for each heterogeneity listed above. The web interface accepts data in many aspects, such as size of specification, number of mouse clicks and size of program code, to evaluate the effort spent to resolve the heterogeneity by the approach. The uploaded scores can be viewed by anybody visiting the website of the THALIA benchmark. This helps other researcher compare their approach with others. Figure 5-4 shows the scores uploaded to THALIA benchmark for Integration Wizard (IWiz) Project at the University of Florida.
Figure 5-4. Scores uploaded to Test Harness for the Assessment of Legacy information Integration Approaches (THALIA) benchmark for Integration Wizard (IWiz) Project at the University of Florida.
94
While THALIA is not the only data integration benchmark,4 what distinguishes THALIA is the fact that it combines rich test data with a set of benchmark queries and associated scoring function to enable the objective evaluation and comparison of integration systems. 5.5
Usage of THALIA
We believe that THALIA does not only simplify the evaluation of existing integration technologies but also help researchers improve the accuracy and quality of future approaches by enabling more thorough and more focused testing. We have used THALIA test data for the evaluation of our SM approach as described in Section 6.1.1. We are also happy to see it is being used as a source of test data and benchmark by researchers [11, 74, 100] and graduate courses5 .
4
A list of Data Integration Benchmarks and Test Suits can be found at http://mars.csci.unt.edu/dbgroup/benchmarks.html 5
URL of the graduate course at the University of Toronto using THALIA is http://www.cs.toronto.edu/~{}miller/cs2525/ 95
CHAPTER 6 EVALUATION We evaluate our approach using the prototype described in Chapter 4. In the following sections, we first describe our test data sets and our experiments. We then compare our results with other techniques and present a discussion on the results. 6.1
Test Data
The test data sets have two main components; schema of the data source and reports presenting the data from the data source. We used two test data sets. The first test data set is from THALIA data integration testbed. This data set has 10 schemas. Each schema of THALIA test data set has one report and the report covers entire schema elements of the corresponding schema. The second test data set is from University of Florida registrar office. This data set has three schemas. Each schema of UF registrar test data set has 10 reports and the reports do not cover all schema elements of the corresponding schema. The first test data set from THALIA is used to see how SMART approach performs when the entire schema is covered by reports and the second test data set from UF is used to see how SMART approach performs when the entire schema is not covered by reports. The test data set from UF also enables us to see the affect of having multiple reports for one schema. In the following subsections, we give detailed descriptions of the schemas and reports of these test data sets. 6.1.1
Test Data Set from THALIA testbed
The first test data set is from THALIA testbed [48]. THALIA offers 44+ different University course catalogs from computer science departments worldwide. Each catalog page is represented in HTML. THALIA also offers data and schema of each catalog page. We explained details of THALIA testbed in Chapter 5. For the scope of this evaluation, we treat each catalog page (in HTML) to be a sample report from the corresponding University. We selected 10 university catalogs (reports) from THALIA that represent different report design practices. For example,
96
Figure 6-1. Report design practice where all the descriptive texts are headers of the data. we give two examples of these report design practices in Figures 6-1 and 6-2. Figure 6-1 shows the course scheduling report of Boston University and Figure 6-2 shows the course scheduling report of Michigan State University.
Figure 6-2. Report design practice where all the descriptive texts are on the left hand side of the data. Sizes of schemas in THALIA test data set vary between 5 to 13 as listed in Table 6-1. We stored the data and schemas for each selected university in a MySQL 4.1 database. When we pair 10 schemas, we have 45 different pairs of schemas to match. 45 different schema pairs have 2576 possible combinations of schema elements. We manually determined that 215 of these possible combinations are real. We use these manual mappings to evaluate our results. 97
Table 6-1. The 10 university catalogs selected for evaluation and size of their schemas. University Name # of Schema Elements University of Arizona 5 Brown University 7 Boston University 7 California Institute of Technology 5 Carnegie Mellon University 9 Florida State University 13 Michigan State University 8 New York University 7 University of Massachusetts Boston 8 University of New South Wales, Sydney 7 We recreated each report (catalog page) from THALIA by using two methods. One method is using Java Servlets and the other is using Eclipse Business Intelligence and Reporting Tool (BIRT).1 Java Servlet applications corresponding to a course catalog fetch the relevant data from the repository and produce the HTML report. Report templates designed by BIRT tool also fetch the relevant data from the repository and produce the HTML report as well. When SMART prototype is run, it analyzes Java Servlet code and report templates to extract semantic information. 6.1.2
Test Data Set from University of Florida
The second test data set is about students registry information and from University of Florida. We contacted several offices at the University of Florida to obtain test data sets.2 We first contacted the College of Engineering. After several meetings and discussions, the College of Engineering agreed to give us the schemas and the report design templates without any data. In fact, we were not after the data because our approach works without the need of the data. The College of Engineering forms and uses the data set that we obtained after several months as follows. The College of Engineering runs a batch program
1
http://www.eclipse.org/birt/
2
I would like to thank to Dr. Joachim Hammer for his extensive efforts for reaching out several departments and organizing meetings with staff to gather test data sets.
98
every first day of the week and downloads data from legacy DB2 database of the Registrar office. DB2 database of the Registrar office is a hierarchical database. The College of Engineering stores the data in relational MS Access databases. The College of Engineering extracts a subset of the database of the registrar office and uses the same attribute and table names in the MS Access database as they are in the database of the registrar office. The College of Engineering creates subsets of this MS Access database and runs their reports on these MS Access databases. Figure 6-3 shows the conceptual view of the architecture of the databases in the College of Engineering.3
Figure 6-3. Architecture of the databases in the College of Engineering. We also contacted the UF Bridges office. The Bridges is a project to replace the universitys business computer systemscalled legacy systemswith new webbased, integrated
3
I would like to thank James Ogles from the College of Engineering for his time to prepare the test data and for answering our questions regarding the data set. 99
systems that provide realtime information and improve university business processes.4 The UF Bridges project also redesigned the legacy DB2 database of the registrar office for MS SQLServer. We obtained schemas and again could not reach the associated data because of privacy issues.5 Finally, we reached the Business School.6 The Business School stores their data in MS SQL Server databases. Their schema is based on the Bridges office schema however they use different naming conventions. They add new structures into the schemas when needed. Table 6-2. Portion of a table description from the College of Project and the Business School schemas. The College of Eng. The Bridges Office trans2 PS UF CREC COURSE UUID VARCHAR(9) UF UUID VARCHAR(9) CNum VARCHAR(4) UF AUTO INDEX INTEGER Sect VARCHAR(4) UF TERM CD VARCHAR(5) CT CHAR UF TYPE DESC VARCHAR(40)
Engineering, the Bridges The Business School t CREC UFID varchar(9) Term varchar(6) CourseType varchar(1) Section varchar(4)
The schemas from the College of Engineering, the Bridges Office and the Business School are semantically related however they exhibit different syntactical features. The naming conventions and sizes of schemas are different. The College of Engineering uses the same names for schema elements as they are in the Registrar’s database. The schema elements names often contains abbreviations which are mostly not possible to guess. The Bridges office uses more descriptive naming convention for schema elements. The schema elements (i.e, column names) in the schema of the Business School have the most descriptive names. However, the table names in the schema of the Business School uses
4
http://www.bridges.ufl.edu/about/overview.html
5
I also would like to acknowledge the help of Mr. Warren Curry from the Bridges office for his help obtaining the schemas. 6
I also would like to acknowledge the help of Mr. John C. Holmes from the Business School for his help obtaining the schemas. 100
non-descriptive names similar to the names in the registrar database. To give an example for different naming convention in the schemas, we present a portion of a table description from the College of Engineering, the Bridges Office and the Business School schemas in Table 6-2. The College of Engineering, the Bridges Office, and the Business School schemas have totally 135, 175, 114 attributes respectively in six tables. We present table names in these three schemas and the number of schema elements that each table has in Table 6-3. In a data integration scenario, we pair the schemas that is to be integrated. When we pair three schemas from the College of Engineering (COE), the Bridges Office (BO) and the Business School (BS), shown in Table 6-3, we have three different schema pairs, (COE-BO), (COE-BS), (BO-BS), to match. We manually determined that (COE-BO) pair has 88, (COE-BS) pair has 91 and (BO-BS) pair has 110 mappings and we use these manual mappings to evaluate the results of the SMART (Schema Matching by Analyzing ReporTs) and COMA++ (COmbination of MAtching algorithms) approaches as described in Section 6.3.2. We recreated corresponding reports by using Eclipse BIRT tool. We have 10 reports for each schema. Table 6-3. Names of tables in the College of Engineering, the Bridges Office, and the Business School schemas and number of schema elements that each table has. The College of Engineering Bridges The Business School colleges, 8 ps uf colleges, 12 t coll, 5 deptx1, 14 ps uf departments, 23 t dept, 8 fice, 32 ps uf fice, 33 t fice, 30 t honors, 40 honors, 56 ps uf honors, 56 majors, 10 ps uf majors, 17 t majo, 9 trans2, 15 ps uf crec course, 34 t crec, 22 total: 135 total: 175 total: 114 6.2
Determining Weights
We explained the formulas to compute similarity scores between concepts of two ontologies and showed how we apply these formulas to compute similarity scores between data element concepts of two report ontology instances in Section 3.2.4. In this section, we
101
show how we determine the weights in the formulas. The correct selection of the weights of the similarity function is very important. The weights directly affect the similarity score and hence directly affect the results and accuracy of the SMART approach. We show the formula for computing the similarity scores between data elements below. Our goal is to determine weights from w1 to w8 . simDataElement (DataElement1 , DataElement2 ) = w1 ∗ SimF unc(Attribute1 , Attribute2 ) +w2 ∗ SimF unc(T able1 , T able2 ) +w3 ∗ SimF unc(T ype1, T ype2) +w4 ∗ SimF unc(Description1 , Description2 )
(6–1)
+w5 ∗ SimF unc(Query1 , Query2) +w6 ∗ SimF unc(T itle1 , T itle2 ) +w7 ∗ SimF unc(HeaderNote1 , HeaderNote2 ) +w8 ∗ SimF unc(F ooterNote1 , F ooterNote2 ) We use multiple linear regression method to determine weights of the similarity function. In multiple linear regression [3], part of the variables are considered to be explanatory variables, and the remaining are considered to be dependent variables. In our problem, the explanatory variables are the similarities of the properties of the concepts. For example, our explanatory variables in the formula to compute similarity scores between data element concepts are similarity scores of Attributes, Table, Type, Title, Query, HeaderNote and FooterNote properties. The dependent variable is the overall similarity of two data element concepts. Linear regression attempts to model the relationship between the explanatory variables and the dependent variable by fitting a linear equation to observed data. Observed data refers to a set of sample vectors for explanatory variables and the desired value of the dependent variable corresponding to
102
the sample vectors. Our prototype computes the sample vectors for explanatory variables according to SimFunc functions that compute similarity scores between properties of concepts. We manually enter the desired value of the dependent variable (i.e., similarity score for two data element concepts) corresponding to the sample vectors. Let us denote the explanatory variables as a column vector (called feature vector) by:
x(n) = [x1 (n), x2 (n),..., xN (n)]T
(6–2)
where T denotes the transpose operator and n is the index of the sample data. The observed data contains a dependent variable (desired data) called d(n) = [d1 (n), d2 (n),..., dL (n)]T
(6–3)
corresponding to the feature vector x(n). Note that L is 1 for our problem. We can combine the feature vectors as an NxP matrix x = [x(1), x(2),..., x(P)]
(6–4)
where P is the number of data points in our observed data. Similarly, the desired data can be combined in an LxP matrix d = [d(1), d(2),..., d(P)]
(6–5)
In linear regression, the goal is to model d(n) as a linear function of x(n), i.e., d(n) = wT x(n)
(6–6)
w = [w1 , w2 ,.., wN ]T
(6–7)
where
is called the weight matrix. The most common approach for finding w is the method of least-squares. This method calculates the optimal w for the observed data by minimizing
103
a cost function which is mean of the square errors (MSE), i.e., MSE =
P X
(d(n) − wT x(n))2
(6–8)
n=1
Using MSE, the weight matrix w can be found analytically or in an iterative fashion. To find the analytical solution, we calculate the minimum value of MSE with respect to w. To find the minimum value of MSE, we take the derivative of the MSE w.r.t w and equate it to 0. The resulting equation for the optimal value of w, denoted by w¯ is given by P P X 1 X T −1 1 w¯ = ( x(n)x(n) ) ( x(n)d(n)) P n=0 P n=0
(6–9)
We have determined the weights by using our test data from THALIA test bed. THALIA test data has 10 schemas and one report for each schema. We extracted and created an instance of the report ontology that corresponds to a report. Each report ontology instance has 5 to 9 data element concepts. As a total, we have 2576 data concept pairs in 45 report instance combinations. We use 1500 of these 2576 data concept pairs as training data set for determining the weights of the similarity function. The eight weights of the similarity function found are shown in Table 6-4. We ran our experiments to determine weights with three different similarity measures; jcn [52], lin [60] and levenstein [20]; to determine similarity scores between texts. Table 6-4. Weights found by analytical method for different similarity functions with THALIA test data. SimFunc Attribute T able T ype Description Query T itle Header F ooter JCN 0.30 -0.05 0.01 0.79 -0.06 0.005 0.05 0.09 LIN 0.32 -0.08 0.03 0.60 0.003 -0.03 -0.15 0.08 Levenstein 0.32 -0.13 0.02 0.72 0.01 -0.03 -0.04 0.14 Alternatively, w¯ can be found in an iterative fashion using the update equation.
w(n + 1) = w(n) + ηe(n)x(n) where η is called the step size and e(n) is the error value given by d(n) − y(n). 104
(6–10)
6.3
Experimental Evaluation
In the following subsections, we explain the results gathered by running SMART prototype on the test data sets explained in Section 6.1. We use f-measure metric and the Receiver Operating Characteristic (ROC) curves to evaluate the accuracy of our results. We present descriptions of f-measure metric and ROC curves below. F-measure has been the most widely used metric for evaluating schema matching approaches [26]. F-Measure is the harmonic mean of Precision (P) and Recall (R). Precision specifies percentage of the correct results among all found results and Recall specifies the percentage of correct results among all real results. Table 6-5 shows the confusion matrix. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. According to Table 6-5, we can formulate Precision (P) as T P/T P + F P and Recall (R) as T P/T P + F N. We do not use P or R measures alone because neither P nor R alone can accurately assess the match quality [25]. R can easily be maximized at the expense of a poor P by returning as many correspondences as possible, for example, the cross product of two input schemas. On the other hand, a high P can be achieved at the expense of a poor R by returning only few but correct correspondences. We calculated the F-Measure with the following formula by giving Recall (R) and Precision (P) metrics equal weights.
F − Measure = 2 ∗
Pr ecision ∗ Recall Pr ecision + Recall
(6–11)
Table 6-5. Confusion matrix. Predicted Positive Predicted Negative Positive Examples True Positives (TP) False Negatives (FN) Negative Examples False Positives (FP) True Negatives (TN) Receiver Operating Characteristic (ROC) analysis originated from signal detection theory. ROC analysis has also widely been used in medical data analysis to study the
105
effect of varying the threshold on the numerical outcome of a diagnostic test. It has been introduced to machine learning and data mining relatively recently. The Receiver Operating Characteristics (ROC) analysis shows the performance of a classifier as a trade off between detection rate and false alarm rate. To analyze the trade off between two rates, a ROC curve is plotted. A ROC curve is a graphical plot of the true positives (a.k.a. hit, detection) rate versus false positives rate (a.k.a. false alarm) as a binary classifier system’s threshold parameter is varied. According to Table 6-5, we formulate true positive rate (TPR) as T P/T P + F N and false positive rate (FPR) as 1 − (T N/T N + F P ). The ROC curve always goes through two points (0,0 and 1,1). 0,0 is where the classifier detects no alarms. In this case it always gets the negative cases right but it gets all positive cases wrong. The second point is 1,1 where everything is classified as positive. So the classifier gets all positive cases right but it gets all negative cases wrong. The best possible prediction method would yield a point in the upper left corner (0,1), representing all true positives are found and no false positives are found. The closer the curve follows a line from (0,0) to (0,1) and then from (0,1) to (1,1), the more accurate the classifier. The area under the ROC curve is a convenient way of comparing classifiers. A random classifier has an area of 0.5, while an ideal one has an area of 1. The larger the area under the ROC curve, the better the performance of the classifier. However, in some cases, the area under the ROC curve may be misleading. At a chosen threshold, the classifier with the larger area may not be the one with the better performance. The best place to operate the classifier (the best threshold) is the point on its ROC which lies on a 45 degree line closest to the upper left corner (0,1) of the ROC plot.7 We run our experiments with different similarity measures (e.g., Lin, JCN and lexical etc.) and then compare them with the results of the COMA++ (COmbination of
7
We assume that the costs of detection and false alarm are equal.
106
Figure 6-4. Results of the SMART with Jiang-Conrath (JCN), Lin and Levenstein metrics. MAtching algorithms) [7] schema matcher framework. We selected to compare our results with the results of COMA++ because COMA++ has performed the best in experiments evaluating the existing schema matching approaches [26, 99]. Besides, it provides a downloadable prototype which enables us to create reproducible results. COMA++ also enables combining different schema matching algorithms. We used All Context and Filtered Context combined matchers in the COMA++ framework. All Context and Filtered Context are combinations of five different matchers; name, path, leaves, parents and siblings. 6.3.1
Running Experiments with THALIA Data
To evaluate SMART approach, we use data sources and cached HTML pages from the THALIA data integration benchmark [48]. THALIA offers 44+ different University course catalogs from computer science departments worldwide. University course catalogs and their schemas can be downloaded from the THALIA web site.8 For the scope of this evaluation, we consider each catalog page (in HTML) to be a sample report from the corresponding University. We selected 10 university catalogs (reports) from THALIA that represent different report design practices and paired their reports resulting 45
8
http://www.cise.ufl.edu/project/thalia.html
107
Figure 6-5. Results of COmbination of MAtching algorithms (COMA++) with All Context and Filtered Context combined matchers and comparison of SMART and COMA++ results. different pairs of reports to match. We tacitly assume that course information is stored in a database and that each report is produced by Eclipse BIRT tool that fetches the relevant data from the repository and produces the HTML report. The data presented on reports are stored in MySQL 4.1 database. SMART approach’s prototype, written in Java language, extracts information from report design templates and stores the extracted information in instances of the report ontology. Then it computes similarity scores using weights described in Section 6.2. SMART prototype uses three different similarity measures to find similarity scores between texts. These measures are JCN and LIN semantic, and Levenstein edit similarity measures. Figure 6-4 shows the f-measure results of SMART when JCN and LIN semantic similarity measures and Levenstein lexical similarity measure are used. We use semantic similarity measures to compute similarity scores between descriptive texts such as column headers, report headers and footers. Figure 6-4 shows the change in precision, recall and f-measure metrics as the threshold changes. The reader can notice that JIN and LIN semantic similarity measure performs better than Levenstein lexical similarity measure.
108
Figure 6-6. Receiver Operating Characteristics (ROC) curves of SMART and COMA++ for THALIA test data. This was quite expected because using semantic similarity measures help us to identify similarities between words that are semantically close but lexically far. In Figure 6-5, we show COMA++ results for the THALIA test data with All Context and Filtered Context combined matchers. Similar to other figures, the results start with low precision but high recall values for lower thresholds. Precision value increases and recall value decreases as the threshold increases. On the right hand side of the Figure 6-5, we present the comparison between SMART and COMA++ results. The reader can notice that the SMART performs better in all thresholds with JCN semantic similarity measure. The second best result on the figure is also achieved by the SMART when Levenstein (EDIT) lexical similarity measure is used. Even the results with lexical similarity measure are better than COMA++ results. That is because the the descriptive texts extracted from reports and used to find similarity scores between schema elements by the SMART. These texts tend to be lexically closer than the schema element names used to find similarity scores between schema elements by the COMA++ approach. In Figure 6-6, we show ROC Curves of SMART and COMA++ approaches for the THALIA Test Data. As stated before, the closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the approach. When the reader analyzes the ROC curves, the reader can notice that results of the SMART
109
are much more accurate than the results of COMA++. The best threshold to run the matchers can be found by the help of the ROC curves. The threshold that generates the closest point to the upper left corner (0,1) of the ROC plot and lies on a 45 degree line is the best threshold.9 For example, the coordinates of the closest point to the upper left corner (0,1) on the ROC curve for the SMART with the LIN similarity measure is (0.05,0.8). The SMART produces the 0.05 false alarm rate and 0.8 detection rate when operated with 0.3 threshold.10 The coordinates of the closest point to the upper left corner (0,1) on the ROC curve for the COMA++ with the All Context matcher is (0.25,0.55). COMA++ produces the 0.25 false alarm rate and 0.55 detection rate when operated with 0.4 threshold. This shows that SMART and COMA++ achieves their best performance at different thresholds. Since the ROC curve of the SMART is always closer to upper left corner (0,1), the SMART performs better than COMA++ at any threshold. This fact can also be seen in Figure 6-5 where f-measure results of the SMART and COMA++ are presented for different thresholds. 6.3.2
Running Experiments with UF Data
In this section, we present our experimental results with our second test data set. The second data set has three schemas and each schema has 10 reports. We obtained the three schemas from the College of Engineering, the Business School and the Bridges Office. We described the details of the second data set in Section 6.1.2. The difference between the first and the second test data set is that the second data set has more reports per schema and also the reports of the second test data set do not cover the entire schema. The schemas from the College of Engineering (COE), the Business School (BS), and the Bridges Office (BO) have 135, 175 and 114 schema
9
The assumption here is that cost of a false alarm and a detection are equal.
10
The thresholds for different detection/false alarm rate combinations are not seen on the figure.
110
Figure 6-7. Results of the SMART with different report pair similarity thresholds for UF test data. elements (i.e., attributes) respectively. We manually determined that (COE-BO) pair has 88, (COE-BS) pair has 91 and (BO-BS) pair has 110 mappings. However, reports cover %90 of these mappings. This means, the SMART can at most have 0.9 recall accuracy value if it determines all the mappings covered by reports. Our experiments show that even with this disadvantage, the SMART performs better than COMA++ results. Since we have more than one report per schema, we need to merge the results from report pair combinations into a final similarity matrix. In Section 3.2.7, we described how we merge the scores into a final similarity matrix when we have more than one score for a schema element pair. Shortly, we compute the weighted average of the similarity scores between schema element pairs. We consider the similarity scores between report pairs as weights for this computation. We described how we compute similarity scores between report pairs in Section 3.2.4. The similarity scores between report pairs are in the range [0,1]. To eliminate unrelated report pairs, we set a threshold for report pair similarity scores and consider only schema element similarity scores that come from report pairs having similarity score higher than the report pair similarity score threshold. We show the accuracy results of the SMART with the UF data set according to f-measure metric in Figure 6-7. On the left hand side of the Figure 6-7, we show the results of the SMART for Business-Bridges schema pair when report pair similarity
111
threshold is set to 0.6, 0.7 and 0.8. The report pairs cover 90% of the actual mappings, therefore recall value is always less than 0.9. There are 19, 13 and 10 report pairs that have higher similarity score than 0.6, 0.7 and 0.8 respectively. The reader can notice that the accuracy of results are better when the report similarity threshold is set to 0.7 or 0.8. That is because the report pairs having similarity score higher than threshold 0.7 are more similar to each other and this causes more accurate results. In the middle of the Figure 6-7, we show the results of the SMART for Business-College of Engineering schema pair when the report pair similarity threshold is set to 0.6, 0.7 and 0.8. There are 16, 12 and 8 report pairs that have higher similarity score than 0.6, 0.7 and 0.8 respectively. The reader can notice that the accuracy of results is slightly better when the report similarity threshold is set to 0.6 or 0.7. That is because when the report pair similarity threshold is set to 0.8, we eliminate some very similar report pairs.11 On the right had side of the Figure 6-7, we show the results of the SMART for College of Engineering-Bridges schema pair when report pair similarity threshold is set to 0.6, 0.7 and 0.8. There are 21, 13 and 8 report pairs that have higher similarity score than 0.6, 0.7 and 0.8 respectively. The reader can notice that the accuracy of results is slightly better when the report similarity threshold is set to 0.7. That is because when the report pair similarity threshold is set to 0.6, we include some unrelated report pairs into computation which affects the accuracy of the results negatively. Also, when the report pair similarity threshold is set to 0.8, we eliminate some very similar report pairs. Therefore, the results are better when threshold is set to 0.7. The changes in the accuracy of the results with different report pair similarity score threshold settings show that correctly determining the report similarity score threshold is important for the SMART approach. The figure 6-7 suggests us that we need to select the report pair similarity score threshold carefully. The choosen report
11
The schemas have 10 very similar report pairs.
112
Figure 6-8. F-Measure results of SMART and COMA++ for UF test data when report pair similarity is set to 0.7. similarity threshold should not eliminate the similar report but eliminate the unrelated reports. In Figure 6-8, we compare the performance of the SMART with the COMA++ based on the f-measure metric. The SMART results were prepared with JCN semantic similarity measure when the report threshold was set to 0.7. The COMA++ approach results were prepared with the All Context and the Filtered Context combined matchers. The reader can notice that SMART produces higher f-measure accuracy results than COMA++. However, the SMART and COMA++ performs their best results at different thresholds. The results of the COMA++ is very close to the results the SMART for the Business School - Bridges Project schema pair. That is because this schema pair has very similar naming conventions as described in Section 6.1.2. When similar naming conventions are used, lexical similarity measures performs better. COMA++ matchers are based on lexical similarity measures, hence COMA++ performs better for the Business School - Bridges Project schema pair compared to its performance for the other schema pairs. On the other hand, SMART does not only use lexical similarity measures. It combines lexical and semantic similarity measures and does not depend on lexical closeness of schema element names. It utilizes more descriptive texts extracted from reports. Therefore, the results of the SMART is not affected by the changes in the descriptiveness or lexical closeness
113
Figure 6-9. Receiver Operating Characteristics (ROC) curves of the SMART for UF test data. of the schema element names. The reader can also notice that SMART and COMA++ performs their best results at different schema element similarity score thresholds. The SMART performs its best results around 0.25 threshold and COMA++ performs around 0.5 threshold for UF test data sets. On the left hand side of Figure 6-9, we present ROC curves of the SMART for UF Business - Bridges schema pair. Each schema has 10 reports which makes 100 report pairs. 10 of these report pairs are very similar. Each report pair has a similarity score in the range [0,1]. There are 31, 19, 13, 10 and 7 report pairs that has higher report similarity score than 0.5, 0.6, 0.7, 0.8 and 0.9 thresholds respectively. When the report pair similarity score threshold is set to 0.9, some of the very similar report pairs are eliminated. Therefore the performance of the SMART when the report similarity score threshold is set to 0.9 is low. On the right hand side of the Figure 6-9, we present ROC curves of the SMART for UF Business-Engineering schema pair. Again, ten of the possible 100 report pairs are very similar. There are 28, 16, 12, 8 and 4 report pairs that has higher report similarity score than 0.5, 0.6, 0.7, 0.8 and 0.9 thresholds respectively. When the report pair similarity threshold is set to 0.8 and 0.9, some of the very similar report pairs are eliminated. Therefore the performance of the SMART decreases for
114
Figure 6-10. Comparison of the ROC curves of the SMART and COMA++ for UF test data. these thresholds. As we lower the report pair similarity threshold the performance of the SMART slightly increases. However, as we lower the report similarity score threshold, more report pairs pass the threshold which requires extra processing time. After some point, the increase in the performance by decreasing the report pair similarity threshold and hence increasing the number of reports and computation amount, is negligible. Therefore, we do not consider the report pairs below the 0.5 similarity score in Figure 6-9. In Figure 6-10, we compare the performance of the SMART with the COMA++ based on the ROC curves. For the Business-Bridges schema pair, the COMA++ performs better than the SMART. As stated before, the naming conventions of the schemas are very close which helps COMA++ to perform better. Moreover, the SMART has a disadvantage that not all the mappings are covered by the available reports. For the Business-Engineering schema pair, the SMART performs very close to COMA++ even though not all the mappings are covered by reports.
115
CHAPTER 7 CONCLUSION Schema matching is a fundamental problem that occurs when information systems share, exchange or integrate data for the purpose of data warehousing, query processing, message translation, etc. Despite extensive efforts, solutions for schema matching are still mostly manual and depend on significant human input which makes schema matching a time consuming and error-prone task. Schema elements are typically matched based on schema and data. However, the clues gathered by processing the schema and data are often unreliable, incomplete and not sufficient to determine the relationships among schema elements [28]. Moreover, the mapping depends on the application and may change from one application to another even though the underlying schemas remain the same. Several automatic approaches exist but their accurateness depends heavily on the descriptiveness of the schemas alone. We have developed a new approach called Schema Matching by Analyzing ReporTs (SMART) which extracts important semantic information about the schemas and their relationships from report generating application source code and report design templates. Specifically, in SMART we reverse engineer the application source code and report templates associated with the schemas that are to be matched. From the source code and report templates which use the schema and produce reports or other user-friendly output, we extract semantically rich descriptive texts. We identify relationships of these descriptive texts with the data presented on the report with the help of a set of heuristics. We trace the data on the report back to the corresponding schema elements in the data source. We store all the information gathered from a report, including the descriptive texts (e.g., column headers and report title) and properties of data presented (e.g., schema element name and type of data) into an instance of the report ontology. We compute similarity scores between instances of the report ontology. We then convert inter-ontology matching scores into scores between schema elements.
116
Our experimental results show that the SMART provides more reliable and accurate results than current approaches that rely on the information contained in the schemas and data instances alone. For example, the highest accuracy (based on the f-measure metric) of the SMART for our first test data set in which reports cover all schema elements is 0.73 while the highest accuracy of the COMA++ (the best schema matching approach according to the evaluations [26]) is 0.5. The results of the SMART is also better or very close to the COMA++ results for our second test data set in which reports cover 90% of mappings. The highest accuracies (based on the f-measure metric) of the SMART for our second test data set are 0.55, 0.68 and 0.57 while the highest accuracies of the COMA++ are 0.5, 0.5 and 0.4 for different schema pairs respectively. We also analyzed out results with receiver operating characteristics (ROC) curves. We saw that the SMART’s performance is better for every threshold for our first test data set and the SMART’s performance is very close to COMA++’s performance for the second data set. Our approach shows that valuable semantic information can be extracted from reports generating application source code. Reverse engineering source code to extract semantic information is a very challenging task. To ease the process of semantic extraction, we introduced a methodology and framework which utilizes state-of-the-art tools and design patterns. Besides, our approach shows that report templates (represented in XML) are also valuable source of semantics and the semantic information can be easily extracted from report templates. Moreover, we show how the extracted information from database schemas and report application source can be stored in ontologies. We also explained in details how we apply multi-linear regression method to determine the weights of different information to reach the best accurate results. We believe our approach represents an important step towards more accurate and reliable tools for schema matching. More and more solutions for automatic schema matching help us save effort, time and investment. The decreased cost for schema matching and hence for data integration facilitate more and more organizations to
117
collaborate. The synergy gained from effective, rapid, and flexible collaborations among organizations boasts the economy and thus enhances the quality level of our daily life. 7.1
Contributions
Researchers advocate that the gain of capturing even the limited amount of useful semantics can be tremendous [87] and they motivate utilizing any kind of information source to improve our understanding of data. Researchers also point out that application source code encapsulates important semantic information about their application domain and can be used for the purpose of schema matching for data integration [78]. External information sources such as corpora of schemas and past matches have been used for schema matching but application source code have not been used as an external information source yet [25, 28, 78]. In this research, we focus on this well-known but not yet addressed challenge of analyzing application source code for the purpose of semantic extraction for schema matching. We present a novel approach for schema matching that utilizes semantically rich texts extracted from application source code. We show that the approach we provide in this dissertation provides better accuracy for the purpose of automatic schema matching. During semantic analysis of application source code, we create an instance of the report ontology from each report generated by application source code and use this ontology instance for the purpose of schema matching. While (semi) automatic extraction of ontologies (a.k.a. ontology learning) from text, relational schemata and knowledge bases are well studied in the literature [23, 37], to the best of our knowledge there has been no study aimed at extracting an ontology from application source code. Another important contribution is the introduction of the generic function for computing similarity scores between concepts of ontologies. We also described how we determine the weights of the similarity function. The similarity function along with the methodology to determine the weights of the function can be applied to any domain to determine similarities between different concepts of ontologies.
118
The schema matching approaches so far have been using lexical similarity functions or look-up tables to determine the similarity scores of two schema elements. There have been suggestions to utilize semantic similarity measures between words [7] but have not been realized. Names of schema elements are mostly abbreviations and concatenations of words. These names can not be found in the dictionaries that semantic similarity measures use to compute similarity scores between two words. Therefore, utilizing the semantic similarity measures between words was not possible. We extract descriptive texts from reports and relate them with the schema elements. Therefore, we can utilize the state-of-the-art semantic similarity measures to determine similarities. By using a semantic similarity measure instead of lexical similarity measure such as edit distance, we can detect the similarities of words that are lexically far but semantically close such as ‘lecturer’ and ‘instructor’ and we can also eliminate the words that are lexically close but semantically far such as ‘tower’ and ‘power’. One important contribution is that integration based on user reports eases the communication between business and (IT) (Information Technology) specialists. Business and IT specialists often have difficulty understanding each other. Business and IT specialists can discuss on data presented on reports not on database schemas. Business specialist can request the data seen on specific reports to be integrated or shared. Analyzing reports for data integration and sharing helps business and IT specialists communicate better. While conducting the research, we saw that there is a need of available test data of sufficient richness and volume to allow meaningful and fair evaluations between different information integration approaches. To address this need, we developed THALIA1 (Test Harness for the Assessment of Legacy information Integration Approaches) benchmark which provides researchers with a collection of over 40 downloadable data
1
THALIA website: http://www.cise.ufl.edu/project/thalia.html
119
sources representing University course catalogs, a set of twelve benchmark queries, as well as a scoring function for ranking the performance of an integration system [47, 48]. We are happy to see it is being used as a source of test data and benchmark by researchers [11, 74, 100] and graduate courses2 . In the semantic analysis part of our work, we introduce a new extensible and flexible methodology for semantic extraction from application source code. We integrate and utilize state-of-the-art techniques in object oriented programming and parser generation, and leverage from the research in code reverse engineering and program understanding. One of the main contributions of our semantic analysis methodology is its functional extensibility. Our information extraction framework lets researchers add new functionality as they develop new heuristics and algorithms on the source code being analyzed. Our current information extraction technique provides improved accuracy as it eliminates unused code fragments (i.e., methods, procedures). 7.2
Future Directions
This research can be continued in the following directions: Extending the semantic analyzer (SA). SA can be extended to extract information from web query interfaces. Web query interfaces have potentially valuable semantic information for matching schema elements. The information gathered from query interfaces can facilitate better results for schema matching. SA can also be extended to extract other possibly important information (e.g., format and location) of data on a report. New heuristics can also be added to relate data and descriptive texts on a report. Enhancing the report ontology. Our report ontology was represented in OWL (Web Ontology Language). We can benefit from capabilities of OWL to relate data and description elements in the ontology. In OWL, a set of OWL statements can allow us
2
The graduate course at the University of Toronto using THALIA is ’Research Topics in Data Management‘: http://www.cs.toronto.edu/~miller/cs2525/
120
to conclude another OWL statement. For example, given the statements (motherOf subProperty parentOf) and (Nedret motherOf Oguzhan) when stated in OWL, allows us to conclude (Nedret parentOf Oguzhan) based on the logical definition of subProperty as given in the OWL spec. Similarly, we can define isDescriptionOf relation between data element concept and description element concept, so that OWL can conclude the isDescriptionOf relation by looking at the location information of both data element and description element concepts on a report. Another advantage of using OWL ontologies is the availability of tools such as Racer, Fact and Pellet that can reason about them. A reasoner can also help us understand if we can accurately extract data and description elements from the report. For instance, we can define a rule such as “No data or description elements can overlap” and check the OWL ontology by a reasoner to make sure if this rule is satisfied. Extending the schema matcher (SMART). We evaluate the similarity scores produced by SMART to determine 1 to 1 mappings. We can work on results of SMART to figure out how to interpret the results to determine 1-n and m-n mappings as well. Continuing research on similarity. Assessing similarity scores between objects is an important research topic. We introduced a generic similarity function to determine similarities between concepts of ontologies. We also explained how we determine the weights of this generic similarity function. We applied this similarity function on report ontology instances. Real world objects can be modeled using ontologies and our similarity function can be used to find similarities between them. For example, our similarity function is appropriate to find semantic similarity scores between two web pages and between two sentences. To determine the similarity scores between two sentences, current approaches do not consider the place of a word in a sentence and the relations between words in a sentences. We can model an ontology specifying the relation of words in a sentences and use our similarity function to assess similarity scores between sentences.
121
REFERENCES [1] A. Aamodt, M. Nygard, Different roles and mutual dependencies of data, information, and knowledge: An ai perspective on their integration, Data Knowl. Eng. 16 (3) (1995) 191–222. [2] P. M. Alexander, Towards reconstructing meaning when text is communicated electronically, Ph.D. thesis, University of Pretoria, South Africa (2002). [3] M. P. Allen, Understanding Regression Analysis, New York Plenum Press, 1997. [4] G. Antoniou, F. van Harmelen, Web ontology language: Owl., in: S. Staab, R. Studer (eds.), Handbook on Ontologies, International Handbooks on Information Systems, Springer, 2004, pp. 67–92. [5] N. Ashish, C. A. Knoblock, Semi-automatic wrapper generation for internet information sources, in: COOPIS ’97: Proceedings of the Second IFCIS International Conference on Cooperative Information Systems, IEEE Computer Society, Washington, DC, USA, 1997. [6] J. A. Aslam, M. Frost, An information-theoretic measure for document similarity, in: SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM Press, New York, NY, USA, 2003. [7] D. Aumuellet, H.-H. Do, S. Massmann, E. Rahm, Schema and ontology matching with coma++, in: Proceedings of SIGMOD 2005 (Software Demonstration), Baltimore, 2005. [8] T.-L. Bach, R. Dieng-Kuntz, Measuring similarity of elements in owl dl ontologies, in: Context and Ontologies: Theory, Practice and Applications, Pittsburgh, Pennsylvania, USA, 2005. [9] S. Banerjee, T. Pedersen, An adapted lesk algorithm for word sense disambiguation using word-net, in: In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2002. [10] J. Berlin, A. Motro, Database schema matching using machine learning with feature selection, in: CAiSE ’02: Proceedings of the 14th International Conference on Advanced Information Systems Engineering, Springer-Verlag, London, UK, 2002. [11] A. Bilke, J. Bleiholder, F. Naumann, C. Bohm, K. Draba, M. Weis, Automatic data fusion with hummer, in: VLDB ’05: Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, 2005. [12] J. Bisbal, D. Lawless, B. Wu, J. Grimson, Legacy information systems: Issues and directions, IEEE Softw. 16 (5) (1999) 103–111.
122
123
[13] M. Bravenboer, E. Visser, Guiding visitors: Separating navigation from computation, Tech. Rep. UU-CS-2001-42, Institute of Information and Computing Sciences, Utrecht University, The Netherlands, University of Utrecht, P.O. Box 80.089, 3508 TB, Utrecht, The Netherlands (November 2001). [14] M. L. Brodie, The promise of distributed computing and the challenges of legacy information systems, in: Proceedings of the IFIP WG 2.6 Database Semantics Conference on Interoperable Database Systems (DS-5), North-Holland, 1993. [15] A. Budanitsky, G. Hirst., Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures., in: NAACL 2001 WordNet and Other Lexical Resources Workshop, Pittsburgh, 2001. [16] A. Budanitsky, G. Hirst, Evaluating wordnet-based measures of semantic distance., Computational Linguistics 32 (1) (2006) 13–47. [17] D. Buttler, L. Liu, C. Pu, A fully automated object extraction system for the world wide web., in: ICDCS, 2001. [18] P. Checkland, S. Holwell, Information, Systems and Information Systems - making sense of the field, John Wiley and Sons, Inc., Hoboken, NJ, USA, 1998. [19] E. J. Chikofsky, J. H. C. II, Reverse engineering and design recovery: A taxonomy, IEEE Softw. 7 (1) (1990) 13–17. [20] W. W. Cohen, P. Ravikumar, S. E. Fienberg, A comparison of string distance metrics for name-matching tasks., in: S. Kambhampati, C. A. Knoblock (eds.), IIWeb, 2003. [21] C. Corley, R. Mihalcea, Measuring the semantic similarity of texts, in: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, Michigan, 2005. [22] K. H. Davis, P. H. Aiken, Data reverse engineering: A historical survey., in: Working Conference on Reverse Engineering (WCRE), 2000. [23] Y. Ding, S. Foo, Ontology research and development. part I- a review of ontology generation, Journal of Information Science 28 (2) (2002) 123–136. [24] E. Do, Hong-Hai; Rahm, COMA - a system for flexible combination of schema matching approaches, in: Proc. 28th Intl. Conference on Very Large Databases (VLDB), Hongkong, Aug. 2002, 2002. [25] H.-H. Do, Schema matching and mapping-based data integration, Dissertation, Universitt Leipzig, Germany, Department of Computer Science, Universitt Leipzig, Germany (January 2006).
124
[26] H. H. Do, S. Melnik, E. Rahm, Comparison of schema matching evaluations, in: Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems, Springer-Verlag, London, UK, 2003. [27] A. Doan, P. Domingos, A. Y. Levy, Learning source description for data integration., in: WebDB (Informal Proceedings), 2000. [28] A. Doan, A. Halevy, Semantic integration research in the database community: A brief survey., AI Magazine, Special Issue on Semantic Integration 26 (1) (2005) 83–94. [29] A. Doan, N. F. Noy, A. Y. Halevy, Introduction to the special issue on semantic integration., SIGMOD Record 33 (4) (2004) 11–13. [30] P. Drew, R. King, D. McLeod, M. Rusinkiewicz, A. Silberschatz, Report of the workshop on semantic heterogeneity and interpolation in multidatabase systems, SIGMOD Rec. 22 (3) (1993) 47–56. [31] M. Ehrig, P. Haase, N. Stojanovic, M. Hefke, Similarity for ontologies - a comprehensive framework, in: 13th European Conference on Information Systems, Regensburg, 2005, iSBN: 3937195092. [32] D. W. Embley, Y. S. Jiang, Y.-K. Ng, Record-boundary discovery in web documents., in: A. Delis, C. Faloutsos, S. Ghandeharizadeh (eds.), SIGMOD Conference, ACM Press, 1999. [33] D. W. Embley, D. P. Lopresti, G. Nagy, Notes on contemporary table recognition., in: H. Bunke, A. L. Spitz (eds.), Document Analysis Systems, vol. 3872 of Lecture Notes in Computer Science, Springer, 2006. [34] J. Euzenat, P. Valtchev, An integrative proximity measure for ontology alignment, in: ISWC-2003 workshop on semantic information integration, Sanibel Island (FL US), 2003. [35] J. Euzenat, P. Valtchev, Similarity-based ontology alignment in owl-lite, in: 15th European Conference on Artificial Intelligence (ECAI), Valencia, 2004. [36] W. J. Frawley, G. Piatetsky-Shapiro, C. J. Matheus, Knowledge discovery in databases: An overview., AI Magazine 13 (3) (1992) 57–70. [37] A. Gal, G. A. Modica, H. M. Jamil, Ontobuilder: Fully automatic extraction and consolidation of ontologies from web sources., in: International Conference on Data Engineering (ICDE), IEEE Computer Society, 2004. [38] E. Gamma, R. Helm, R. E. Johnson, J. M. Vlissides, Design patterns: Abstraction and reuse of object-oriented design, in: ECOOP ’93: Proceedings of the 7th European Conference on Object-Oriented Programming, Springer-Verlag, London, UK, 1993.
125
[39] R. L. Goldstone, Similarity, in: R. Wilson, F. C. Keil (eds.), MIT encylopedia of the cognitive sciences, MIT Press, Cambridge, MA, 1999, pp. 763–765. [40] T. R. Gruber, A translation approach to portable ontology specifications, Knowl. Acquis. 5 (2) (1993) 199–220. [41] J.-L. Hainaut, M. Chandelon, C. Tonneau, M. Joris, Contribution to a theory of database reverse engineering., in: WCRE, 1993. [42] J.-L. Hainaut, J. Henrard, A general meta-model for data-centered application reengineering, in: Dagstuhl workshop on Interoperability of Reengineering Tools, 2001. [43] A. Y. Halevy, J. Madhavan, P. A. Bernstein, Discovering structure in a corpus of schemas., IEEE Data Eng. Bull. 26 (3) (2003) 26–33. [44] J. Hammer, Resolving semantic heterogeneity in a federation of autonomous, heterogeneous database systems, Ph.D. thesis, University of Southern California (August 1994). [45] J. Hammer, W. O’Brien, M. S. Schmalz, Scalable knowledge extraction from legacy sources with seek., in: H. Chen, R. Miranda, D. D. Zeng, C. C. Demchak, J. Schroeder, T. Madhusudan (eds.), Intelligence and Security Informatics (ISI), vol. 2665 of Lecture Notes in Computer Science, Springer, 2003. [46] J. Hammer, M. Schmalz, W. O’Brien, S. Shekar, N. Haldavnekar, SEEKing knowledge in legacy information systems to support interoperability, in: ECAI-02 Workshop on Ontologies and Semantic Interoperability, 2002. [47] J. Hammer, M. Stonebraker, O. Topsakal, Thalia: Test harness for the assessment of legacy information integration approaches., in: International Conference on Data Engineering (ICDE), IEEE Computer Society, 2005. [48] J. Hammer, M. Stonebraker, O. Topsakal, Thalia: Test harness for the assessment of legacy information integration approaches, Tech. Rep. tr05-001, University of Florida, Computer Science and Information and Engineering (2005). [49] J. Henrard, Program understanding in database reverse engineering, Ph.D. thesis, University of Notre-Dame (2003). [50] J. Henrard, V. Englebert, J.-M. Hick, D. Roland, J.-L. Hainaut, Program understanding in databases reverse engineering., in: G. Quirchmayr, E. Schweighofer, T. J. M. Bench-Capon (eds.), DEXA, vol. 1460 of Lecture Notes in Computer Science, Springer, 1998. [51] G. Hirst, D. S. Onge, Lexical chains as representations of context for the detection and correction of malapropisms, in: C. Fellbaum (ed.), WordNet: An electronic lexical database, MIT Press, 1998.
126
[52] J. J. Jiang, D. W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: In the Proceedings of ROCLING X, Taiwan, 1997, 1997. [53] S. C. Johnson, YACC: Yet another compiler compiler, Tech. Rep. CSTR 32, ATT Bell Laboratories (1978). [54] Y. Kalfoglou, M. Schorlemmer, Ontology mapping: The state of the art, The Knowledge Engineering Review Journal 18 (1) (2003) 1–31. [55] T. K. Landauer, P. W. Foltz, D. Laham, Introduction to latent semantic analysis, Discourse Processes 25 (1998) 259–284. [56] C. Leacock, M. Chodorow, Combining local context andwordnet similarity for word sense identification, in: C. Fellbaum (ed.), WordNet: An electronic lexical database, MIT Press, 1998. [57] D. B. Lenat, Cyc: a large-scale investment in knowledge infrastructure, Communications of the ACM 38 (11) (1995) 33–38. [58] M. Lesk, Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone, in: SIGDOC 86, 1986. [59] M. E. Lesk, Lex - a lexical analyzer generator, Tech. Rep. CSTR 39, ATT Bell Laboratories, New Jersey (1975). [60] D. Lin, An information-theoretic definition of similarity, in: ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. [61] J. Madhavan, P. A. Bernstein, A. Doan, A. Halevy, Corpus-based schema matching, in: ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05), IEEE Computer Society, Washington, DC, USA, 2005. [62] J. Madhavan, P. A. Bernstein, E. Rahm, Generic schema matching with cupid, in: VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001. [63] A. Maedche, S. Staab, Measuring similarity between ontologies, in: EKAW ’02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, Springer-Verlag, London, UK, 2002. [64] D. L. McGuinness, F. van Harmelen, Owl web ontology language overview, http://www.w3.org/TR/owl-features/, w3C Recommendation (February 2004). [65] G. Miller, W. Charles, Contextual correlates of semantic similarity, Language and Cognitive Processes 6 (1) (1991) 1–28.
127
[66] G. A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (11) (1995) 39–41. [67] J. Q. Ning, A. Engberts, W. V. Kozaczynski, Automated support for legacy code understanding, Commun. ACM 37 (5) (1994) 50–57. [68] N. F. Noy, Semantic integration: A survey of ontology-based approaches., SIGMOD Record 33 (4) (2004) 65–70. [69] N. F. Noy, D. L. McGuinness, Ontology development 101: A guide to creating your first ontology, Technical Report KSL-01-05, Stanford Knowledge Systems Laboratory (2003). [70] V. Oleshchuk, A. Pedersen, Ontology based semantic similarity comparison of documents, in: 14th International Workshop on Database and Expert Systems Applications (DEXA03), IEEE, 2003. [71] J. Palsberg, C. B. Jay, The essence of the visitor pattern., in: Computer Software and Applications Conference (COMPSAC), IEEE Computer Society, 1998. [72] S. Patwardhan, Incorporating dictionary and corpus information into a context vector measure of semantic relatedness, Master’s thesis, University of Minnesota (2003). [73] T. Pedersen, S. Patwardhan, J. Michelizzi, Wordnet::similarity - measuring the relatedness of concepts, in: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), San Jose, CA, 2004. [74] D. H. Peter Bailey, A. Krumpholz, Toward meaningful test collections for information integration benchmarking, in: IIWeb 2006, Workshop on Information Integration on the Web in conjunction with WWW 2006, Edinburgh, Scotland, 2006. [75] M. Postema, H. W. Schmidt, Reverse engineering and abstraction of legacy systems, Informatica: International Journal of Computing and Informatics 22 (3). [76] A. Quilici, Reverse engineering of legacy systems: A path toward success., in: ICSE, 1995. [77] R. Rada, H. Mili, E. Bicknell, M. Blettner, Development and application of a metric on semantic nets, IEEE Transactions on Systems, Man, and Cybernetics 19 (1) (1989) 17–30. [78] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching., Very Large Data Bases (VLDB) J. 10 (4) (2001) 334–350. [79] P. Resnik, Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language., J. Artif. Intell. Res. (JAIR) 11 (1999) 95–130.
128
[80] R. Richardson, A. F. Smeaton, Using wordnet as a knowledge base for measuring semantic similarity between words, Tech. Rep. CA-1294, School of Computer Applications, Dublin City University, Dublin, Ireland (1994). [81] M. A. Rodriguez, M. J. Egenhofer, Determining semantic similarity among entity classes from different ontologies, IEEE Transactions on Knowledge and Data Engineering 15 (2) (2003) 442–456. [82] H. Rubenstein, J. Goodenough, Contextual correlates of synonymy, Computational Linguistics 8 (1965) 627–633. [83] S. Rugaber, Program comprehension, Encyclopedia of Computer Science and Technology 35 (20) (1995) 341–368, marcel Dekker, Inc:New York. [84] S. Sangeetha, J. Hammer, M. Schmalz, O. Topsakal, Extracting meaning from legacy code through pattern matching, Technical Report TR-03-003, University of Florida, Gainesville (2003). [85] N. Seco, T. Veale, J. Hayes, An intrinsic information content metric for semantic similarity in wordnet, in: Proceedings of ECAI’2004, the 16th European Conference on Artificial Intelligence, Valencia, Spain, 2004. [86] E. R. Sergey Melnik, Hector Garcia-Molina, Similarity flooding: A versatile graph matching algorithm and its application to schema matching, in: ICDE ’02: Proceedings of the 18th International Conference on Data Engineering (ICDE’02), IEEE Computer Society, Washington, DC, USA, 2002. [87] A. P. Sheth, Data semantics: What, where and how., in: Proceedings of the 6th IFIP Working Conference on Data Semantics, 1995. [88] P. Shvaiko, J. Euzenat, A survey of schema-based matching approaches, Journal on Data Semantics (JoDS) IV (2005) 146–171. [89] E. Stroulia, M. El-Ramly, L. Kong, P. G. Sorenson, B. Matichuk, Reverse engineering legacy interfaces: An interaction-driven approach., in: 6th Working Conference on Reverse Engineering (WCRE’99), 1999. [90] Y. A. Tijerino, D. W. Embley, D. W. Lonsdale, Y. Ding, G. Nagy, Towards ontology generation from tables., World Wide Web 8 (3) (2005) 261–285. [91] O. Topsakal, Extracting semantics from legacy sources using reverse engineering of java code with the help of visitor patterns, Master’s thesis, Department of Computer and Information Science and Engineering, University of Florida (2003). [92] M. Uschold, M. Gruninger, Ontologies and semantics for seamless connectivity, SIGMOD Rec. 33 (4) (2004) 58–64.
129
[93] P. Vossen, Eurowordnet: a multilingual database for information retrieval, in: Proceedings of the DELOS workshop on Cross-language Information Retrieval, Zurich, 1997. [94] J. Wang, F. H. Lochovsky, Data extraction and label assignment for web databases, in: WWW ’03: Proceedings of the 12th international conference on World Wide Web, ACM Press, New York, NY, USA, 2003. [95] B. W. Weide, W. D. Heym, J. E. Hollingsworth, Reverse engineering of legacy code exposed., in: ICSE, 1995. [96] M. Weiser, Program slicing, in: ICSE ’81: Proceedings of the 5th international conference on Software engineering, IEEE Press, Piscataway, NJ, USA, 1981. [97] L. M. Wills, Using attributed flow graph parsing to recognize clichs in programs, in: Selected papers from the 5th International Workshop on Graph Gramars and Their Application to Computer Science, Springer-Verlag, London, UK, 1996. [98] Z. Wu, M. Palmer, Verbs semantics and lexical selection, in: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, 1994. [99] M. Yatskevich, Preliminary evaluation of schema matching systems, Tech. Rep. DIT-03-028, University of Trento (2003). [100] B. Yu, L. Liu, B. C. Ooi, K. L. Tan, Keyword join: Realizing keyword search for information integration, in: Computer Science (CS), DSpace at MIT, 2006.
BIOGRAPHICAL SKETCH Oguzhan Topsakal, a native of Turkey, received his Bachelor of Science degree from the Computer and Control Engineering Department of Istanbul Technical University in June 1996. He worked in information technologies departments of several companies before pursuing graduate degree in the U.S.A. After he received his Master of Science degree in computer engineering at the University of Florida in August 2003, he continued for his Ph.D. degree. In the 2004-2005 academic year, he was a study-abroad student at the University of Bremen, Germany. During his Ph.D. studies, he worked as a teaching assistant for programming language and database related courses at the University of Florida and for data warehousing and data mining course at the University of Hong Kong. He also worked as a research assistant in scalable extraction of enterprise knowledge (SEEK) and test harness for the assessment of legacy information integration approaches (THALIA) projects. His research interests include semantic analysis, machine learning, natural language processing, knowledge management, information retrieval and data integration. He believes in continued learning and education to better understand and to contribute to society.
130