Scalable Knowledge Extraction from Legacy ... - Semantic Scholar

4 downloads 0 Views 162KB Size Report
Scalable Knowledge Extraction from Legacy Sources with SEEK. Joachim Hammer. †. , William O'Brien. ‡. , and Mark Schmalz. †. University of Florida.
Scalable Knowledge Extraction from Legacy Sources with SEEK Joachim Hammer†, William O’Brien‡, and Mark Schmalz† University of Florida Gainesville, FL 32611-6120, U.S.A. 1. Introduction The continuous increase in the amount of electronically available information from multiple (legacy) sources, their diversities (heterogeneities) in terms of operation and information representation as well as the need to access and analyze this wealth of information for decision support (e.g., intelligence, law enforcement, enterprise coordination) has lead to research and development of advanced information technologies such as data exchange (e.g., XML-based protocols) and integration architectures (e.g., data warehouse, integration middleware based on CORBA, SOAP). However, given the number of available sources as well as the heterogeneities that exist at various levels of abstraction (see, for example, [7]), the time and investment needed to establish integrated access to multiple sources, especially to legacy sources, has imposed severe limitations on the scalability and maintainability of these technologies. For example, current wrapper development tools [4, 5, 9] require significant programmatic set-up with limited reusability of code. Other efforts are now under way to develop languages and tools for describing resources in a way that can be processed by computers (e.g., Semantic Web [2]). However, they do not address the problem of how to discover and collect the available information, or how to maintain it efficiently for the continuously increasing number of legacy sources. The SEEK project1 (Scalable Extraction of Enterprise Knowledge) at the University of Florida is directed at developing scaleable data access and extraction technology for overcoming some of the problems of assembling and integrating knowledge resident in numerous legacy information systems [8] and to make it available for analysis and decision-support. Development of theory and knowledge in this area is relevant to many applications that depend on integrated access to heterogeneous information including tactical situation analysts in complex, data-rich environments, extended enterprises in manufacturing and project environments, etc.

2. Overview of the SEEK Approach A high-level view of the core SEEK architecture is shown in Figure 1. SEEK follows established integration methodologies (e.g., TSIMMIS [3], InfoSleuth [1]) and provides a modular middleware layer that bridges the gap between legacy information sources and decision makers/support tools. However, unlike existing approaches, it provides tools for extracting knowledge from the legacy source to support configuration of the mediators and wrappers. SEEK works as follows. At run-time, the analysis module processes queries from the end-users (e.g. decision support tools and analysts) and performs knowledge composition including basic mediation tasks and post-processing of the extracted data. Data communication between the analysis module and the legacy sources is provided by the wrapper component. The wrapper translates SEEK queries into access commands understood by the source and converts native source results into SEEK’s internal language. Prior to answering queries, SEEK must be configured. This is accomplished semi-automatically by the knowledge extraction module that directs wrapper and analysis module configuration during build-time. The wrapper must be configured with information regarding communication protocols between SEEK and legacy sources, access mechanisms, and underlying source schemas. The analysis module must be configured with †

Dept. of Computer & Information Science & Engineering. Contact e-mail: {jhammer,mssz}@cise.ufl.edu Rinker School of Building Construction. Contact e-mail: [email protected] 1 Supported by the National Science Foundation under grant numbers CMS-0075407 and CMS-0122193. ‡

information about source capabilities, available knowledge and its representation. To produce a SEEKspecific representation of the operational knowledge in the legacy source (for configuration of the analysis module) as well as a wrapper for source access, we have developed a three-step knowledge extraction approach. Our approach to knowledge extraction is based on the assumption that information is stored in some form of data repository which may be accessed by application code.

Applications/ decision support

End Users and Decision Support

Connection Tools

Analysis Module

Knowledge Extraction Module

Sources

source expert

Wrapper commander & situation analyst

SEEK components

Legacy data and systems

Legend: run-time/operational data flow build-time/set-up/tuning data flow

Figure 1: Schematic diagram of SEEK logical architecture. Schema reverse engineering and semantic analysis: SEEK generates a detailed description of the data repository including entities, relationships, and encoded business rules using a Data Reverse Engineering (DRE) algorithm. This information is augmented with semantics which are extracted by the Semantic Analyzer from the application-specific meanings of those concepts found in the application code. 2. Domain model mapping: The semantically enhanced legacy source schema is mapped onto the domain model used by the application(s) that want(s) to access the legacy source. This is done using a schema mapping process that produces the mapping rules between the legacy source schema and the application domain model. 3. Wrapper generation: The extracted legacy schema and the mapping rules provide the input to the wrapper generation toolkit [5] for fast, scalable, and efficient implementation of the source wrapper. At run-time, the source wrapper translates queries from the application domain model to the legacy source schema. The above three steps are carried out under the guidance of a source expert who can also extend the capabilities of the initial, automatic configuration directed by the knowledge extraction module. Use of domain experts in knowledge extraction and mapping rule generation is particularly necessary for poorly formed database specifications often found in older legacy systems. Furthermore, the knowledge extraction module also enables step-wise refinement of wrapper configuration to improve extraction capabilities. It is important to note that SEEK is NOT a general-purpose toolkit. Rather, it allows extraction of knowledge required by specific types of decision support applications. Thus SEEK enables scalable implementation of computerized decision and negotiation support across a network of sources. SEEK represents a departure from research and development in shared data standards. Instead, SEEK embraces heterogeneity in information systems, providing the ability to extract and compose knowledge resident in sources that vary in the way data is represented and how it can be queried and accessed. 1.

2

3. Work-to-Date and Future Research To-date, we have built an initial prototype to demonstrate feasibility of our approach. Specifically, we have developed novel DRE and SA algorithms and validated their usefulness in experimental studies. Our DRE algorithm [6] is capable of producing schema and constraints for relational databases including key constraints and inclusion dependencies with significantly less human input than current approaches. Our SA algorithm [10] is capable of examining application code written in either C or Java using a combination of program comprehension techniques and extracts semantic information about the repository that is accessed by the code. Examples of semantics that can be extracted are primary and foreign key constraints, the English meanings of table and attribute names in the database, and business rules. This knowledge can be used for schema matching and wrapper generation, code improvement, code documentation effort etc. Plans for the SEEK toolkit are to develop a matching tool capable of producing mappings between two semantically related yet structurally different schemas. Currently, schema matching is performed manually, which is a tedious, error-prone, and expensive process. We are also in the process of integrating SEEK with a wrapper development toolkit to determine if the extracted knowledge is sufficiently rich semantically to support compilation of legacy source wrappers for various domains. The eventual system concept is that of a large, nearly-automatic system that can (1) acquire large amounts of knowledge from multiple legacy systems, (2) extend and enhance its on-board knowledge representation and characterization capabilities through ontology-based learning, and (3) thus make each successive acquisition of knowledge from a legacy system easier and more accessible to the SEEK user community.

References [1]

R. Bayardo, W. Bohrer, R. Brice, A. Cichocki, G. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishnan, A. Unruh, and D. Woelk, "Semantic Integration of Information in Open and Dynamic Environments," MCC Technical Report MCC-INSL-088-96, October 1996. [2] T. Berners-Lee, J. Hendler, and O. Lassila, "The Semantic Web," Scientific American, 2001. [3] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, "The TSIMMIS Project: Integration of Heterogeneous Information Sources," Proc. of the Tenth Anniversary Meeting of the Information Processing Society of Japan, Tokyo, Japan, 1994. [4] J.-R. Gruser, L. Raschid, M. E. Vidal, and L. Bright, "Wrapper Generation for Web Accessible Data Sources," Proc. of the 3rd IFCIS International Conference on Cooperative Information Systems, New York City, New York, USA, 1998. [5] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos, "TemplateBased Wrappers in the TSIMMIS System," SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 26, pp. 532-535, 1997. [6] J. Hammer, M. Schmalz, W. O'Brien, S. Shekar, and N. Haldavnekar, "SEEKing Knowledge in Legacy Information Systems to Support Interoperability," Proc. of the ECAI-02 International Workshop on Ontologies and Semantic Interoperability, Lyon, France, 2002. [7] W. Kent, "The Many Forms of a Single Fact," Proceedings of the IEEE Spring Compcon, 1989. [8] W. O'Brien, R. R. Issa, J. Hammer, M. S. Schmalz, J. Geunes, and S. X. Bai, "SEEK: Accomplishing Enterprise Information Integration Across Heterogeneous Sources," ITCON - Journal of Information Technology in Construction, vol. 7, pp. 101-124, 2002. [9] L. Rashid, University of Maryland Wrapper Generation Project, www.umiacs.edu/labs/CLIP/DARPA/ww97.html. [10] S. Shekar, J. Hammer, and M. Schmalz, "Extracting Meaning from Legacy Code through Pattern Matching," Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611-6120, Technical Report TR03-003, January 2003.

3

Suggest Documents