Towards a Theory of Refinement for Data Migration - Semantic Scholar

Towards a Theory of Refinement for Data Migration Bernhard Thalheim1 and Qing Wang2 1

2

Department of Computer Science, Christian-Albrechts-University Kiel, Germany, [email protected] Department of Information Science, University of Otago, Dunedin, New Zealand, [email protected]

Abstract. We develop a theoretical framework for refining transformations occurring in the process of data migration. A legacy kernel can be discovered at a high-level abstraction which consolidates heterogeneous data sources in a legacy system. We then show that migration transformations are specified via the composition of two subclasses of transformations: property-preserving transformations and property-enhancing transformations at flexible levels of abstraction. By defining a refinement scheme with the notions of correct refinements for property-preserving and property-enhancing transformations, we are able to stepwise refine migration transformations and to prove the correctness of refinements. The result of this paper lays down a formal foundation for investigating data migration.

1

Introduction

Modernising legacy systems is one of the most challenging problems we often face when engineering information systems [2, 3, 10, 12]. With new technologies emerging and application domains evolving, legacy systems need to be migrated into new systems at some point, to support enhanced functionality and reengineered business models. Data migration, as a fundamental aspect of projects on modernising legacy systems, has been recognised to be a difficult task that may result in failed projects as a whole [8, 19]. Industry survey results [8] reveal that the data migration market is rapidly growing and business companies annually invest billions of dollars in data migration tasks (e.g., over 5bn from the top 2000 global companies in 2007); nevertheless, only 16% projects have their data migration tasks be successfully accomplished (i.e. being delivered on time and on budget). One of main reasons for time and budget overrun is the lack of a well-defined methodology that can help handle the complexity of data migration tasks. Data migration moves data sources from legacy systems into new systems in which data sources have different structures. There are several issues that can considerably complicate this process. First, legacy systems may have a number of heterogeneous data sources that are interconnected but designed by using different data modelling tools or interpreted under different semantics. Second,

2

B. Thalheim and Q. Wang

legacy systems may have inaccurate, incomplete, duplicate or inconsistent data, and new systems may also require additional semantic constraints on data after being migrated. Thus, bringing the quality of data up to standard of new systems can be costly and time-consuming. Third, many data migration tasks such as data profiling, discovery, validating, cleansing, etc. need to be iteratively executed in a project and specification changes frequently happen in order to repair detected problems. It is estimated [1] that 90% of the initial specifications change and over 25% of the specifications change more than once during the life of a data migration project. These issues highlight the importance of methodology for data migration, in particular, the need of a refinement theory for data migration that can link conceptual modelling of specifications to executable code in a way practitioners can verify properties of data migration. This paper aims at establishing a theoretical framework of refinement for data migration that allows us to refine a high-level specification of migration transformations into ones at an implementation level. This framework provides answers to the following questions arising from data migration in practice. – How can we react to specification changes in a way of keeping track of all relevant aspects the changes may impact on such as inconsistencies between specifications, interrelated data and correctness of implementation? – How can we compare legacy data sources with the migrated data in new systems to ensure data was migrated properly in terms of desired data semantics and integrity? From a traditional perspective, the process of data migration involves three stages: Extract, Transform, and Load (ETL). However, different from conventional ETL used for data warehousing that deals with analytic data, ETL in data migration is more complicated in which operational data has to be handled. Our first contribution is the formal development of the ETL processes for data migration as described below. – A legacy kernel is first “extracted” at a high-level of abstraction, which consolidates heterogeneous data sources in a legacy system. The links between abstract and more concrete models can be further exploited by using Galois connections in abstract interpretation [7, 16]. – We then “transform” the legacy kernel into a new kernel by specifying migration transformations in which data cleansing methods are described for “dirty” data, and “clean” data is mapped into an abstract model. A specific migration strategy [2, 3, 10] can be chosen and applied at this stage. – As “loading” an abstract model (i.e., the new kernel) into a more concrete model has been well studied (e.g., [13, 17, 20]), we will omit the discussion of this stage in this paper. Our second contribution is a refinement scheme specifying the refinement of migration transformations in terms of two types of refinement correctness. Using our refinement scheme, the above ETL processes can be stepwise refined into a

Towards a Theory of Refinement for Data Migration

3

real-life implementation. As illustrated in Fig.1, the models in an abstract computation (e.g.,Mlegacykernel and Mnewkernel ) can be refined into more concrete ∗ ∗ models (e.g., Mlegacykernel and Mnewkernel ) of the corresponding computation at a concrete level, and similarly, computation segments of interest (e.g., extract, tranf orm and load) at an abstract level refined into the corresponding computation segments (e.g., extract∗ , tranf orm∗ and load∗ ) at a concrete level. With our notions of refinement correctness, the generic proof method proposed by Schellhorn [14] can be used to verify the correctness of these refinements.

0OHJDF\ 0OHJDF\Q

H[WUDFW

0 OHJDF\ 0 OHJDF\Q

H[WUDFW

WUDQVIRUP 0OHJDF\NH UQHO

0QHZNHUQHO

WUDQVIRUP 0 OHJDF\NH UQHO

0 QHZNHUQHO

ORDG

0QHZ 0QHZP

ORDG

0 QHZ FRQFUHWH OHYHO

0 QHZP

DEVWUDFW OHYHO

Fig. 1. ETL in data migrations

We present basic definitions for schema and model in Section 2. In Section 3 we discuss an approach of discovering a legacy kernel. Section 4 gives the formal definitions for migration transformations and two important subclasses in data migration. We further exploit the refinement correctness of migration transformations and their subclasses in Section 5, along with a general discussion on how to verify the correctness of refinements for migration transformations. In Section 6 we briefly conclude the paper.

2

Schemata, Models and Level of Abstraction

For simplicity, we take an object-based view on models in this paper. In data migration, data sources of legacy and new systems may be designed by using different data modelling approaches. Nevertheless, components supported by many data modelling approaches can be viewed as objects, for example, entities and relationships in the entity-relationship model, tuples in the relational data model, elements in XML, etc. This simplified but generic view gives us the flexibility to relate models residing at different levels of abstraction to each other. Let us fix a family D of basic data types and a set C of constructors (e.g., record, list, set, multiset, array, etc). Then a set of object types over (D, C) can be inductively defined by applying constructors in D over basic data types in C. Similarly, given an object type τ , the set of objects of type τ can be inductively defined according to the types and constructors that constitute τ . In order to

4

B. Thalheim and Q. Wang

&RGH

Towards a Theory of Refinement for Data Migration - Semantic Scholar

Towards a Theory of Refinement for Data Migration - Semantic Scholar

Suggest Documents

Refinement-Animation for Event-B â Towards a ... - Semantic Scholar

Towards a Theory of Tracing for Functional ... - Semantic Scholar

Towards a theory of bisimulation for local names - Semantic Scholar

Invariant Diagrams with Data Refinement - Semantic Scholar

a Refinement Calculus for Requirements ... - Semantic Scholar

Towards a Semantic Data Library for the Social ... - Semantic Scholar

Towards a Design Theory for Customer ... - Semantic Scholar

Towards a Design Theory for Customer ... - Semantic Scholar

Towards a Domain Theory for Termination Proofs ... - Semantic Scholar

some refinement algorithms and data structures for ... - Semantic Scholar

Information Theory For Data Management - Semantic Scholar

Information Theory For Data Management - Semantic Scholar

refinement - Semantic Scholar

Theory Refinement for Program Verification

Towards a Framework for Corporate Data Quality ... - Semantic Scholar

Towards a Theory of AI Completeness - Semantic Scholar

Towards a Grounded Theory of Information ... - Semantic Scholar

Towards a Theory of Conceptual Modelling - Semantic Scholar

Towards a General Theory of Scope - Semantic Scholar

towards a theory of service innovation - Semantic Scholar

towards a motivational theory of technology ... - Semantic Scholar

Towards a General Theory of Robust Nonlinear ... - Semantic Scholar

Towards a theory of statistical tree-shape analysis - Semantic Scholar

Towards a Theory of White-Box Security - Semantic Scholar

Towards a Theory of Refinement for Data Migration - Semantic Scholar