Towards a Theory of Refinement for Data Migration Bernhard Thalheim1 and Qing Wang2 1
2
Department of Computer Science, Christian-Albrechts-University Kiel, Germany,
[email protected] Department of Information Science, University of Otago, Dunedin, New Zealand,
[email protected]
Abstract. We develop a theoretical framework for refining transformations occurring in the process of data migration. A legacy kernel can be discovered at a high-level abstraction which consolidates heterogeneous data sources in a legacy system. We then show that migration transformations are specified via the composition of two subclasses of transformations: property-preserving transformations and property-enhancing transformations at flexible levels of abstraction. By defining a refinement scheme with the notions of correct refinements for property-preserving and property-enhancing transformations, we are able to stepwise refine migration transformations and to prove the correctness of refinements. The result of this paper lays down a formal foundation for investigating data migration.
1
Introduction
Modernising legacy systems is one of the most challenging problems we often face when engineering information systems [2, 3, 10, 12]. With new technologies emerging and application domains evolving, legacy systems need to be migrated into new systems at some point, to support enhanced functionality and reengineered business models. Data migration, as a fundamental aspect of projects on modernising legacy systems, has been recognised to be a difficult task that may result in failed projects as a whole [8, 19]. Industry survey results [8] reveal that the data migration market is rapidly growing and business companies annually invest billions of dollars in data migration tasks (e.g., over 5bn from the top 2000 global companies in 2007); nevertheless, only 16% projects have their data migration tasks be successfully accomplished (i.e. being delivered on time and on budget). One of main reasons for time and budget overrun is the lack of a well-defined methodology that can help handle the complexity of data migration tasks. Data migration moves data sources from legacy systems into new systems in which data sources have different structures. There are several issues that can considerably complicate this process. First, legacy systems may have a number of heterogeneous data sources that are interconnected but designed by using different data modelling tools or interpreted under different semantics. Second,
2
B. Thalheim and Q. Wang
legacy systems may have inaccurate, incomplete, duplicate or inconsistent data, and new systems may also require additional semantic constraints on data after being migrated. Thus, bringing the quality of data up to standard of new systems can be costly and time-consuming. Third, many data migration tasks such as data profiling, discovery, validating, cleansing, etc. need to be iteratively executed in a project and specification changes frequently happen in order to repair detected problems. It is estimated [1] that 90% of the initial specifications change and over 25% of the specifications change more than once during the life of a data migration project. These issues highlight the importance of methodology for data migration, in particular, the need of a refinement theory for data migration that can link conceptual modelling of specifications to executable code in a way practitioners can verify properties of data migration. This paper aims at establishing a theoretical framework of refinement for data migration that allows us to refine a high-level specification of migration transformations into ones at an implementation level. This framework provides answers to the following questions arising from data migration in practice. – How can we react to specification changes in a way of keeping track of all relevant aspects the changes may impact on such as inconsistencies between specifications, interrelated data and correctness of implementation? – How can we compare legacy data sources with the migrated data in new systems to ensure data was migrated properly in terms of desired data semantics and integrity? From a traditional perspective, the process of data migration involves three stages: Extract, Transform, and Load (ETL). However, different from conventional ETL used for data warehousing that deals with analytic data, ETL in data migration is more complicated in which operational data has to be handled. Our first contribution is the formal development of the ETL processes for data migration as described below. – A legacy kernel is first “extracted” at a high-level of abstraction, which consolidates heterogeneous data sources in a legacy system. The links between abstract and more concrete models can be further exploited by using Galois connections in abstract interpretation [7, 16]. – We then “transform” the legacy kernel into a new kernel by specifying migration transformations in which data cleansing methods are described for “dirty” data, and “clean” data is mapped into an abstract model. A specific migration strategy [2, 3, 10] can be chosen and applied at this stage. – As “loading” an abstract model (i.e., the new kernel) into a more concrete model has been well studied (e.g., [13, 17, 20]), we will omit the discussion of this stage in this paper. Our second contribution is a refinement scheme specifying the refinement of migration transformations in terms of two types of refinement correctness. Using our refinement scheme, the above ETL processes can be stepwise refined into a
Towards a Theory of Refinement for Data Migration
3
real-life implementation. As illustrated in Fig.1, the models in an abstract computation (e.g.,Mlegacykernel and Mnewkernel ) can be refined into more concrete ∗ ∗ models (e.g., Mlegacykernel and Mnewkernel ) of the corresponding computation at a concrete level, and similarly, computation segments of interest (e.g., extract, tranf orm and load) at an abstract level refined into the corresponding computation segments (e.g., extract∗ , tranf orm∗ and load∗ ) at a concrete level. With our notions of refinement correctness, the generic proof method proposed by Schellhorn [14] can be used to verify the correctness of these refinements.
0OHJDF\ 0OHJDF\Q
H[WUDFW
0 OHJDF\ 0 OHJDF\Q
H[WUDFW
WUDQVIRUP 0OHJDF\NH UQHO
0QHZNHUQHO
WUDQVIRUP 0 OHJDF\NH UQHO
0 QHZNHUQHO
ORDG
0QHZ 0QHZP
ORDG
0 QHZ FRQFUHWH OHYHO
0 QHZP
DEVWUDFW OHYHO
Fig. 1. ETL in data migrations
We present basic definitions for schema and model in Section 2. In Section 3 we discuss an approach of discovering a legacy kernel. Section 4 gives the formal definitions for migration transformations and two important subclasses in data migration. We further exploit the refinement correctness of migration transformations and their subclasses in Section 5, along with a general discussion on how to verify the correctness of refinements for migration transformations. In Section 6 we briefly conclude the paper.
2
Schemata, Models and Level of Abstraction
For simplicity, we take an object-based view on models in this paper. In data migration, data sources of legacy and new systems may be designed by using different data modelling approaches. Nevertheless, components supported by many data modelling approaches can be viewed as objects, for example, entities and relationships in the entity-relationship model, tuples in the relational data model, elements in XML, etc. This simplified but generic view gives us the flexibility to relate models residing at different levels of abstraction to each other. Let us fix a family D of basic data types and a set C of constructors (e.g., record, list, set, multiset, array, etc). Then a set of object types over (D, C) can be inductively defined by applying constructors in D over basic data types in C. Similarly, given an object type τ , the set of objects of type τ can be inductively defined according to the types and constructors that constitute τ . In order to
4
B. Thalheim and Q. Wang
&RGH