A shared context approach for supporting experts in Data ETL ...

6 downloads 0 Views 1MB Size Report
Data ETL (Extraction, Transformation and Loading) processes. Hassane Tahir .... SHARED CONTEXT. Group work supposes to manage the group context.
A shared context approach for supporting experts in Data ETL (Extraction, Transformation and Loading) processes Hassane Tahir

Patrick Brézillon

Laboratory of Computer Science-Paris 6 (LIP6) University Pierre and Marie Curie (UPMC) Paris, France [email protected]

Laboratory of Computer Science-Paris 6 (LIP6) University Pierre and Marie Curie (UPMC) Paris, France [email protected] dealing with the modeling of ETL processes concern the following : conceptual modeling of ETL processes in DWs using Unified Modeling Language [14][15];conceptual and logical modeling of ETL process [16][17]; a formal logical model for ETL processes based on a general graphical structure to represent ETL activities[18]. The first category of problems in ETL processes is about the technical problems related to the design, configuration and installation of the ETL infrastructure, overall performance of ETL processes when dealing with large volumes of data, scheduling constraints like dependencies among ETL jobs, change in source data structure, etc. We know the right solution for some of these problems. However, we have been assisted by technical support as well as online knowledge bases and forums to find the suggested solutions to some of the more complicated and unexpected detected errors. Some of the research works to improve ETL processes concern conceptual modeling and implementation issues as in [10][11][12][13][14][18][21]. The second category of problems is related to the lack of collaboration and the context sharing between the different actors involved in the ETL process. This is mainly due to the fact actor’s viewpoint is generally implicit and it is not easily taken into account by all actors interacting with him. This paper presents our interest to have a support in solving this second type of problems based on shared-context management to show how to overcome difficulties about dealing with actors’ viewpoints in the different phases of the ETL process. We use the Contextual-Graphs formalism [1] to illustrate how to implement the different actors’ activities and actions according to the different contextual elements. The main advantage of Contextual Graphs is that we are not obliged to know everything prior the use of the system, thanks to the possibility to enrich incrementally the system with new knowledge and practice learning capability [1][2].Moreover, a contextual graph is a good communication tool to helping actors to easily exchange their experiences and viewpoints. The paper starts by illustrating the problem in the ETL process we faced in a data migration project followed by a description of the shared context approach. Then we present how contextual graphs are used to model the ETL process and how it is easy to add practices developed by the different actors to represent their viewpoints and make their contextual elements explicit. We conclude by an

Abstract— Many experts in data integration often use standard procedures to accomplish the process of extracting data from the existing source systems in order to be transformed and loaded into new target systems such as a data warehouse or ERP (Enterprise Resource Planning) applications. The process is called ETL (Extraction, Transformation and Loading) process. There are different ways of applying such a process by different actors because they do not have the same viewpoints in the contexts in which the process occurs. To create an effective strategy and minimize risks, actors need to devote their efforts to develop new practices to meet the current business and IT needs, and mainly use past expert experience. The paper presents how to contextualize ETL processes based on different expert viewpoints. We show how making shared context explicit can help to improve the ETL process and, thus, avoid conflicts between experts having different viewpoints. We illustrate our proposal by using a software tool called Contextual Graphs (CxGs). The paper is intended to provide the basis for the development of an experience base that will be used by a support system for data migration experts. Keywords- Contextual Graph; Data Integration; ETL; Experience base; Procedure; Practice; Shared Context; Support system; Viewpoint.

I.

INTRODUCTION

Today data is continuously collected in large volumes like data warehouse using different techniques of ETL (Extraction, Transformation and Loading). The ETL process is an important phase in data migration where data is moved (after transformation) from the existing sources to the new target systems. This is generally a critical mission because there are different risks that may occur if a bad data migration strategy is applied. Risks can be financial (high cost), loss of important data, security and confidentiality of data, etc. as in [8][9][15][16]. Therefore any ETL process must be well-modeled and robustly designed. The objective of our work is to improve ETL process in data migration based on our previous and current IT consulting projects. This is in the framework of building an experience base using a shared context approach that can be used to design a support system for data migration experts in general and for ETL Developers in particular. In addition, this work will also enrich research in context modeling [4][5][6][7]. Some of the research studies

c 978-1-4577-1676-8/11/$26.00 2011 IEEE

720

evaluation of the work and a discussion on future perspectives. II.

ETL PROCESS IN DATA MIGRATION

This section describes one of the problems of the ETL processes based on one of our current projects about data migration from existing source systems to a new target application called EasyEBill (the pseudo name that will be used throughout this paper). EasyEBill (see Fig. 1) is a billing system that will be operated by an Energy supplier to satisfy new authority regulations and standards. It should also ensure customers to pay the right amount of energy and protect them from large unexpected bills and give the Energy supplier the incentive to get billing right every time. In this migration we are interested in problems encountered in the ETL process when contextual elements are not shared between the migration team members. If for any new context, some of the common steps in the ETL process must be updated or removed, all actors should be aware of this change. We can distinguish two parts. The first one is about contextual elements relevant at a given time (e.g. memory size, hard drives). The second part is about the values of these contextual elements at that moment: (memory size: 70%, full, hard drives: HP-1, IBM-23). The migration members involved in the process must share the same instances or values of the contextual elements. The main actors involved in the project are: Migration Manager, Business experts, Data Analysts, ETL Developers, Database Administrators (DBAs), Data architects, ERP Consultants and Testing consultants. These actors have different roles and viewpoints about the different activities that they should carry out together. For example in the extraction phase of the ETL design process, some examples of questions that may be asked by the migration members are: - Are all source systems identified? - Is there any business object model? - Are there any conceptual, logical or physical data models for source systems? - Is there any model (conceptual, logical or physical) for the target system? - Does the version of source application on production environnement match that of the existing models? - Is there any mapping between source and target data? - Do we anonymize data extracted for testing purposes? - Where to put extracted source data (mail, server directory, USB key, download from a website, etc.)? - How to provide the data extracted from sources (flat files, database dump files, xml files, a copy of the source application, a copy of the source databases)? - What is the size of the required source data?

Figure 1. ETL Process in Data Migration

The above questions correspond to different contextual elements in the data extraction phase of the ETL process (with their known values). The problem is that these contextual elements are not shared by all the datamigration actors because the actor’s viewpoints are generally left implicit. Therefore it is necessary to coordinate between all viewpoints and make them closer and compatible to each other to avoid bad design and system failures. In other words, each actor has to determine the most relevant contextual elements and communicate them to other actors. Our objective is to develop a shared context in which we will make explicit the contextual elements considered by each category of the actors and accepted by all the other categories “Make explicit what is implicit”. III.

SHARED CONTEXT

Group work supposes to manage the group context explicitly, not individual contexts only. However, group context is not simply the union or intersection of individual contexts. A group member needs to have some knowledge about other members, but also the context in which this knowledge is operational. This allows each member to know about the other but also to interpret and extrapolate the other’s behavior. Our approach is based on sharing context between actors in order to make their viewpoints compatible and closer to each other in their different interactions [19][20] and collaboration [3][17]. Shared context means that the context of each actor must intercept the context of the other. In a collaborative-design process, the shared context is the context of validity (or use) of the focus of the design. The shared context is built from contextual elements coming from the different experts’ contexts. The shared-context building results from an incremental enrichment of contextual elements coming from individual contexts of experts. Thus, a contextual element proposed by an expert will enter the shared context if accepted (validated) by other experts. Each expert has a mental representation (i.e. his individual context) of the design focus and of its context of validity (the shared context). A contextual element provided by an expert must be integrated in other experts’ mental

2011 11th International Conference on Intelligent Systems Design and Applications

721

nodes, sub-graphs and parallel grouping. A sub-graph allows the modeling of actor activities, and thus contextual graphs give a representation of the reasoning directly understandable by data migration team members whatever the granularity is. A path is an ordered sequence of elements (contextual and recombination nodes and actions) of the contextual graph from the input source to the output. Each path (a sequence of actions and contextual elements) represents a practice developed by an actor. In this section we present how contextual graphs can be used hereafter: (1) to represent the actor’s activities during the three phases of the ETL process in the project of migrating source data into the EasyEBill system (the target). (2) to represent each actor’s viewpoints to determine the most

relevant contextual elements that should be shared

Figure 2. Shared Context in the ETL process.

representation, i.e. each expert must find a translation of this contextual element in his mental representation. Thus, the collaborative-design process results by making the different views among experts compatible, not necessarily identical because all mental representations are different. Therefore, it is necessary to extend this shared context to cover all actors. In the example of Fig. 2, the ETL developer shares context (bleu area) with all other actors. Context has an infinite dimension and it is not clearly defined. To deal with a large number of contextual information, Brézillon and Pomerol [2] distinguish, for a given focus of attention, between three types of context, namely, external knowledge, contextual knowledge, and proceduralized context. The external knowledge is the knowledge that has nothing to do with the current focus. The contextual knowledge is the knowledge that is more or less relevant for the current focus of attention. Always at a given focus, the actor selects a part of the contextual knowledge to be proceduralized. The proceduralized context is a part of the contextual knowledge, which is invoked, organized, structured and situated to be used at a given step of the decision making according to this focus. Shared contexts contain elements of the contextual knowledge for the building of the proceduralized context in the focus of attention of the team members of the data migration. These elements of knowledge in the shared contexts are extracted from the contextual knowledge of each category of actor. Context can be modeled using an approach based on Contextual Graphs (CxGs) [1] [2] where the contextual elements are acquired incrementally when needed. The following section explains how Contextual-Graphs formalism can be used to implement the shared context between actors involved in the ETL process of the data migration. IV.

CONTEXTUAL

GRAPHS

FOR

MODELING

ETL

PROCESS

Contextual graphs have been designed initially in an application for incident solving on a subway line [1] [2]. A contextual graph is an acyclic directed graph with a one input, one output, and a serial-parallel organization of nodes connected by oriented arcs. There are different types of nodes in a contextual graph: actions, contextual and recombination

722

A. Representation of actor’s activities Data-migration actors have to perform continuously many tests and trials to find the best procedure to migrate data from the existing source systems to a target system. We are interested to show how, on the one hand, contextual-graphs formalism (CxGs) tool is used to model the three phases of the ETL process as carried out in our data migration project (Fig. 1), and, on the other hand, CxGs enables actors to add new practices to the original procedure to take into account the specificity of their new context. In each phase of the ETL process, actors perform a set of activities depending on their contextual elements. As shown in the Fig. 3, activities are represented by the pink ovals numbered 2, 3, 4, 7, 9, 10, 12 and 13, and contextual elements are represented using blue circles numbered 1, 5, 6, 8 and 11 (or CE1, CE5, CE6, CE8 and CE11 in the text). Each contextual element in Fig.2 has two exclusive values: Yes, No. The activity is composed of a list of contextual elements and actor’s actions (see hereafter). In Phase 1 and in the case of new data extraction (Value(CE1)=Yes), an actor may start performing the extraction by accessing data sources (RDBMs, ERP, flat files, XML files, etc …) in order to retrieve the wished data (Activity 2) and load the result into a staging database (Activity 4). Otherwise (Value (CE1) =No), the actor may use the data extracted previously (Activity 3) and load the data into the staging database (Activity 4). In Phase 2 and with a successful loading of data in the staging database (Value (CE5) =Yes), frequently actors perform data cleansing (Activity 7) and transform data according business rules (Activity 9) if data require conversion (Value (CE6) =Yes and Value (CE8) =Yes). When the loading of the staging database fails, Activity 10 will be performed to report all the problems in order to let technical or business actors to correct them (in the case of bad data quality). The third phase of the ETL process is performed if and only if data in the staging database are successfully cleansed and transformed (Value (CE11) =Yes). Actors then can perform the extraction from staging database (Activity 12) and load the data into the target system (Activity 13).

2011 11th International Conference on Intelligent Systems Design and Applications

Figure 3. Contextual graph for the ETL process.

The contextual graph in Fig. 3 contains the activities performed by the different actors involved in the ETL process of the data migration. Each team member has to carry out at least one activity. Another point to note about the above contextual graph is that the same activity can be performed differently by different actors because actors do not have the same viewpoints. This is discussed in the following section. Representing actor’s viewpoints ETL processes didn’t always succeed. One of the reasons for the failure is that actors involved in common activities didn’t share the same contextual elements during their collaboration. Hence it is important for actors to make their viewpoints explicit and close to each other to avoid conflicts and system failure. Fig. 4 illustrates actions performed by an ETL Developer in the activity to extract data (activity number 2 in Fig. 3) from the sources information systems (for example from an Oracle Database) and load it into a staging database (actions are represented using square boxes). The developer takes into consideration a set of contextual elements that may be different from that of a DBA extracting other data from the

same source or another database source. For example, if the contextual element 2 (or CE2 “Database Connexion Parameters”) is instantiated by a DBA to “Identified”, and this value is shared by the ETL Developer, this will help both actors in successfully connecting to the database (Action 3 or A3).

B.

Figure 4. CxG for extracting data by the ETL Developer (Developer’s viewpoint)

2011 11th International Conference on Intelligent Systems Design and Applications

723

to make explicit their viewpoints. For example, the Developer shares the contextual element 2 (or CE2) with a DBA in order to know the parameters to connect to the data source. The main purpose of contextual graphs relies on the possibility to introduce easily new practices in the existing graphs. A new practice generally corresponds at a known practice with few changes introduced by contextual nodes. Thus, a contextual graph based system either knows a practice or acquires it when needed. Figure 5. CxG for extracting data by a DBA (DBA’s viewpoint)

Activity number 2 in Fig. 3 is performed by the DBA as shown in Fig. 5. Note that the DBA share some contextual elements with the ETL Developer, but the actions performed by the two actors are not identical for the same activity (i.e. activity number 2). The DBA is the technical expert who has a details about database sources and he generally communicates all the needed information that support others users such as developer to help them perform their tasks efficiently. Fig. 6 shows the contextual elements shared between the DBA and the ETL Developer

V.

FUTURE WORK

The paper has shown that it is possible to use contextual graphs to model and represent ETL processes based on sharing context and viewpoints between experts involved in a data migration project. In the case studied, we have pointed not only the technical contexts related to the ETL process but also the contexts about the different interactions between the ETL expert (Developer) and other actors. In our future work, we will continue our research by considering the following aspects:

Figure 6. Illustrating shared context

724

2011 11th International Conference on Intelligent Systems Design and Applications

1) Now by using contextual graphs, we are able to represent an actor viewpoint in the different parts of the ETL process. This can be extended to represent all the actors’ viewpoints in order to build a real experience base. 2) Explore the possible interactions between Contextual graphs representing different viewpoints and their consequences. 3) Design and implement a context-based intelligent assistant system (CBIAS) that uses an experience base to help the ETL experts and data migration actors in general. The experience base should be developed in a uniform representation of knowledge, reasoning and context. 4) To generalize and extend the context-based intelligent assistant system to other domains of applications. VI.

CONCLUSION

In this paper we have shown how to contextualize an ETL process based on sharing context and viewpoints between experts involved in a data migration project. We have illustrated how it is easy to represent different actors’ viewpoints in the data extraction by using contextual graphs (CxGs). This study is based on the notion of shared context that has been applied in many applications particularly in collaborative work in software design. We have shown how making shared context explicit can help to improve the ETL process and, thus, avoid conflicts between experts having different viewpoints. Our work is in the framework of building an experience base that can be used to design an intelligent support system for ETL experts. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

P. Brézillon, L. Pasquier and J.-C. Pomerol, Reasoning with contextual graphs. European Journal of Operational Research, 2002, 136(2): 290-298 P. Brézillon, Context-based modeling of operators’ practices by contextual graphs. Proceedings of HCP-2003, the 14th Mini-Euro Conference on Human Centered Processes, R. Bisdorf (Ed.), Fonds National de la Recherche, Luxembourg. P. Brézillon, Explaining for developing a shared context in collaborative design. Proceedings of the 2009 13th International Conference on Computer Supported Cooperative Work in Design (CSCWD-2009). M.R.S.Borges, W. Shen, J.A. Pino, J.-P. Barhtès, J. Luo, S.F. Ochoa, J. Yong (Eds.), IEEE Catalog Number CFP09797-CDR, Santiago, Chile, April 22-24, 2009. P. Brézillon, J. Brézillon and J.-C. Pomerol, Context-based improvement of decision making : Application for car driving. International Journal of Decision Support Systems and Technology, Vol 1, N°3, pp. 1--20. P. Brézillon, From expert systems to context-based intelligent assistant systems: a testimony. The Knowledge Engineering Review, Vol 26, N°1, pp. 19--24, 2011 X. Fan, P. Brézillon, R. Zhang and L. Li, Making context explicit towards decision support for a flexible scientific workflow system. In 2011 Workshop on Human-Centered Processes, pp. 3--9, http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol696/. X. Fan, P. Brézillon, R. Zhang and L. Li, A context-based framework for improving decision making in scientific workflow. In 2011 3rd International Conference on Computer Research and

[8]

[9]

[10] [11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

Development, Vol 2, pp. 15--19, IEEE Computer Society, Shanghai, China. Federal Student Aid (2007). Data Migration Roadmap: A Best Practice Summary, http://federalstudentaid.ed.gov/static/gw/docs/ciolibrary/ECONOP S_Docs/DataMigrationRoadmap.pdf IBM, Best practices for data migration. http://www-935.ibm.com/services/us/gts/pdf/softek-best-practicesdata-migration.pdf R. Krishna and Sreekanth, An Object Oriented Modeling and Implementation of Web Based ETL Process. IJCSNS 10(2), 2010. Z. Li, J. Sun, H. Yu and J. Zhang, Commoncube-based conceptual modeling of ETL processes. In International Conference on Control and Automation (ICCA2005), pages 131-136, 2005. M. Mrunalini, T.V Suresh Kumar, D. Evangelin Geetha and K. Rajanikanth, Modelling of Data Extraction in ETL Processes Using UML 2.0, Vol. 26, No. 5, September 2006, pp. 3-9. L. Muñoz, J. Mazón, J. Pardillo and J. Trujillo, Modelling ETL Processes of Data Warehouses with UML Activity Diagrams, Lecture Notes in Computer Science, 2008, Volume 5333/2008, 4453 L. Muñoz, J. Mazón and J. Trujillo, Automatic Generation of ETL processes from Conceptual Models, DOLAP’09, November 6, 2009, Hong Kong, China, ACM, 2009 G. Pick, Data Migration Concepts & Challenges. http://www.aymgael.com/pdf%20reports/Data%20Migration%20C oncepts%20&%20Challenges.pdf P. Russom, Best Practices in Data Migration http://download.101com.com/pub/TDWI/Files/TDWI_Monograph _BPinDataMigration_April2006.pdf F. Santoro, P. Brézillon and R. Araujo, Management of shared context dynamics in software design. Proceedings of 9th International Conference on CSCW in Design (CSCWD-2005), Shen, W., James, A., Chao, K.-M., Younas, M., Lin, Z. and Barthès, J.-P.Coventry University, IEEE, Vol. 1, pp. 134-139. A. Simitsis, P. Vassiliadis, M. Terrovitis and S. Skiadopoulos, Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43-52. Springer, Heidelberg (2005). H. Tahir and P. Brézillon, Improvement of database administration by procedure contextualization 17-24. Proceedings of HCP 2011 – Fourth Workshop on Human Centered Processes, Genoa (Italy) February 10-11, 2011. H. Tahir and P. Brézillon, Procedure contextualization for collaborative database administration. Proceedings of the 2011 15th International Conference on Computer Supported Cooperative Work in Design, Lausanne, Switzerland, June 8-10, 2011. P. Vassiliadis, A. Simitsis, P. Georgantas and M. Terrovitis, A framework for the design of ETL Scenarios. In Proceedings of the 1 5th International Conference on Advanced lnformation Systems Engineering, Velden, Austria, 16 June 2003.

2011 11th International Conference on Intelligent Systems Design and Applications

725

Suggest Documents