Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
Entity Mapping Diagram For modeling ETL processes Ali Hamed El Bastawesy
Mokhtar Boshra Riad
Abdeltawab M. A. Hendawi
Cairo University, Egypt Faculty of Computers and Information Information Systems Dept. a.elbastawesy, m.boshra,
[email protected] projects is devoted to the ETL process. Inmon in [8], mentions that the ETL process costs 55% of the total costs of data warehouse runtime. In this research, we provide a complete framework for the modeling of ETL processes. To the best of our knowledge, it is the first complete trial to propose EMD as an extension of ERD model to capture ETL activities used to build any DW schema. Briefly, our main contributions are: - Proposing the EMD as a conceptual model for ETL processes. - Proposing a palette of several constructs used to capture all activities of any ETL routine. - Introducing two levels of transformations, entity level and attribute level which represented clearly in our graphical model. - Applying the model on a simple example as a case study. This paper is organized as follows. In section 2, we list the main functionalities of any ETL tool. In section 3, we present the related work. In section 4, we describe the EMD components and constructs. In section 5, we introduce a simple motivating example. Finally, in section 6, we conclude our work with a prospect to the future work.
Abstract. Data Warehousing (DW) is the process of moving data from the sources (relational databases, flat files …etc) to the target data warehouse schema. During this process the incoming data is passed through a lot of reformatting and refining tasks known as Extraction, Transformation, and loading (ETL) processes. Many researchers consider ETL processes as the most important activities in any Data Warehouse project. In spite of the importance of ETL processes, few number of researches was introduced in this area. In this paper, we suggest Entity Mapping Diagram (EMD) as a graphical model for representing ETL operations required to map data from sources to target data warehouse or data mart. EMD is based on the other related work done in this area with some adaptation. EMD could save time and effort of Data Warehouse designers. EMD covers two transformation levels: attribute transformation level and entity transformation level, to accomplish the task of data integration.
Keywords: Data Warehouse, ETL, Data Mapping 1 Introduction
2 Functionality of ETL tools
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, reformatting, integration, and insertion into a data warehouse. Building the ETL process is potentially one of the biggest tasks of building a warehouse; it is complex and time consuming. In many implementations, it can take the biggest part of the total warehouse implementation effort. Shilakes and Tylman in [5] mention that ETL and Data Cleaning tools are estimated to cost at least one third of effort and expenses in the budget of the data warehouse while Demarest in [6] and [7] mention that 70 to 80 percent of development time and effort in data warehouse
To give a general idea of the functionality of ETL tools we mention their most major tasks, which include: - The identification of relevant information at the source side. - The extraction of this information; taking into consideration the differences among DBMS technologies, operating systems, and hardware structure. - Restructuring of keys; since operational input keys usually need to be restructured and converted before they are written out to the data warehouse.
290
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
-
-
In [1] and [9] more details about the functionality ETL tools. As shown in Figure 1, data is extracted from the various data sources to the Data Staging Area (DSA) then a set of transformations takes place on the incoming data to be suitable for loading to the data warehouse schema. DSA is usually used to store copies from the operational data sources and the temporary tables as intermediate results from the transformation and cleansing operations.
The customization, integration, and transformation of the information coming from multiple sources into a common format. The cleansing of the resulting data set, on the basis of database and business rules; for example (formatting verification and duplicate record elimination). Index the data. Renaming of data elements, and documenting of that changes. The Summarization and aggregation of the data. The propagation of the data to the data warehouse and/or data marts. Maintaining the metadata, and Refreshing the warehouse with updated data.
Extract
3 Related Work There is a relatively small body of research efforts around the issues of conceptual modeling of the ETL processes. In [4] the authors attempt to provide
Transform
Load
File RDB
DSA
DW
Data Sources Figure 1. A general framework for ETL processes Metamodel layer are generic enough to model any ETL scenario, through the appropriate instantiation. The last layer namely, Template Layer. The constructs in the Template layer are also meta-classes, but they are quite customized for the regular cases of ETL processes. Thus, the classes of the Template layer as specializations (i.e., subclasses) of the generic classes of the metamodel layer (depicted as “IsA” relationships. After defining the previous framework, the authors present the graphical notation as shown in figure 2, and the metamodel of their proposed graphical model . Then, they detail and formally define all the entities of the metamodel.
a first model towards the conceptual modeling of the DW ETL processes. They introduce a framework for the modeling of ETL activities. Their framework contains three layers; the lower layer namely Schema layers, involves a specific ETL scenario. All the entities of the Schema layer are instances of the classes Concept, Attribute, Transformation, ETL Constraint and Relationship. The second layer namely, Metamodel Layer involving the aforementioned classes. The linkage between the Metamodel and the Schema layers is achieved through instantiation (“instanceOf”) relationships. The Metamodel layer implements the aforementioned genericity desideratum: the five classes which are involved in the
291
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
Concept
Transformation
Attribute
Note ETL_Constraint Active candidate
Part of
Provider N:M
Provider 1:1
candidate
Serial Composition
candidate
Figure 2. Notation for the conceptual modeling of ETL activities [4] Then the authors use their graphical model to represent ETL processes in motivating example. As shown in figure
3, two data sources (S1.partsupp and S2.partsupp) are used to build the DW concept (DW.partsupp). Duration < 4h
Necessary providers S1 and S2
Due to accuracy and small size
U Annual PartSupp's S2.partsupp
DW.partsupp
PK
S1.partsupp
Recent Partsupp's SK
SK
F
V F
NN F
$ to €
American to European date
SysDate
Figure 3. Motivating example for conceptual model in [4] a graph which they call Architectural Graph and they introduce some notations for this graph. They introduce importance metrics to measure the degree to which entities are bound to each other. In [14] the authors describe a framework for the declarative specification of ETL scenarios. They discuss the implementation issues and they present a graphical tool, ARKTOS II that facilitates the design of ETL scenarios, based on their model. In [16] the author focuses on the optimization of the ETL processes, in order to minimize the execution time of an ETL process. Regarding data mapping, in [11]
Notice that there no clear partitioning for this diagram; partitioning means: part for the source side, another part for the destination side, and a third part for the mapping side. No difference between entity transformation notation and attribute transformation notation in these model notations. User defined transformation is ignored in this model and this is another disadvantage; so we try to handle these demerits in our EMD model. The conceptual model of [4] is complemented in [10, 14, 16] with the logical design of ETL processes as data-centric workflows. In [10] the authors model an ETL scenario as
292
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
authors discuss issues related to the data mapping in the integration of data. A set of mapping operators is introduced and a classification of possible mapping cases is presented. However, no graphical representation of data mapping scenarios is provided, so it is difficult to be used in real world projects. In [12] a framework for supporting meta-data related applications where models and Product
PERFORMANCE Maintenance
Reusability
DTS
mappings are manipulated is proposed. In [15] the authors make a comparison between some ETL tools features and capabilities as shown in figure 4. They create an active ETL tool without proposing any graphical models. They use SQL queries to represent the mapping between the source and the target data. They show a case study based on real scale commercial data to verify their tool features.
ORACLE DW BUILDER
DATA ATAGE
SAGENT
Average High with SQL Server Updating ActiveX programs Defined functions
High with Oracle
Depends on data source
Depends on data source
Compiling modified programs Defined functions
Modifying used objects
Modifying plan content
Set of objects or plan
Reusable plan
None
None
Meta-data stage
None
Plan with timer
Plan with triggers
Event management
Sagent automation
Txt, via ODBC OLDB
Flat files, Oracle, Sybase, via ODBC OLDB
SQL Server, Oracle, DB2,Txt, Informix
SQL Server, Oracle, DB2,Txt, Informix
Access to meta-data Plan for process extraction Data sources (Data integration)
Figure 4. ETL Tools Capabilities
4.1 Framework
4 The proposed model (Entity Mapping Diagram)
Figure 5 shows the general parts of EMD. - In the Data Source(s) Part: we draw the participated data source tables and their participated attributes. - In the DW schema Part: we draw the data warehouse schema table (cube or Dimension). - In the Mapping: we draw the ETL processes using our proposed constructs. Notice that both data sources and data warehouse schemas should be known before starting to draw EMD.
To conceptualize the ETL processes used to map data from sources to the target data warehouse schema, we extended the Entity Relationship Diagram (ERD) by adding extra constructs that represent ETL tasks, then we called it Entity Mapping Diagram, as it is used to map between entity in the source and another entity in the destination DW.
293
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
Source tables and its attributes
DW Cube or Dimension
ETL processes
Data Source(s) Part
Mapping Part
DW schema Part
Figure 5. A block diagram for EMD -
4.2 Primitives of mapping notations -
-
-
-
Loader Relationship: is used when the attribute data is moved directly without any changes from the source entity to the target one. Composite Loader Relationship: is used when data loaded to the output attribute combined from two or more distinct input attributes. Optional Loader Relationship: is used when data loaded to the output attribute could be extracted from input attribute x or input attribute y. Entity Transformation Operations: this kind of transformations usually results in a temporary entity. There are standard operators that are used inside this construct, figure 6 (a) shows some of them.
Union (U) Intersection (∩) Join ( ) Difference (∆) Division (/) Multiply (*) Rename (RN) Duplicate Elimination (DE)
(a)
-
-
Built-in Attribute Transformation Operations: standard operations are used with this construct, figure 6 (b) shows sample of them. User defined Attribute Transformation Operations: user can use his defined operations, so any number of transformations can be added such as currency conversion functions, packages (units) conversions, and so on. Notice that we put a symbol or shorthand of the operation inside the entity or the attribute transformations.
The two levels of transformations are covered as follows: Entity level: is covered using Entity Transformation Operations. Attribute level: is covered using both Built-in Attribute Transformation Operations and User defined Attribute Transformation Operations.
-Add (+), Subtract (-), Divide (/), Multiply (*) -Rename (Ren), Concatenate (Con), Split (Sp) - Data type conversion. • ToNumeric (TN) • ToString (TS) • ToFloat (TF) - Data format conversion • ToUpperCase (TUC) • ToLowerCase (TLC) • ToAmericanDate (TAD) (b)
Figure 6. Some of entity and attribute transformation operations
294
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
4.3 EMD Constructs In figure 7, we show the basic constructs of the Entity Mapping Diagram. Mapping Construct
To Represent
Shape
Name Cylinder
Schema
Rectangle
Entity
Oval
Attribute
Solid arrow
Loader Relationship
Dotted arrow
Composite Relationship
Connected arrows
Optional Loader Relationship
Square edge
Built in Attribute Transformation operation
with
rounded
Loader
Square with triangle edge
User Defined Attribute Transformation operation
Hexagon
Entity Transformation operation
Rectangle corner
with
folded
User Note
Figure 7. Graphical constructs for EM-Diagram
295
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
figure 8), another relational data source described by schema DS2 for selling a general products (shown in figure 9). A relational data warehouse is designed to capture sales data from the two predefined data sources. The star schema in figure 10 shows the design of the proposed data warehouse whish consists of one fact table and four dimensions tables.
5 Motivating Example To illustrate the usage of our proposed graphical model, we introduce a simple example. A company wants to build a data warehouse for monitoring the sales processes in its two branches. It has a relational data source described by schema DS1 for selling books (shown in
Category
Author AuthorID AuthorName
CategoryID CategoryDescription
book
Author-Book
BookID CategoryID T itle ISBN Year Price NoPages BookDescription
AuthorID BookID
Ordering BookID OrderID Price
Customer Book-Order OrderID Cus_CustomerID CustomerID OrderDate
CustomerID FirstName LastName ZipCode City State
Figure 8. Relational schema DS1 for books-orders database
296
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
Customers CustomerID CompanyName ContactName City State PostalCode PhoneNumber
Shipping-Methods
Employees
ShippingMethodID ShippingMethod
EmployeeID FirstName LastName Title WorkPhne
Payment PaymentMethodID OrderID paymenetID PaymentAmount PaymentDate CreditCardNumber CreditCardEXPDate CardHoldersName
PaymentMethod PaymentMethodID PaymentMethod
Orders OrderID ShippingMethodID EmployeeID CustomerID OrderDate Quantity UnitPrice Discount ShipDate
Products ProductID BrandID ProductName
Order-Details OrderID ProductID OrderDetailID Quantity UnitPrice Discount
Brands BrandID BrandDescription
Figure 9. Relational schema DS2 for products-orders database Dim_Products ProductID ProductName CategoryID CategoryName
Dim_Location PostalCode TerritoryID TerritoryName RegionID RegionName
Fact_Sales OredrID OrderDetailID ProductID PostalCode CustomerID Date Quantity UnitPrice Discount
Dim_Time Date Month Quarter Year
Dim_Customer CustomerID CustomerName
Figure 10. Star schema for the proposed DW Figure 11 depicts the Entity Mapping Diagram for building the products dimension from the desired data sources, passing through the required ETL activities. The explanation of this diagram is as follows: DS1: refers to the first data source (books-orders database). DS2: refers to the second data source (products-orders database). We may have more than two data sources.
There are two entities from each data source that participate in this diagram; Book (BookID, BookTitle, CategoryID) and Category (CategoryID, CategoryName) from the first data source and Products (ProductID, ProductName, BrandID) and Brands (BrandID, CategoryName) from the second data source. DW1: refers to the data warehouse schema to which the data will be moved, we may have one or more DW 297
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
transformation operations take place before loading data to the target data warehouse. ++ is a user defined transformation operation applied to Temp1.ProductID to add 1000000 to each product code number as a user requirement. ProductID and CategoryID data types are transformed to string data type by using ToString (TS) operation. Temp2 table is transferred to the site of DS1 using File Transfer Protocol (FTP), then a union operation (U) runs to combine the two tables. The loader relationships connected to ProductName and CategoryName attributes mean that data is loaded from these two attributes to their corresponding attributes in the DW without any transformation. Figure 12 shows SQL query for building DW1.Dim_Products dimension table.
schemas, one or more Data Mart (DM) schemas, or a combination of DW and DM. Dim_Products is a dimension entity found in DW1. In the middle of the diagram ETL processes are represented using a set of transformation steps; starting with join operation between Book and Category tables then removing the redundant records by applying the duplicate elimination operation. Temp entity (Temp1) is created to capture the intermediate data results from the previous operations. Notice that data of attribute Temp1. CategoryID could be loaded optionally from DS1.Book.CategoryID or DS1.Category.CategoryID. The same activities take place in the other site that contains DS2 to result Temp2 table. After that some attributes
298
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
DS1
DW1 DE
Book
Temp1
ProductID
BookID
++
TS
ProductName
BookTitle
categoryID
CategoryID
TS
CategoryName
DS2 DE
Temp2
FTP
ProductID ProductID
ProductName
BrandID
Brands
TS
ProductName
categoryID
TS
CategoryName BrandID
BrandDescription
FTP: File Transfer Protocol ++: Add 1000000 to each ProductID in Temp1
Figure 11. EM-Diagram for building Products dimension
299
categoryID
CategoryName
CategoryID
Products
ProductID
ProductName
CategoryName
Category
DIM_Product
U
Third International Conference on Informatics and Systems, Mar 19-22, 2005 Cairo University, Faculty of Computers and Information, Giza, Egypt
Create view DW1.DIM_Products (ProductID, ProductName, CategoryID, CategoryName) as select str(Temp1.ProductID +1000000), Temp1.ProductName, str(Temp1.CategoryID), Temp1.CategoryName from Temp1 union all select str(Temp2.ProductID), Temp2.ProductName, str(Temp2.CategoryID), Temp2.CategoryName from Temp2 Figure 12. SQL query for the mapping from Temp1 and Temp2 and products dimension
http://www.hevanet.com/demarest/marc/dwpol.html. [7] Oracle warehouse builder tutorial, at http://www.oracle.com. [8] B. Inmon, The Data Warehouse Budget, DM Review Magazine, January 1997. [9] A. Berson, S. J. Smith, Data Warehousing, Data Mining, & OLAP, McGraw-Hill, 1997. [10] P. Vassiliadis, A. Simitsis, S. Skiadopoulos, Modeling ETL Activities as Graphs, Proc. of 4th Intl.Workshop on the Design and Management of DataWarehouses (DMDW’02), Toronto, Canada (2002), p 52–61. [11] Dobre, A., Hakimpour, F., Dittrich, K.R., Operators and Classification for Data Mapping in Semantic Integration, In: Proc. of the 22nd Intl. Conf. on Conceptual Modeling (ER’03). Volume 2813 of LNCS., Chicago, USA (2003) p 534–547. [12] P. Bernstein, E. Rahm, Data Warehouse Scenarios for Model Management, Proc. of the 19th Intl. Conf. on Conceptual Modeling (ER’00). Volume 1920 of LNCS., Salt Lake City, USA (2000), p 1–15. [13] P. Vassiliadis, Gulliver in the land of data warehousing: practical experiences and observations of a researcher, In Proc. of DMDW, p. 12.1 –12.16, Stockholm, Sweden, 2000. [14] P. Vassiliadis, A. Simitsis , P. Georgantas, and M. Terrovitis, A Framework for the Design of ETL Scenarios, In the Proceedings of the 15th CAiSE, Velden, Austria, June 16, 2003. [15] R. Rifaieh, N. Aïcha Benharkat, Query-based data warehousing tool. In Proc . of the 5th ACM international workshop on Data Warehousing and OLAP, November 2002. [16] A. Simitsis, Modeling and managing ETL processes, VLDB Ph.D. Workshop 2003.
6 Conclusions and Future Work In our research, we have proposed a novel graphical model to represent the conceptual model of ETL processes. The new model EMD is an extension of ERD with its own constructs. Both entity and attribute levels of transformations can be modeled using the proposed model with more simplicity and flexibility than other pervious work in this field. Clearly a lot of work remains to be done for the completion of our research approach. We intend to develop a tool based on the proposed EMD model to aid to draw the model and generate a script code. And future challenge is to test the completeness of the proposed model on real world cases.
References [1] W. H. Inmon, Building the Data Warehouse, John Wiley & Sons, Third edition, 2002. [2] P. Vassiliadis, Data Warehouse Modeling and Quality Issues, Ph.D. June 2000. At http://citeseer.ist.psu.edu/vassiliadis00data.html. [3] P. Vassiliadis, T. Sellis, A survey of logical models for OLAP databases, ACM SIGMOD Record, December 1999. [4] P. Vassiliadis, A. Simitsis, S. Skiadopoulos, Conceptual Modeling for ETL Processes, In Proceedings of the fifth ACM international workshop on Data Warehousing and OLAP. 2002, p. 14-21. [5] C. Shilakes, J. Tylman, Enterprise Information Portals, Enterprise Software Team. http://www.sagemaker.com/company/downloads/eip/inde pth.pdf. [6] M. Demarest, The politics of data warehousing.
300