A Model and a Toolkit for Supporting Incremental ... - Semantic Scholar

3 downloads 49618 Views 358KB Size Report
construction of a data warehouse by introducing a framework for designing new data marts either from ..... Web-based environments such as enterprise Intranets.
A Model and a Toolkit for Supporting Incremental Data Warehouse Construction Paolo Naggar1 , Luigi Pontieri2 , Mariella Pupo3 , Giorgio Terracina4 , and Emanuela Virardi1 1

CM Sistemi - Via Nazario Sauro 1, 00100 Roma, Italy ISI-CNR - Via P. Bucci, 87036 Arcavacata di Rende (CS) Italy 3 CM Sistemi Sud - Via Galluppi 87100 Cosenza, Italy 4 DIMET - Universit` a degli Studi “Mediterranea” di Reggio Calabria Via Graziella, Localit` a Feo di Vito, 89100 Reggio Calabria, Italy {paolo.naggar, mariella.pupo, emanuela.virardi}@gruppocm.it, [email protected], [email protected] 2

Abstract. The design of data warehouse is a very relevant issue in supporting the management decisional processes and data analysis. While the design and maintenance of data warehouses are difficult tasks, enterprise managements are increasingly asking for tools capable to support designers in all the activities involving data warehouse construction. In this context it is mandatory to provide designers with the capability to incrementally define data warehouse components. In this paper we propose a conceptual model, called Multidimensional Fact Network (MFN), allowing to incrementally define data marts and a toolkit, called AURORA, based on MFN and providing a comprehensive set of data warehouse design tools.

1

Introduction

In the last years data warehouses have been recognized by a large number of organizations as a solution for exploiting the large quantity of information stored in their operational systems and for improving their decisional processes. A data warehouse collects data coming from various sources, integrated and restructured, in order to support On Line Analytical Processes (OLAP), typically based on the multidimensional paradigm. The design and maintenance of data warehouses is a difficult task involving challenging issues in the database research. However, since the availability of a data warehouse has been demonstrated to be a key issue for improving both the management decisional processes and data analysis, enterprise managements are increasingly asking for tools capable to support designers in all the activities involving the data warehouse construction. The most important activities in this context are: (i) the integration of the information stored in the enterprise databases, (ii) the definition of the data marts and (iii) the application of knowledge discovery techniques and workflow tools on constructed data marts. While (i) and (iii) may be obtained by applying quite standardized techniques, the definition of the data marts is strictly R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 123–132, 2002. c Springer-Verlag Berlin Heidelberg 2002 

124

P. Naggar et al.

related to both the enterprise decisional processes and the kind of analysis to be performed on available data. Moreover, the analysis targets may change quite frequently over time. These considerations make it clear the necessity to provide designers with tools allowing to incrementally define data warehouse components. Indeed, this seems to be the only way to face frequent changings in the enterprise data warehouse desiderata. In this paper we propose a conceptual model, called Multidimensional Fact Network (MFN), allowing to both incrementally define data marts and produce networks of multidimensional data marts. Data marts in a network might be obtained either from enterprise data or they can be the result of (possibly complex) transformations on other, previously defined, data marts. MFN is the core of a toolkit, called AURORA [1,2], intended to support designers in all the activities related to the construction and exploitation of a data warehouse, i.e. enterprise data integration, data mart construction and knowledge discovery. Decision support analysis processes in a data warehouse are typically carried out by exploiting the OLAP paradigm [4], where data are represented in a multidimensional perspective and are queried by means of easy-to-use interactive operations. Since the structure of the multidimensional space imposes strict limitations to the set of queries that may be expressed, the conceptual design of a multidimensional database is a critical task. Several proposals for handling multidimensional data have been presented in the literature. In particular, some of them [3,8,13] define suitable logical data models and query languages which are able to support various OLAP operations. On the contrary, we are mainly interested in the conceptual modelling task, looking at it as a central aspect within the design process. In connection with this aspect, in [5,11] the Entity/Relationship is extended, whereas [6] introduces a new model, called Dimensional Fact Model (DFM), which is capable of representing several aspects of multidimensional data. This latter work defines also a methodology for designing a data warehouse, by deriving a DFM scheme from E/R (or relational) models of the data sources. An object-oriented approach is adopted by [12], which presents a specialization of UML capable of expressing both static and dynamic features of multidimensional data. Our approach, on the contrary, is devoted to supporting the incremental construction of a data warehouse by introducing a framework for designing new data marts either from enterprise data or from other, previously defined, ones. In particular, the model exploited in our paper is novel with respect to those described so far, mainly because it explicitly represents, at a conceptual level, (possibly complex) transformations among multidimensional data. The plan of the paper is as follows. In Section 2 the Multidimensional Fact Network model is presented. The AURORA toolkit is described in Section 3. In the Appendix, the exploitation of the MFN model to a real example case is shown.

A Model for Supporting Incremental Data Warehouse Construction

125

Fig. 1. Example of a DFM scheme

2

The Multidimensional Fact Network

In [6] the DFM conceptual model has been proposed for representing multidimensional facts. DFM is a graphical model conceived for supporting the data mart design. It can be considered as a specialization of the multidimensional model for warehousing applications. One of the main features of the DFM is the graphical representation of fact schemes. Basic elements of the fact schemes are (i) the facts, representing concepts of interest in the decisional process, (ii) the measures, which are attributes quantitatively describing the fact from different points of view, and (iii) the dimensions, i.e. discrete valued attributes determining the minimal granularity of the fact representation. An example of DFM scheme is shown in Figure 1. While the DFM model is a powerful and user-friendly model for representing fact schemes, it is not able to support the designer in the incremental definition of multidimensional data marts. As we have pointed out in the Introduction, the incremental definition of multidimensional data marts allows the designer to reuse previously defined facts for deriving more complex ones. The result of this activity is a network of facts related each other by some mapping functions. In this network, single elements are obtained by transforming, with various kinds of mappings, pre-existing facts possibly integrated with information available in the Business Data Warehouse (BDW) or in master tables external to the data warehouse. This capability of supporting the definition of networks of multidimensional facts is particularly important when we have to deal with complex Business Information Warehouse (BIW) design. In this section we propose a model, called Multidimensional Fact Network (MFN for short) allowing to define complex networks of multidimensional facts. This model is well suited to be promptly integrated with complex toolkits for data warehouse design such as that developed in the AURORA project. A Multidimensional Fact Network can be represented as an acyclic graph in which each node is associated with a multidimensional fact, whereas each arc of the form F1 , F2  indicates that the fact F1 participates in the definition of the derived fact F2 . A fact F is said to be derived if it is the result of (possibly complex) transformations on other facts, basic otherwise. Each fact is associated with a level representing the maximum number of fact transformations necessary to obtain that fact from basic ones. Derived facts are obtained by transforming and composing dimensions and measures of one or more source facts. These last can be either basic or derived facts themselves. Given a fact F , the bag of facts [Fi ] such that an arc Fi , F  exists in the network constitutes the bag of source facts of F . A fact can partic-

126

P. Naggar et al.

Fig. 2. Example of a generic Multidimensional Fact Network

ipate more than once in the definition of a derived fact and, therefore, we refer to bags of facts instead of sets thereof. A graphical representation of a generic Multidimensional Fact Network is illustrated in Figure 2 where facts F1 , . . . , Fn at level 0 are basic facts, whereas the other ones are derived facts. Note that a fact at level i > 0 in the network can be obtained from facts of any level j < i. Obviously, the definition of a derived fact requires the formal definition of the transformations involving both the dimensions and the measures of its source facts. In the following section we formalize the representation of basic and derived multidimensional facts in the MFN model. 2.1

Definition of Multidimensional Facts in MFN

As explained in the previous section, facts in MFN can be either basic or derived. Derived facts at level i are obtained from a bag of either basic or derived facts of any level j < i in the network. Moreover, the same fact can participate more than once in the definition of a derived fact. Dimensions and measures of the derived fact are obtained by both applying suitable transformations to the dimensions and measures of the source facts and linking the fact with external master tables and the BDW. The set of transformations to be applied on source dimensions and measures to obtain derived dimensions and measures can be defined in terms of suitable algebras on the bag of source facts. Each multidimensional fact f in MFN can be represented as a tuple: f = S, F B, AD, AM, lev, where – – – – –

S is the scheme definition of the fact f ; F B is the bag of facts f is derived from (this bag is empty if f is basic); AD is the algebra for the transformation of the dimensions; AM is the algebra for the transformation of the measures; lev is the level of f in the network.

A Model for Supporting Incremental Data Warehouse Construction

127

Obviously, for basic facts, only S is meaningful. In the following, we will often use the notation f.x to indicate the generic component x of the tuple representing f ; as an example, f.S indicates the fact scheme associated to f . In what follows we formalize each of those components, one per subsection. Scheme Definition of the Fact. The scheme definition of the fact follows the formalization proposed in [6]. The scheme S of a multidimensional fact f is defined as a tuple M, D, N, R, where: – M is the set of measures of f , each defined by a boolean or numerical expression; – D is the set of dimensional attributes of f , each characterized by a discrete domain; – N is a set of non dimensional attributes; – R is a set of functional dependencies used to represent various aggregation levels along dimensional hierarchies. As described above, we use the dot notation for indicating the various components of the facts. This also extends to sub-components. As an example f.S.M indicates the set of measures of the fact scheme associated the multidimensional fact f . Dimension Transformation Algebras. Given a fact f derived from a bag of facts F B, each derived dimension of f is obtained from the dimensions of the facts in F B (we will also call these dimensions source dimensions). Generally, the source dimensions are transformed and composed to obtain the derived dimensions. Such manipulations can be described by suitable transformation operators which formally define the way the derived dimensions are obtained from the source ones. Each derived dimension has an associated transformation operator. The set of these operators constitutes the Dimension Transformation Algebra for the fact f into consideration and describes the relationship existing between each dimension of f and the dimensions of the facts in F B. Formally: Definition 1. Given a multidimensional fact f = S, F B, AD, AM, lev, the Dimension Transformation Algebra AD of f on the bag of source facts F B = [f1 , . . . , fn ] is defined as: 

AD = {opk , dk , Dk  | opk is a well typed computable function,   dk ∈ f.S.D, ∅ ⊂ Dk ⊆ fi ∈f.F B fi .D, (∀fi ∈ f.F B)(|fi .S.D ∩ Dk | ≤ 1)} i.e. the set of transformation operators to be applied on the source dimensions for obtaining the derived dimensions of f . AD contains one tuple for each dimension in f.D. ✷ 

In the definition above, opk indicates a generic transformation operator, dk is the derived dimension generated by opk and Dk is the (non empty) set of  dimensions of the facts in F B generating dk .  Note that, for each tuple opk , dk , Dk , the following conditions must hold:

128

P. Naggar et al.

– opk must be a well typed computable function defined on the domains of the dimensions in Dk ; – each element in Dk is one of the dimensions of the facts in F B; – for each fact fi ∈ f.F B, Dk contains at most one of its dimensions. Definition 1 allows that some of the dimensions of the facts in F B do not generate any derived dimension. This situation might happen either if some source dimensions are not interesting for the definition of the derived ones, or if some source dimensions are heterogeneous so that it can be not meaningful to consider these dimensions in the derived facts. These dimensions can be ignored. Measure Transformation Algebras. A reasoning analogous to that drawn for obtaining derived dimensions can be exploited for derived measures; indeed, given a fact f derived from a bag of facts F B, each measure of f is obtained from suitable transformations applied on information stored in the facts of F B. However, in this case, the derivation of new measures is a more complex task than the derivation of new dimensions. Derived dimensions, indeed, might be obtained from transformations performed only on other dimensions, whereas derived measures can be obtained from transformations on both measures and either dimensional or non dimensional attributes of the facts in F B; moreover, their derivation can exploit information available in external master tables. This produces the necessity of defining more complex transformation operators but, conversely, allows to have a powerful tool for the analysis of the data stored in the data warehouse. Obviously, for each derived measure, there must be a transformation operator; the set of such operators constitutes the Measure Transformation Algebra. Formalizing: Definition 2. Given a multidimensional fact f = S, F B, AD, AM, lev, the Measure Transformation Algebra AM of f on the bag of source facts F B is defined as:  m AM = {opm k , mk , M k , Ak , T k  | opk is a well typed computable function,  mk ∈ f.S.M, ∅ ⊂ M k ⊆ fi ∈f.F B fi .S.M,  ∅ ⊆ Ak ⊆ fi ∈f.F B (fi .S.D ∪ fi .S.N )}

i.e. the transformation operators to be applied on (i) the source measures, (ii) the source dimensional and non dimensional attributes and, possibly, (iii) the set of data taken from external master tables T k for obtaining the derived measures of f ✷ Each transformation operator opm k must be defined on at least one measure of the facts in F B, but can involve also dimensional and non dimensional attributes of the facts in F B and data taken from external tables T k . The generic transformation operator for a measure must be a well typed and computable function defined as:  opm k : M k × Ak × T k → ∆(f ) → mk 

A Model for Supporting Incremental Data Warehouse Construction

129

where ∆(f ) denotes the space of the derived dimensions of f , i.e. ∆(f ) = ×d ∈f.S.D Dom(d ) = Dom(d1 ) × . . . × Dom(d|f.S.D| ). The definition above of opm k indicates that each transformation operator produces, starting from M k , Ak and T k , a suitable function defined on the values of the set of derived dimensions. Note that there always exists a relationship between dimensions and measures; transformations on the dimensions are made in order to obtain the new measures of interest. Therefore, it is clear that values of derived measures are obtained from values of derived attributes suitably composed with the values of the elements in M k , Ak and T k . Note that, in the definition of the Measure Transformation Algebra, no constraint is set for the number and kind of either measures or attributes from which mk can be derived. This allows to define complex relationships between the source facts and the derived ones. The exploitation of the MFN model to a real application case is presented in the Appendix.

3

Tool Description

In this section we illustrate the AURORA toolkit [1,2] developed at CM-Sistemi Sud in collaboration with Universit` a di Reggio Calabria, Universit` a della Calabria, Universit` a di Bologna and ISI-CNR. As pointed out in the Introduction, AURORA aims at providing a comprehensive set of data warehouse design tools for supporting designers in all the activities related to the construction of a data warehouse, i.e. (i) the integration of the information stored in the enterprise databases, (ii) the definition of the data marts and (iii) the application of knowledge discovery techniques and workflow tools on constructed data marts. As for the data warehouse implementation, AURORA exploits a three level architecture allowing to maintain the information source integration and the derivation of multidimensional data for OLAP application two independent tasks. The architecture of AURORA is depicted in Figure 3. The information owned by the enterprise is integrated by the Integration Module which receives in input the schemes relative to the enterprise databases and performs the integration task in order to obtain the set of reconciled data from the operational ones. If necessary, the designer can edit existing or new E/R diagrams by exploiting a graphical interface provided by the E/R Diagram Designer module. The construction of data marts is performed by the Data Mart Builder module, which is the core of AURORA architecture. As pointed out in the Introduction, AURORA is based on the Multidimensional Fact Network model, defined in the previous section, to support data mart design. In particular, the Data Mart Builder first derives the attribute trees from the source schemes into consideration. On the basis of these attribute trees, the First Level Multidimensional Facts Builder module allows the user to build basic multidimensional facts (i.e., first level multidimensional facts in the MFN network). This module is part of a more general one, namely the Multidimensional Fact Network Builder. Once first level

130

P. Naggar et al.

Fig. 3. AURORA architecture

multidimensional facts are derived, the Multidimensional Fact Network Builder allows to define more complex multidimensional facts as the result of (possibly complex) transformations on other, previously defined, multidimensional facts. All the produced multidimensional facts, as well as the overall Multidimensional Fact Network, are stored in the Multidimensional Fact Network Repository. The Data Mart Population Control Module allows to populate data marts stored in the repository with the enterprise data. These are stored in the Enterprise Databases. Finally, Knowledge Discovery and Data Mining Tools as well as Workflow Tools can be applied on derived data. A schematic representation of AURORA implementation is shown in Figure 4. AURORA user interface is implemented in standard HTML with java scripts and adopts a client-server architecture so that it can be exploited in Web-based environments such as enterprise Intranets. The various modules of AURORA architecture described above interact by means of an extended metadata component. This allows AURORA to be easily upgraded if new, more advanced, techniques would become available for some of the phases comprised in the data warehouse construction and exploitation. In Figure 4, the item named Methodology represents a module allowing to trace all the steps of the construction of the data warehouse for an automatic generation of the necessary documentation. The QDE E/R Design tool provides the E/R diagram design facilities, whereas DIKE [9,10] implements the source integration module. The MFN provides the tools for the construction of Data Marts as Multidimensional Fact Networks. jMINING [7] is responsible of the

A Model for Supporting Incremental Data Warehouse Construction

131

Fig. 4. Schematic representation of AURORA implementation

Fig. 5. AURORA screen-shot

front-end to mining techniques. The item named Workflow represents the set of workflow tools allowing to overlap workflow based interpretative schemes to derived data marts. The Prototyper implements the Data Mart Population Control

132

P. Naggar et al.

Module and, finally, the DDL builder is responsible of the automatic generation of DDL scripts for a generic database. In Figure 5 a screen-shot of AURORA is presented; in particular the figure shows how multidimensional data marts can be defined in a graphical way.

References 1. Extended metadata component - Specifiche, Document project number 1704-STEC01-00-10 in IMI-MURST project “Aurora: un ambiente unitario di realizzazione de sistemi informativi direzionali”. 2. ATS Aurora Toolset - ATS progetto, Document project number 1704-STECH-0104-02 in IMI-MURST project “Aurora: un ambiente unitario di realizzazione de sistemi informativi direzionali”. 3. L. Cabibbo, R. Torlone. A Logical Approach to Multidimensional Databases. in Proceedings of International Conference on Extending Data Base Technology (EDBT’98), Springer-Verlag, 1998. ACM SIGMOD Record 26(1): 65–74, 1997. 4. S. Chaudhuri, U. Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record 26(1): 65–74, 1997. 5. E. Franconi, U. Sattler. A Data Warehouse Conceptual Data Model for Multidimensional Aggregation. In Proc. Int. Workshop on Design and Management of Data Warehouses, 1999. 6. M. Golfarelli, D. Maio, S. Rizzi, Conceptual Design of Data Warehouses from E/R Schemes, Proceedings of the Hawaii International Conference on System Sciences, Kona (Hawaii), USA, 1998. 7. S. Greco, E. Masciari, L. Pontieri, “Combining inductive and deductive tools for data analysis”, in AI Communications, 14(2), pp. 69–82, 2001. 8. W. Lehner. Modeling Large Scale OLAP Scenarios. In Proc. of Int. Conf. on Extending Database Technology (EDBT), Valencia, Spain, 1998: 153–167. 9. L. Palopoli, G. Terracina and D. Ursino, The System DIKE: Towards the SemiAutomatic Synthesis of Cooperative Information Systems and Data Warehouses, Proc. of Challenges of Symposium on Advances in Databases and Information Systems (ADBIS-DASFAA 2000), Prague, Czech Republic, pp. 108–117, 2000, Matfyzpress 10. L. Palopoli, L. Pontieri, G. Terracina and D. Ursino, Intensional and extensional integration and abstraction of heterogeneous databases, Data & Knowledge Engineering, 35(3), pp. 201–237, 2000 11. C. Sapia, M. Blaschka, G. H¨ ofling, B. Dinter. Extending the E/R Model for the Multidimensional Paradigm. In Proc. of ER Workshops, Singapore, 1998: 105–116. 12. J. C. Trujillo, M. Palomar, J. G´ omez. Applying Object-Oriented Conceptual Modeling Techniques to the Design of Multidimensional Databases and OLAP Applications. In Proc. First Int. Conf. Web-Age Information Management (2000): 83–94. 13. P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. In Proc. of 10th Int. Conf. on Scientific and Statistical Database Management (SSDB), Capri, 1998.

Suggest Documents