CONCEPTUAL DATA WAREHOUSE DESIGN: AN AUTOMATED APPROACH Department of Information Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia Opim Salim Sitompul & Shahrul Azman Mohd Noah 43600 UKM Bangi, Selangor, Malaysia Tel.: + (603) 89216653 Main author email :
[email protected]
CONCEPTUAL DATA WAREHOUSE DESIGN: AN AUTOMATED APPROACH Department of Information Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia Author Email(s):
[email protected],
[email protected] Abstract: Data warehouse is becoming a necessity for today’s enterprises as the need for complex data analysis in company decision support systems becoming more and more demanding. Although many data warehouse systems have been developed, little has been said about its conceptual design. Some research works related to the conceptual data warehouse design were also trying to automate the design process. In this paper, we present an approach and a tool in designing conceptual data warehouse model based on an existing ER model whereby the ER model is formulated into a language specification model, transformed into initial problem domain model, and then progressively translated into multidimensional model. As the translation process proceeds, the problem domain model is modified and new facts are added accordingly. The transformed multidimensional model will then undergo a refinement process whereby users are allowed to alter or further refine the model. This is achieved from a series of intelligent interactions between the tool and the user. Keywords: conceptual model, data warehouse, intelligent tool.
1. INTRODUCTION Data warehouse concept encloses a wide range of aspects such as application tools, architectures, information services, and communication infrastructures to synthesize information useful for decision making from distributed heterogeneous operational data sources (Golfarelli et al. 1998). This concept is supported with specific data model called multidimensional model by which users could view data from different dimension necessary for data analysis purposes. Multidimensional model is universally agreed as the basis of data warehouse implementation. In multidimensional model, data are represented in terms of fact and dimensions where each fact is associated to multiple dimensions. In this manner, facts are the focus of interest by which they are analyzed through the quantifying context stored in measures and the qualifying context determined through dimension levels (Hüsemann et al. 2000). Categorizing data along dimensions is a mean to organize them into hierarchical levels so that data can be viewed from their finer to coarser granularities (Agrawal et al. 1997). In this paper, we present our methodology on developing the multidimensional model for data warehouse design from the ER diagram of an operational database. The initial ER diagram is translated into a language specification model where it is subsequently parsed and formed into an initial problem domain model. The problem domain model is
progressively transformed into the multidimensional model. This approach is called the transformation-oriented methodology (Marotta & Ruggia 2002). The methodology is implemented by developing an intelligent tool to translate the ER model into the multidimensional model (Opim Salim & Shahrul Azman 2002, 2003). It is realized, however, that the automated tool would not provide full satisfaction to users as this tool merely transforms the ER model automatically based on the knowledge it obtained while the transformation progressed. To overcome this drawback, the automated tool is prepared with a refinement module which interacts with user through an intelligent interactive dialog. 2. RELATED RESEARCH Several research works have been conducted that propose different approaches to the design of multidimensional model for data warehouse (see for example Golfarelli et al. (1998), Tryfona et al. (1998), Hüsemann et al. (2000), Moody & Kortink (2000), Phipps & Davis (2002)). All these works are more or less using the transformation-oriented methodology. Golfarelli et al. (1998), after defining facts for the data warehouse design, started the transformation by building attribute trees from the attributes of the entities and the relationships of the ER model. Further processing of the attribute tree is done by removing unwanted attributes along with its descendants entirely from the tree (pruning), or removing only the unwanted attribute while still keeping its descendants (grafting). From the trimmed tree, the dimension, fact attributes (measures), and dimension hierarchies could be built. As in the work of Tryfona et al. (1998), the construction of their StarER model was initiated by a one-to-one translation of the ER model into the star schema. This model has several constructs, such as fact set, entity set, relationship set, and attributes. The fact set represents the process of generating data over time, each time an event related to that fact take place. The relationship set is the basis for constructing dimensions and hierarchies along a dimension. One important relationship is a fact and the time the fact is performed. This time dimension could be used to summarize the properties of a fact according to the time granularity. The entity set and attributes have the same meaning as in the ER model. Hüsemann et al. (2000) proposed their design with an assumption that the requirement analysis has been carried out and there is a global operational ER schema available that has been analyzed to determine measures, dimensions and initial OLAP queries. The design process is divided into three sequential phases: context definition of measures, dimensional hierarchy design, and definition of summarizability constraints. The design starts with determination of functional dependencies from dimensional level to measures and continued with developing dimension hierarchies for each dimension by determining all functional dependencies among dimensions. The last stage is the separation of meaningful aggregation of measures from meaningless ones according to aggregation functions, such as SUM, AVERAGE, and COUNTS.
Moody & Kortink (2000) developed their design in four steps, i.e. classifying entities, identifying hierarchies, producing dimensional models, and evaluating and refining the model. In the first step, all entities from the ER model are classified into three categories, namely transaction, component, and classification entities representing facts, dimensions, and dimension hierarchies, respectively. The second step is identifying any sequence of entities related by one-to-many relationships, which are aligned in the same direction. The third step is producing the dimensional model by applying two operators: collapse hierarchy and aggregation. The last step is the evaluation and refinement step such as combining facts and dimension tables, many-to-many relationships, and handling subtypes. Phipps & Davis (2002) generate the multidimensional schema based on ER schema (represented in form of table data structures) with numeric fields and many relationship representing measures and dimensions. The generation of the multidimensional schema is performed in five steps: finding entities with numeric fields, which becoming fact nodes of the fact schema; creating numeric attributes (measures) for each fact node from the numeric fields; creating date and/or time dimensions for each fact node; creating dimension from non-numeric, non-key, and non-date fields; and creating dimension hierarchies from relationships amongst entities. 3. DESIGN METHODOLOGY The design methodology for the data warehouse conceptual model consists of five stages: 1. 2. 3. 4. 5.
Creating the specification language from the ER model Creating an initial problem domain model Analyzing and adding more facts into the problem domain model Creating multidimensional model Refining multidimensional model
3.1 Creating the Specification Language from the ER Model The ER specification language records three main constructs of an ER diagram, namely entity, attribute, and relationship so that the tool will ‘understand’ its properties and semantic prior to processing. For example, a portion of ER diagram representing a student entity in a university database is specified in the language specification as follows: CLASS “STUDENT” ATTRIBUTE ((“Class”: Integer)) IDENTIFIER NIL SUBCLASS (“GRAD_STUDENT”) AGGREGATION NIL RELATIONSHIP ((“Minor” “DEPARTMENT” “NIL” “(1 1)” “(1 n)”)\ (“Major” “DEPARTMENT” “NIL” “(1 1)” “(1 n)”)\ (“Registered” “CURR_SECTION” “((“Count”: Integer))” “(1 n)” “(1 m)”)\ (“Transcript” “SECTION” “((“Grade”: Float))” “(1 n)” “(1 m)”)) End-Class
3.2 Creating Initial Problem Domain Model The initial problem domain model is created by parsing the language specification and composing elements of the entity class into compound terms (adapted from Luger & Stubblefield (1998)), which is a list with the first element is the property name and the remaining elements are the name of the entity and the value of the property. Using this representation, the facts acquired from the “STUDENT” entity are formulated as: HAS-ATTRIBUTE “STUDENT” ((“Class”: Integer)) HAS-SUBCLASS “STUDENT” (“GRAD_STUDENT”) HAS-RELATIONSHIP “STUDENT” ((“Minor” “DEPARTMENT” “NIL” “(1 1)” “(1 n)”) (“Major” “DEPARTMENT” “NIL” “(1 1)” “(1 n)”) (“Registered” “CURR_SECTION” “((“Count”: Integer))” “(1 n)” “(1 m)”) (“Transcript” “SECTION” “((“Grade”: Float))” “(1 n)” “(1 m)”))
3.3 Analyzing and Adding More Facts into the Problem Domain Model The initial problem domain model is the starting point of the transformation process in which each fact is progressively analyzed by applying production rules and translation rules. The analysis includes modifying format of the fact’s values into association list; adding inherited attribute and identifier; adding indirect subclass and direct/indirect superclass; adding new entities from relationship with numeric attribute; and classifying attributes into three categories: numeric attributes, temporal attributes, and other attributes. As a result of the analysis, more facts will be added into the problem domain model. Continuing with the STUDENT entity, facts related to this entity could be found in the problem domain model as follows: (Has-Attribute "STUDENT" (((Numeric-Att (Class . Integer)) (Date-Att) (Other-Att)) (Inherited-Attribute ((Numeric-Att) (Date-Att (Bdate . Date)) (Other-Att (Composite . Name) (Fname . String[15]) (MInit . String[3]) (Lname . String[20]) (Composite . End) (Ssn . String[12]) (Sex . String[1]) (Composite . Address) (No . String[4]) (Street . String[20]) (AptNo . String[4]) (City . String[15]) (State . String[2]) (Zip . String[5]) (Composite . End)))))) (Has-Identifier "STUDENT" (Inherited ("Ssn"))) (Has-Subclass "STUDENT" (("GRAD_STUDENT") ("PHD_STUDENT" "MASTERS_STUDENT"))) (Has-Superclass "STUDENT" ("PERSON")) (Has-Relationship "STUDENT" (((Name . of) (Participating-obj . TRANSCRIPT) (Rel-Attribute . NIL) (First-constraint . (1 n)) (Second-constraint . (1 1))) ((Name . of) (Participating-obj . REGISTERED) (RelAttribute . NIL) (First-constraint . (1 n)) (Second-constraint . (1 1))) ((Name . Major) (Participating-obj . DEPARTMENT) (Rel-Attribute . NIL) (First-constraint . (1 1)) (Second-constraint . (1 n))) ((Name . Minor) (Participating-obj . DEPARTMENT) (Rel-Attribute . NIL) (First-constraint . (1 1)) (Second-constraint . (1 n)))))
3.4 Creating Multidimensional Model Basic constructs of multidimensional model i.e. facts, fact attributes (measures), and dimensions (including dimension hierarchies) are created after the analysis of the problem domain has been completed. Each entity that has numeric attribute is potentially becoming a fact (fact entity). This fact, together with measures and dimensions forms a fact scheme. In this stage, the three attribute categories, namely numeric, temporal, and other attributes obtained while analyzing the problem domain model are transformed into measures, temporal dimension, and dimensions hierarchies, respectively. The multidimensional model could be represented as an attribute tree, where the identifier of the fact entity becomes the root of the attribute tree and attributes included in its other category grow into sub-trees. Sub-trees are also generated from attributes included in other category of entities that have a many-relationship to the fact entity. If there is another entity that has many-relationship to any of those entities, more sub-trees will be added recursively to the existing sub-tree (see Figure 1). As can be seen from Figure 1, sub-tree rooted at DeptName is generated from a DEPARTMENT entity that has a manyrelationship with the STUDENT entity and a COLLEGE entity that has a manyrelationship with the DEPARTMENT entity.
Figure 1 : Attribute tree for STUDENT entity In its implementation, sub-trees represent dimension hierarchies. To identify each dimension level in the dimension hierarchy, a number is added to the attribute, where 0 represents the root of the attribute tree. As there is a chance to get redundant attributes when inheriting them from superclasses, the system will only insert a new attribute into the dimension hierarchy if similar attribute has not yet exist. As a result of creating the multidimensional model for the STUDENT entity, the following facts are added into the problem domain model.
(Has-Measure "STUDENT" ((Class . Integer))) (Has-Dimension "STUDENT" ((Temporal Dimension ((Bdate . Date))) (Other Dimension ((Composite . Name) (Fname . String[15]) (MInit . String[3]) (Lname . String[20]) (Composite . End) (Ssn . String[12]) (Sex . String[1]) (Composite . Address) (No . String[4]) (Street . String[20]) (AptNo . String[4]) (City . String[15]) (State . String[2]) (Zip . String[5]) (Composite . End))))) (Has-Hierarchy "STUDENT" ((Ssn . 0) (Name . 1) (Fname . 2) (MInit . 2) (Lname . 2) (Sex . 1) (Address . 1) (No . 2) (Street . 2) (AptNo . 2) (City . 2) (State . 2) (Zip . 2) (DeptName . 1) (DeptPhone . 2) (Office . 2) (CollegeName . 2) (CollegeOffice . 3) (Dean . 3)))
3.5 Refining Multidimensional Model The automatic translation of ER model into multidimensional model could only provide users with the ‘first cut’ of the multidimensional model. User refinement is still needed in order to setup the final model that suits his requirement. The refinement stage of the multidimensional model creation process provides users with the possibility of modifying each constructs of the fact scheme i.e. measures, temporal dimension, and dimension hierarchies through a series of dialogs as demonstrated in the following sub-sections. 3.5.1
Refining measure(s)
Dialog with users to refine the measure(s) of the multidimensional model is illustrated as follows: Refining Measure(s) > You have chosen to modify measure(s) of the "STUDENT" fact scheme which is (Class). Do you wish to refine this measure(s)? [Y]es - [E]xplain >>> e > Measure is the focus of interest in data warehouse design. It is represented as attributes of a fact upon which operations such as count/sum/average/max/min are performed. Measures are always quantitative in nature. Continue? [Y]es - [D]one >>> y > Please enter each measure(s), separated by space Ex.: (number-of-sales integer) (qty-sold float) >>> (Number_of_Grad_Student integer) > You are providing the ((Number_of_Grad_Student integer)) measure(s). Continue? [O]K - [N]o >>> o > End of measure(s) refinement. >>>
In the dialog, user could directly refine the measures or ask for explanation first. Because measures are quantitative entities on which users focus their analysis, then to refine the measures they should enter the name of each measure and its numeric type (integer or float). 3.5.2
Refining temporal dimensions
The temporal dimension would be the dimension that always undergoes refinement, as time is very important in data warehousing. For this purpose, users could refine it as illustrated in the following dialog. Refining Temporal Dimension > You have chosen to modify temporal dimension of the "STUDENT" fact scheme, which is Bdate. Do you wish to refine this dimension? [Y]es - [E]xplain >>> y > Will you choose one of the available dimensions or provide your own dimension? [C]hoose - [P]rovide >>> c > OK. Then please choose one of the dimensions below: (1) day - week - month - year (2) month - semester - year (3) month - quarter - year Your Choice: >>> 2 > You choose to take (month semester year) option. Continue? [O]K - [N]o >>> o > End of temporal dimension refinement. >>>
As with measures, users could ask for explanation first or make the refinement directly. For this refinement, users could choose the dimension from several options available or provide their own by typing each dimension level separated by space. 3.5.3
Refining dimension hierarchies
The dimension hierarchies are represented as an attribute tree so that users could do the refinement by pruning and/or grafting. As with measures and temporal dimension, a brief explanation could be acquired by users on what pruning and grafting are. Users could do pruning or grafting on a single node of the tree or a set of nodes. However, users should be aware that pruning a node will remove that node as well as all its descendants, so the choice of pruning or grafting the trees should be done properly.
In addition to pruning and grafting the attribute tree, users could also perform aggregation in which two or more dimensions are combined into a separate dimension hierarchy and if necessary, adding more dimensions to formulate coarser or finer granularities. To illustrate this process, we will look at the refinement of the dimension hierarchies of the STUDENT entity as shown in the following dialog. Refining Hierarchical Dimension > You have chosen to modify hierarchical dimension of the "STUDENT" fact scheme. You can do pruning, grafting, or creating aggregation on the attribute tree. Do you wish to refine this dimension? [Y]es - [E]xplain >>> y > OK. Then please choose: (1) Pruning (2) Grafting (3) Aggregating Your Choice: >>> 1 > You choose to prune the dimensions. Which dimension do you wish to modify? (Choose among the dimension hierarchy) >>> name no street aptno zip > You are going to prune the (name no street aptno zip) dimension. [O]K - [N]o ? >>> o > Hierarchical dimension modified. Continue? [Y]es - [D]one >>> d > End of hierarchical dimension refinement. >>>
4. RESULTS AND DISCUSSION Some aspects of design approach taken in this work are worth mentioning. First of all is the formulation of a language specification model to capture semantics of the ER constructs with the intention that the transformation process could be done in a straightforward manner. The language specification model is easy to be formulated into a specific knowledge representation suitable for the subsequent transformation. Another aspect is the usage of association lists, which turn out to be a very convenient data structures. The association list is used in many parts of the codes to gather componentvalue pairs, such as attribute and its type, classification and its component, as well as dimension and its current level. This feature for instance, makes the grafting algorithm for the attribute tree becomes very simple: remove the unwanted dimension from the list and for each descendant, subtract the number representing the dimension level by one.
Formulation of the problem domain model makes the application of various rules for the transformation process is easy and maintainable, such as modifying existing facts, adding new facts, and maintaining the integrity of the problem domain model by removing redundant facts (see Opim Salim & Shahrul Azman 2003 for details). Lastly, users could obtain the final outputs of the transformation process in form of text files containing the fact scheme of all facts. A portion of the final output for the STUDENT fact scheme resulted after refining the multidimensional model can be seen in Table 1. Table 1: Result of the refinement for STUDENT fact scheme Automatically Generated Fact Scheme: "STUDENT" Measure(s): Class Temporal Dimension: Bdate Hierarchical Dimension: Name Fname MInit Lname Sex Address No Street AptNo City State Zip DeptName DeptPhone Office CollegeName CollegeOffice Dean
After Refinement Fact Scheme: "STUDENT" Measure(s): Number_of_Grad_Student Temporal Dimension: Month Semester Year Hierarchical Dimension: Sex City State Country DeptName CollegeName
5. CONCLUSIONS AND FURTHER RESEARCH In this paper we have presented an approach for the automatic conceptual data warehouse design process. The proposed methodology consists of five stages: creating the specification language from the ER model, creating initial problem domain model, analyzing and adding more facts into the problem domain model, creating multidimensional model, and refining multidimensional model. The last stage is necessary because interaction with user cannot be avoided in order to obtain a design that fully suit user requirement. The approach taken in this work has brought into perspective a preliminary step in applying some AI techniques in the realm of conceptual data warehouse design as has been
achieved within the context of database analysis and design (Noah & Williams 1998, 2000). As a design tool, the system developed is still in its early state, where the multidimensional model obtained is in form of plain text. It is our intention to enhance this tool by providing a graphical representation of the multidimensional model as well as applying further AI techniques. 6. REFERENCES Agrawal, R., Gupta, A., & Sarawagi, S. 1997. Modeling multidimensional databases. Proceedings of the 13th International Conference On Data Engineering (ICDE'97). Birmingham, U.K.: 232-243. Golfarelli, M., Maio, D. and Rizzi, S. 1998. Conceptual design of data warehouses from E/R schemes. Proceedings of 31st Hawaii International Conference on System Sciences VII. Kona, Hawaii: 334-343. Hüsemann, B., Lechtenbörger, J. & Vossen, G. 2000. Conceptual data warehouse design. Proceedings of the International Workshop on Design and Management of Data Warehouse (DMDW ‘2000). Stockholm, Sweden: 6-1 – 6-11. Luger, G. F. & Stubblefield, W. A. 1998. Artificial Intelligence: Structures and Strategies for Complex Problem Solving. 3rd Edition. Addison-Wesley: England. Marotta, A. & Ruggia, R. 2002. Data warehouse design: A schema-transformation approach. Proceedings of the XXII International Conference of the Chilean Computer Science Society (SCCC’02). Montevideo, Uruguay: 153- 161. Moody, D. and Kortink, M.A.R. 2000. From enterprise models to dimensional models: A methodology for data warehouse and data mart design. Proceedings of the International Workshop on Design and Management of Data Warehouses 2000 (DMDW’2000). Stockholm, Sweden: 5-1 – 5-12. Noah, S. A. and Williams, M. 1998. An evaluation of two approaches to exploiting realworld knowledge by intelligent database design tools. Proceedings of the 17th International Conference on Conceptual Modelling, Singapore. Springer-Verlag, Berlin: 197-210. Noah, S. A. and Williams, M. 2000. Exploring and validating the contributions of realworld knowledge to the diagnostic performance of automated database design tools. Proceedings of the Fifteenth IEEE International Conference on Automated Software Engineering (ASE 2000). Grenoble, France:177-185. Opim Salim Sitompul & Shahrul Azman Mohd. Noah. 2002. Translation of ER Model to multidimensional model for data warehouse – an automated approach. International Journal of IT. 3: 11-32.
Opim Salim Sitompul & Shahrul Azman Mohd. Noah. 2003. Rules for the automatic translation of ER model into multidimensional model. Proceedings of Advanced Technology Congress (ATC’2003). Putrajaya, Malaysia. To appear. Phipps, C. and Davis, K. C. 2002. Automating data warehouse conceptual schema design and evaluation. Proceedings of the 4th International Workshop on Design and Management of Data Warehouses 2002 (DMDW'2002). Toronto, Canada: 23-32. Tryfona, N., Busborg, F. and Christiansen, J. G. B. 1999. starER: A conceptual model for data warehouse design. Proceedings of the ACM 2nd International Workshop on Data Warehousing and OLAP. New York, N.Y.: 3-8.