Improving Expression Power in Modeling OLAP Hierarchies E. Malinowski Department of Computer and Information Sciences University of Costa Rica
[email protected]
Abstract. Data warehouses and OLAP systems form an integral part of modern decision support systems. In order to exploit both systems to their full capabilities hierarchies must be clearly defined. Hierarchies are important in analytical applications, since they provide users with the possibility to represent data at different abstraction levels. However, even though there are different kinds of hierarchies in real-world applications and some are already implemented in commercial tools, there is still a lack of a well-accepted conceptual model that allows decisionmaking users express their analysis needs. In this paper, we show how the conceptual multidimensional model can be used to facilitate the representation of complex hierarchies in comparison to their representation in the relational model and commercial OLAP tool, using as an example Microsoft Analysis Services.
1
Introduction
Organizations today are facing increasingly complex challenges in terms of management and problem solving in order to achieve their operational goals. This situation compels managers to utilize analysis tools that will better support their decisions. Decision support systems (DSSs) provide assistance to managers at various organizational levels for analyzing strategic information. Since the early 1990s, data warehouses (DWs) have been developed as an integral part of modern DDSs. A DW provides an infrastructure that enables users to obtain efficient and accurate responses to complex queries. Various systems and tools can be used for accessing and analyzing the data contained in DWs, e.g., online analytical processing (OLAP) systems allow users to interactively query and automatically aggregate the data using the roll-up and drill-down operations. The former, transforms detailed data into summarized ones, e.g., daily sales into monthly sales; the latter does the contrary. The data for DW and OLAP systems is usually organized into fact tables linked to several dimension tables. A fact table (FactResellerSales in Fig.1) represents the focus of analysis (e.g., analysis of sales) and typically includes attributes called measures; they are usually numeric values (e.g., amount) that allow a quantitative evaluation of various aspects of an organization. Dimensions (DimTime in Fig.1) are used to see the measures from different perspectives, e.g.,
according different periods of time. Dimensions typically include attributes that form hierarchies. Hierarchies are important in analytical applications, since they provide the users with the possibility to represent data at different abstraction levels and to automatically aggregate measures, e.g., moving in a hierarchy from a month to a year will yield aggregated values of sales for the various years. Hierarchies can be included in a flat table (e.g., City-StateProvince in the DimGeography table in Fig.1) forming the so-called star schema or using a normalized structure (e.g., DimProduct, DimProductSubcategory, and DimProductCategory in the figure), called the snowflake schema. However, in real-world situations, users must deal with different kinds and also complex hierarchies that either cannot be represented using the current DW and OLAP systems or are represented as star or snowflake schemas without possibility to capture the essential semantics of multidimensional applications. In this paper we refer to different kinds of hierarchies already classified in [7] that exist in real-world applications and that are required during the decisionmaking process. Many of these hierarchies can already be implemented in commercial tools, e.g., in Microsoft SQL Server Analysis Services (SSAS). However, these hierarchies cannot be distinguished either at the logical level (i.e., star of snowflake schemas) or in the OLAP cube designer. We will show the importance of using a conceptual model, such as the MultiDim model [7], to facilitate the process of understanding users’ requirements by distinguishing different kinds of hierarchies. This paper does not focus on details as described in [7] for representing different kinds of hierarchies. The main objective is to show how many concepts already implemented in commercial tools and accepted by the practitioners can be better understood and correctly specified if the practice for DW design changes and includes a representation at the conceptual level. We use the MultiDim model as an example of conceptual model without pretending that this model is the only one that responds to analytical need. At the contrary, we leave to the designer the decision of using a conceptual multidimensional model among several already existing models, e.g., [1, 6]. We have chosen as an example of a commercial tool SSAS, since it provides different kinds of hierarchies that can be included in the cube without incurring to any programming effort, i.e., using a wizard or the click-and-drag mechanism. We will compare the representation of different kinds of hierarchies in the MultiDim model, in relational model, and in OLAP cube designer. Section 2 surveys works related to DW and OLAP hierarchies. Section 3 introduces a motivating example that is used throughout this paper. Section 4 briefly presents the main features of the MultiDim model. Section 5 refers to the conceptual representation and implementation of different kinds of hierarchies. Finally, the conclusions are given in Section 6.
2
Related work
The advantages of conceptual modeling for database design have been acknowledged for several decades and have been studied in many publications. However,
the analysis presented in [11] shows the small interest of the research community in conceptual multidimensional modeling. Some proposals provide graphical representations based on the ER model (e.g., [12]), on UML (e.g., [1, 6]), or propose new notations (e.g., [2, 5]), while other proposals do not refer to a graphical representation (e.g., [9, 10]). Very few models include a graphical representation for the different kinds of hierarchies that facilitates their distinction at the schema and instance levels (e.g., [6, 12]). Other models (e.g., [2, 12]) support only simple hierarchies. This situation is considered as a shortcoming of existing models for DWs [3]. Current commercial OLAP tools do not allow conceptual modeling of hierarchies. They usually provide a logical-level representation limited to star or snowflake schemas. Some commercial products, such as SSAS, Oracle OLAP, or IBM Alphablox Analytics, can cope with some complex hierarchies.
3
Motivating example
In this section we briefly describe an example that we use throughout this paper in order to show the necessity of having a conceptual model for representing different kinds of hierarchies for DW and OLAP applications.
Fig. 1. An extract of the AdventureWorksDW schema.
The schema in Fig.1 shows an extract of the AdventureWorksDW database issued by Microsoft 1 . The DW schema in the figure is used for analysis of sales by resellers (the fact table FactResellerSales). These sales are analyzed from different perspectives, i.e., dimensions. The Product dimension includes a hierarchy using the snowflake structure representing products, subcategories and categories. The Time dimension include attributes that allow users to analyze data considering calendar and fiscal periods of time. Another perspective of analysis is represented by the DimSalesTerritory table which allows decision-making users to analyze measures considering geographical distribution of a sales organization. The DimReseller table in Fig.1 includes stores that resale products and has attached a table (the DimGeography) indicating geographical distribution of these stores. In addition, this schema contains an employee dimension (the DimEmployee table in the figure) with an organizational hierarchy of supervisors and subordinates. We modified slightly the DimEmployee table and deleted the attribute DepartmentName. Instead, we created a new table that represents different departments. Since we assigned some employees to two different departments, we had to create an additional table (the EmplDepBridge table in Fig.1). This table represents all assignments of employees to their corresponding departments and in addition, it includes an attribute called DistributingFactor that indicates how to distribute measures between different departments for employees that work in more than one department, e.g., assign 70% of sales to the department 10 and 30% of sales to the department 14. As can be seen in Fig.1, even though there are several hierarchies that users are interested in exploring, only the hierarchy represented as snowflake schema (e.g., Product-Subcategory-Category) can be distinguished. We will see in the next section, how this situation can be changed using a conceptual model.
4
The MultiDim model
The MultiDim model [7] is a multidimensional model that allows designers to represent at the conceptual level all elements required in data warehouse and OLAP applications, i.e., dimensions, hierarchies, and facts with associated measures. In order to present a brief overview of the model2 , we use the example in Fig.2. The schema in this figure corresponds to the logical schema in Fig.1. We include in the schema only those hierarchies that are relevant for the article and we omit the attributes since they are the same as in Fig.1. A schema is composed of a set of dimensions and a set of fact relationships. A dimension is an abstract concept that groups data sharing a common semantic meaning within the domain being modeled. A dimension is composed of a level or a set of hierarchies. A level corresponds to an entity type in the ER model. It describes a set of real-world concepts that have similar characteristics, e.g., the Product level in Fig.2. Instances of a level are called members. A level has a set of attributes 1 2
We do not refer to the correctness of the AdventureWorksDW schema. The detailed model description and formalization can be found in [7]
Color
Region
Product
Reseller sales
Date
Subcategory
Employee
Time
Assignation
Supervision
Country
Product by color
Product groups
Category Sales territory
Group
SalesAmount OrderQuantity Calendar month
subordinate
supervisor
Fiscal quarter
Department Reseller
Calendar quarter
Fiscal year
Location StateProvince
Calendar year
City
x
x
CountryRegion
Fig. 2. Conceptual representation of hierarchies using the MultiDim model.
that describe the characteristics of their members and one or several keys that identify uniquely the members of a level. These attributes can be seen in Fig.3a. A hierarchy comprises several related levels. Given two related levels of a hierarchy, the lower level is called the child and the higher level is called the parent. The relationships between parent and child levels are characterized by cardinalities, indicating the minimum and the maximum number of members in one level that can be related to a member in another level. We use different symbols for indicating cardinalities: (0,1), (1,1), (0,n), and (1,n). Different cardinalities may exist between parent and child levels leading to different kinds of hierarchies, to which we refer in more detail in the next sections. The level in a hierarchy that contains the most detailed data is called the leaf level; its name is used for defining the dimension’s name. The last level in a hierarchy, representing the most general data, is called the root level. The hierarchies in a dimension may express various structures used for analysis purposes; thus, we include an analysis criterion to differentiate them. For example, the Product dimension in Fig.2 includes two hierarchies: Product groups and Product by color. The former hierarchy comprises the levels Product, Subcategory, and Category, while the latter hierarchy includes the levels Product and Color. A fact relationship expresses a focus of analysis and represents an n-ary relationship between leaf levels, e.g., the Reseller sales fact relationship relates the
Product, Region, Employee, Reseller, and Date levels in Fig.2. A fact relationship may contain attributes commonly called measures that usually contain numeric data, e.g., SalesAmount and OrderQuantity in Fig.2.
5
Hierarchies: their representation and implementation
In this section, we present various kinds of hierarchies using the MultiDim model that provides clear distinction at the schema and instance levels. We also show that even though, some commercial tools, such as SSAS, allow designers to include and manipulate different kinds of hierarchies, the distinction between them is difficult to make. 5.1
Balanced hierarchies
A balanced hierarchy has only one path at the schema level, e.g., Product groups hierarchy in Fig.2 composed by the Product, Subcategory, and Category levels. At the instance level, the members form a tree where all the branches have the same length, since all parent members have at least one child member, and a child member belongs to only one parent member, e.g., all subcategories have assigned at least one product and a product belongs to only one subcategory. Notice that in Fig.2 we have another balanced hierarchy Product by color that could not be distinguished in the logical level in Fig.1 since it is included as an attribute in the DimProduct table. Balanced hierarchies are the most common kind of hierarchies. They are usually implemented as a star or a snowflake schema as can be seen in Fig.1. On the other hand, SSAS uses the same representation for all kinds of hierarchies (except recursive as we will see later) as shown for the Sales Territory hierarchy in Fig.4a. 5.2
Unbalanced hierarchies
An unbalanced hierarchy3 has only one path at the schema level and, as implied by the cardinalities, at the instance level some parent members may not have associated child members. Fig.3a shows a Sales territory hierarchy composed of Region, Country, and Group levels. However, the division in some countries does not include region ((e.g., Canada in Fig.3b). At the logical level this kind of hierarchy is represented as a star or a snowflake schema (the DimSalesTerritory table in Fig.1). At the instance levels, the missing levels can include placeholders, i.e., the parent member name (e.g., Canada for the region name) or null values. SSAS represents unbalanced hierarchy as shown in Fig.4a. For displaying instances, the designers can choose between two options: to display the repeated member (Fig.4b) or not to include this member at all (Fig.4c). To select one of 3
These hierarchies are also called heterogeneous [4] and non-onto [9].
Country
Group
Country name Population ...
Group name Responsible ...
North America
Sales territory United States
Region Region name Area ...
Central
...
Canada
Southwest b)
a)
Fig. 3. Unbalanced hierarchy: a) schema and b) examples of instances.
a) b)
c)
Fig. 4. Unbalanced hierarchy in SSAS: a) schema and b), c) instances.
these options, designers should modify the HideMemberIf property by indicating one of the following options: (1) OnlyChildWithParentName: when a level member is the only child of its parent and its name is the same as the name of its parent, (2) OnlyChildWithNoName: when a level member is the only child of its parent and its name is null or an empty string, (3) ParentName: when a level member has one or more child members and its name is the same as its parent’s name, or (4) NoName: when a level member has one or more child members and its name is a null value. Notice that this is an incorrect assignment, since for unbalanced hierarchies, only the first or second option should be applied, i.e., the parent member will have at most one child member with the same name, e.g., the name Canada in Fig.3b will be repeated in the missing levels until the tree representing the instances will be balanced.
5.3
Recursive hierarchies
Unbalanced hierarchies include a special case that we call recursive hierarchies4 . In this kind of hierarchy the same level is linked by the two roles of a parent-child relationship. An example is given in Fig.2 for the Employee dimension where the Supervision recursive hierarchy represents the employee-supervisor relationship. The subordinate and supervisor roles of the parent-child relationship are linked to the Employee level. Recursive hierarchies are mostly used when all hierarchy levels express the same semantics, e.g., where an employee has a supervisor who is also an employee. At the logical levels this kind of hierarchy is represented by the inclusion of a foreign key in the same table that contains a primary key as can be seen in Fig.1 for the DimEmployee table. This kind of hierarchy is not represented as a hierarchy in SSAS; only a hierarchy symbol represents a parent key. 5.4
is attached to the attribute that
Non-covering hierarchies
A non-covering or ragged hierarchy contains multiple exclusive paths sharing at least the leaf level. Alternative paths are obtained by skipping one or several intermediate levels of other paths. All these paths represent one hierarchy and account for the same analysis criterion. At the instance level, each member of the hierarchy belongs to only one path. We use the symbol ⊗ to indicate that the paths are exclusive for every member. Fig.2 includes a Location non-covering hierarchy composed of the Reseller, City, StateProvince, and CountryRegion levels. However, as can be seen by the straight lower line and the cardinalities, some countries do not have division in states. Fig.5 shows some hypothetical instances that we use for this hierarchy5 . Notice that the cities of Berlin and Eilenburg do not have assigned any members for the StateProvince level. This hierarchy is represented in the logical schema as a flat table, e.g., the DimGeography table in Fig.1, with corresponding attributes. At the instance level, similar to the unbalanced hierarchies, placeholders or null values can be included in the missing members. SSAS, for representing these hierarchies, uses a similar display as shown for the Sales Territory hierarchy in Fig.4a. For displaying the instances, SSAS provides the same four options as described in Sec. 5.2. However, for non-covering hierarchies, the third or fourth option should be applied since, for our example in Fig.5, two children roll-up to the Germany member included as a placeholder for the missing StateProvince level. Even though the unbalanced and non-covering hierarchies represent different situations and can be clearly distinguished using a conceptual model (Fig.2), 4 5
These are also called parent-child hierarchies [8]. We modify the instance of the AdventureWorksDW to represent this kind of hierarchy.
Germany
...
Bayern
Augsburg
...
Frankfurt
Berlin
Eilenburg
... Rustic Bike Store
Global ... Bike Retailers
... Riding Supplies
Off-Price Bike Center
Fig. 5. Some instances of non-covering hierarchy.
SSAS considers implementation details that are very similar for both hierarchies and states that ”it may be impossible for end users to distinguish between unbalanced and ragged hierarchies” [8]. These two hierarchies also differ in the process of measure aggregation. For an unbalanced hierarchy, the measure values are repeated from the parent member to the missing child members and cannot be aggregated during the roll-up operations. For the non-covering hierarchies, the measures should be aggregated for every placeholder represented at the parent level. 5.5
Non-strict hierarchies
For the hierarchies presented before, we assumed that each parent-child relationship has many-to-one cardinalities, i.e., a child member is related to at most one parent member and a parent member may be related to several child members. However, many-to-many relationships between parent and child levels are very common in real-life applications, e.g., an employee can work in several departments, a mobile phone can be classified in different product categories. We call a hierarchy non-strict if at the schema level it has at least one many-to-many relationships. Fig.2 shows the Assignation hierarchy where an employee can belong to several departments. Since at the instance level a child member may have more than one parent member, the members form an acyclic graph. Non-strict hierarchies induce the problem of double-counting measures when a roll-up operation reaches a many-to-many relationship, e.g., if an employee belongs to the two departments, his sales will be aggregated to both these departments, giving incorrect results. To avoid this problem one of the solutions6 is to indicate that measures should be distributed between several parent members. For that, we include an additional symbol ° ÷ called a distributing factor. The mapping to the relational model [7] will provide the same solution as presented in Fig.1: the DimEmployee, DimDepartment, and EmplDepBridge tables 6
Several solutions can be used as explained in [7].
a)
b)
Fig. 6. Non-strict hierarchy in SSAS: a) dimension usage and b) and representation of many-to-many relationship.
for representing the Employee, Department levels and many-to-many cardinalities with distributing factor attribute, respectively. However, having the bridge table we loose the meaning of a hierarchy that can be used for the roll-up and drilldown operations. This is not the case when using a conceptual model. The SSAS requires several different steps in order to use this hierarchy and to have correct results. First, the bridge table is considered as another fact table and Employee and Department dimensions are handled as separate dimensions (Fig.6a) that later on can be combined to form a hierarchy in a cube data browser. In the next step, designers must define in the Dimension Usage that in order to aggregated the measure from Fact Reseller Sales table, many-to-many cardinalities must be considered (Fig.6a). Notice the SSAS representation of this cardinality in Fig.6b. Finally, in order to use a distributing factor from the bridge table, for every measure of the fact table the Measure
Expression property must be modified, e.g., for SalesAmount measure we include [SalesAmount]*[DistributingFactor]. 5.6
Alternative hierarchies
Alternative hierarchies represent the situation where at the schema level there are several non-exclusive simple hierarchies sharing at least the leaf level and accounting for the same analysis criterion. The Time hierarchy in Fig.2 is an example of alternative hierarchies, where the Date dimension includes two hierarchies corresponding to the usual Gregorian calendar and to the fiscal calendar of an organization. Alternative hierarchies are needed when the user requires analyzing measures from a unique perspective (e.g., time) using alternative aggregation paths. Since the measures from the fact relationship will participate totally in each composing hierarchy, measure aggregations can be performed as for simple hierarchies. However, in alternative hierarchies it is not semantically correct to simultaneously combine the different component hierarchies to avoid meaningless intersections, such as Fiscal 2003 and Calendar 2001. The user must choose only one of the alternative aggregation paths for his analysis and switch to the other one if required. The logical schema does not represent clearly this hierarchy since all attributes forming both paths of alternative hierarchies are included in the flat DimTime table (Fig.1). The current version of SSAS does not include this kind of hierarchy and the designers should define two different hierarchies, one corresponding to calendar and another to fiscal time periods, allowing combinations between the alternative paths and creating meaningless intersections with null values for measures. 5.7
Parallel hierarchies
Parallel hierarchies arise when a dimension has associated several hierarchies accounting for different analysis criteria, e.g., the Product dimension in Fig.2 with Product by color and Product groups parallel hierarchies. Such hierarchies can be independent where composed hierarchies do not share levels or dependent, otherwise. Notice that even though both multiple and parallel hierarchies may share some levels and may include several simple hierarchies, they represent different situations and should be clearly distinguishable. This is done by including only one (for alternative hierarchies) or several (for parallel dependent hierarchies) analysis criteria. In this way the user is aware that in alternative hierarchies it is not meaningful to combine levels from different composing hierarchies, while this can be done for parallel hierarchies, e.g., for the Product dimension in Fig.2, the user can issue the query “what are the sales figures for products that belong to the bike category and are black”.
6
Conclusions
DWs are defined using a multidimensional view of data, which is based on the concepts of facts, measures, dimensions, and hierarchies. OLAP systems allow users to interactively query DW data using operations such as drill-down and roll-up, and these operations require the definition of hierarchies for aggregating measures. A hierarchy represents some organizational, geographic, or other type of structure that is important for analysis purposes. However, there is still a lack of a well-accepted conceptual multidimensional model that is able to represent different kinds of hierarchies existing in real-world applications. As a consequence, even though some commercial tools are able to implement and manage different kinds of hierarchies, users and designers have difficulties in distinguishing them. Therefore, users cannot express clearly their analysis requirements and designers as well as implementers cannot satisfy users’ needs.
References 1. A. Abell´ o, J. Samos, and F. Saltor. YAM2 (yet another multidimensional model): An extension of UML. Information Systems, 32(6):541–567, 2006. 2. M. Golfarelli and S. Rizzi. A methodological framework for data warehouse design. In Proc. of the 1st ACM Int. Workshop on Data Warehousing and OLAP, pages 3–9, 1998. 3. W. H¨ ummer, W. Lehner, A. Bauer, and L. Schlesinger. A decathlon in multidimensional modeling: Open issues and some solutions. In Proc. of the 4th Int. Conf. on Data Warehousing and Knowledge Discovery, pages 275–285, 2002. 4. C. Hurtado and C. Gutierrez. Handling structural heterogeneity in OLAP. In R. Wrembel and C. Koncilia, editors, Data Warehouses and OLAP: Concepts, Architectures and Solutions, chapter 2, pages 27–57. IRM Press, 2007. 5. B. H¨ usemann, J. Lechtenb¨ orger, and G. Vossen. Conceptual data warehouse design. In Proc. of the Int. Workshop on Design and Management of Data Warehouses, page 6, 2000. 6. S. Luj´ an-Mora, J. Trujillo, and I. Song. A UML profile for multidimensional modeling in data warehouses. Data & Knowledge Engineering, 59(3):725–769, 2006. 7. E. Malinowski and E. Zim´ anyi. Advanced Datawarehouse Design: From Conventional to Spatial and Temporal Applications. Springer, 2008. 8. Microsoft Corporation. SQL Server 2005. Books Online. http://technet.microsoft.com/en-us/sqlserver/bb895969.aspx, 2003. 9. T. Pedersen, C. Jensen, and C. Dyreson. A foundation for capturing and querying complex multidimensional data. Information Systems, 26(5):383–423, 2001. 10. E. Pourabbas and M. Rafanelli. Hierarchies. In M. Rafanelli, editor, Multidimensional Databases: Problems and Solutions, pages 91–115. Idea Group Publishing, 2003. 11. S. Rizzi. Open problems in data warehousing: 8 years later. In Proc. of the 5th Int. Workshop on Design and Management of Data Warehouses, 2003. 12. C. Sapia, M. Blaschka, G. H¨ ofling, and B. Dinter. Extending the E/R model for multidimensional paradigm. In Proc. of the 17th Int. Conf. on Conceptual Modeling, pages 105–116, 1998.