MMDW: A Multi-dimensional and Multi-granular Schema for Data ...

3 downloads 2503 Views 669KB Size Report
Dec 4, 2012 - concept of multi-granular data warehousing, which stores data at multi- ... exchange of data, both for operational and business intelligence ...
Advances in Knowledge-Based and Intelligent Information and Engineering Systems M. Graña et al. (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved. doi: 10.3233/978-1-61499-105-2-1211

1211

MMDW: A Multi-dimensional and Multi-granular Schema for Data Warehousing Nadeem Iftikhar Technology & Business, University College of Northern Denmark [email protected]

Abstract. With the emergence of modern database technologies, the concept of multi-granular data warehousing, which stores data at multiple levels of granularity has become important. In order to store multigranular data, the existing data warehousing schemas, such as star and snowflake are unable to provide support, for the reason that they are designed to store data at a single level of granularity. In this paper, we present a multi-dimensional and multi-granular data warehousing schema (MMDW). MMDW is based on Relational OLAP (ROLAP) and uses a RDBMS to manage new and old (aggregated) data by means of an extended star schema. In contrast to traditional ROLAP, MMDW also allows pre-computation and storage of data at multiple levels of granularity. Furthermore, MMDW is evaluated based on a real world case study and results show that MMDW performs well in terms of aggregation query speed, aggregation query complexity as well as storage used and compares favorably with standard star schema.

1

Introduction

Most data warehouses use star or snowflake schema to represent the multidimensional model. The four key inputs to the dimensional model are: business process, granularity, dimensions and facts [1]. After identifying the business process, the single most important design issue confronted by the data warehouse developers is to determine the suitable level of granularity [2]. When the level of granularity is correctly determined then the remaining two inputs, dimensions and facts of the dimensional model can be resolved with less effort. Unfortunately, both star and snowflake schemas [3] are not able to store data at different levels of granularity, for the reason that both the schemas only support storing data at a single level of granularity. This paper presents the multi-dimensional and multi-granular schema MMDW, which has the ability to store new and old (aggregated) data at different levels of granularity. MMDW is an extension of the standard star schema. The extension of the standard star schema is based on the following three motivations. First, the proposed schema is based on ROLAP, and ROLAP often uses a star schema [4]. Second, the users can be kept away from the complexity of a completely new schema. Last, the proposed schema can be implemented using standard relational technologies. This paper extends our previous work [5], [6] and [7] on managing application specific multi-granular data

1212

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

(single dimension based only) by proposing a multi-dimensional based generic schema. More significantly, this paper suggests an effective mechanism to aggregate multi-dimensional and multi-granular data for analysis and reporting purposes. The rest of this paper is structured as follows. Section 2 presents the realworld farming case study and explains the motivation behind the proposed schema. Section 3 provides the overview of MMDW schema. Section 4 describes MMDW based data aggregation. The evaluation of MMDW is presented in Section 5. Section 6 discussed the related work. Finally, Section 7 concludes and points to future work.

2

Case Study

This section presents a real-world case study based on the farming business. The case study is the result of LandIT project [8]. The LandIT was an industrial collaboration project about developing technologies for integration, aggregation and exchange of data, both for operational and business intelligence purposes. We use the case study to illustrate the kind of challenges faced by storing and aggregating data at different levels of granularity, which are addressed by this paper. This case study concerns spray related data in the fields that has to be logged in order to comply with environmental regulations. The logged data is initially kept in detailed format in the Fact table that consists of following attributes: Fact (Timeid, Locationid, Taskid, Parameterid, Value). The Timeid specifies a recording of a time event at different levels of granularity. The Locationid indicates the name of a specific area that is being treated with a chemical for a given task. The Taskid represents the business model to separate tasks that are carried out by a contractor for a farmer in a particular field of a farm. The Parameterid represents a variable code for which a data value is recorded. For example, parameterid 247 represents the amount of chemical sprayed in liters; parameterid 248 represents distance covered by a tractor in km; parameterid 1 represents tractor speed in km/h; parameterid 41 represents area sprayed in hectares and so on. Moreover, each parameter has a different data logging frequency, for example the logging frequencies of parameterid 1 and 247 are every 1 second, the logging frequencies of parameterid 41 and 248 are every 60 seconds and so on. Finally, the Value is simply a measure. In order to list some example data, we used data produced by the farm equipment, which consists of 10 minutes of data at 2 different locations for 1 task and 2 parameters. In Table 1, initially the granularity of the detailed data is at second by treatmentzone by task level (Rowid 1-4). For instance, row number 1 reads as follows: Tid=1 (represents: time at a second granularity level 12.04.2012 16:00:01), Lid=100 (represents: treatmentzone that is being treated with chemical), Taid=10 (represents: unique id for a task), Pid=247 (represents: amount of chemical used) and Value=0.21 (represents: current value of chemical used in liters). Whereas, Rowid is an abstract attribute and used only for row identification purposes. The first level aggregation aggregates data by 10 minutes

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

1213

by partfield by task level (Rowid 5-8). Similarly, row number 7 reads as follows: Tid=601 (represents: time at 10 minutes granularity level 12.04.2012 16:00), Lid=1000 (represents: partfield that is being treated with chemical), Taid=10 (represents: unique id for a task), Pid=247 (represents: amount of chemical used) and Value=24.13 (represents: total value of chemical used in liters). Furthermore, the data could be aggregated from the 10 minutes to the hour level and from the partfield to the field level and so on. Hence, in Table 1 the data is at multiple levels of granularity, Rowid 1-4 are at second by treatmentzone by task level and Rowid 5-8 are at 10 minutes by partfield by task level. Table 1. Multi-dimensional and multi-granular fact table Rowid Timeid Locationid Taskid Parameterid Value 1 2 . 3 4 . 5 6 7 8 .

1 2 . 301 302 . 601 601 601 601 .

100 100 . 101 101 . 1000 1000 1000 1000 .

10 10 . 10 10 . 10 10 10 10 .

247 1 . 247 1 . 247 1 247 1 .

0.21 0.26 . 0.19 0.28 . 25.50 9.46 24.13 10.20 .

The leading approach to store data in a data warehouse, star schema [3] has been selected for analysis. The star schema is unable to represent data at multiple levels of granularity due to the fact that it requires all the dimension attributes, to be given concrete values or in other words all the dimension levels are mandatory. The star schema requires to determine the suitable level of granularity (single level) at the design level, which cannot be changed later. In conclusion, based on the above mentioned case study, the standard star schema is not appropriate to handle multi-granular data.

3

Overview of MMDW schema

This section informally describes the general idea of creating a generic data warehousing schema for multi-dimensional and multi-granular data. The description also provides an insight into the working of MMDW by introducing a simple example. As mentioned in Section 1, the standard star schema is unable to manage multi-granular data, for that reason an extended version of the star schema is proposed. The proposed schema, shown in Fig. 1, comprises of two major enhancements in the standard star schema: Granularity attribute and a MG notation, each of which represents the multi-granular nature of data. The

1214

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

Granularity attribute in all dimensions represent the level of granularity of each row stored in the fact table. The MG notation represents that multi-granular relationships exist between the fact table and the dimensions. Thus, with the inclusion of Granularity column in each dimension, it is possible to store data in the fact table at multiple levels of granularity. In addition, with the proposed schema, end users can generate SQL queries either to calculate aggregates onthe-fly or to pre-compute and store the aggregates at any desired combination of dimensions and granularities for analysis and reporting purposes. In the later case, there are further two options, the first option is to delete all the rows that have been aggregated from the fact table and from the dimensions (with no fact rows pointing to them). The second option is to keep both the detailed as well as aggregated data in the fact table. However, to save disk space and to perform efficient query processing the first option seems more suitable for the reason that as the detailed data grows older, it slowly loses its value or may not have the same value as before. The detailed description of calculating aggregates in presented in Section 4.

Fig. 1. The MMDW Schema

Fig. 2. Logs database

Example I now introduces a simple example that will be used as a running example throughout the paper. Fig. 2, illustrates a Logs database designed as a multi-star schema. In the multi-star schema, the composite primary key of the fact table consists of both a set of foreign keys, which reference dimensions, and one or more prime columns in order to provide a unique identifier for each row. In the fact table Logs, the primary key is composed of three foreign keys, Timeid, Locationid, and Taskid, each of which references a primary key in a dimension as well as a prime attribute, Parameterid. Many-to-one relationships exist between the fact table and the dimensions. The Time dimension defines the times. Each row in the dimension represents a distinct time with a unique time identifier and a level of granularity. The time identifier can occur multiple times in the Logs table representing logs at that level of granularity for each location and task. For instance, when the granularity is at Day level, the row in the Time table appears as (T1, Day, 2012, 03, 01), where “T1” represents the unique time identifier,

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

1215

“Day” indicates the level of granularity of that row and “2012, 03, 01” stand for the calendar year, month and day. Similarly, when the granularities are at Month and Year levels, the rows in the Time table look like (T2, Month, 2012, 03, NULL) and (T3, Year, 2012, NULL, NULL), respectively. Further, the Location and Task dimensions defines the locations. and tasks, respectively. For example, when the granularity is at Treatmentzone level, the row in the Location table appears as (L1, Treatmentzone, Red Field, North, Medium Rate), where “L1” represents the unique location identifier, “Treatmentzone” indicates the level of granularity of that row and “Red Field, North, Medium Rate” stand for the Field, Partfield and Treatmentzone. Similarly, when the granularity is at Partfield level, the row in the Location table look like (L2, Partfield, Red Field, North, NULL). Furthermore, in case of Task dimension, when the granularity is at Task level, the row in the Task table appears as (TA1, Task, Niels, Town Farm, Initial Spray), where “TA1” represents the unique task identifier, “Task” indicates the level of granularity of that row and “ Niels, Town Farm, Initial Spray” stand for the Customer, Farm and Task. Although, each dimension in MMDW may have numerous trailing NULL columns due to the presence of data at multiple levels of granularity; however, major DBMSs, such as Oracle does not use space for trailing NULL values [9]. Similarly, in MySQL ([10] chapter 13.2.10.5), NULL value reserves zero bytes in the data part of the record, if stored in a variable length column. Moreover, the multi-granular rows in the fact table (Logs) appear as (T1, L1, TA1, 247, 5), (T2, L2, TA1, 247, 60) and so on. “T1, L1 and TA1” represent the unique fact identifier at Day, Treatmentzone and Task granularities and “T2, L2 and TA1” represent the unique fact identifier at Month, Partfield and Task granularities. In addition, “247 and 5 as well as 247 and 60” represent the amount of chemical sprayed in liters and the actual value, such as 5 and 60 liters.

4

MMDW based Data Aggregation

This section first gives a brief description of how the multi-dimensional lattice is generated. After that it describes MMDW based data aggregation. Multidimensional and multi-granular view of data in MMDW emphasizes on aggregation of measures by various combinations of dimensions and granularities. The lattice in Fig. 3, shows a snap shot of multi-dimensional hierarchies. It is based on the following hierarchies: Time hierarchy (All→Year→Month), Location hierarchy (All→Field→Partfield) and Task hierarchy (All→Customer→Farm). As seen in Fig. 3, the bottom node (Month, Partfield, Farm) represents the lowest granularity and the top node (All, All, All) represents the highest granularity. In total there are 27 nodes in the lattice ((granularity of Time+1)*(granularity of Loc ation+1)*(granularity of Task+1)=(3+1)*(3+1)*(3+1)= 64). The lower granularities can be aggregated into various combinations of higher granularities as shown in Fig. 3, such as: (Month, Field, Farm); (Year, Field, Customer); (All, Field, All) and so on. The complete procedure of multi-dimensional and multigranular data aggregation (based on pre-computing and storing the aggregates)

1216

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

is outlined in Algorithm 1. It works as follows. First, it aggregates data and stores it in multiple cursors. The reason to have multiple cursors is due to the fact that parameters are aggregated differently, some of them are being aggregated using AVG, some using MAX and so on. Second, it iterates through each cursor, generate new rows in the selected dimension tables (if not already existing) to correspond to higher granularity rows of the cursor. Third, it inserts the aggregated rows in the fact table along with updated reference to the selected dimension tables. Fourth, it deletes the rows that have been aggregated from the fact table. Last, it deletes the rows from the selected dimension tables with no fact rows pointing to them.

Fig. 3. The Lattice

1 2 3 4 5 6 7 8 9 10

SELECT L.Timeid, L.Locationid, L.Taskid, L.Parameterid, SUM(L.Value) FROM Logs L, Time_Dimension TD, Location_Dimension LD, Task_Dimension TAD WHERE L.Parameterid = 247 OR L.Parameterid = 248 AND L.Timeid = TD.Timeid AND L.Locationid = LD.Locationid AND L.Taskid = TAD.Taskid GROUP BY L.Taskid, L.Paratemerid, TD.Year, LD.Field, TAD.Customer, TAD.Farm;

The methods that perform the desired data aggregation are already presented in [5]. In this section, we present the data aggregation query (Lines 1-10) that

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

1217

Algorithm 1 Multi-dimensional and Multi-granular Data Aggregation – Aggregate the existing rows based on single or different levels of granularity and store them in different cursors (a separate cursor for each aggregate function). – For all cursor do: • For each cursor do. ∗ Generate new rows in the selected dimension tables (if not already existing) to point to the higher granularity rows in the cursor. · If rows with the required granularity levels have not been found in the selected dimension tables then generate new rows with the required level of granulatiy. Else find existing rows in the selected dimension tables that match with the required granularity. ∗ Insert the aggreated rows in the fact table with a reference to the newly created or retrieved rows in the selected dimension tables. • Close the cursor. – Close all cursors. – Delete all the rows from the fact table that have just been aggregated. – Delete all the rows from the selected dimension tables with no fact rows pointing to them.

aggregates multi-dimensional data at different levels of granularity. The most significant part of the SQL statement is the GROUP BY clause (Lines 9-10). The GROUP BY clause aggregates several different lower levels of granularity into higher levels of granularity. Similarly, if we consider all the levels in the dimension hierarchies, described in MMDW schema (Fig. 2), in that case the GROUP BY clause (Lines 11-13) appears as. 11 GROUP BY L.Taskid, L.Parameterid, TD.Year, TD.Month, TD.Day, 12 LD.Field, LD.Partfield, LD.Treatmentzone, TAD.Customer, 13 TAD.Farm, TAD.Task; Moreover, it is possible to aggregate across multiple dimensions and multiple granularities. The following GROUP BY clause (Lines 14-15) aggregates across all three dimensions (Time, Location and Task). It aggregates data from Day and Month granularities into Year granularity, from Treatmentzone and Partfield granularities into Field granularity and from Task granularity into Farm granularity. 14 GROUP BY L.Taskid, L.Parameterid, TD.Year, LD.Field, 15 TAD.Customer, TAD.Farm;

5

Evaluation

This section evaluates MMDW based on the running example, presented in Section 3. MMDW has been evaluated based on a realistically sized dataset (10,000,000 (fact)+1,000 (time)+100 (location)+100 (task)=10,001,200 rows). The tests were designed to calculate aggregates on-the-fly and to measure the

1218

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

aggregation query speed in seconds, aggregation query complexity with respect to number of lines of code and storage used in MB. The tests were performed on a 2.0 GHz Intel Core Duo with 512 MB RAM, running Ubuntu 8.04 (hardy) and MySQL 5.0.5. Every test was performed 5 times. The maximum and minimum values are discarded and an average is calculated using the middle three values. Table 2 shows the comparison (of the aggregation query speed (QS) in seconds, aggregation query complexity (QC) with respect to number of lines of code and storage used (SU) in MB) between MMDW and the standard star schema (SSS). Table 2. Comparison between MMDB and SSS (10,001,200 rows) QS (MMDW) QC (MMDW) SU (MMDW) 140.28 seconds

10 Lines

QS (SSS)

QC (SSS) SU (SSS)

421.146 MB 139.17 seconds 10 Lines 421.139 MB

It is important to note that the fact table used in both the schemas (MMDW and SSS), which usually takes more than 90-95% of the storage space, is exactly the same. Whereas, the dimensions are slightly different due to the presence of the Granularity attribute; however, the dimensions takes only 5-10% of the storage. The comparison shows that MMDW performs well in terms of aggregation query speed, aggregation query complexity as well as storage used and compares favorably with standard star schema, in order to calculate aggregates on-the-fly. In addition, to calculate aggregates on-the-fly, MMDW also allows to pre-compute and store the aggregates. Furthermore, schema design can play an important role in the DW development process and it needs to be flexible, easy-to-use and easy-to-maintain. In general, schemas become less useful, if they reduce understandability to end users and increase complexity of queries [11]. In fact, MMDW neither adds unnecessary complexity, nor reduces considerable query performance and does not significantly increase storage space.

6

Related Work

Previously, other studies on multi-granular schema design and data aggregation have been done. In the context of multi-granular schema design, the foundation is provided by [7]. The work provides various alternatives to design a multi-granular schema using a fact table and a time dimension. It presents the extended versions of star, snowflake and shrunken schemas; however, the work is based on singledimension and it lacks a strategy for multi-granular data aggregation. Another work [6], presents a dimension-less (without any dimensions) multi-granular fact table and a few abstract data aggregation strategies. In contrast to both these works, the current paper presents a generic multi-dimensional and multi-granular schema as well as a concrete data aggregation strategy. Furthermore, the work in [12], proposes a ROLAP based multi-dimensional schema to store detailed data (single granularity) and extends the OLAP algebra by introducing an operator

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

1219

for multi-granular analysis. The work allows to aggregate data on-the-fly at various levels of granularity; however, it does not have the ability to store multigranular (detailed) data and to pre-compute and store the aggregated data at various levels of granularity, as suggested in the current paper. In the context of multi-granular data aggregation, a practical approach is provided by [5]. The approach is mainly focused on a single dimension based data aggregation in relational systems. It consists of following steps: finding the tasks that are due for aggregation; aggregating the existing rows; generating the new rows in the Time dimension in order to point to the higher granularity fact rows; inserting the newly aggregated rows in the fact table; deleting the previous rows from the fact table. In contrast to this approach, the current paper presents a multi-dimensional based data aggregation strategy, which is outlined in Algorithm 1. Moreover, work has been reported on multi-dimensional data stream summarization. The work presented in [13], [14] and [15] aims at developing materialized data cubes for on-line, multi-granular and multi-dimensional aggregation of data stream. The work presented in [13] is based on tilted time frame model, whereas, in [14] the tilted time frame model is extended by incorporating user preferences and precision functions to materialize only the useful part of the historical stream data. The work presented in [15] is based on graph as well as tilted time frame model with a particular focus on continuously-changing data. The main purpose of these approaches is to build compact data cubes for on-line analysis of stream data. In comparison, the current paper provides a complete solution for the management and aggregation of multi-dimensional and multigranular data with a focus on static (old) data. Another work presented in [16] is based on hierarchy-based approach to address the task of data aggregation in OLAP systems. The main focus of this work is on computing OLAP queries onthe-fly over multi-dimensional data streams. In contrast to this work, the current paper provides both on-the-fly as well as pre-computing and storing aggregates. Finally, the approaches presented in [13], [14], [15] and [16] are main memory (MOLAP) based, in contrast, our solution is relational (ROLAP) based.

7

Conclusion

To manage multi-dimensional and multi-granular data, this paper presents a multi-dimensional and multi-granular schema (MMDW) for data warehousing. MMDW is an extension of the standard star schema and based on ROLAP. MMDW has the ability to store the multi-granular (detailed) data and to aggregate data on-the-fly or to pre-compute and store the aggregated data at any desired combination of dimensions and granularities for analysis and reporting purposes. MMDW has been evaluated based on a realistically sized dataset. The experiments show that MMDW performs well in terms of query speed, query complexity as well as storage used and compares favorably with standard star schema. The proposed schema is general and can be used in any industry in which significant amounts of multi-dimensional and multi-granular data is generated and analyzed. In the future, more extensive evaluation of MMDW should

1220

N. Iftikhar / A Multi-Dimensional and Multi-Granular Schema for Data Warehousing

be done. It should also be investigated how the model can be efficiently implemented using standard data warehousing technologies (MOLAP) and how it could be applied to data streams.

References 1. Kimball, R., Ross, M.: The Data Warehouse Toolkit, Second Edition. Wiley Computer Publishing, New York (2005) 2. Inmon, W.H.: Building the Data Warehouse, Fourth Edition. Wiley Publishing, Inc., Indianapolis (2002) 3. Kimball, R.: The Data Warehousing Toolkit, Wiley Computer Publishing, New York (2002) 4. Bidgoli, H.: The Handbook of Technology Management: Supply Chain Management, Marketing and Advertising, and Global Management, Volume 2. John Wiley & Sons, (2010) 5. Iftikhar, N., Pedersen, T.B.: Using a Time Granularity Table for Gradual Granular Data Aggregation. In: Catania, B., Ivanovic, M., Thalheim, B. (eds.) ADBIS 2010. LNCS, vol. 6295, pp. 219–233. Springer, Heidelberg (2010) 6. Iftikhar, N., Pedersen, T.B.: Gradual Data Aggregation in Multi-granular Fact Tables on Resource-constrained Systems. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6278, pp. 349–358. Springer, Heidelberg (2010) 7. Iftikhar, N., Pedersen, T.B.: Schema Design Alternatives for Multi-granular Data Warehousing. In: Bringas, P.G., Hameurlain, A., Quirchmayr, A. (eds.) DEXA 2010. LNCS, vol. 6262, pp. 111–125. Springer, Heidelberg (2010) 8. LandIT, http://www.tekkva.dk/page326.aspx 9. Osborne, K., Johnson, R., P¨ oder, T.: Expert Oracle Exadata, Apress, New York (2011) 10. MYSQL 5.0 Manual, http://dev.mysql.com/doc/refman/5.0/en/innodb-physical-record.html 11. Moody, D., Kortink, M.: From ER Models to Dimensional Models II: Advanced Design Issues. Business Intelligence Journal 8(4), (2003) 12. Hubert, G., Teste, O.: Multigranular Manipulations for OLAP Querying. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds.) AKDM 2010. SCI, vol. 292, pp. 97–112. Springer, Heidelberg (2010) 13. Han, J., Chen, Y., Dong, G., Pei, J., Wah, B.W., Wang, J., Cai. Y.D.: Stream Cube: An Architecture for Multi-dimensional Analysis of Data Streams, Distributed and Parallel Databases 18(2), 173–197 (2005) 14. Pitarch, Y., Laurent, A., Plantevit, M., Poncelet, P.: Multidimensional Data Stream Summarization using Extended Tilted-time Windows. In: IEEE International Conference on Advanced Information Networking and Applications Workshops, pp. 250–254. IEEE Press, New York (2009) 15. Pitarch, Y., Laurent, A., Poncelet, P.: Summarizing Multidimensional Data Streams: A Hierarchy-graph-based Approach. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 335–342. Springer, Heidelberg (2010) 16. Cuzzocrea, A.: Cams: OLAPing Multidimensional Data Streams Efficiently. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 48–62. Springer, Heidelberg (2009)

Suggest Documents