Multidimensional Arrays for Warehousing Data on ... - Semantic Scholar

Multidimensional Arrays for Warehousing Data on Clouds Laurent d’Orazio1 and Sandro Bimonte2 1

Blaise Pascal University - LIMOS, France [email protected] http://www.isima.fr/limos/ 2 Cemagref, France [email protected] http://www.cemagref.fr

Abstract. Data warehouses and OLAP systems are business intelligence technologies. They allow decision-makers to analyze on the fly huge volumes of data represented according to the multidimensional model. Cloud computing on the impulse of ICT majors like Google, Microsoft and Amazon, has recently focused the attention. OLAP querying and data warehousing in such a context consists in a major issue. Indeed, problems to be tackled are basic ones for large scale distributed OLAP systems (large amount of data querying, semantic and structural heterogeneity) from a new point of view, considering specificities from these architectures (pay-as-you-go rule, elasticity, and user-friendliness). In this paper we address the pay-as-you-go rules for warehousing data storage. We propose to use the multidimensional arrays storage techniques for clouds. First experiments validate our proposal.

1

Introduction

Data warehouses and OLAP systems are business intelligence technologies that aim at the analysis of huge volumes of data modeled according with the multidimensional model [13]. In typical architectures, OLAP systems are deployed using Relational DBMS systems to store and analyze data. This approach is suitable for sparse data warehouses. When data is dense, the MOLAP approach can be used [28]. It stores data using a multidimensional data structure such as multidimensional arrays in order to reduce the size of stored data. High Performance Computing architectures aim at insuring increasing needs in terms of computing or storage resources of both scientific and industrial applications [6]. Among these architectures, cloud computing on the impulse of companies like Google, Microsoft and Amazon focuses the interest in particular due to low costs and the fact that they consists in good out-of-the-box solutions, even if their performances are bellow current parallel DBMS [23]. Data warehouses and OLAP systems on cloud raise several problems related to storage and query computation performance. In particular, problems to consider include basic ones from large scale distributed systems (large amount of data querying, semantic and structural heterogeneities) from a new point of A. Hameurlain, F. Morvan, and A. Min Tjoa (Eds.): Globe 2010, LNCS 6265, pp. 26–37, 2010. c Springer-Verlag Berlin Heidelberg 2010

Multidimensional Arrays for Warehousing Data on Clouds

27

view, regarding specific behaviors of these architectures: pay-as-you go model, elasticity and user friendliness [6]. Some works support complex queries like spatial and OLAP queries on clouds [17], [25], [27]. However, to the best of our knowledge, no work defines a particular data model to store multidimensional data on clouds trying to respect the pay-as-you go model. Therefore, in this paper we provide the first step towards the implementation of a multidimensional arrays-based architecture on clouds, in order to reduce storage data costs. In particular, we present an algorithm that transforms data stored using multidimensional arrays into Pig data [17]. This allows us to perform OLAP queries using the MapReduce paradigm [9] and save storage costs. The second contribution of this paper is the description of open research issues of using cloud databases for OLAP analysis. This paper is organized as follows. Section 2 presents the context of our work. Section 3 introduces our proposition of multidimensional arrays storage on clouds. Section 4 validates our approach. Section 5 lists research opportunities. Finally, section 6 concludes this paper.

2

Context and Research Motivation

This section briefly presents a case study that will be used as an illustration in subsection 2.1, data warehouses and OLAP in subsection 2.2, data management in clouds in subsection 2.3, then introduces our research motivation in subsection 2.4. 2.1

Case Study

In order to present our work, we introduce a simulated case study that concerns the OLAP analysis of sales for stores of a supply chain, which are located in each French department. It presents two dimensions, a spatial dimension that groups departments into regions, and the temporal dimension (day < month < year), and the measure is the profit. An example of data is shown on table 1. Table 1. Case study data Year 2010 2010 ... 2005

2.2

Month 04 04 ... 11

Day 01 01 ... 29

Country France France ... France

Region Auvergne Auvergne ... Rhˆ one-Alpes

Department Profit Puy-de-Dˆ ome 2000 Allier 500 Is` ere

2500

Data Warehouse and OLAP

Data warehouses model data according to the multidimensional model. Such a model defines the concepts of dimensions and measures. A dimension is composed of hierarchies and represents the analysis axis. A hierarchy organizes data into a hierarchical structure allowing decision-makers to analyze measures at different granularities. Indeed, measures are numerical indicators which describe the

28

L. d’Orazio and S. Bimonte

analysis subject. OLAP operators such as roll-up and drill-down allow decisionmakers to navigate into hierarchies aggregating data using SQL aggregation functions [13]. Some other operators have been defined to select a part of the data warehouse and permute dimensions [20]. MOLAP systems use multidimensional data structures such as multidimensional arrays constructed from the original data, which are typically stored in relational databases. MOLAP systems improve storage performance for dense data warehouse through its particular storage data model [27]. Indeed, using multidimensional arrays allows storing only measures values, as they are indexed using the position of dimensions members. For example, according the MOLAP representation of our study case illustrated by figure 1, the measure value at the position ARRAY[2] [1] is associated to the second member of the first dimension (2000-01-02) and the first member of the second dimension (Allier department). Dimensions Time Time Dim[0]=2000-01-01 Time Dim[1]=2000-01-02 ... Time Dim[34121]=2010-04-10 Location Location Dim[0]=France,Auvergne,Puy-de-D^ ome Location Dim[1]=France,Auvergne,Allier ... Location Dim[99]=France,Rh^ one-Alpes,Is` ere Measures Facts Profit Fact[0]=2000 Facts Profit Fact[1]=500 ...

Fig. 1. MOLAP representation of data

In order to store an unidimensional array, a simple formula has been provided: Let d dimensions, Nk the members of the k th dimension, then dthe position d of the measure value in the unidimensional array is: p(i1 , ..., id ) = j=1 (ij * k=j+1 Nk ) where ij is the position of the member of the j th dimension. 2.3

Data Management on Cloud

In order to achieve scalable and efficient databases management systems to deal with data volumes that cannot be managed by classical relational or object DBMS (for example Facebook manages more than three hundred millions users, more than two billions pictures uploaded and more than three millions events added per day), some new data management architectures on clouds have been developed. Data management in clouds usually follows a layered architecture as illustrated by figure 2. The first level is the infrastructure tier. Typically, such a tier consists in one or several data centers that are used in order for large data analysis processes to be done [4] [1]. The main behavior of this level is the associated pay-as-you-go model.


29

Fig. 2. Cloud data management architecture

The second tier is the storage tier. Its main objective is to propose a highly scalable and fault-tolerant system. In clouds, data are stored in files managed by such systems [11] [2]. The third tier is the execution environment tier. The most known example of cloud computing execution environment is probably Google MapReduce [9] and its open source version Hadoop [3]. Such an execution tier aims at providing elasticity by enabling to adjust resources according to the application. In one way such a property avoids large invests in order for applications to sustain some peaks of use and as a consequence will lead to a global under use of the infrastructure. On the other way, it enables to ensure the good functioning of an application which popularity would have not been correctly foreseen, increasing resources if necessary. The last tier is the high querying language tier. Such a tier aims at proposing user-friendliness and transparence of the other tiers of the architecture, and the parallelism as possible. Some query languages have been proposed like Facebook Hive [25], Microsoft Scope [7], Google Sawzall [19], Map-Reduce-Merge [26], which are based on particular data models such as the column oriented [22] or extensions of the relational model [17] [25]. In particular, the Pig Latin language [17] has been designed to propose a trade-off between the declarative style of SQL, and the low-level, procedural style of MapReduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over a parallel execution environment. 2.4

Research Motivations

According to the principle of pay-as-you-go, in clouds, users only pay for resources (CPU, storage, bandwidth consumption) they use. For example, with Microsoft Windows Azure [4] CPU costs 0.12 $ for one hour execution, storage costs 0.15 $ per month and per GB, bandwidth consumption costs 0.10 $ per GB in upload and 0.15 $ in download. Therefore, if on one hand all query languages for data in clouds support indirectly (since no ad-hoc operator [12] has been introduced) OLAP queries, on the other hand no one address multidimensional data storage. Then, our

30

L. d’Orazio and S. Bimonte

idea is to provide a particular organization of multidimensional data on cloud in order to reduce the storage and computation costs for OLAP queries, and at the same time take advantage of cloud data management systems characteristics: scalability and performance.

3

Multidimensional Arrays in Clouds

In this section, we introduce an overview of the querying process on multidimensional arrays in clouds (subsection 3.1). Then, we present the storage and data processing in more details (subsection 3.2) and our optimization of Pig OLAP queries (subsection 3.3). 3.1

Overview of Querying Multidimensional Arrays in Clouds

The querying process is composed of two steps as illustrated by figure 3: 1. Data are structured as arrays. This enables to reduce the size of the stored files, and as a consequence the price to be paid by clients. When a query (or a set of queries) is posed, arrays are translated in Pig data using a temporary file, using the algorithm presented in subsection 3.2. Such a file will removed after the analysis. 2. OLAP queries are formulated and optimized in an efficient execution plan of Pig Latin instructions. It has to be noted that this queries can be executed in a a parallel fashion, using the MapReduce paradigm, enabling elasticity.

Fig. 3. Overview of the querying process

3.2

OLAP Queries Using PIG Latin and Multidimensional Arrays

This section presents how multidimensional arrays can be used in Pig storage capabilities that we propose. The Pig data model is an extension of the relation model, with the following atomic concepts: Bag (set of values), Map (hash functions), nested tables and UDF (User Defined Functions).


31

Data are stored in logical multidimensional arrays, physically stored as a unidimensional array using the formula previously presented in subsection 2.2. Figure 4(a) illustrates multidimensional arrays for data our study case. For example, the first fact, first line of the measure part, is the measures values associated to members 2000-01-01 (first line of the time) and France,Auvergne,Puy-de-Dôme (first line of the location part). When queries are posed, data are converted in Pig data in a temporary file. Each line represents a tuple, values for a tuple being separated by semicolons. The conversion of the considering data by our study case are shown on figure 4(b). 2000-01-01 2000-01-02 ... 2010-04-10 France,Auvergne,Puy-de-D^ ome France,Auvergne,Allier ... France,Rh^ one-Alpes,Is` ere

2000;01;01;France;Auvergne;Puy-de-D^ ome;2000 2000;01;01;France;Auvergne;Allier;500 ... 2010;04;20;France;Rh^ one-Alpes;Is` ere;2500

(b) Pig data

2000;500;400;...

(a) Multidimensional arrays Fig. 4. Data representation

Conversion from multidimensional arrays to Pig data is done via the algorithm 1. Inputs of such an algorithm are the files which store arrays of dimensions and measures (figure 4(a)). The output consists in the file which represents warehouse data using the Pig data model (figure 4(b)). The idea of the algorithm is to build the Cartesian product using the n − 1 dimensions. Then, these data are join in a Cartesian product with the nth dimension, and measure values of the measure array are added to generated tuples by this way: ith tuple with the ith value of the measure array. When the analysis is complete, the temporary file is removed in order to save storage costs. Algorithm 1. Algorithm for conversion of multidimensional arrays data to Pig data Require: Tables files Ensure: Pig file int i ← 1; int n; file cartProdFile; {initialized by the cartesian product of the two last dimensions} file pigFile; file mAFile; array dimensions; {set of dimensions} while i