Optimizing OLAP cubes construction by improving data placement on ...

14 downloads 20071 Views 324KB Size Report
republish, to post on servers or to redistribute to lists, requires prior specific permission ... sary to define a priori a strategy for data colocation that meets the ...
Optimizing OLAP cubes construction by improving data placement on multi-nodes clusters Billel ARRES

Nadia KABACHI

Omar BOUSSAID

Universite Lumiere Lyon 2 69676 Bron, France

Universite Lyon 1 69100 Villeurbanne, France

Universite Lumiere Lyon 2 69676 Bron, France

[email protected]

[email protected]

ABSTRACT The increasing volumes of relational data let us find an alternative to cope with them. The Hadoop framework - which is an open source project based on the MapReduce paradigm is a popular choice for big data analytics. However, the performance gained from Hadoop’s features is currently limited by its default block placement policy, which does not take any data characteristics into account. Indeed, the efficiency of many operations can be improved by a careful data placement, including indexing, grouping, aggregation and joins. In this paper we propose a data warehouse placement policy to improve query gain performances on multi nodes clusters, especially Hadoop clusters. We investigate the performance gain for OLAP cube construction query with and without data organization. And this, by varying the number of nodes and data warehouse size. It has been found that, the proposed data placement policy has lowered global execution time for building OLAP data cubes up to 20 percent compared to default data placement.

Keywords MapReduce, HDFS, Data warehouses, Block Placement

1.

INTRODUCTION

For some years now, we have witnessed dramatic increases in the volume of data literally in every area: business, science, and daily life to name a few. The storage and processing of such an overwhelming amount of data is a challenging task in the current computing environments. To address the scalability requirements of today’s data analytic, parallel shared-nothing architectures of commodity machines (often consisting of thousands of nodes) have been lately established as the de-facto solution. Various systems have been developed mainly by the industry to support Big Data analysis, including Google’s MapReduce [13, 14], Yahoo’s PNUTS [7], Microsoft’s SCOOP [16] and Walmart Labs’ Muppet [23]. Also, several companies, including Facebook [4], both use and have contributed to Apache Hadoop (an

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WOODSTOCK ’97 El Paso, Texas USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

[email protected]

open-source implementation of MapReduce) and its ecosystem. Recently, MapReduce have been used in not only the analysis of a homogeneous data set (e.g., log processing) but also that of a heterogeneous data set such as data warehouses and data marts. Particularly in businesses, online analytical processing (OLAP) applications can take great advantage of the MapReduce framework because these applications process multi-dimensional analytical queries dealing with very large data warehouses (or ‘big data’). In this context, different techniques and efforts were initiated by researchers from database communities in order to get the maximum benefits of MapReduce programming model. These techniques have tree major directions. The first direction focuses on integrating database with MapReduce [1]. The second focuses on integrating indexing capabilities to MapReduce [15]. The third direction attempt to empower Hadoop by carful data organization [17]. Other research efforts tried to enhance the performance of MapReduce by optimizing structure of data storage. Column-oriented storage techniques [3] and extension of Hadoop such as Manimal [9], try to optimize data storage of related data in the same nodes. In this paper, we studied the benefits of data colocation within the context of data warehousing and OLAP cube construction. Based on an existing data organisation mechanism, our experiments suggest that a good data placement on a cluster during the implementation of the data warehouse can significantly increase the performance of the OLAP cube construction and querying. Here, it is necessary to define a priori a strategy for data colocation that meets the objectives of the OLAP cube before it’s construction (which we discuss in section 4). We used CoHadoop’s approach [17] since it is an extension of Hadoop that enable applications to control where data are stored; indeed, it retains the flexibility and fault tolerance of Hadoop which are most important Mapreduce characteristics. In our experiments, we used Hive [5] as a data warehousing solution for the construction of the data cube on top of the proposed Hadoop’s extension, we investigate the performance gain for OLAP cubes construction with and without data organization. We also used the Star Schema benchmark [18] data sample, which is well-known large-scale data analysis benchmark. The remainder of the paper is organized as follows. Section 2 presents background and related work. Section 3 describes the problem we address in this paper. Section 4 details the proposed approach and it’s implementation. Section 5 evaluates the performance efficacy of the proposed approach. Then, Section 6 concludes the paper.

2.

BACKGROUND AND RELATED WORK

In this section, we begin by describing the MapReduce paradigm in the context of OLAP. We then discuss the Hadoop data organization enhancement techniques in brief.

2.1 MapReduce and OLAP MapReduce [13] is a framework for parallel processing of massive data sets. As shown in Figure 1, a job to be performed using MapReduce has to be specified as two phases: the map phase specified by a Map function takes key/value pairs as input, performs computation on this input, and produces key/value intermediate results; and the reduce phase which processes these results as specified by a Reduce function. The data from the map phase are shuffled, i.e., exchanged and merge-sorted, to the machines performing the reduce phase. It should be noted that the shuffle phase can itself be more time-consuming than the two others depending on network bandwidth availability and other resources.

Figure 1: The MapReduce model. The MapReduce programming model has many advantages, such as high throughput/performance, use of commodity machines and fault tolerance. MapReduce is used in not only index construction for search engines but also data analysis of both homogeneous and heterogeneous sets. Data join processing, which is very important for complex analysis in data warehouses, is addressed in [11] using MapReduce. Other works like [6] and [10] used Hadoop-based implementations, such as Hive [5], as alternatives to relational DBMSs benchmarking and comparing different approaches to retrieve data cubes with MapReduce. According to [12], a data warehouse is an online repository for data from operational systems of an enterprise . It is usually maintained using a star schema that is composed of a fact table and a number of dimension tables. A fact table contains atomic data or records for business areas such as sales and production. Dimension tables have a large number of attributes that describe records of the fact table. In the context of MapReduce all data in data warehouses are stored as a form of a chunk in distributed file system [19]. The data warehouse is splited into blocks and distributed over the cluster nodes. However, the MapReduce Data File System tries to balance the load by placing blocks randomly, independently of the intended use of the data. In this paper, we focus on the performance gained by careful data organization for star schema data warehouses.

2.2 Hadoop Data Organization Improvements Hadoop [21] is a MapReduce based framework designed for scalable and distributed computing. It consists of two main parts: the Hadoop distributed file system (HDFS) and MapReduce for distributed processing. Files in HDFS are split into a number of large blocks (usually a multiple of

64 MB) which are stored on DataNodes (worker nodes). A file is typically distributed over a number of DataNodes in order to facilitate high bandwidth and parallel processing. Efficient access to data is an essential step for achieving improved performance during query processing which is a very important step in data warehousing. In contrast to many positive features of MapReduce and it’s open-source implementation Hadoop, such as scalability and fault tolerance, it has some limitations, especially in data access. According to literature [2, 8], three subcategories of data access improvement exists, namely indexing, data layouts and intentional data placement. For indexing, Hadoop++ [15] is a systems which provide indexing capabilities to Hadoop. Hadoop++ cogroups input files by creating a special ”Trojan” file. In fact, Hadoop++ can only colocate two files that are created by the same job. In the case of data warehouses it is necessary to colocate all the data warehouse files, not only two, and this can be a major inconvenience. Data layouts is another subcategory for enhancing MapReduce performance. For example, Llama [24] proposes the use of a columnar file (called CFile) for data storage in HDFS. This enables selective access only to the columns used in the query. However, this type of storage provides more efficient access to data than traditional rowwise storage only for queries that involve a small number of attributes. This is not always the case for data warehouses. Other research efforts tried to empower Hadoop by an intentional data organization. CoHadoop [17] is an extension of Hadoop to enable applications to control where data are stored. It retains the flexibility of Hadoop since it does not require users to convert their data to a certain format such as a relational database or a specific file format. To achieve colocation, CoHadoop extends HDFS with a file-level property (locator), and files with the same locator are placed on the same set of DataNodes. In addition, a new data structure (locator table) is added to the NameNode (master node) of HDFS. The contributions of CoHadoop include colocation and copartitioning, and it targets a specific class of queries that benefit from these techniques. For example, joins can be efficiently executed using a map-only join algorithm, thus eliminating the overhead of data shuffling and the reduce phase of MapReduce. In this work, we propose to use the Cohadoop’s mechanism for colocating the data warehouse tables files. Since the logic of Cohadoop does not change the data structure and preserves the performances of Hadoop in terms of fault tolerance and scalability [2]. We propose a data warehouse distribution strategy over a multi nodes cluster that best meets the expectations of the final user queries, such as building OLAP data cubes queries.

3. PROBLEM STATEMENT The implementation of a data warehouse that incorporates the best features of the MapReduce model - scalability and fault tolerance - is the goal of several research, eg [22] and [1]. However, the process remains the same. Assuming a star schema data warehouse, with a fact table F F and four dimensions D1, D2, D3, D4. With the distributed file system HDFS, all the data in the data warehouse are split into blocks of fixed size and stored on Datanodes. The block size is configurable and defaults to 64MB. So there will be F Fi , D1j , D2k , D3m and D4n blocks in the file system. The integers {i, j, k, m, n} depends on the size of each table. By default, the data placement policy of HDFS tries

to balance load by placing blocks randomly on the Datanodes. This default data placement policy of HDFS arbitrarily places partitions across the cluster so that mappers often have to read the corresponding partitions from remote nodes. This causes a high data shuffling costs and network overhead when querying step.

4. SYSTEM IMPLEMENTATION The main idea of our work suggests that to improve data warehouse query performances, particularly building OLAP cubes queries, we must first define a good strategy for data distribution. In this case, we propose to colocate on the same set of Datanodes the fact table partitions, related attributes and dimensions partitions which are involved in the user query. This may reduce significantly the network overload. We implement our data warehouse approach by using the HDFS-385 API (Version 1.2.0 released on May 2013) which is an expert-level interface for developers who want to try out custom placement algorithms to override the HDFS default placement policy. Based on Cohadoop’s data placement mechanism called locator [17], our approach proceeds as follows: During the loading phase, each data warehouse table file is assigned to at most one locator and many tables files can be assigned to the same locator. Tables data files and partitions with the same locator are placed on the same set of Datanodes, whereas others with no locator are placed via Hadoop’s default strategy. The table locator is set with default values according to the policy location initially defined. Figure 3 shows the locator table (noted Reference Table) corresponding to four tables files colocation on a cluster.

Figure 2: DW blocks processing without colocation vs. with colocation. For the vast majority of OLAP cube-based queries, there are four basic steps: Select measures and dimension attributes; Join the cube and dimension views; Apply measure and dimension attribute conditions; And, use ”All” filters to leverage summaries for excluded dimension columns. In this case, the network overhead can be eliminated or at least reduced by colocating the corresponding partitions, i.e., storing them on the same set of Datanodes as shown in Figure 2. Furthermore, it improves the efficiency of many operations such as joins and grouping. To make it simple, let consider a query that involves the two dimensions D1 and D2, our strategy consists of colocating the D1j , D2k and F Fi partitions on the same or closest nodes, by using a reference mechanism (section 4). While D3m and D4n will be placed via Hadoop’s default strategy over the remaining nodes. Nowadays, with the advent of big data and very large data warehouses (consisting of hundreds of tables), users are facing real challenges in terms of performance, availability and security. And with the emergence of cloud computing and models such as pay-as-you-go the issue of cost reduction remains crucial. In this context, our approach enables a step forward. For example, a provider of virtual computing resources such as computing nodes and storage, for many considerations spread its resources over several regions in the world. A random distribution of the data warehouse on its various clusters will automatically increase the cost of data querying transfer and network overhead. However, a process of colocating files and data blocks will helps to reduce the cost of data transfer in addition to the response time, and therefore, reduce the total cost utilization. We discuss the implementation details of the proposed approach in the next section.

Figure 3: Example of four tables files colocated using locator mechanism on multi nodes cluster. In our work, we only extended the Hadoop file system HDFS to support the customized data placement policy. We did not address other aspects such as the variation of the blocks size, which remains equal to default 64MB. Nor the replication blocks policy which remains to default 3. Our goal is the study of the plain query performance gains due to careful data warehouse organization in the context of parallelization with MapReduce. We used on top of our extended hadoop version the Apache Hive (Version 0.10.0, released on January 14th, 2013). It is a data warehouse software which facilitates querying and managing large datasets residing in distributed storage. Hive provides also an SQL-like language called HiveQL and has support for creating cubes [5]. Our approach was evaluated with well-known, large-scale data analysis benchmark: SSBM (Star Schema Benchmark) [18]. It is a data warehousing benchmark derived from TPCH [20]. However, unlike TPC-H, it uses a pure textbook star-schema (the ”best practices” data organization for data warehouses). Figure 4 shows the schema of the tables. The star-join query, in which the fact table is joined with one or more dimension tables, is one of the well-known queries in OLAP. In a context of parallelization and data colocation we have chosen to evaluate our approach with a star-join OLAP cube query that involve two dimension tables.

files colocation improves the query performance from 7% for 160GB to 19% for a 920GB data warehouse size. This behaviour is expected since colocation of data, in the context of data warehousing, avoids network overhead, besides, it reduces the expensive data shuffling operation compared to default data placement policy.

Figure 4: Schema of the SSBM Benchmark. We chose to build a data cube answering the following decisional query: ”What is the sum of sales revenues by year and brand in Asia since 1992?”. Here, we look for the total income (REVENUE) as a measure according to the dimensions PRODUCT (PART) and TIME (DATE). For this, we started our approach by collocating data blocks related to dimension tables PART and DATE involved in the query, as well as the fact table LINEORDER, on the same group (Rack) of Datanodes. CUSTOMER and SUPPLIER are partitioned and placed via Hadoop’s HDFS default strategy.

5.

EXPERIMENTAL EVALUATION

Figure 5: OLAP cube construction time by varying data warehouse size.

5.3 Varying the number of nodes Figure 6 shows the elapsed time for cube computation by increasing the number of nodes, we keep the data warehouse size at 920GB. From the results, we can observe that the execution time of cube computation decreases as the number of nodes increases. In contrast, colocating data improves query performance as the cluster getting larger, from only 4% with 4 nodes to 20% with 20 nodes cluster. We observe that intentional data warehouse placement is more suitable for large clusters, this is because the map and reduce jobs avoids the data shuffling/sorting phase, moreover, reading the data locally is much faster than reading data over the network.

5.1 Experimental Setup In the experiments, we used 1 NameNode, 20 DataNodes, and total 21 PCs in a cluster. The NameNode is equipped with Intel Xeon E5-2630 v2 (Six Core HT, 2.6GHz, 15 MB), 16GB (4x4GB) RAM, and a 1TB SATA Hard Drive. The DataNodes are equipped with Intel Core i5-2400M processor, 4GB RAM, and a 300GB HDD. The operating system is Ubuntu 12.04 LTS, the Java version is JDK 7, and the MapReduce framework is Hadoop 1.0.3. The network speed is 1G bps. In the experiments, we compare the performances (Execution time) of the OLAP cube construction query presented previously, first, without optimization (Default) by using the Hadoop.1.0.3 version, then with optimization (Optimized) by using our HDFS extension with the same Hadoop version. We used Hive.0.10.0 for the execution of the query for both platforms. Note that all experiments were repeated at least 3 times.

5.2 Varying the data warehouse size Figure 5 shows the elapsed time for cube computation by varying the data warehouse size on a 20 nodes cluster. As shown in the figure, cube computation execution time increases significantly as the data size increases and the benefits of the proposed data warehousing colocation approach are appreciable with the increasing size compared to default HDFS data distribution. The Figure shows that tables

Figure 6: OLAP cube construction time by varying number of nodes. In Figure 7, we investigate the effect of intentional data organization on loading data warehouse tables files, and this, by varying the number of nodes. To load data files on HDFS the two systems performed a full MapReduce job. However, as shown in the Figure, the optimized file system is slightly slower. This is due to increased network utilization by collocating different files partitions. Note that with these initial tests performed in this work, loading cost encountered by the proposed approach is very small compared to the savings at query time. Note that the highest difference is observed on a 16 nodes cluster which does not exceed 6 percent.

[8]

[9]

[10]

Figure 7: Loading data warehouse time. [11]

6.

CONCLUSIONS

In this paper, we present a data warehouse placement strategy to improve query performances on a Hadoop cluster, more particularly building OLAP data cube queries. By adopting existing colocation mechanisms, we empirically tested the performance gain for OLAP cube queries with and without data colocation. Our experimental results show that our approach, by colocating data warehouse related blocks, outperforms default placement strategy in terms of performance gain by reducing network overhead. In next step, we will extend the experiments to study the effects of other configuration parameters on collocation data in the context of parallel data warehousing such as partitions size, replication factor and OLAP query complexity. We are also studying a workload-based data placement strategy for large data warehouses by integrating clustering methods, Multi-Agent System (MAS) and Intelligent Agents to the process.

7.

REFERENCES

[1] A.Abouzeid, K.B.Pawlikowski, D.Abadi, A.Silberschatz, and A.Rasin. Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1):922–933, 2009. [2] A.Elsayed, O.Ismail, and M.E.El-Sharkawin. Mapreduce: State-of-the-art and research directions. International Journal of Computer and Electrical Engineering, 6(1):34–39, 2014. [3] A.Floratou, J.Patel, E.Shekita, and S.Tata. Column oriented storage techniques for mapreduce. PVLDB, 4(5):419–429, 2011. [4] A.S.Aiyer, M.Bautin, G.J.Chen, P.Damania, P.Khemani, K.Muthukkaruppan, K.Ranganathan, N.Spiegelberg, L.Tang, and M.Vaidya. Storage infrastructure behind facebook messages: using hbase at scale. IEEE Data Engineering Bulletin, 35:4–13, 2012. [5] A.Thusoo, J.S.Sarma, N.Jain, Z.Shao, P.Chakka, S.Anthony, H.Liu, P.Wyckoff, and R.Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. [6] B.Arres, N.Kabbachi, and O.Boussaid. Building olap cubes on a cloud computing environment with mapreduce. In ACS International Conference on Computer Systems and Applications, pages 1–5, 2013. [7] B.F.Cooper, R.Ramakrishnan, U.Srivastava, N.Puz, D.Weaver, and R.Yerneni. Pnuts: Yahoo!’s hosted

[12] [13]

[14]

[15]

[16]

[17]

[18]

[19]

[20] [21] [22]

[23]

[24]

data serving platform. Proceedings of the VLDB Endowment (PVLDB), 2:1277–1288, 2008. C.Doulkeridis and K.Norvagl. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, 23(3):355–380, 2014. E.Jahani, M.Cafarella, and C.R’e. Automatic optimization for mapreduce programs. PVLDB, 4(6):385–396, 2010. H.Han, Y.C.Lee, S.Choi, H.Y.Yeom, and A.Y.Zomaya. Cloud-aware processing of mapreduce-based olap applications. Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing, 140:31–38, 2013. H.Yang, A.Dasdan, R.-L.Hsiao, and D.S.Parker. Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD ’07, 2007. W. H. Inmon. The data warehouse and data mining. Communications of the ACM, 39(11):49–50, 1996. J.Dean and S.Ghemawat. Mapreduce: simplified data processing on large clusters. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2004. J.Dean and S.Ghemawat. Mapreduce: simplified data processing on large clusters. Communictions of the ACM, 51:107–113, 2008. J.Dittrich, J.A.Quiane-Ruiz, A.Jindal, Y.Kargin, V.Setty, and J.Schad. Hadoop++: making a yellow elephant run like a cheetah. Proceedings of the VLDB Endowment, 3(1):515–529, 2010. J.Zhou, N.Bruno, M-C.Wu, P.A.Larson, R.Chaiken, and D.Shakib. Scope: parallel databases meet mapreduce. VLDB Journal, 21:611–636, 2012. M.Eltabakh, Y.Tian, F.Gemulla, A.Krettek, and J.McPherson. Cohadoop: Flexible data placement and its exploitation in hadoop. PVLDB, 4(9):575–585, 2011. P.O’Neil, E.O’Neil, and X.Chen. The star schema benchmark. Performance Evaluation and Benchmarking, 2007. S.Ghemawat, H.Gobioff, and S-T.Leung. The google file system. Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 19–22, 2003. TPC-H. Transaction processing performance council. http://www.tpc.org/tpch, 2012. T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009. W.Huiju, Q.Xiongpai, Z.Yansong, W.Shan, and W.Zhanwei. Lineardb: A relational approach to make data warehouse scale like mapreduce. In Database Systems for Advanced Applications, volume 6588, pages 306–320. 2011. W.Lam, L.Liu, S.Prasad, A.Rajaraman, Z.Vacheri, and A.Doan. Muppet: Mapreduce-style processing of fast data. Proceedings of the VLDB Endowment (PVLDB), 5:1814–1825, 2012. Y.Lin, D.Agrawal, C.Chen, and S.Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 961–972, 2011.

Suggest Documents