2016 IEEE International Conference on Big Data (Big Data)
EStore: An Effective Optimized Data Placement Structure for Hive Xin Li1, Hui Li1, Zhihao Huang1, Bing Zhu1, and Jiawei Cai2 1
Shenzhen Key Lab of Information Theory & Future Internet Architecture, PKU Inst. of Big Data Technology, Future Network PKU Lab of National Major Research Infrastructure, Shenzhen Eng. Lab of Converged Networking Technology, Huawei & PKU Jointly Eng. Lab of Future Network Based on SDN, Shenzhen Graduate School, Peking University, Shenzhen, China 2 Guangdong Super Computing & Data Technology Co., Ltd
[email protected],
[email protected],
[email protected]
Abstract—The data warehouse system Hive has emerged as an important facility for supporting data computing and storage. In particular, RCFile is a tailor-made data placement structure implemented in Hive, which is designed for the data processing efficiency. In this paper, we propose several optimized schemes based on RCFile and introduce EStore, which is an optimized data placement structure that is able to improve the query rate and reduce storage space for Hive. Specifically, it adopts both row-store and column-store in blocks, and further classifies the columns by the frequency of each table-column. Moreover, we also employ the classic RDP code to store files of the data table. We conduct experiments on a real cluster, and the results show that EStore has better features in terms of data query rate and storage space compared with RCFile. Keywords-Hive; RCFile; optimized schemes; EStore.
I.
INTRODUCTION
In the era of terabytes of data, the processing of largescale archived datasets is one of the most important and complex challenges in the data warehouse systems. As the most common data layout type, the structured data is usually used in the database management systems. For the distributed systems, the methods of splitting data table of structured data have a great influence over the performance of query and space. It results from the differences of single node processing efficiency and data transfer between nodes via network. For example, row-store mode makes node read unnecessary columns in the query processing, in contrast, column-store mode requires more data transfer for forming query result from different columns. The column-store usually has higher data compression ratio because of the same data type in each column. Therefore, it is practical and essential to propose a data placement structure which has high effect in the terms of query rate and storage space for the data warehouse systems. Hadoop [1] is an open source distributed system which implements storage model HDFS and parallel computing framework MapReduce [2], and they are maturity storage and processing solutions for users which hide the details of task scheduling and fault-tolerance. Some data management technologies using MapReduce have been concluded in [3]. However, it is not always straightforward to convert a structured data query to functions of map and reduce. For
978-1-4673-9005-7/16/$31.00 ©2016 IEEE
2996
this reason, some high-level data processing techniques have been proposed, such as Hive [4][5], Pig [6], Jaql [7] and [8]. Among them, Hive is a significant member in the Hadoop ecosystem, which is a data warehouse system processing large datasets residing in distributed storage using SQL. In terms of data layout of Hive, there are some kinds of layout structures. For the layout mode of row-store [9], its disadvantages have been described in the context of [10]. Correspondingly, column-store [11]-[14] also has drawback as the data for different columns may reside on different physical nodes. Therefore, some effective data placements have been proposed, such as [15, 16, 17]. [18] has proposed a block placement for Hive called RCFile, which adopts both row-store and column-store, and it outperforms the other kinds of placement structures in the effect of query and space. However, there is still room for improving performance of RCFile. It compresses all the columns with the same data compression algorithm in the blocks, the system has to decompress data before querying. One other point is it has not employed some space-efficient methods for the faulttolerance of data. The system we design in this paper adopts some optimized schemes aimed at the above inadequacies of RCFile, which makes Hive has higher effect in the query rate and storage space. In the scheme of fault-tolerance in our system, we use the erasure code [19] to save storage space. Distributed storage systems achieve data fault-tolerance by storing redundant data. The method of multiple replicas is the widely used to protect data against loss or corruption, which is the default storage method of HDFS. The other way to protect data is storing data with erasure codes, which could reconstruct the lost data by executing specific recovery algorithm with information and parity data. Among the erasure codes, RDP code has been proposed in [20] for recovering double disk faults in the RAID system, and there are no researches which focus on constructing the parity blocks of HDFS with RDP code. The aim of this work is designing and implementing an optimized storage structure for Hive, which improves performances of query rate and storage space. In this paper, we present a data placement structure EStore (Effective Store), which adopts three optimized schemes based on the RCFile. We highlight EStore structure as follows.
EStore stores the data table by partitioning the table into multiple row groups whose size depends on the build parameter, then each row group is stored by column-store. In this way, the system avoids unnecessary data loading and fields combination between nodes in the query. EStore adopts code threshold of column-classification to classify columns into query-columns and code-columns. It improves the query rate by reducing the data decompression overhead of query columns. EStore utilizes RDP code to store table files by generating parity blocks in each file group. The number of files in file group is decided by the build parameter. The rest of the paper is organized as follows. We briefly introduce the data placement structure of RCFile in Section II. Section III presents the design and implementation of several schemes of EStore. In Section IV, we present the performance evaluation. Conclusion and discussion of future work are presented in Section V. II.
THE DATA PLACEMENT STRUCTURE OF RCFILE
In the warehouse system based on MapReduce, the scheme of column-store is used for read-optimized data. It partitions a table data into sub-relations of columns, and stores them in the same disk continuously. Its advantage is avoiding reading unnecessary columns during a query execution, and get higher data compression ratio because of the same data type in the column. However, the store scheme could not improve the efficiency of the query performance due to high overhead of record reconstruction. If a query contains different column groups which are stored in different nodes, it will cause a large amount of data transfers via network among cluster storage nodes. A storage scheme of combinations of row-store and column-store is used for Hive described in [18]. It proposes a data placement structure called RCFile. Figure 1 shows an example of data layout of RCFile. It organizes records with the basic unit of a row groups. The row groups have the same size and are generated by splitting the table data horizontally. Each block stores only one or multiple row groups. In each row group, data is stored with column-store structure. In this way, RCFile obtained the advantages of two kinds of storage schemes at the same time in the query and space effects. This paper proposes some schemes to optimize the query and space performance of RCFile in the following aspects. Firstly, it compresses every column respectively and stores them in row group continuously. So the system has to decompress some of them when reading file. This way of compressing columns without distinction would take more time in the query than only compressing data of low usage rate. Secondly, its fault-tolerance depends on basic storage system, like HDFS. We could construct parity information in the blocks to reduce the number of replications for faulttolerance. This way could save total storage space of data table in the storage system.
2997
Figure 1. Data Layout of RCFile in an HDFS block
III.
THE DESIGN AND IMPLEMENTATION OF ESTORE
In this section, we present a high effect data placement structure EStore for Hive in the aspects of query rate and storage space. It constructs block structure based on the combination method of row-store and column-store. The method of column classification is used to store columns in the row group, and it improves query effect by eliminating decompression overhead of the columns used frequently in the query process. In addition, EStore uses the scheme of RDP code and block replication to store blocks of a data table. A. Block structure In EStore, we adopt the similar strategy with RCFile in block placement structure, which first partitioning the data table into row groups horizontally, then storing each row group vertically. Unlike the arbitrary group size of RCFile, EStore set the build parameter before storing table data. The build parameter is a prime number greater than 2, and it is used to decide on the sizes of row group and file group, the file group would be described in the part C. We use the symbol p to represent the build parameter of EStore. EStore stores data table in HDFS blocks, each block contains p - 1 row groups, so the size of row group depends on p and the block size. By default, we set the p to 5 and the block size is 64 MB, so the maximum size of group size is 16 MB. In the normal case, a large data table contains a number of such data blocks. In the implementation of EStore, we construct the row groups with three sections. The first section is a sync maker, which is mainly used to separate two continuous row groups in a block. The second section is a metadata header, which contains the information about distinguishing columns in the row group and fields in the column. In addition, it also adds the information about column-classification described in the part B, which is used to show that whether each column is decompressed after the row group is loaded into memory. The third section is the data which is organized in the column-store. B. Column-Classification In order to reduce physical storage space, data warehouse system usually uses the method of compression before storing table data. However, the decompression would
TABLE I. Column Name
COLUMNS OF THE TABLE ITEM Be used by Queries
Frequency
item_sk
Q5 Q7 Q12 Q15 Q16 Q17 Q19 Q21 Q22 Q23 Q24 Q26 Q29 Q30
0.47
item_id
Q16 Q21 Q22
0.1
item_desc
Q21
0.03
current_price
Q7 Q22 Q24
0.1
class_id
Q26
0.03
category_id
Q1 Q15 Q29 Q30
0.13
category
Q5 Q7 Q12 Q17 Q26
0.17
Figure 2. File Group of EStore with RDP code
increase query latency and computation cost. In the table data, columns have different used frequencies. For the columns that are used in the most queries, each table query processing means that it requires an extra decompression. In this case, the above method would significantly reduce the system query effect. The storage method of the RCFile is to compress all of table columns in the row groups. We design the optimized scheme in EStore to reduce decompressing columns cost that are frequently used in the query. EStore would make some pretreatments before loading data into blocks, it divides columns into two types, and stores them in different ways. We use the column-classification to improve system query effect. In EStore, we call the column that is used frequently in the query processing query-column, and the other columns code-column. We classify columns of table by a threshold which depends on column used frequency, and we call it code threshold. In the data warehouse system, there are usually periodic queries to the data table for data mining or information decision. We classify the columns by these queries. We first collect a list of periodic routine queries and set a code threshold for a data table, then calculate used frequencies of each column separately by adding up the frequencies of queries that involve the column. The columns whose used frequency greater than the code threshold are query-columns, and the other columns are code-columns. We take the tables and queries of the benchmark TCPxBB [21] as an example, which is shown in Table 1. We select a data table that contains seven columns, there are totally 30 queries in the benchmark, we assume these queries are all processed periodically for business decision and they have the same used frequency, which is 1/30, then we add up all the frequencies of queries in each column. As shown in Table 1, there are 14 queries involving column 1 of the table, so we calculate used frequency of column 1 by adding up all these queries used frequency, so its result is 0.47. We set code threshold to 0.2, so the columns that used frequency bigger than 0.2 are query-columns, and the remains are codecolumns. In this table, the column 1 is classified as querycolumns and the other columns are code-columns. We store the two types of columns in different ways in the row groups,
2998
the query-column requires less query latency for query effect, and the code-column requires larger data compression rate to reduce storage space. In EStore, we store query-column in raw form without any compression, code-column with Gzip algorithm. The method of column-classification is practical during the using process of data warehouse system. The method of column-classification in EStore can reduce query time by eliminating decompression time of query-columns, in the mean while, the use of compression algorithm for the codecolumns could make row group contains more raw data. C. Table Files Storage Scheme In this section, we will describe how to store block files of data table in the HDFS with RDP code and replications in EStore. We shall take into account fault-tolerant in the data warehouse systems which are composed of a large number of inexpensive computers, and node failures are considered as usual event in such system. HDFS generally uses the method of three replications to store blocks, which replications are stored at different nodes as far as possible. If a block replication is corrupted when client request to read a file, client can get this block from one of another two nodes. However, storing multi-replications of blocks demands a few times physical storage space. In the data warehouse systems, data storage space is the one of the most important issues, reducing storage space will bring remarkable benefit for them. In EStore, we utilize RDP code to store data. Erasure codes are used to protect data from disk failure by adding extra parity data. RDP code is one of the most important double fault-tolerant erasure codes which achieves optimization both in computation efficiency and I/O efficiency. The RDP code uses a prime number greater than 2 as parameter p, its coding result is a (p - 1) × (p + 1) twodimensional array, the first p - 1 columns of the array are information columns, the last two are row parity column and the diagonal parity column. Each symbol in the row parity column is the XOR sum of all the information symbols in that row, and symbols of the diagonal are generated by XOR summing all symbols along the same diagonal from the first p columns. In EStore, the problem we need to solve is how to divide blocks of data table into two-dimensional array. We define the file group in EStore, which is to indicate RDP array of blocks. We have known from part A that the size of row
(a) Table Size
(b) Recovery Time
(c) Execution Time of Q17
Figure 3. Performance of EStore with different build parameter
group depends on the build parameter, and it is also used to construct the file groups. We use symbol p to represent the build parameter below, which has the same meaning with the RDP parameter. The following steps show the generation process of file groups. 1) We split data files of data table into different file groups, each file group has p - 1 data files. 2) Generating two parity blocks for each file group with RDP code, which are row parity block and diagonal parity block. 3) Adding the parity blocks into file groups respectively. So in the end, each file group contains p + 1 blocks totally. Meanwhile all the blocks in the file group contain p - 1 row groups. We consider a file group in the data table as a coding array of RDP code in EStore. We show the construction process of file group in figure 2 where p = 5, block 0 to block 3 are table blocks, block 4 and block 5 are parity blocks. Each row group in the block 4 contains all the row parity symbols, e.g., r0,4 is the XOR sum of row groups {r0,0, r0,1, r0,2, r0.3}. Block 5 contains all the diagonal parity symbols, e.g., r0,5 is the XOR sum of row groups {r0,0, r3,2, r2,3, r1,4}. We use union ways of replicas and RDP code to store data table. Each file group of data table has two replications and two parity files generated by RDP encoding. The reason why we still use the method of replication in EStore is that the recovery bandwidth of RDP code is greater than replication recovery, especially when p is large. So we store the one other replication of blocks in the HDFS, in this way, if single block is corrupted, we could get and recover the file from the other replication. We use the method of RDP code to recover file only when the two replications of the block are both corrupted. In the case of it is not the usual case in the warehouse systems, we could accept the recovery bandwidth of RDP code where p is relatively small. In EStore, we utilize the default scheme of HDFS to store the two replications of blocks, which are stored on the different nodes in common cases. The two parity blocks are stored on the nodes that do not contain any files from the file group, which can recover any blocks in the file groups when the two replications of the block are both corrupted.
2999
IV.
PERFORMANCE EVALUATION
We conduct experiments with EStore and Hive to demonstrate the performance of EStore. The testing platform is constructed over a 20 node Hadoop cluster with a 3.7GHz Intel Core i3 CPU, 8GB DDR3 main memory, and 1TB disk. These machines run Linux system. The software stack comprised of Hadoop 2.6.4 and Hive 2.0.0. The workload used in the experiments is the industry standard TPCx-BB benchmark [21], which is used to measure the performance of Hadoop based big data systems. We design the experiments from two sections. In the first section, we evaluate the performance of EStore from the perspectives of storage space, file recovery and query execution by using different build parameter. In the second section, we compare EStore that has different code threshold and RCFile. A. EStore with different build parameter In this section of experiments, we examine the build parameter how to affect data storage space, file recovery, and query execution time. There are three selected values of p satisfying system demand for building block structures easily, which are 3, 5 and 17, the corresponding row group size are 32MB, 16MB and 4MB respectively. 1) Storage Space: We select the ITEM table from the benchmark, and generate the dataset about 150G. The table has seven columns, whose structure has been described in Table 1. In addition, we test row-store structure with the Gzip compression algorithm for performance comparisons. Figure 2 (a) shows the storage space for different configuration. We can see that storage structure with data compression can significantly reduce the storage space. EStore needs less storage space compared with row-store in the cases of the row group size are 4MB to 32MB. In addition, in this figure, we can not see obvious difference among these group sizes. This means that the change of p values in 3, 5, 17 would not help improve data compression efficiently significantly. 2) File Recovery Throughput: We test the block file recovery throughput of the table ITEM in the cases that each file group loss a information block. The replication recovery is also tested as comparison value.
Figure 2 (b) shows the recovery throughput of different p. We can see that the ways of block file recovery with RDP code are slower than the replication. The throughput of the p = 17 is less than the other cases obviously, since the result from the recovery module needs get 16 file blocks from a file group to decode the information data, comparing to the other 4 and 2. This reflects the large size of row group would reduce the recovery effectiveness distinctly. 3) Query Execution Time: We select Q17 from the benchmark as our query. The query executes aggregations on the table ITEM, and it involves two columns. In these columns, i_item_sk is divided into query column and i_category is code column depending on the code threshold 0.2. Figure 2 (c) shows the query execution time of Q17. We can see that EStore can significantly outperform row-store in the three cases, since it skips unnecessary columns in the querying process. Furthermore, the query speed of the 4MB group size is faster than the other group sizes. It is because EStore also uses lazy decompression as RCFile, which a large row group can cause more unnecessary data decompression. 4) Summary: Among the three values of build parameter, the middle value (p = 5) achieves the most balanced performance in three aspects. Therefore, we use this value as the default parameter value of EStore and the experiments of the part B. B. EStore versus RCFile In this section of experiments, we configure Hive to use different data structures. We set the same group size to RCFile and EStore, which is 16MB. Different code thresholds are tested for performance comparisons. We use the symbol t to represent code threshold in this part. We use the same dataset generated in the previous section, and Q17 from the benchmark also be executed as testing query. Figure 4 (a) shows the storage space after using fault-tolerance measure in the HDFS for different structures. We can observe that EStore could reduce the storage space significantly in the case of t = 0.2. Since the storage scheme with RDP code and replications is used in EStore for protecting data loss. In contrast, RCFile uses the default the way of three replications in the file system. In addition, different code threshold of EStore would affect the size of storage space, which depends on the number and the type of query columns. We record the query execution time of different data structures, which is showed in the Figure 3 (b). We can see that EStore outperform RCFile in all the cases of different code threshold, since EStore can get some columns that contained in the query directly without decompression. Besides, the case of t = 0.12 has the same execution time as t = 0.16 because all the columns used in Q17 are query column when t = 0.16. We compare EStore by different code threshold with RCFile in the perspectives of storage space and query execution time. The above experimental results reflect the performance advantages of column-classification and fault-
3000
(a) Storage Space
(b) TPCx-BB Q17 Figure 4. Comparison of Performance in EStore and RCFile
tolerant schemes of EStore in all these cases compared to RCFile.
V.
CONCLUSION
This paper proposes an optimized data placement structure based on RCFile called EStore. It constructs row group and file group in the system, whose sizes depend on the build parameter. Our system uses the method of columnstore in each row group as RCFile. EStore adopts the scheme of column-classification to improve query rate, which splits the columns into query-columns and code-columns by code threshold. The file groups are built by information blocks and parity blocks, which are generated by the RDP code. Experimental results show that EStore outperforms the RCFile in terms of query rate and storage space on the Hive. Our future work includes selecting the proper code threshold for the columns storing automatically according to the data types and using frequencies of columns in data table, and adjusting the size of file group automatically depending on system space status.
[9]
ACKNOWLEDGMENT
[10]
This work is supported by Natural Science Foundation of China (NSFC) (No. 61671001, No. 61521003), The National Key Research and Development Program of China (No. 2016YFB0800101), Guangdong Research Programs (2016B 030305005), Shenzhen Research Programs (ZDSYS201603 311739428, JCYJ20150331100723974 & 2014050909381 7684).
[11] [12]
[13] [14]
REFERENCES [1] [2] [3] [4]
[5]
[6]
[7] [8]
[15]
http://hadoop.apache.org/ J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in OSDI, 2004, pp. 137-150. F. Li, B. C. Ooi, M. T. Özsu, and S. Wu, “Distributed data management using mapreduce,” ACM Comput. Surv. 46(3), 2014. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive - a warehousing solution over a MapReduce framework,” PVLDB, vol. 2, no. 2, pp. 1626̄1629, 2009. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy, “Hive - a petabyte scale data warehouse using hadoop,” in ICDE, 2010, pp. 996̄1005. A.Gates, O.Natkovich, S.Chopra, P.Kamath, S.Narayanam, C.Olston, B.Reed, S.Srinivasan, and U.Srivastava, “Building a highlevel dataflow system on top of MapReduce: The Pig experience,” PVLDB, vol. 2, no. 2, pp. 1414̄1425, 2009. Jaql Project hosting. http://code.google.com/p/jaql/ C. Doulkeridis and K. Nørvag, “A survey of large-scale analytical query processing in mapreduce,” VLDB J., vol. 23, no. 3, pp. 355̄ 380, 2014.
3001
[16] [17]
[18]
[19]
[20]
[21]
R. Ramakrishnan and J. Gehrke, “Database management systems,” McGraw-Hill, 2003. D. Abadi et al. “Column-Oriented Database Systems,” PVDLB, 2(2):1664-1665, 2009. G. P. Copeland and S. Khosha㧋an, “A decomposition storage model,” in SIGMOD Conference, 1985, pp. 268̄279. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. ỎNeil, P. E. ỎNeil, A.Rasin,N.Tran,andS.B.Zdonik, “C-store: Acolumn-oriented dbms,” in VLDB, 2005, pp. 553̄564. D.J.Abadi, S.Madden, and N.Hachem, “Column-stores vs.row-stores: how different are they really?” in SIGMOD Conference, 2008. A. L.Holloway and D. J. DeWitt, “Read-optimized databases, in depth,” PVLDB, vol. 1, no. 1, pp. 502̄513, 2008. A. Floratou et al. “Column-Oriented Storage Techniques for MapReduce.,” PVLDB, 4(7):419̄429, 2011. A. Jindal, J.-A. Quiane-Ruiz, and J. Dittrich. Trojan Data Layouts, “Right Shoes for a Running Elephant,” In SOCC, 2011. Y. Lin et al. “Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework,” In SIGMOD, pages 961̄ 972, 2011. Y. He, R. Lee, S. Zheng, N. Jain, Z. Xu, and X. Zhang, “RCFile: A Fast and Space-efficient Data Placement Structure in MapReducebased Warehouse Systems,” In ICDE, 2011. M. K. Aguilera, R. Janakiraman, and L. Xu, “Using erasure codes efficiently for storage in a distributed system,” in Proc. International Conference on Dependable Systems and Net-works (DSN-2005), pp. 336̄345, 2005. P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, “Row-diagonal parity for double disk failure correction,” In FAST’04: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp. 1̄14, 2004. TPCx-BB hosting. http://www.tpc.org/tpcx-bb/