Procedia Computer Science Volume 66, 2015, Pages 635–641 YSC 2015. 4th International Young Scientists Conference on Computational Science
Using Data Compression for Increasing Efficiency of Data Transfer Between Main Memory and Intel Xeon Phi Coprocessor or NVidia GPU in Parallel DBMS Konstantin Y. Besedin1 , Pavel S. Kostenetskiy2 , and Stepan O. Prikazchikov3 1 2 3
South Ural State University, Chelyabinsk, Russia
[email protected] South Ural State University, Chelyabinsk, Russia
[email protected] South Ural State University, Chelyabinsk, Russia
[email protected]
Abstract The need to transfer data through PCI Express bus is considered as one of main bottlenecks in programming for manycore coprocessors and GPUs. This paper focuses on using data compression methods, such as RLE, Null Suppression, LZSS and combination of RLE and Null Suppression to increase efficiency of data transfer between main memory and coprocessor. Chosen compression methods were implemented for two hardware platforms: Intel Xeon Phi coprocessors and NVidia GPUs. A number of experiments were made using these implementations. It is shown that chosen compression methods can be used to increase the efficiency of database processing using Intel Xeon Phi and NVidia GPUs by reducing the time, needed for transferring data on coprocessor, if data satisfies certain conditions. It is also shown that, when a compression method allows one to process data without decompression, such a processing procedure can additionally increase the efficiency of this method. Keywords: Intel Xeon Phi, CUDA, parallel DBMS, compression, RLE, Null Suppression, LZSS
1
Introduction
In modern world, coprocessors like Intel Xeon Phi and GPUs became an integral part of high performance computing. Today, the most powerful computers are equipped with either GPUs or manycore coprocessors [1]. These devices have shown their efficiency for many applications. Parallel database management systems (DBMS) are one example of these applications [3, 6]. Using manycore coprocessors and GPUs in parallel DBMS is an important scientific problem, because parallel DBMS that are able to use computing cluster play important role for processing of big amounts of data [11, 12, 15, 17]. Selection and peer-review under responsibility of the Scientific Programme Committee of YSC 2015 c The Authors. Published by Elsevier B.V. doi:10.1016/j.procs.2015.11.072
635
Using Data Compression for Increasing Efficiency of . . .
Besedin, Kostenetskiy, Prikazchikov
However, the overhead, caused by the need to transfer data through PCI-Express can limit the effectiveness of manycore coprocessors and graphics accelerators for database processing. This overhead is not specific for databases and considered as one of main bottlenecks for programming these devices [9, 10]. This paper evaluates the possibility to use data compression for reducing the overhead from transferring data from main memory to GPU or Xeon Phi coprocessors through PCI-Express bus. We evaluate three compression schemes used in DBMS: RLE, Null Suppression and LZSS. We also explore the possibility of using combinations of different compression methods. At current state of research, we evaluate only the combination of RLE and Null Suppression. We have implemented parallel versions of decompression algorithms of evaluated compression methods for Intel Xeon Phi coprocessors and NVidia GPUs. These implementations were used in experiments to evaluate effectiveness of compression methods for increasing the effectiveness of data transfer between main memory and Intel Xeon Phi coprocessor or NVidia GPU. This paper is continuation of our research, published in [4], extended with more detailed information about compression methods implementation and experiments. This paper is structured as follows: in section 2, a brief overview of prior work is provided. Section 3 contains description of examined compression methods and their implementation. Then, in section 4 we provide information about experiments and their results.
2
Prior Work
Data compression is one of possible ways to increase efficiency of data transfer between main memory and coprocessor or GPU. Data compression is already used by database systems to decrease the amount of data involved in I/O operations [7, 8, 13, 16]. Data compression for column-oriented databases is often evaluated separately because storing data in column-oriented manner introduces a number of opportunities for data compression. Moreover, column-oriented databases can operate directly on compressed data [6, 2, 5]. There are a number of papers devoted to implementation of different compressions methods for GPU. For example, paper [19] describes implementation of lossy (jpeg, vorbis) and lossless (LZ77) compression schemas. Paper [14] describes implementation of LZSS method. One of possible approaches is to use combination of several simple compression schemas like RLE, Null Suppression and Dictionary Coding. Paper [6] shows that this approach can additionally increase the efficiency of compression. To the best of our knowledge, there are no prior researches in using data compression for database processing on Intel Xeon Phi coprocessors.
3
Compression Methods Implementation
At this stage of research, we have implemented all chosen compression methods for Intel Xeon Phi coprocessor and for GPU NVidia Tesla K40m we implemented RLE and Null Suppression methods. For Intel Xeon Phi we used OpenMP technology and used coprocessor in offload mode. Implementation for NVidia Tesla GPU was done with CUDA. We did not implement hardware-specific versions of evaluated compression methods since it is not required by purposes of our study. 636
Using Data Compression for Increasing Efficiency of . . .
3.1
Besedin, Kostenetskiy, Prikazchikov
Null Suppression
Null Suppression compression method removes leading zeros from bit representation of each input element. In our version the number of deleted zeros is same for all input elements. For performance reasons we keep compressed values byte-aligned for both GPU and Xeon Phi. Decompression is pretty straightforward: a required number of zero leading bits are added to each element.
3.2
RLE
In our implementations, data, compressed with RLE method consist of two arrays: array of run values and array of run lengths. For NVidia GPUs run lengths are represented with 64-bit integers and for Intel Xeon Phi — with 8-bit integers. We implemented RLE decompression for NVidia GPU with following algorithm: 1. Calculate runs positions in decompressed data and decompressed data size with prefix sum. 2. Decompress each compressed run with separate thread block. Decompression on Intel Xeon Phi is done in two steps: 1. Use prefix sum to calculate run positions in decompressed data and decompressed data size. 2. Divide runs between several OpenMP threads and decompress them in parallel.
3.3
LZSS
Unlike RLE and Null Suppression, the original LZSS algorithm, as it was described in [18], does not allow parallel compression and decompression. To provide it, we divide compressed data into several segments that can be processed in parallel. Our decompression algorithm consists of following steps: 1. Calculate segment positions in decompressed data and decompressed data size with prefix sum. 2. Decompress segments in parallel by several OpenMP threads.
3.4
Combination of RLE and Null Suppression
At this stage of research, combination of RLE and Null suppression is the only evaluated combination of compression methods. Our implementation applies Null Suppression to data, compressed by RLE.
4
Experiments
Experiments with Intel Xeon Phi were done on Tornado Supercomputer in South Ural State University, Chelyabinsk, Russia. Experiments with NVidia GPUs where done on a separate GPU cluster. Characteristics of both clusters are presented in table 1 For each compression method we evaluate: 637
Using Data Compression for Increasing Efficiency of . . .
Cluster name CPU RAM Coprocessor Number of nodes with coprocessor
Besedin, Kostenetskiy, Prikazchikov
”Tornado SUSU” 2x Intel Xeon 5680 3.33 GHz 24 GB Intel Xeon Phi SE 10X 384
GPU cluster 2x Intel Xeon E5-2687W v2 128 GB NVidia Tesla K40m 3
dŝŵĞ;ƐĞĐͿ
Table 1: Hardware setup ϭ Ϭ͕ϵ Ϭ͕ϴ Ϭ͕ϳ Ϭ͕ϲ Ϭ͕ϱ Ϭ͕ϰ Ϭ͕ϯ Ϭ͕Ϯ Ϭ͕ϭ Ϭ
'WhͲEŽĐŽŵƉƌĞƐƐŝŽŶ 'WhͲǁŝƚŚĚĞĐŽŵƉƌĞƐƐŝŽŶ 'WhͲEŽĚĞĐŽŵƉƌĞƐƐŝŽŶ yĞŽŶWŚŝͲEŽĐŽŵƉƌĞƐƐŝŽŶ yĞŽŶWŚŝͲtŝƚŚĚĞĐŽŵƉƌĞƐƐŝŽŶ
Ϭ͕Ϭϭ
Ϭ͕ϭ
Ϭ͕Ϯ
Ϭ͕ϯ
Ϭ͕ϰ
Ϭ͕ϱ
Ϭ͕ϲ
Ϭ͕ϳ
Ϭ͕ϴ
Ϭ͕ϵ
ϭ
EƵŵďĞƌŽĨƌƵŶƐͬŶƵŵďĞƌŽĨĞůĞŵĞŶƚƐ
Figure 1: Processing time for data, compressed with RLE • Processing time for uncompressed data; • processing time for compressed data with prior decompression; • processing time for compressed data without prior decompression (if allowed by compression method). By ”processing time” we mean the sum of data transfer to device memory, the time of decompression (if needed) and the time of applying an aggregate function (sum) to data. We measure processing time for data without prior decompression for all compression methods except LZSS. The amount of test data for each experiment is 1500 MB. Test data consist of 64-bit integers (uint64_t). For each compression scheme, its compression ratio depends on certain characteristics of data that method is applied to [6]. We vary such characteristic for each compression method. For RLE, LZSS and combination of RLE and Null suppression we use number of runs to number of elements ratio. Therefore, experiments for these methods are done with the same test data. For Null Suppression we use ”value size” — the number of significant bytes. Fig. 1 shows results of experiments with RLE compression method. Results for experiments with LZSS compression method are presented on fig. 2. Results for Null Suppression experiments are summarized on fig. 3. Results for combination of Null Suppression and RLE are shown on fig. 4. For RLE compression scheme and combination of RLE and Null Suppression it can be seen that if number of runs to number of data elements is low (run lengths are long), applying these compression methods decreases the amount of time, needed for transferring data from main memory to device. It is also can be seen that processing data in compressed form additionally increases the efficiency of these compression methods. 638
Using Data Compression for Increasing Efficiency of . . .
Besedin, Kostenetskiy, Prikazchikov
Ϭ͕ϲ
dŝŵĞ;ƐĞĐͿ
Ϭ͕ϱ Ϭ͕ϰ
Ϭ͕ϯ Ϭ͕Ϯ
tŝƚŚ ĚĞĐŽŵƉƌĞƐƐŝŽŶ
Ϭ͕ϭ
EŽĐŽŵƉƌĞƐƐŝŽŶ
Ϭ
Ϭ͕Ϭϭ
Ϭ͕ϭ
Ϭ͕Ϯ
Ϭ͕ϯ
Ϭ͕ϰ
Ϭ͕ϱ
Ϭ͕ϲ
Ϭ͕ϳ
Ϭ͕ϴ
Ϭ͕ϵ
ϭ
EƵŵďĞƌŽĨƌƵŶƐͬŶƵŵďĞƌŽĨĞůĞŵĞŶƚƐ
Figure 2: Processing time for data, compressed with LZSS
Figure 3: Processing time for data, compressed with Null Suppression The same can be applied to LZSS compression scheme: if number of runs to number of data elements ratio is low enough, applying this compression method allows one to reduce overhead for data transfer through PCI-E bus. LZSS does not allow data processing without prior decompression. For Null Suppression, applying compression to transferred data allows to outperform data transfer without compression if number of significant bytes in compressed data allows removing leading zero bytes. As with RLE and combination of RLE and Null Suppression, with Null Suppression it is possible to increase efficiency by processing data in compressed form without prior decompression.
5
Conclusion
This paper focuses on using data compression methods, such as RLE, Null Suppression, LZSS and combination of RLE and Null Suppression to increase efficiency of data transfer between main memory and coprocessor. The chosen compression methods were implemented for Intel Xeon Phi coprocessors and NVidia GPUs. It is shown experimentally that these compression methods can be used to increase the efficiency of database processing using Intel Xeon Phi coprocessor and NVidia GPUs under certain conditions imposed on the data under treatment. It is also shown that, when a compression method allows one to process data without decompression, such a processing procedure can additionally increase the efficiency of this method. 639
Using Data Compression for Increasing Efficiency of . . . Ϭ͕ϳ
EŽĐŽŵƉƌĞƐƐŝŽŶ tŝƚŚĚĞĐŽŵƉƌĞƐƐŝŽŶ EŽĚĞĐŽŵƉƌĞƐƐŝŽŶ
Ϭ͕ϲ Ϭ͕ϱ dŝŵĞ;ƐĞĐͿ
Besedin, Kostenetskiy, Prikazchikov
Ϭ͕ϰ Ϭ͕ϯ Ϭ͕Ϯ Ϭ͕ϭ
Ϭ ϭ
Ϯ
ϯ
ϰ
ϱ
ϲ
ϳ
ϴ
ϵ
ϭϬ
ϭϭ
EƵŵďĞƌŽĨƌƵŶƐͬŶƵŵďĞƌŽĨĞůĞŵĞŶƚƐ
Figure 4: Processing time for data, compressed with combination of RLE and Null Suppression Possible directions of future work are: • Implementation and evaluation of LZSS and combination of RLE and Null Suppression for NVidia GPUs; • evaluation of other compression methods; • evaluation of other combinations of compression methods.
Acknowledgment This work was supported in part by the Ministry of Education and Science of Russia under the Federal Targeted Programme for Research and Development in Priority Areas of Development of the Russian Scientific and Technological Complex in 2014-2020 (Agreement No. 14.574.21.0035) and the Russian Federation president grant no. СП-1229.2015.5.
References [1] Top 500 lists june 2015, jul 2015. [2] Daniel J. Abadi, Samuel Madden, and Nabil Hachem. Column-stores vs. row-stores: how different are they really? In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 967–980, 2008. [3] Konstantin Y. Besedin and Pavel S. Kostenetskiy. Simulating of query processing on multiprocessor database systems with modern coprocessors. In 37th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2014, Opatija, Croatia, May 26-30, 2014, pages 1614–1616, 2014. [4] Konstantin Y. Besedin, Pavel S. Kostenetskiy, and Stepan O. Prikazchikov. Increasing efficiency of data transfer between main memory and intel xeon phi coprocessor or nvidia gpus with data compression. In Victor Malyshkin, editor, Parallel Computing Technologies, volume 9251 of Lecture Notes in Computer Science, pages 319–323. Springer International Publishing, 2015. [5] Carsten Binnig, Stefan Hildenbrand, and Franz F¨ arber. Dictionary-based order-preserving string compression for main memory column stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 July 2, 2009, pages 283–296, 2009.
640
Using Data Compression for Increasing Efficiency of . . .
Besedin, Kostenetskiy, Prikazchikov
[6] Wenbin Fang, Bingsheng He, and Qiong Luo. Database compression on graphics processors. Proc. VLDB Endow., 3(1-2):670–680, September 2010. [7] Goetz Graefe and Leonard D. Shapiro. Data compression and database performance. In In Proc. ACM/IEEE-CS Symp. On Applied Computing, pages 22–27, 1991. [8] Balakrishna R. Iyer and David Wilhite. Data compression support in databases. In VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, pages 695–704, 1994. [9] James Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013. [10] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2 edition, 2013. [11] P. S. Kostenetskii and L. B. Sokolinsky. Simulation of hierarchical multiprocessor database systems. Program. Comput. Softw., 39(1):10–24, January 2013. [12] Pavel S. Kostenetskiy and Leonid B. Sokolinsky. Analysis of hierarchical multiprocessor database systems. In International Conference on High Performance Computing, Networking and Communication Systems, HPCNCS-07, Orlando, Florida, USA, July 9-12 2007, pages 245–251, 2007. [13] Wee Keong Ng and Chinya V. Ravishankar. Block-oriented compression techniques for large statistical databases. IEEE Trans. on Knowl. and Data Eng., 9(2):314–328, March 1997. [14] Adnan Ozsoy and Martin Swany. Culzss: Lzss lossless data compression on cuda. In Proceedings of the 2011 IEEE International Conference on Cluster Computing, CLUSTER ’11, pages 403–411, Washington, DC, USA, 2011. IEEE Computer Society. [15] Constantin S. Pan and Mikhail L. Zymbler. Taming elephants, or how to embed parallelism into postgresql. In Database and Expert Systems Applications - 24th International Conference, DEXA 2013, Prague, Czech Republic, August 26-29, 2013. Proceedings, Part I, pages 153–164, 2013. [16] Mark A. Roth and Scott J. Van Horn. Database compression. SIGMOD Record, 22(3):31–39, 1993. [17] Leonid B. Sokolinskiy and Elena V. Ivanova. Using distributed column indexes for query execution over very large databases. In The International conference ”Advanced mathematics, computations and applications— 2014” Institute of Computational Mathematics and Mathematical Geophysics of Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia, pages 51–52. Novosibirsk: Academizdat, June 2014. [18] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928–951, October 1982. [19] L. Wu, M. Storus, and D. Cross. Final project cuda wuda shuda: Cuda compression project. Technical Report Cs315a, Stanford University, 2009.
641