Exploring Database Workloads on Future Clustered Many-Core Architectures Panayiotis Petrides
Andreas Diavastos
Pedro Trancoso
Department of Computer Science University of Cyprus Email:
[email protected]
Department of Computer Science University of Cyprus Email:
[email protected]
Department of Computer Science University of Cyprus Email:
[email protected]
Abstract—Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessor. In this work we analyze the benefits of using future many-core architectures, more specifically on-chip clustered many-core architectures, for such workloads for accelerating DSS query execution and study their performance behavior. To achieve this goal we propose data-parallel versions of the original database scan and join algorithms. In our experiments we study the behavior of three queries from the standard DSS benchmark TPC-H executing on the Intel Single Chip Cloud Computing experimental processor (Intel SCC). The results show that parallelism can be well exploited by such architectures and also how the computational workload compared to the data size of each executed query can influence performance. Our results show linear scalability for queries where the computation to data size ratio is balanced.
balancing and exploit locality to memory accesses. In our algorithm, all records are traveresed and the record’s attributes are checked against a certain condition. If the condition is satisfied then the record belongs to the results, otherwise is dumped. For data-parallel nested loop Join is executed by exploiting the streaming model. More specifically if we want to join Table A with key ka and Table B with foreign key ka, then the key value over the foreign key are compared. Join opertation is performed by checking a certain key value from one table against all the key values of the other table. The checks may be performed in a loop and the results that satisfy the checks are then passed to the next condition.
I. I NTRODUCTION
For this work we have used Intel Single Chip Cloud Computing experimental processor, RockyLake version. The operating system used for the Intel SCC cores is the default Linux kernel provided by the RCCE SCC Kit 1.3.0. The host PC, responsible for controlling the applications execution on the Intel SCC processor, is configured with Intel Core i7 processor 3.7GHz and 4GB memory. The connection of the host PC to the Intel SCC is through a PCIe Expansion Card and a PCIe x4 Cable. For porting and executing the applications on the SCC we have used RCCE 1.3.0 toolchain. For our work we focused on the execution of the basic database algorithms and their parallelization. As such, the queries analyzed in this work were implemented as programs that executed the operations determined by the queries and their results were validated. We have ported 3 different queries from TPC-H benchmark suite of different complexity and demands. More specifically we have ported Queries 3, 6 and 12, from now on referenced as Q3, Q6 and Q12 correspondingly. Different input sizes were used for our evaluation in order to study their performance scalability which were generated using the dbgen tool. The input sizes and the number of tables used for each query execution are depicted in Table I.
Database applications are of the most demanding workloads. More specifically, Decision Support Systems (DSS) database applications combines the processing of large data sets along with the computation of statistical information extracted from data. The purpose of this paper is first to understand the benefits of the use of a future clustered many-core architecture, like Intels’ Single Chip Cloud Computing experimental processor [1], in a large scale data center which handles DSS applications. Moreover we want to investigate how these architectures be benefit in terms of performance. In order to achieve our goal we have analyze the performance of the basic database algorithms parallelized using the available toolchain of the Intel SCC. The algorithms are the basis for the execution of standard representatives DSS queries taken from the TPC-H benchmark suite [3]. II. M APPING DATA -PARALLEL DATABASE Q UERIES TO I NTEL SCC EXPERIMENTAL PROCESSOR We have formated the processing data as data streams resembling data arrays from regular high-level languages. For the purpose of our work the data are stored column-wise, i.e. all values of a particular attribute belonging to different records, are stored in the same data stream. More details about the algorithms used and how data are mapped can be found in [2]. From the arrangement of data, stream data, we have chosen the Sequential Scan for our implementation, achieving load
III. E XPERIMENTAL S ETUP
IV. R ESULTS We have moitored the execution of the three queries scaling them from 1 to 48 cores on the Intel SCC processor. The first investigation for our workloads was to monitor the time taken for reading the input data to the different
TABLE I TPC-H Q UERIES INPUT SIZE .
the time taken for reading the input data. Total Queries Speedup on SCC for Input Size 0.01 and 0.1
Tables 3 1 2
Input Size 0.01 4.24MB 3.71MB 3.74MB
Input Size 0.1 93.56MB 74.24MB 91.14MB
45 40 35 30 Speedup
Query Q3 Q6 Q12
25 20 15 10 5 0 1
cores for the different input sizes. The results were obtained by measuring the time taken for all cores to read the input data simultaneously and reading the same amount of data, i.e. creating a copy of the input data locally at each core. It is important to notice that for both input sizes the time taken to read input data is stable and does not show high deviation to the different number of cores. This can be explained from the high bandwidth available to the cores from the on-chip integrated memory controllers. Computational Speedup on SCC for Input Size 0.01 and 0.1 50 45 40
Speedup
35 30 25 20 15 10 5 0 1
2
4
8
16
32
48
1
2
4
8
16
32
48
Number of Cores Speedup Computation Q6
Speedup Computational Q12
Speedup Q3 Computational
Fig. 1. Computation Speedup for TPC-H Queries 3, 6 and 12 for input size 0.01 and 0.1.
Secondly we wanted to investigate the scalability of the algorithms in the section where computation is done. In Figure 1 we can observe the speedup results for the three queries for the two different input data, 0.01 and 0.1. From our results we can observe the almost linear speedup for Q12. For Q6 with input data size 0.01, we can observe that the speedup reaches a maximum point of 10x for 16 cores. For 32 and 48 cores we can observe a speedup degradation. This is caused due to the low computational complexity and the limited workload of Q6. Even though in scale factor 0.1 the workload increases significanlty we can also observe that the speedup does not increases linearly from 32 to 48 cores, but instead it remains stable. This can also be explained due to the low computation complexity of the specific query. For Q3 we can observe a performance improvment for both input data sizes as the number of cores increases, although not in a linear way. This is caused by the high computation complexity in contrast to the other queries which results slow performance increase as the number of cores increases. In Figure 2, we show the total speedup for the three queries including both the time for reading the input data and the computational time until the completion of the queries. We can observe from our results that even though that we can achieve a relatively good speedup scalability for Q6 in terms of computation, when it comes for the total time spend for the execution of the specific query the results are dominated by
2
4
8
16
Speedup Total Q6
32
48
1
2
4
Number of cores Speedup Q12 Total
8
16
32
48
Speedup Q3 Total
Fig. 2. Total Speedup for TPC-H Queries 3, 6 and 12 for input size 0.01 and 0.1.
Q12 offers very good scalability since it combines well the data transfers and computation resulting to almost linear speedup for both input sizes, 36x and 42x respectively for the two input sizes. Q3 with the higher complexity of the three queries offers relatively good speedup but in lower level compared to the Q12 for the two different input sizes, 15x and 17x respecitvely. Even though there is a similar behavior to the other queries the impact to the performance is the computational part of the query. As described earlier this query makes a join to three tables, this can impact performance due to data transfers from the memory to the local cache of the cores and/or conflict misses to the local cache from the different data. These factors affect performance of Q3 even though the performance improves among the different number of cores. V. C ONCLUSION From our experiments we have observed that for queries algorithms we have different performance behavior. In order to achieve good performance the algorithms’ complexity and data input ratio must be well balanced otherwise the performance is dominated by the data transfers. We plan to concurrent execute the queries algorithms in order to determine how the SCC can be split in regions of performance of these algorithms according to their characteristics. ACKNOWLEDGMENT The authors would like to thank Intel Labs for lending the Intel SCC research processor. R EFERENCES [1] J. Howard, et al, A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS, In Proceedings of the International Solid-State Circuits Conference, Feb 2010. [2] P. Trancoso and D. Othonos and A. Artemiou, Data parallel acceleration of decision support queries using Cell/BE and GPUs, In Proceedings of the 6th ACM conference on Computing frontiers (CF’09), pages 117-126, 2009. [3] Transaction Processing Council, TPC Benchmark H (Decision Support) Standard Specification Revision 2.6.1, June 2006. [4] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, Gputerasort: High performance graphics o-processor sorting for large database management, In SIGMOD 06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325336, 2006.