Category: Developer - Performance Optimization - DO06 Poster
P5300
contact Name
Eyal Rozenberg:
[email protected]
Towards a real GPU-speedup in SQL query processing with a new heterogeneous execution framework
THIS THIS IS IS AA
WIP
Adnan Agbaria | David Minor | Natan Peterfreund | Roman Talyansky | Ofer Rosenberg | Eyal Rozenberg Heterogeneous Computing Group, Shannon Lab, Huawei Research
State of the art (?)
The contribution of this work-in-progress consists of:
· An ongoing concern of the IC&T industry is maximizing analytic DBMS performance – with GPUs showing some promise towards this end.
1. An application-agnostic execution environment, (named AXE) supporting data and task parallelism over multiple GPUs and CPUs. 2. A analytic SQL processing framework, incorporating this engine, achieving significant speedups relative to a performant DBMS - via...
· Previous works have shown speedups through the use of a GPU, but it was still unclear if their solutions are competitive with more performant CPU-based DBMSes.
2.1 Pre-processing data to enhance parallelism. 2.2 Hardware-minded optimization of execution graph with rewriting rules. 2.3 Breaking SQL-processing-related computational operations and constituting alternate ones, easier to optimize.
Contact us: Natan Peterfreund
[email protected] Eyal Rozenberg
[email protected]
Recent work on speeding up SQL query processing, presented at this conference last year (GTC’14):
… but the reference DBMSes used are quite slow compared to topperforming (columnar) DBMSes.
2,760
6,000 0
2,000
4,000
6,000
Note: The above does not regard a contribution of this work.
Accelerated Processing Time TPC-H Query 1, Scale Factor 1
MonetDB
159.4
LogicBlox / Red Fox
330.0
PostgreSQL / Galactica
580.0 0
200
400
600
msec - plan execution
CPUs used: RedFox – 2 x Xeon 2670 Galactica – 1 x Xeon 5680 This work – 2 x Xeon 2690 GPU used: RedFox – GTX Titan Galactica – Tesla K40c This work – GTX 780 Ti While some adjustment may be necessary, the CPUs are close in performance and for the GPUs, review websites present different advantages for each card. Timing values are therefore not normalized except for an optimistic factor 0.5 applied to the PostgreSQL CPU performance figure. MonetDB version used: 11.15.15
· When possible, we have timed a multithreaded CPU run with the same Logical Execution Graph, for reference (for some queries, we have still not completed necessary work to allow this).
Logical AXE Execution Plan AXE : Partitioning & Device Selection
Overlapping Compute & Memory Transfer
MultiGPU
· Our ‘aggressive’ graph rewriting is well reflected in the set of top GPU compute time consumers: · In Q1 we see the significance of the GROUP-BY Sum transformation. · In both Q4 and Q9 we see consumers whose use is in no way evident from the query itself (e.g. outputing the indices of bits turned on in a vector of booleans).
Realized AXE Execution Plan
AXE: Run-Time Engine SQL-related operations
Multiple device support via graph partitions a()
1. The AXE execution environment receives a ‘Logical’ Execution Graph, from an (arbitrary) application. 2. This graph is then partitioned into subgraphs; each can be assigned to a subset of the available devices (CPUs and GPUs).
a()
a()
GPU
f()
GPU 1
g()
g()
· Typical execution, without any preprocessing of the Key column. Groups are materialized for aggregation.
d2h
d2h Joiner
R R R
... ... ...
N N
... ...
...
Gather
j.out_rhs
equivalent to j.out_rhs
Breaking operations up allows for cancel-out:
Required preprocessing: None.
d.in
d.in
Determine Group Presence
s.indices
Determine Group Presence
s.indices
SelectIndices
Select
s.indices
d.in
Gather
Gather
The GROUP BY Sum example (bottom of this poster): T.c2
T.c2
T.c1
gm.groupby_value
gm.data
Required preprocessing: Group Index column for schema colum T.c2
T.c1 gias.data
gias.group_indices Group-Indexed Aggregate Sum
Group Materialization Per-Group Sum
R
...
N
...
ata
0 1 2
... ... ...
Note: Preprocessing and data regularization also benefits CPU execution of queries – but GPUs benefit more.
Plan Execution Time Acceleration: TPC-H Queries Scale Factor 2 400
Scale Factor 1
356.3
350
290.6
300
217.5
250 200 150
134.6
96.6
100
48.6
50
129.41
107.5
28.5
MonetDB MonetDB/AXE CPU MonetDB/AXE 1 GPU MonetDB/AXE 2 GPUs
84.8 35.7 29.3
22.6
32.2
0 Q01
Notes:
Q04
Q09
Q21
1. The measured times do not include execution DBMS parsing & plan generation, execution graph optimization etc.; see below 2. Currently, (multi-threaded) CPU multi-GPU execution rely on data partitioning of the respective execution graph. These are not yet ready for Q4 and Q21.
GPU compute time breakdown by operation (top 5 time-consumers): TPC-H Q4
TPC-H Q1 (ari thmeti c kernels )
1.40
Select
Select Indices
1.62
Gather
Gather
1.80
Elementwise Compare
Hi s togram
Group-Indexed Aggregate Sum
7.80 2.0
Group-Indexed Aggregate Sum
0.05
Select Indices
0.26
4.0 6.0 total msec
8.0
0.83
RHS-Unique Join 0.2
0.4 0.6 0.8 total msec
0.3
1.0
0.8
1.1
Gather
0.97 0.0
10.0
0.1
Substring Search Foreign Key Join
0.35
Select Indices Sorted
3.31
0.0
TPC-H Q9
1.2
2.4 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 total msec
Note: In all queries, the sum of these is still far below the total plan execution time. The rest is non-overlapped I/O, GPU idle time, and post-GPU-activity work on the CPU.
Overall query processing time – activity breakdown TPC-H Q1 in different scale factors
TPC-H Q9, Scale Factor 2 100%
0.25 1
0.15 1
0.9
27.14
24.9
1.3
1.9
1.3
1.1
18.1
27.4
1.3
1.9
90% 80%
50.07
70%
32.57
60% 50%
250.6
40% 30%
1.101 6.537
3.935 3.118
20%
7.78
10% 0%
28
102.2
91.1
1
44.55
54.6
47.3
16.124
15.85
MonetDB AXE CPU AXE 1 GPU
9.922
AXE 2 GPUs
0.9 2.7
10.7
SF 1
2.4
2.7
3.2
7.5
1.1 5.9 13.1 SF 2 1 GPU
1.0 3.4
11.8
14.9
CPU Compute GPU Overall Select Devices & Partition
2.6
Compile & Optimize Graph
7.7
MonetDB Prepares its plan
17.7
2.5 5.1 14.1
11.5 SF 4
Other
SF 1
SF 2
SF 4
SF 8
2 GPUs
We have presented results from a first iteration of development. We plan to develop it with and without relation to the use-case of SQL processing, adding several features enhancing usability as well as performance:
∑D
0 1 0 2 2 2
j.out_lhs (trivial)
Some challenges for future work
up
... ... ... ... ... ...
Group-Indexed Group-Indexed Aggregation Aggregation
A
Gro
up
A R N R R N
...
Gro
... ... ... ... ... ...
Aggregation
a
- Why choose only some devices for subgraph execution?
A R N R R N
A
Dat
4. I/O Operations Added
Group Group Index Index Calculation Calculation
Key
- Why do we choose certain nodes for a subgraph?
3. Multi-Device Partitioning
a
2. Device Selection
... ... ... ... ... ...
Dat
1. Logical Execution Graph
· A pre-calculated Group Index for the Key column is available; we use it as an offset into the aggregates vector.
Key
b()
A R N R R N
GROUP BY Materialization
ata
Example: how would we execute the query fragment above? ∑D
b()
f()
Key
Joiner
f()
a
CPU
g()
SELECT sum(Data) FROM Table GROUP BY Key;
Dat
b()
g()
A common query fragment...
Key
g()
GPU 2
h2d
a
b()
h2d
Splitter
f()
f()
f() g()
CPU
Dat
4. I/O operations are added as necessary to move data between devices’ memory spaces.
a()
· It is also important to present a complete breakdown of the query processing time into the various kinds of activity other than just GPU compute and memory transfers. On the other hand, as the scale factor increases, these become less significant.
Preprocessing ⇒ Parallelism
Key
3. Subgraphs are replicated for processing on multiple devices, with splitter/joiner nodes effecting partitions of the data.
Splitter
Device generic operations Abstraction Layer
Pre-calculated query-inspecific column statistics: · Specifically permitted by the TPC-H benchmark. · Can be (partially) maintained during execution. · Allow for data regularization, and for assumptions of locality, · which in turn facilitate parallel processing.
j.in_lhs holds foreign keys into T2.c
ScatterIndices
RHS Unique Join
· Removing redundant nodes. · Breaking up nodes representing complex operations into simpler, constituent operations (available to AXE) · Re-integrating simpler computations into more complex, specifically-optimized ones. · Replacing general-case computation with betterperforming special-case · Adjusting computation on individual input columns to utilize pre-calculated statistics.
· At this time, our fragment of supported SQL only allows us to process some of the TPC-H queries (with relative efficiency). The charts present results for Q1, Q4 and Q9.
Execution Graph Compiler & Optimizer GPU Kernel Task Parallelism
T2.c - Primary Key
Results, and a bit of analysis
DBMS Intermediate Representation
Simultaneous Processing: Multi-Core CPU & Multiple GPUs
· Compiling the host DBMS’ internal execution plan representation into graph form. · Transforming the graph to make it acceptable input for the application-agnostic execution environment (AXE). · Effecting execution graph optimizations, in view of the available resources reported by AXE.
Required preprocessing: Determine the maximum value of each primary key column.
At this time, the above is implemented only for a fragment of analytic SQL.
Analytic SQL Query
Multi-GPU Memory Transfer
800
T2.c - Primary Key
j.in_lhs holds foreign keys into T2.c
The latter two are affected through repeated application of multiple graph rewriting rules :
← This work
17.9
Host DBMS SQL Parser & Initial Execution Plan Generator
· Overlapped I/O & computation · Simultaneous I/O to multiple devices · Parallel multi-device execution · Graph node task parallelism · Graph node data parallelism
8,000
msec - plan execution
MonetDB / AXE
Foreign Key Join Avoidance:
The DSL-to-execution-graph compilation component has a triple functionality:
159
LogicBlox
We graft our framework onto MonetDB; its unmodified CPU processing is often faster than GPUaccelerated results in previous works Ref… (before our changing anything.) P… msec 126 -… It seems that we are just entering 0 the ‘ballpark’ for exploring GPGPU acceleration of SQL query processing.
Possible optimizations include:
Questions to ponder: - Where do splitters/joiner nodes get run?
MonetDB
PostgreSQL
Our AXE execution environment uses its device abstraction layer to schedule execution on multiple devices.
The Logical Execution Graph is now a Realized Execution Graph ready for execution by AXE’s Run-Time Engine.
Reference CPU Processing Time TPC-H Query 1, Scale Factor 1
· RedFox (Wu & al., also in CGO’14) · Galactica (Yong & al. also BDS’14)
Realized plan execution
… but it’s generally not useful to apply these all together to any single subgraph.
Optimization via graph rewriting
msec - plan execution
Motivation and contribution summary
· Increase SQL coverage & support other DSLs. · Support more processors types, vendors & platforms (APUs, OpenCL-based devices) · GPU Memory management · Automatic fusion of consecutive operations.
· Improve I/O and computation overlap. · Fully utilize CPU vectorization capabilities to explore more optimal mixtures of execution on the CPU and on the GPU. · et cetera.
As our framework matures, we hope to secure approval for its release as Free Software.