Towards a real GPU-speedup in SQL query ... - GTC On-Demand

0 downloads 0 Views 542KB Size Report
We have presented results from a first iteration of development. We plan to develop it with and without relation to the use-case of SQL processing, adding several features enhancing usability as ... reference (for some queries, we have.
Category: Developer - Performance Optimization - DO06 Poster

P5300

contact Name

Eyal Rozenberg: [email protected]

Towards a real GPU-speedup in SQL query processing with a new heterogeneous execution framework

THIS THIS IS IS AA

WIP

Adnan Agbaria | David Minor | Natan Peterfreund | Roman Talyansky | Ofer Rosenberg | Eyal Rozenberg Heterogeneous Computing Group, Shannon Lab, Huawei Research

State of the art (?)

The contribution of this work-in-progress consists of:

· An ongoing concern of the IC&T industry is maximizing analytic DBMS performance – with GPUs showing some promise towards this end.

1. An application-agnostic execution environment, (named AXE) supporting data and task parallelism over multiple GPUs and CPUs. 2. A analytic SQL processing framework, incorporating this engine, achieving significant speedups relative to a performant DBMS - via...

· Previous works have shown speedups through the use of a GPU, but it was still unclear if their solutions are competitive with more performant CPU-based DBMSes.

2.1 Pre-processing data to enhance parallelism. 2.2 Hardware-minded optimization of execution graph with rewriting rules. 2.3 Breaking SQL-processing-related computational operations and constituting alternate ones, easier to optimize.

Contact us: Natan Peterfreund [email protected] Eyal Rozenberg [email protected]

Recent work on speeding up SQL query processing, presented at this conference last year (GTC’14):

… but the reference DBMSes used are quite slow compared to topperforming (columnar) DBMSes.

2,760

6,000 0

2,000

4,000

6,000

Note: The above does not regard a contribution of this work.

Accelerated Processing Time TPC-H Query 1, Scale Factor 1

MonetDB

159.4

LogicBlox / Red Fox

330.0

PostgreSQL / Galactica

580.0 0

200

400

600

msec - plan execution

CPUs used: RedFox – 2 x Xeon 2670 Galactica – 1 x Xeon 5680 This work – 2 x Xeon 2690 GPU used: RedFox – GTX Titan Galactica – Tesla K40c This work – GTX 780 Ti While some adjustment may be necessary, the CPUs are close in performance and for the GPUs, review websites present different advantages for each card. Timing values are therefore not normalized except for an optimistic factor 0.5 applied to the PostgreSQL CPU performance figure. MonetDB version used: 11.15.15

· When possible, we have timed a multithreaded CPU run with the same Logical Execution Graph, for reference (for some queries, we have still not completed necessary work to allow this).

Logical AXE Execution Plan AXE : Partitioning & Device Selection

Overlapping Compute & Memory Transfer

MultiGPU

· Our ‘aggressive’ graph rewriting is well reflected in the set of top GPU compute time consumers: · In Q1 we see the significance of the GROUP-BY Sum transformation. · In both Q4 and Q9 we see consumers whose use is in no way evident from the query itself (e.g. outputing the indices of bits turned on in a vector of booleans).

Realized AXE Execution Plan

AXE: Run-Time Engine SQL-related operations

Multiple device support via graph partitions a()

1. The AXE execution environment receives a ‘Logical’ Execution Graph, from an (arbitrary) application. 2. This graph is then partitioned into subgraphs; each can be assigned to a subset of the available devices (CPUs and GPUs).

a()

a()

GPU

f()

GPU 1

g()

g()

· Typical execution, without any preprocessing of the Key column. Groups are materialized for aggregation.

d2h

d2h Joiner

R R R

... ... ...

N N

... ...

...

Gather

j.out_rhs

equivalent to j.out_rhs

Breaking operations up allows for cancel-out:

Required preprocessing: None.

d.in

d.in

Determine Group Presence

s.indices

Determine Group Presence

s.indices

SelectIndices

Select

s.indices

d.in

Gather

Gather

The GROUP BY Sum example (bottom of this poster): T.c2

T.c2

T.c1

gm.groupby_value

gm.data

Required preprocessing: Group Index column for schema colum T.c2

T.c1 gias.data

gias.group_indices Group-Indexed Aggregate Sum

Group Materialization Per-Group Sum

R

...

N

...

ata

0 1 2

... ... ...

Note: Preprocessing and data regularization also benefits CPU execution of queries – but GPUs benefit more.

Plan Execution Time Acceleration: TPC-H Queries Scale Factor 2 400

Scale Factor 1

356.3

350

290.6

300

217.5

250 200 150

134.6

96.6

100

48.6

50

129.41

107.5

28.5

MonetDB MonetDB/AXE CPU MonetDB/AXE 1 GPU MonetDB/AXE 2 GPUs

84.8 35.7 29.3

22.6

32.2

0 Q01

Notes:

Q04

Q09

Q21

1. The measured times do not include execution DBMS parsing & plan generation, execution graph optimization etc.; see below 2. Currently, (multi-threaded) CPU multi-GPU execution rely on data partitioning of the respective execution graph. These are not yet ready for Q4 and Q21.

GPU compute time breakdown by operation (top 5 time-consumers): TPC-H Q4

TPC-H Q1 (ari thmeti c kernels )

1.40

Select

Select Indices

1.62

Gather

Gather

1.80

Elementwise Compare

Hi s togram

Group-Indexed Aggregate Sum

7.80 2.0

Group-Indexed Aggregate Sum

0.05

Select Indices

0.26

4.0 6.0 total msec

8.0

0.83

RHS-Unique Join 0.2

0.4 0.6 0.8 total msec

0.3

1.0

0.8

1.1

Gather

0.97 0.0

10.0

0.1

Substring Search Foreign Key Join

0.35

Select Indices Sorted

3.31

0.0

TPC-H Q9

1.2

2.4 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 total msec

Note: In all queries, the sum of these is still far below the total plan execution time. The rest is non-overlapped I/O, GPU idle time, and post-GPU-activity work on the CPU.

Overall query processing time – activity breakdown TPC-H Q1 in different scale factors

TPC-H Q9, Scale Factor 2 100%

0.25 1

0.15 1

0.9

27.14

24.9

1.3

1.9

1.3

1.1

18.1

27.4

1.3

1.9

90% 80%

50.07

70%

32.57

60% 50%

250.6

40% 30%

1.101 6.537

3.935 3.118

20%

7.78

10% 0%

28

102.2

91.1

1

44.55

54.6

47.3

16.124

15.85

MonetDB AXE CPU AXE 1 GPU

9.922

AXE 2 GPUs

0.9 2.7

10.7

SF 1

2.4

2.7

3.2

7.5

1.1 5.9 13.1 SF 2 1 GPU

1.0 3.4

11.8

14.9

CPU Compute GPU Overall Select Devices & Partition

2.6

Compile & Optimize Graph

7.7

MonetDB Prepares its plan

17.7

2.5 5.1 14.1

11.5 SF 4

Other

SF 1

SF 2

SF 4

SF 8

2 GPUs

We have presented results from a first iteration of development. We plan to develop it with and without relation to the use-case of SQL processing, adding several features enhancing usability as well as performance:

∑D

0 1 0 2 2 2

j.out_lhs (trivial)

Some challenges for future work

up

... ... ... ... ... ...

Group-Indexed Group-Indexed Aggregation Aggregation

A

Gro

up

A R N R R N

...

Gro

... ... ... ... ... ...

Aggregation

a

- Why choose only some devices for subgraph execution?

A R N R R N

A

Dat

4. I/O Operations Added

Group Group Index Index Calculation Calculation

Key

- Why do we choose certain nodes for a subgraph?

3. Multi-Device Partitioning

a

2. Device Selection

... ... ... ... ... ...

Dat

1. Logical Execution Graph

· A pre-calculated Group Index for the Key column is available; we use it as an offset into the aggregates vector.

Key

b()

A R N R R N

GROUP BY Materialization

ata

Example: how would we execute the query fragment above? ∑D

b()

f()

Key

Joiner

f()

a

CPU

g()

SELECT sum(Data) FROM Table GROUP BY Key;

Dat

b()

g()

A common query fragment...

Key

g()

GPU 2

h2d

a

b()

h2d

Splitter

f()

f()

f() g()

CPU

Dat

4. I/O operations are added as necessary to move data between devices’ memory spaces.

a()

· It is also important to present a complete breakdown of the query processing time into the various kinds of activity other than just GPU compute and memory transfers. On the other hand, as the scale factor increases, these become less significant.

Preprocessing ⇒ Parallelism

Key

3. Subgraphs are replicated for processing on multiple devices, with splitter/joiner nodes effecting partitions of the data.

Splitter

Device generic operations Abstraction Layer

Pre-calculated query-inspecific column statistics: · Specifically permitted by the TPC-H benchmark. · Can be (partially) maintained during execution. · Allow for data regularization, and for assumptions of locality, · which in turn facilitate parallel processing.

j.in_lhs holds foreign keys into T2.c

ScatterIndices

RHS Unique Join

· Removing redundant nodes. · Breaking up nodes representing complex operations into simpler, constituent operations (available to AXE) · Re-integrating simpler computations into more complex, specifically-optimized ones. · Replacing general-case computation with betterperforming special-case · Adjusting computation on individual input columns to utilize pre-calculated statistics.

· At this time, our fragment of supported SQL only allows us to process some of the TPC-H queries (with relative efficiency). The charts present results for Q1, Q4 and Q9.

Execution Graph Compiler & Optimizer GPU Kernel Task Parallelism

T2.c - Primary Key

Results, and a bit of analysis

DBMS Intermediate Representation

Simultaneous Processing: Multi-Core CPU & Multiple GPUs

· Compiling the host DBMS’ internal execution plan representation into graph form. · Transforming the graph to make it acceptable input for the application-agnostic execution environment (AXE). · Effecting execution graph optimizations, in view of the available resources reported by AXE.

Required preprocessing: Determine the maximum value of each primary key column.

At this time, the above is implemented only for a fragment of analytic SQL.

Analytic SQL Query

Multi-GPU Memory Transfer

800

T2.c - Primary Key

j.in_lhs holds foreign keys into T2.c

The latter two are affected through repeated application of multiple graph rewriting rules :

← This work

17.9

Host DBMS SQL Parser & Initial Execution Plan Generator

· Overlapped I/O & computation · Simultaneous I/O to multiple devices · Parallel multi-device execution · Graph node task parallelism · Graph node data parallelism

8,000

msec - plan execution

MonetDB / AXE

Foreign Key Join Avoidance:

The DSL-to-execution-graph compilation component has a triple functionality:

159

LogicBlox

We graft our framework onto MonetDB; its unmodified CPU processing is often faster than GPUaccelerated results in previous works Ref… (before our changing anything.) P… msec 126 -… It seems that we are just entering 0 the ‘ballpark’ for exploring GPGPU acceleration of SQL query processing.

Possible optimizations include:

Questions to ponder: - Where do splitters/joiner nodes get run?

MonetDB

PostgreSQL

Our AXE execution environment uses its device abstraction layer to schedule execution on multiple devices.

The Logical Execution Graph is now a Realized Execution Graph ready for execution by AXE’s Run-Time Engine.

Reference CPU Processing Time TPC-H Query 1, Scale Factor 1

· RedFox (Wu & al., also in CGO’14) · Galactica (Yong & al. also BDS’14)

Realized plan execution

… but it’s generally not useful to apply these all together to any single subgraph.

Optimization via graph rewriting

msec - plan execution

Motivation and contribution summary

· Increase SQL coverage & support other DSLs. · Support more processors types, vendors & platforms (APUs, OpenCL-based devices) · GPU Memory management · Automatic fusion of consecutive operations.

· Improve I/O and computation overlap. · Fully utilize CPU vectorization capabilities to explore more optimal mixtures of execution on the CPU and on the GPU. · et cetera.

As our framework matures, we hope to secure approval for its release as Free Software.