SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio ... We deployed the EDW-Analytics option in HD
Technical Information Sheet
SQL on Apache Hadoop benchmarks Apache Hive LLAP and Kognitio 8.2 This document supports the “SQL on Apache Hadoop benchmarks – Apache Hive LLAP and Kognitio 8.2” whitepaper. 1 It contains the following information: • • •
Benchmark Architecture – details of the 9 node AWS system used in all the benchmarks Benchmark Deployment – Overview and schematic of the benchmarks for each platform: Hive LLAP and Kognitio Individual Query Timings – The query timings for each of the 99 TPC-DS queries on each of the platforms
Benchmark Architecture The 2 benchmarks: Apache Hive LLAP and Kognitio were executed on the same 9 node system. The hardware utilised were standard AWS instances. The infrastructure was deployed using Hortonworks Data Cloud available on Amazon Marketplace. This allows you to select a Hortonworks deployment from a list of options. Details of how to deploy HDC can be found at: https://hortonworks.github.io/hdp-aws/index.html#get-started For the benchmarks one m3.2xlarge was deployed as the edge node along with eight r3.8xlarge data nodes. Each data node has the following specifications: • • •
640GB available disk 244GB RAM. 32 cores
We deployed the EDW-Analytics option in HDC with Apache Hive 2 LLAP automatically deployed so that we did not have to do any set-up or configuration.
Benchmark Deployment Each of the benchmarks was run on the system with the other system stopped. This allowed the platform to utilise all of the available resources available to it during the benchmark. In all cases the 1TB TPC-DS data set was generated using the data generator (dsdgen) provided as part of the TPC-DS benchmarking tool suite. In all cases the TPC-DS query generation tool (dsqgen) was utilised to generate the queries. This tool generates a script for each query stream that randomises the order of the 99 queries in each script. The tool is also designed to insert randomised values for parameters in each of the queries. This ensures the benchmark is a truly mixed workload. For more details on how this tool works see http://www.tpc.org/tpcds/. Small syntax changes were done such as adding aliases for derived tables, renaming columns, renaming group by and sort by columns and editing when reserved words used but query rewriting was not allowed.
You can download the whitepaper from: https://kognitio.com/resources/whitepapers/hive-llap-kognitio-benchmarking-usingtpc-ds-query-set/ 1
Published August 2017
1
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio
Kognitio
Notes: • • • • • •
Kognitio version 8.2.0-rel20170616 was used This is the current version available for download at http://kognitio.com/on-hadoop/ . Kognitio is a standard YARN application deployed from the edge node. Data was held in Kognitio RAM view images. The larger data sets were hashed on the columns most commonly used in the joins. These reside within the Kognitio YARN containers and can be utilised by multiple queries. Kognitio statistics were collected on all views. Queries were submitted from the edge node using the Kognitio command line tool wxsubmit for each of the randomised query streams in the benchmark. Each query is executed within all containers in the remaining RAM available (not utilised by view images)
Published August 2017
2
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio
Apache Hive LLAP Notes: • •
• • •
Apache Hive2 LLAP was deployed with Hive Version 1.2.1. This was shipped as part of HDP2.6 (cloud) used in the Hortonworks Data Cloud deployment Hive LLAP was set-up and configured automatically when selecting the EDW-Analytics option. The only changes made to configuration were to allow 10 concurrent queries. This was done in Ambari and all recommended changes to the underlying configuration resulting from this change were accepted. Hive was restarted Data was held in Hive ORC formatted files. The tables were partitioned on date columns were applicable and larger tables were also bucketed on the columns most commonly used in joins. Analyze table (and columns) was executed prior to benchmarking allowing Hive LLAP to build up a cache. Queries were submitted from the edge node using the beeline command line tool for each randomly generated query script for each query stream of the benchmark. Beeline utilised the JDBC connection to HiveServer2.
Published August 2017
3
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio
Individual Query Times for single stream @ 1TB Query Number
Execution Time (S) Kognitio
Query Number
LLAP
Execution Time (S) Kognitio
LLAP
1
2.5
14.4
26
1.9
4.0
2
9.6
32.9
27
2.3
11.4
3
1.0
2.8
28
6.5
16.2
4
122.4
117.3
29
1.7
5
3.2
14.9
30
9.0
12.3
6
9.8
11.1
31
31.0
21.6
7
2.8
6.0
32
1.2
4.5
8
3.8
6.8
33
11.4
8.4
9
9.1
34
4.0
8.1
10
5.2
10.2
35
17.6
29.3
11
75.1
72.3
36
2.5
54.1
12
2.7
3.9
37
0.8
4.0
13
10.2
5.1
38
14.6
40.4
14
54.0
218.1
39
6.7
23.6
15
3.0
8.4
40
1.0
9.8
16
10.6
50.4
41
0.6
17
2.0
15.4
42
0.7
2.0
18
6.8
15.9
43
2.1
5.5
19
4.1
10.4
44
1.7
16.5
20
2.7
4.0
45
2.7
24.5
21
0.7
2.0
46
4.2
9.5
22
2.6
190.3
47
5.9
23
173.3
697.1
48
7.6
24
45.4
1,023.7
49
1.5
37.0
25
1.5
15.0
50
2.5
20.9
Published August 2017
Sub Query Error
4
Run failure out of memory
Sub Query Error
SELECT * issue in complex SQL 4.9
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio
Individual Query Times for single stream @ 1TB (continued) Query Number
Execution Time (S) Kognitio
Query Number
LLAP
Execution Time (S) Kognitio
LLAP
51
10.7
42.5
76
3.5
52
0.7
2.3
77
3.3
53
1.2
5.0
78
80.6
54
15.8
65.3
79
5.7
13.6
55
0.7
2.2
80
3.3
25.6
56
11.3
8.0
81
10.5
13.5
57
4.3
82
1.5
7.6
58
7.1
6.5
83
3.1
6.0
59
9.0
44.4
84
4.2
4.3
60
8.3
8.8
85
8.2
12.2
61
8.1
7.9
86
1.3
19.3
62
1.1
6.9
87
15.4
43.5
63
1.2
3.2
88
6.4
15.9
64
9.4
82.8
89
3.1
4.1
65
4.5
123.8
90
0.6
5.1
66
1.5
7.7
91
5.2
4.2
67
217.4
989.3
92
0.7
3.9
68
3.5
9.0
93
1.5
17.7
69
6.3
9.9
94
6.2
27.6
70
4.7
75.8
95
73.4
48.7
71
4.7
10.5
96
0.7
No COUNT(*) without GROUP BY
72
5.5
97.5
97
10.6
109.8
73
1.9
4.9
98
4.6
5.9
74
24.4
60.7
99
1.7
10.4
75
15.6
76.8
Queries Run
99
92
Fastest Query Count
88
11
Published August 2017
SELECT * issue in complex SQL
5
11.7 11.3 Run failure out of memory
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio
Average Query Times for 10 streams @ 1 TB Query Number
Execution Time (S) Kognitio
Query Number
Execution Time (S) Kognitio
LLAP
LLAP
1
26.6
107.4
26
26.4
73.7
2
83.2
150.9
27
21.3
134.6
3
12.7
66.3
28
40.9
107.0
4
460.0
726.5
29
24.3
5
41.2
125.5
30
159.3
63.2
6
275.6
75.5
31
127.1
98.3
7
40.3
79.1
32
14.1
68.5
8
30.1
105.7
33
81.5
89.6
9
57.0
34
25.6
85.4
10
43.9
106.9
35
293.8
175.5
11
309.8
380.8
36
31.5
132.6
12
46.6
61.1
37
13.2
56.9
13
109.8
57.8
38
100.8
251.7
14
747.3
675.1
39
53.2
111.7
15
22.6
110.1
40
10.8
139.4
16
66.6
238.7
41
10.0
17
59.2
149.9
42
10.1
40.3
18
79.5
104.3
43
15.6
87.7
19
34.5
63.6
44
11.6
110.5
20
45.7
56.6
45
53.7
97.9
21
9.2
39.0
46
32.9
97.3
22
20.0
377.1
47
94.9
23
1,216.6
Long Running
48
144.6
68.8
24
469.4
Long Running
49
20.5
170.6
25
66.4
50
17.2
144.4
Published August 2017
Sub Query Error
127.8
6
Run failure out of memory
Sub Query Error
SELECT * issue in complex SQL
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio
Average Query Times for 10 streams @ 1 TB (continued) Query Number
Execution Time (S) Kognitio
Query Number
LLAP
Execution Time (S) Kognitio
LLAP
51
131.6
182.9
76
34.3
52
15.3
38.9
77
22.1
53
12.5
66.8
78
804.9
54
67.6
160.5
79
27.4
91.3
55
8.9
36.8
80
31.2
211.5
56
82.2
57.0
81
155.4
93.0
57
104.6
82
23.2
53.8
58
42.6
71.8
83
41.7
98.2
59
58.8
156.3
84
35.8
71.0
60
93.5
59.1
85
179.3
146.6
61
44.4
68.8
86
20.1
108.1
62
11.8
70.7
87
109.6
275.4
63
16.9
65.2
88
68.3
68.2
64
96.1
452.2
89
27.7
93.6
65
29.8
230.1
90
7.4
52.7
66
16.8
97.9
91
51.8
32.5
92
15.0
52.7
93
7.4
107.9
67
Long Running
SELECT * issue in complex SQL
Long Running
96.9 74.5 Run failure out of memory
68
29.9
69
61.6
76.0
94
51.2
140.9
70
25.1
219.8
95
329.6
310.6
71
47.7
58.1
96
7.1
No COUNT(*) without GROUP BY
72
90.0
353.9
97
85.4
263.3
73
27.6
70.1
98
53.1
49.9
74
135.1
369.0
99
12.9
72.1
75
133.7
320.4
Queries Run
98
89
Fastest Query Count
83
15
Published August 2017
79.4
7
SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio