SQL on Apache Hadoop benchmarks Apache Hive LLAP and Kognitio ...

Technical Information Sheet

SQL on Apache Hadoop benchmarks Apache Hive LLAP and Kognitio 8.2 This document supports the “SQL on Apache Hadoop benchmarks – Apache Hive LLAP and Kognitio 8.2” whitepaper. 1 It contains the following information: • • •

Benchmark Architecture – details of the 9 node AWS system used in all the benchmarks Benchmark Deployment – Overview and schematic of the benchmarks for each platform: Hive LLAP and Kognitio Individual Query Timings – The query timings for each of the 99 TPC-DS queries on each of the platforms

Benchmark Architecture The 2 benchmarks: Apache Hive LLAP and Kognitio were executed on the same 9 node system. The hardware utilised were standard AWS instances. The infrastructure was deployed using Hortonworks Data Cloud available on Amazon Marketplace. This allows you to select a Hortonworks deployment from a list of options. Details of how to deploy HDC can be found at: https://hortonworks.github.io/hdp-aws/index.html#get-started For the benchmarks one m3.2xlarge was deployed as the edge node along with eight r3.8xlarge data nodes. Each data node has the following specifications: • • •

640GB available disk 244GB RAM. 32 cores

We deployed the EDW-Analytics option in HDC with Apache Hive 2 LLAP automatically deployed so that we did not have to do any set-up or configuration.

Benchmark Deployment Each of the benchmarks was run on the system with the other system stopped. This allowed the platform to utilise all of the available resources available to it during the benchmark. In all cases the 1TB TPC-DS data set was generated using the data generator (dsdgen) provided as part of the TPC-DS benchmarking tool suite. In all cases the TPC-DS query generation tool (dsqgen) was utilised to generate the queries. This tool generates a script for each query stream that randomises the order of the 99 queries in each script. The tool is also designed to insert randomised values for parameters in each of the queries. This ensures the benchmark is a truly mixed workload. For more details on how this tool works see http://www.tpc.org/tpcds/. Small syntax changes were done such as adding aliases for derived tables, renaming columns, renaming group by and sort by columns and editing when reserved words used but query rewriting was not allowed.

You can download the whitepaper from: https://kognitio.com/resources/whitepapers/hive-llap-kognitio-benchmarking-usingtpc-ds-query-set/ 1

Published August 2017

1

SQL on Hadoop benchmarks using TPC-DS query set: Hive LLAP & Kognitio

Kognitio

Notes: • • • • • •

Kognitio version 8.2.0-rel20170616 was used This is the current version available for download at http://kognitio.com/on-hadoop/ . Kognitio is a standard YARN application deployed from the edge node. Data was held in Kognitio RAM view images. The larger data sets were hashed on the columns most commonly used in the joins. These reside within the Kognitio YARN containers and can be utilised by multiple queries. Kognitio statistics were collected on all views. Queries were submitted from the edge node using the Kognitio command line tool wxsubmit for each of the randomised query streams in the benchmark. Each query is executed within all containers in the remaining RAM available (not utilised by view images)


2


Apache Hive LLAP Notes: • •

• • •

Apache Hive2 LLAP was deployed with Hive Version 1.2.1. This was shipped as part of HDP2.6 (cloud) used in the Hortonworks Data Cloud deployment Hive LLAP was set-up and configured automatically when selecting the EDW-Analytics option. The only changes made to configuration were to allow 10 concurrent queries. This was done in Ambari and all recommended changes to the underlying configuration resulting from this change were accepted. Hive was restarted Data was held in Hive ORC formatted files. The tables were partitioned on date columns were applicable and larger tables were also bucketed on the columns most commonly used in joins. Analyze table (and columns) was executed prior to benchmarking allowing Hive LLAP to build up a cache. Queries were submitted from the edge node using the beeline command line tool for each randomly generated query script for each query stream of the benchmark. Beeline utilised the JDBC connection to HiveServer2.


3


Individual Query Times for single stream @ 1TB Query Number

Execution Time (S) Kognitio

Query Number

LLAP


LLAP

1

2.5

14.4

26

1.9

4.0

2

9.6

32.9

27

2.3

11.4

3

1.0

2.8

28

6.5

16.2

4

122.4

117.3

29

1.7

5

3.2

14.9

30

9.0

12.3

6

9.8

11.1

31

31.0

21.6

7

2.8

6.0

32

1.2

4.5

8

3.8

6.8

33

11.4

8.4

9

9.1

34

4.0

8.1

10

5.2

10.2

35

17.6

29.3

11

75.1

72.3

36

2.5

54.1

12

2.7

3.9

37

0.8

4.0

13

10.2

5.1

38

14.6

40.4

14

54.0

218.1

39

6.7

23.6

15

3.0

8.4

40

1.0

9.8

16

10.6

50.4

41

0.6

17

2.0

15.4

42

0.7

2.0

18

6.8

15.9

43

2.1

5.5

19

4.1

10.4

44

1.7

16.5

20

2.7

4.0

45

2.7

24.5

21

0.7

2.0

46

4.2

9.5

22

2.6

190.3

47

5.9

23

173.3

697.1

48

7.6

24

45.4

1,023.7

49

1.5

37.0

25

1.5

15.0

50

2.5

20.9


Sub Query Error

4

Run failure out of memory

Sub Query Error

SELECT * issue in complex SQL 4.9


Individual Query Times for single stream @ 1TB (continued) Query Number


Query Number

LLAP


LLAP

51

10.7

42.5

76

3.5

52

0.7

2.3

77

3.3

53

1.2

5.0

78

80.6

54

15.8

65.3

79

5.7

13.6

55

0.7

2.2

80

3.3

25.6

56

11.3

8.0

81

10.5

13.5

57

4.3

82

1.5

7.6

58

7.1

6.5

83

3.1

6.0

59

9.0

44.4

84

4.2

4.3

60

8.3

8.8

85

8.2

12.2

61

8.1

7.9

86

1.3

19.3

62

1.1

6.9

87

15.4

43.5

63

1.2

3.2

88

6.4

15.9

64

9.4

82.8

89

3.1

4.1

65

4.5

123.8

90

0.6

5.1

66

1.5

7.7

91

5.2

4.2

67

217.4

989.3

92

0.7

3.9

68

3.5

9.0

93

1.5

17.7

69

6.3

9.9

94

6.2

27.6

70

4.7

75.8

95

73.4

48.7

71

4.7

10.5

96

0.7

No COUNT(*) without GROUP BY

72

5.5

97.5

97

10.6

109.8

73

1.9

4.9

98

4.6

5.9

74

24.4

60.7

99

1.7

10.4

75

15.6

76.8

Queries Run

99

92

Fastest Query Count

88

11


SELECT * issue in complex SQL

5

11.7 11.3 Run failure out of memory


Average Query Times for 10 streams @ 1 TB Query Number


Query Number


LLAP

LLAP

1

26.6

107.4

26

26.4

73.7

2

83.2

150.9

27

21.3

134.6

3

12.7

66.3

28

40.9

107.0

4

460.0

726.5

29

24.3

5

41.2

125.5

30

159.3

63.2

6

275.6

75.5

31

127.1

98.3

7

40.3

79.1

32

14.1

68.5

8

30.1

105.7

33

81.5

89.6

9

57.0

34

25.6

85.4

10

43.9

106.9

35

293.8

175.5

11

309.8

380.8

36

31.5

132.6

12

46.6

61.1

37

13.2

56.9

13

109.8

57.8

38

100.8

251.7

14

747.3

675.1

39

53.2

111.7

15

22.6

110.1

40

10.8

139.4

16

66.6

238.7

41

10.0

17

59.2

149.9

42

10.1

40.3

18

79.5

104.3

43

15.6

87.7

19

34.5

63.6

44

11.6

110.5

20

45.7

56.6

45

53.7

97.9

21

9.2

39.0

46

32.9

97.3

22

20.0

377.1

47

94.9

23

1,216.6

Long Running

48

144.6

68.8

24

469.4

Long Running

49

20.5

170.6

25

66.4

50

17.2

144.4


Sub Query Error

127.8

6

Run failure out of memory

Sub Query Error



Average Query Times for 10 streams @ 1 TB (continued) Query Number


Query Number

LLAP


LLAP

51

131.6

182.9

76

34.3

52

15.3

38.9

77

22.1

53

12.5

66.8

78

804.9

54

67.6

160.5

79

27.4

91.3

55

8.9

36.8

80

31.2

211.5

56

82.2

57.0

81

155.4

93.0

57

104.6

82

23.2

53.8

58

42.6

71.8

83

41.7

98.2

59

58.8

156.3

84

35.8

71.0

60

93.5

59.1

85

179.3

146.6

61

44.4

68.8

86

20.1

108.1

62

11.8

70.7

87

109.6

275.4

63

16.9

65.2

88

68.3

68.2

64

96.1

452.2

89

27.7

93.6

65

29.8

230.1

90

7.4

52.7

66

16.8

97.9

91

51.8

32.5

92

15.0

52.7

93

7.4

107.9

67

Long Running


Long Running

96.9 74.5 Run failure out of memory

68

29.9

69

61.6

76.0

94

51.2

140.9

70

25.1

219.8

95

329.6

310.6

71

47.7

58.1

96

7.1

No COUNT(*) without GROUP BY

72

90.0

353.9

97

85.4

263.3

73

27.6

70.1

98

53.1

49.9

74

135.1

369.0

99

12.9

72.1

75

133.7

320.4

Queries Run

98

89

Fastest Query Count

83

15


79.4

7