Wolf Rödiger
Tobias Mühlbauer
Alfons Kemper
Thomas Neumann
TU München Munich, Germany
TU München Munich, Germany
TU München Munich, Germany
TU München Munich, Germany
[email protected]
[email protected]
[email protected]
[email protected]
ABSTRACT
core 0
core 1
Hyper-Thread 0
Modern memory-optimized clusters entail two levels of networks: connecting CPUs and NUMA regions inside a single server via QPI in the small and multiple servers via Ethernet or Infiniband in the large. A distributed analytical database system has to optimize both levels holistically to enable high-speed query processing over high-speed networks. Previous work instead focussed on slow interconnects and CPU-intensive algorithms that speed up query processing by reducing network traffic as much as possible. Tuning distributed query processing for the network in the small was not an issue due to the large gap between network throughput and CPU speed. A cluster of machines connected via standard Gigabit Ethernet actually performs worse than a single many-core server. The increased main-memory capacity in the cluster remains the sole benefit of such a scale-out. The economic viability of Infiniband and other high-speed interconnects has changed the game. This paper presents HyPer’s distributed query engine that is carefully tailored for both the network in the small and in the large. It facilitates NUMA-aware message buffer management for fast local processing, remote direct memory access (RDMA) for high-throughput communication, and asynchronous operations to overlap computation and communication. An extensive evaluation using the TPC-H benchmark shows that this holistic approach not only allows larger inputs due to the increased main-memory capacity of the cluster, but also speeds up query execution compared to a single server.
core 2
core 3
Hyper-Thread 1
core 4
core 5
L1: 32 KB L2: 256 KB
core 6
core 7
core 8
core 9
1.
15.75 GB/s
8⨉ 16 GB
L3: 25 MB 16 GB/s
59.7 GB/s
PCIe 3.0 59.7 GB/s
QPI
CPU 0
QPI
HCA
CPU 1
8⨉ 16 GB
arXiv:1502.07169v1 [cs.DB] 25 Feb 2015
High-Speed Query Processing over High-Speed Networks
16 GB/s
host 0
host 1
host 2
Infiniband 4⨉QDR 4 GB/s
host 3
host 4
host 5
Figure 1: Two levels of networks in a cluster: connecting CPUs in the small and servers in the large
core machines with main-memory capacities of several terabytes. The advent of these brawny servers enables unprecedented single-node query performance. Moreover, a small cluster of these servers is sufficient even for large companies to analyze their business. For example, Walmart—the world’s largest company by revenue—uses a cluster of only 16 nodes with 64 terabytes of main memory [30]. Such memory-optimized clusters entail two levels of networks as highlighted in Figure 1: The network in the small connects CPUs inside a single server via a high-speed QPI interconnect. The decentralization of memory controllers partitions memory into separate regions, one per CPU. Mainmemory database systems have to adapt to this non-uniform memory architecture (NUMA) to avoid severe performance degradations caused by remote memory accesses [25, 24]. The network in the large connects separate servers in a cluster. In the past, limited network bandwidth used to significantly decrease overall query performance when scaling a database system out to a cluster. Consequently, previous research has focused on techniques that reduce network traffic in distributed query processing. Now, that high-speed networks such as Infiniband are economically viable, these
INTRODUCTION
Main-memory database systems have gained increasing interest in academia and industry over the last years. The success of academic projects, including MonetDB [26] and HyPer [20], has led to the development of commercial mainmemory database systems such as Vectorwise, SAP HANA, Oracle Exalytics, IBM DB2 BLU, and Microsoft Apollo. This development is driven by a significant change in the hardware landscape: Today’s commodity servers are many-
1
speed-up
2.
RDMA (IB 4×QDR) TCP (GbE)
4× 3×
HyPer’s NUMA-aware parallel query execution engine [24] scales query processing almost linearly to many cores inside a single server. Compared to classical Volcano-style (tuple-at-a-time) and vectorized (e.g., Vectorwise) query execution engines, HyPer’s data-centric code generation [29] interleaves operators that do not require intermediate materialization and compiles them into a single execution pipeline to maximize locality. The input data—coming from either a pipeline breaker or a base relation—is split into work units of constant size, called morsels1 . The dispatcher dynamically assigns morsels to worker threads, which push the tuples of their morsels all the way through the execution pipeline until the next pipeline breaker is reached. This keeps tuples in registers and low-level caches for as long as possible. Worker threads are created at startup to eliminate the cost of creating and destroying threads during query execution. HyPer creates one thread per hardware context to prevent oversubscription. These threads are pinned to cores to avoid the eviction of tuples from registers and low-level caches, which would otherwise lead to severe performance degradations. All database operators are parallelized such that all workers can process the same pipeline job in parallel, with each thread working on its own morsel. Workers share hash-tables for joins using lock-free and lightweight synchronization primitives. They process NUMA-local morsels to avoid expensive remote memory accesses, while load balancing is achieved via work stealing. This paper presents the scale-out of HyPer’s parallel query execution engine to a cluster of servers. We do not cover fault tolerance and workload management, which are orthogonal topics. Instead, we focus on the distributed query processing engine where our goal is to increase the mainmemory capacity as well as the query performance compared to a single server. There are several options to extend a database system from single-machine to distributed processing, ranging from highly invasive, i.e., making the relational operators aware of distributed processing, to low invasive, i.e., keeping operators oblivious of distributed processing. We decided to pursue the latter option via a two-level design: Locally, we use HyPer’s morsel-driven parallel query execution engine to parallelize pipelines. Globally, we introduce a Volcano-style exchange operator for data redistribution between servers during distributed query execution. All the relational operators including join and aggregation are left unaware of the distributed setting. This reduces the complexity of the distributed query execution engine and still achieves excellent performance (cf., Section 6). Figures 3(a) and 3(b) illustrate this straightforward process for TPC-H query 17: The single-node query plan, which was unnested by the query optimizer, is augmented with exchange operators where required. An exchange operator redistributes the tuples across machines so that joins and aggregations can be processed locally. Partitioning according to join and aggregation keys guarantees the correct result. The exchange operator was introduced by Graefe for the Volcano database system [18]. It is a landmark idea as it allows to encapsulate parallelism inside an operator. All other relational operators are kept oblivious to parallel execution.
2× single-node HyPer
1× 0× 1
2
3
4
5
A DISTRIBUTED QUERY EXECUTION ENGINE FOR HYPER
6
number of nodes
Figure 2: A scale-out with RDMA speeds up query execution compared to single-node HyPer (SF 100) CPU-intensive techniques have become prohibitively expensive. Instead, it is important to optimize the network in the small for the network in the large. This work describes a blueprint for designing distributed query engines that take NUMA characteristics into account and expose them to the local query execution engine in order to enable high-speed query processing over high-speed networks. This paper presents the scale-out of our high-performance main-memory database system HyPer, which already scales almost linearly in the number of cores on a single node. Applying our holistic approach to the scale-out can indeed achieve a speed-up as shown in Figure 2 for a TPC-H scale factor 100 data set. The experiment increases the number of nodes while keeping the input size fixed. Six servers achieve a speed-up of 3.4× compared to single-node HyPer. In this experiment relations are not partitioned by primary key so that data redistribution is required for joins and aggregations. Figure 2 also shows that with standard Gigabit Ethernet (GbE) query performance degrades significantly. In particular, this paper makes the following contributions: 1. A performance analysis of high-speed cluster interconnects. Specifically, we study how to optimize TCPand RDMA-based communication for bulk transfers that are typical for analytical database workloads. The insights are gained from experiments run on Infiniband hardware using RDMA and IP over Infiniband but also apply to 10 Gbits/100 Gbits Ethernet hardware. 2. Based on these findings, the implementation of a distributed query engine for our high-performance mainmemory database system HyPer, which uses a Volcanostyle exchange operator to encapsulate cross-node parallelization. HyPer’s exchange operator is carefully optimized for both the network in the small and in the large, exposing NUMA characteristics to the local query engine. As we will show, this is essential to achieve a speed-up from the scale-out. 3. Several optimizations for distributed query plans that use exchange operators, including broadcast for distributed joins, and pre-aggregations for groupings. 4. Finally, a comprehensive performance evaluation that compares HyPer to several state-of-the-art distributed SQL systems: Apache Hive, Cloudera Impala, and Vectorwise Vortex. HyPer improves performance by 475× compared to Hive, 106× to Impala, and 3× to Vortex in the TPC-H analytical benchmark.
1 A morsel encompasses 100,000 tuples, which is sufficient for an almost perfect degree of parallelism and load balancing.
2
χ Γ exchange pre-aggregation
Γ
B
χ
exchange
exchange
χ
Γ
B
B
Γ
B
B
B part
χ
lineitem
χ
exchange
exchange
groupjoin
part
lineitem
part
(a)
χ
broadcast
groupjoin
part
exchange
lineitem
groupjoin
exchange
exchange
broadcast
part
lineitem
part
lineitem
(b)
lineitem
(c)
Figure 3: Query plans for TPC-H query 17: (a) for local execution, (b) for distributed execution introducing exchange operators where required, and (c) optimized with broadcast and pre-aggregation where beneficial The exchange operator was originally used to introduce parallelism both inside a single node and between nodes in a distributed setting. However, for modern hardware its materialization overhead and inflexible design lead to scalability issues when used for single-node parallelization [24]. E.g., the exchange operator fixes the degree of parallelism in the query plan, which makes it inflexible for dealing with load imbalances. The morsel-driven parallelization instead allows to dynamically change the degree of parallelism at runtime. Nevertheless, the exchange operator is still a perfect fit for distributed processing. In the distributed setting, the inflexibility of the exchange operator is not a problem as the data is fragmented across all nodes. Further, network communication is a pipeline breaker, hence tuples have to be materialized in any case. Due to the increased complexity of modern systems, the exchange operator has to be carefully tuned for both the network in the small and in the large. Otherwise, a scale-out will actually reduce the query performance compared to a single node. We integrated HyPer’s exchange operator with the aforementioned morsel-driven query execution engine for local parallelization. Section 3 analyzes the performance of high-speed networks for typical analytical database workloads. This includes a performance study of the two main options for the transport protocol for high-speed interconnects: TCP and RDMA. The access to the network has to be coordinated across servers to avoid switch contention due to head-of-line blocking for Ethernet and credit starvation for Infiniband. Network scheduling improves throughput by up to 40 %. Section 4 builds upon these findings and presents our design for an exchange operator that is tuned for modern hardware. It uses node-addressing instead of thread-addressing to avoid using an excessive number of connections for modern many-core systems that would otherwise lead to severe scalability problems in the communication layer. The exchange operator is further tuned for local processing and ensures NUMA-local processing of messages. Section 5 discusses several optimizations for distributed query plans with exchange operators. Two of these are il-
GbE Throughput [GB/s] Latency [µs] Year of introduction
0.125 340 1998
4×QDR 4×FDR 4×EDR 4 1.3 2007
6.8 0.7 2011
12.1 0.5 2014
Table 1: Comparison of network data link standards
lustrated in Figure 3(c): The first optimization introduces pre-aggregations to reduce the number of tuples that need to be redistributed for distributed group-by operators. The second optimization switches the redistribution method for joins from hash partitioning to broadcast when one relation is much smaller than the other.
3.
HIGH-SPEED NETWORKS
Infiniband is a high-bandwidth and low-latency cluster interconnect. Several data rates have been specified, three of which are compared to Gigabit Ethernet (GbE) in Table 1. Infiniband QDR, FDR, and EDR offer 32×, 54×, and 97× the bandwidth of GbE and latencies as low as a microsecond. Infiniband offers the choice between two transport protocols: standard TCP using IP over Infiniband (IPoIB) and an Infiniband-specific verbs interface that implements remote direct memory access (RDMA). In the following we will analyze the performance of both transport protocols and tune them for typical analytical database workloads. This will allow us to keep HyPer’s exchange operator flexible and support both transport protocols at maximum performance.
3.1
TCP
Existing applications that use TCP or UDP for network communication can run over Infiniband using IPoIB. It is a convenient option to increase the network bandwidth for existing applications without changing their implementation. TCP was introduced in 1974 by Vint Cerf and Robert Kahn [5]. Since then the bandwidth provided by the networking hardware increased by several orders of magnitude. 3
The current socket interface relies on the fact that message data is copied between the application buffer and the socket buffer [13]. This makes it hard to scale TCP performance when the network bandwidth increases.
3.1.1
HCA HCA 3. LLC 3. LLC socket buffer
RAM 5. RAM 5. socket buffer 4. socket buffer socket buffer 2. 4. 2. application application buffer buffer 1. application application buffer buffer 1.
Data Direct I/O and Non-Uniform I/O Access
One of the main reasons for the high per-byte cost of TCP compared to RDMA is the additional copying from application to socket buffer and the resulting multiple trips over the memory bus. This was identified as one of the main reasons hindering TCP scalability and has been coined the “memory wall” [33]. However, this changed for modern systems when Intel introduced data direct I/O (DDIO) in 2012 for Sandy Bridge processors. DDIO reduces the number of memory trips required by TCP and similar protocols by allowing the I/O subsystem to directly access the last-level cache of the CPU for network transfers. It has no hardware dependencies and is invisible to hardware drivers and software. The difference between the classic I/O model and DDIO is highlighted in Figure 4. On the sender side, classic I/O (1) reads the data from the application buffer, (2) copies it into the socket buffer, and (3) sends it over the network, which causes (4) cache eviction and (5) a speculative read. On the receiver side, the data is (6) DMAed to the socket buffer in RAM, (7) copied into the CPU cache, (8) copied into the application buffer, and (9) written to main memory. DDIO instead targets the CPU cache directly. On the sender side, DDIO (1) reads the application data, (2) copies it into the socket buffer, and (3) sends it directly from cache. On the receiver side, the data is (4) allocated or overwritten in the cache via Write Allocate/Write Update, (5) copied into the application buffer, and (6) written to main memory. Write Allocate is restricted to 10 % of the last-level cache capacity to reduce cache pollution (2.5 MB for our servers). DDIO reduces the number of memory bus transfers from 3 to 1 compared to the classic I/O model. However, it does not avoid the copying itself and the associated CPU cost. NUMA systems add a complication: A network adapter is directly connected to one of the CPUs. Consequently, there is a difference between local and remote I/O consumption. This is called Non-Uniform I/O Access (NUIOA). NUIOA is—similar to NUMA—a direct consequence of the architecture of modern many-core systems that have multiple sockets. NUIOA systems restrict DDIO to threads running on the CPU local to the network card. We validated this by measuring the CPU performance counters in a micro-benchmark using Intel PCM: Running the network thread on the local NUMA node caused every byte to be read 1.03× on the sender side and written 1.02× on the receiver side. Running the network thread on the remote NUMA node read every byte 2.11× for the sender while on the receiver side it was read 1.5× and written 2.33× (differences are due to TCP control traffic, retransmissions, cache invalidations, and prefetching [13]). This demonstrates that DDIO was only active for the NUMA-local network thread running on the NUIOA-local CPU. Due to these findings, HyPer’s exchange operator pins the network thread to the NUIOA-local CPU to avoid additional memory bus trips and thus enables the I/O subsystem to directly target the cache.
3.1.2
HCA HCA LLC LLC socket buffer socket buffer 8. 8. application buffer application buffer
6. RAM 7. RAM 6. socket buffer 7. socket buffer 9. 9.
application buffer application buffer
sender receiver sender receiver (a) Classic I/O involves three memory trips at sender/receiver HCA HCA
HCA HCA
3. LLC 3. LLC socket buffer
RAM RAM socket buffer
4. LLC 4. LLC socket buffer
RAM RAM socket buffer
socket buffer 2.
socket buffer
socket buffer 5.
socket buffer
2. application buffer application buffer
application buffer 1. application buffer 1.
sender sender
5. application buffer application buffer
6. 6.
application buffer application buffer
receiver receiver
(b) Data direct I/O reduces this to one memory trip each Figure 4: Data direct I/O reduces the memory bus traffic for TCP significantly compared to classic I/O
of the various tuning options. TCP faces two quite different scenarios: small and large transfers. For small transfers processing time is dominated by kernel overheads, sockets and protocol processing. For bulk transfers data touching, i.e., checksums and copy, as well as interrupt processing accounts for most of the processing time [13]. Database applications mostly transfer large chunks of many tuples, we will thus focus on tuning TCP performance for large packets. The micro-benchmark is a single-threaded program that sends 100k distinct messages of size 512 KB between two machines. From a variety of TCP options that should improve performance only SACK gave a measurable improvement. SACK enables fast recovery from packet loss, important for high-speed links. In a first experiment we transfer data only from sender to receiver while in the second we use fully duplex communication. The results are shown in Figure 5. The original specification of IPoIB in RFC 4391 and 4392 was limited to the datagram mode. This mode supports a 2,044 byte MTU, TCP offloading to the network card, and IP multicast. The connected mode was later added in RFC 4755. While it allows a MTU of up to 65,520 bytes, it does not support TCP offloading or IP multicasting. Disabling the TCP offloading features in datagram mode decreases the throughput for bidirectional transfer by 60 % from 0.93 GB/s to 0.37 GB/s. The connected mode with an MTU of 2,044 bytes performs similar with 0.38 GB/s. However, the missing offloading capabilities are more than offset by the larger MTU of 65,520 bytes available in connected mode. The large MTU increases the throughput by 62 % to 1.51 GB/s compared to datagram mode with offloading. A further performance bottleneck for TCP is interrupt handling. The network device issues interrupt requests (IRQ) to which the kernel responds by executing an interrupt han-
Tuning TCP for Database Workloads
We designed another micro-benchmark to compare TCP and RDMA for database workloads and analyzed the impact 4
TCP w/o offload TCP w/ offload
0.93 2.27 1.51 3.57
TCP interrupts
2.17 3.59 3.41
RDMA 1 GB/s
2 GB/s
3 GB/s
4 GB/s
Figure 5: Tuning TCP for database workloads (single-threaded, 100k transfers, 512 KB messages) dler on a configured core. The kernel automatically schedules the network thread to this same core to reduce cache misses. However, TCP throughput increases by a further 44 % to 2.17 GB/s when the network thread is explicitly pinned to a different core. While this improves performance it also adds to the CPU overhead as now two cores are used. We also investigated whether NUIOA has an impact on TCP throughput. In our micro-benchmark, pinning the network thread to the local NUMA socket improves throughput by 15 % in datagram and 6 % in connected mode for bidirectional transfers. The bottleneck of TCP remains the CPU load of the receiver. Receive and send thread as well as the interrupt handler add a significant CPU overhead. The receiver experiences 100 %–190 % CPU utilization for the unidirectional transfers. The peak of 190 % (i.e., two fully occupied cores) is reached in datagram mode when network thread and interrupt handler are pinned to different cores. In contrast, RDMA has almost no CPU load and achieves full bandwidth already with a single data stream.
3.1.3
3.2.3
Discussion
RDMA 3.2.4
Infiniband offers an asynchronous, zero-copy communication method that incurs almost no CPU overhead called remote direct memory access (RDMA). Its low overhead frees resources for application-specific processing.
3.2.1
Memory vs. Channel Semantics
RDMA supports memory semantics via one-sided operations that allow to remotely read and write the main memory of a server without involving its CPU at all. The registration of a memory region creates a memory key that is required for remote access. These memory keys, in combination with protection domains, allow to restrict the access to memory regions further. A separate channel is required to exchange memory keys before communication can start. The alternative to the memory semantics are conventional channel semantics. These are enabled by two-sided send and receive operations. The receiver posts receive work requests that specify the target memory regions for the next incoming messages. This eliminates the requirement to exchange memory keys before the transfer. There is no performance difference between one- and two-sided operations [14]. An application should choose the semantics that fit best. For analytical database systems, it makes sense to use channel semantics as these allow the receiver to get notified when a new message has arrived. This is not supported for memory semantics as the receiver is not involved in the transfer at all and would need to be implemented via polling or a separate communication channel. For our RDMA implementation, we thus use channel semantics.
As visible from the micro benchmark, it is possible to bring TCP’s throughput close to that of RDMA via careful parameter tuning and by using multiple TCP streams for bidirectional transfers. However, this comes at the cost of an increased CPU load. We have also shown that tuning TCP for bulk transfers is complicated and fragile. RDMA, as we will further see in the next section, thus remains the best option for analytical database systems as it requires less tuning and frees the CPU for query processing.
3.2
Kernel Bypassing
The Infiniband HCA reads and writes main memory directly without interacting with the operating system or application during transfers. The performance-critical path of sending and receiving data thus involves no CPU overhead. Work requests and completion notifications are posted and received directly to and from the HCA and do not involve the kernel. This avoids the overhead of system calls and the copying between application and system buffers. As a consequence, the application has to manage buffers explicitly. For this purpose, RDMA introduces the concept of a memory region. A memory region provides the mapping between virtual and physical addresses so that the HCA can access main memory without involving the kernel. Memory regions have to be registered with the HCA before communication can start and the registration causes the kernel to provide the virtual-to-physical mapping to the HCA but also pins the memory to avoid swapping to disk. This guarantees that the HCA can access the memory at any time. Registering memory regions is a time-consuming operation [14] and they should thus be reused whenever possible. HyPer’s exchange operator implements this via a message buffer pool.
1.58
TCP 64k MTU
0 GB/s
3.2.2
unidirectional bidirectional
0.69 0.37
Polling vs. Events for Completion Notifications
RDMA with channel semantics provides two mechanisms to check for the availability of new messages. The first uses constant polling to check for new completion notifications. While this guarantees lowest latency it also incurs a CPU load of 100 % on one core. The second mechanism uses events to signal new completion notifications. The HCA raises an interrupt when a new message has arrived and wakes threads that are waiting for this event. The eventbased handling of completion notifications reduces the CPU overhead to a mere 4 % for a full-speed bidirectional transfer with 512 KB messages at the cost of an potentially higher latency compared to polling. However, the latency increase is not noticeable for analytical database workloads with large
Asynchronous Operation
The Infiniband verbs interface is inherently asynchronous. Work requests are posted to the send/receive work queue of the Infiniband host channel adapter (HCA). The HCA processes work requests asynchronously and adds a work completion to the completion queue once it has finished. The application can process completion notifications when it sees fit to do so. This asynchronous interface makes overlapping of communication and computation straightforward. 5
4
3
2
2
3
1
1
3
1
4
2
4
2
3
4
all-to-all
3.5 3 2.5 2 2
2 crossbar switch
round-robin
3
4
5
7
6
8
throughput in GB/s
1
1
outputs 2 3 4
throughput in GB/s
inputs
input queues
4
512 KB
3 2 1 0 1 KB
16 KB 256 KB 4 MB 64 MB
number of nodes
message size
(a) Head-of-line blocking caused by input (b) Round-robin network scheduling (c) 512 KB messages or larger hide synqueuing: port 1 blocks ports 3 and 4 improves throughput by up to 40 % chronization cost completely (6 nodes) Figure 6: Network scheduling avoids switch contention and improves throughput by up to 40 % messages. The CPU overhead of handling completion notifications can be further reduced by processing only every nth completion notification. For Mellanox ConnectX-3 cards n is limited to 16,351 active work requests per connection.
3.3
and likewise a single source from which it receives. Using round-robin scheduling of the phases improves throughput by up to 40 % for our 6-node Infiniband 4×QDR cluster and two additional smaller nodes as demonstrated by the microbenchmark in Figure 6(b). In this experiment, each node transmits 1,680 messages with 512 KB each. After sending 8 messages to a fixed target, all nodes synchronize via lowlatency (∼1 µs) inline synchronization messages before they send to the next target. The data transfer between synchronizations has to be large enough to amortize the time needed for synchronization as illustrated in Figure 6(c). For our distributed engine we thus use a message size of 512 KB.
Network Scheduling
Two servers can communicate with maximum throughput as long as their CPUs can keep up with TCP stack processing or, in the case of RDMA, the rate of completion notifications. This is not the case for all-to-all communication between several machines. Here, switch contention can arise in high traffic situations and reduce the throughput significantly—even for non-blocking switches that have enough capacity to support all ports simultaneously with maximum throughput. In the case of Ethernet switches, uncoordinated all-to-all communication can reduce throughput due to head-of-line (HOL) blocking. It is a consequence of input queuing in the switch as illustrated in Figure 6(a): Incoming port 1 blocks incoming ports 3 and 4 even though they could send data to outgoing port 4 and 3 without contention. Infiniband does not suffer from HOL blocking. A packet from a port cannot block the transmission of a packet from another port. However, something similar occurs: credit starvation. TCP and Infiniband differ in how they implement flow control. While TCP drops packets during congestion, Infiniband specifies a lossless fabric to achieve minimal latency. It uses a credit-based link-level flow control to guarantee lossless transport. Each credit granted by the receiver to the sender guarantees that 64 bytes can be received. No data is transmitted unless the credits indicates sufficient buffer space. Credit passes via management packets and is issued per virtual lane to enable traffic prioritization. When several input ports transmit data to the same output ports, the credits from the corresponding receiver can run out faster than they are granted. While other packets from the same input ports could still be processed, the buffer space for input ports is limited. Now, the situation can arise that the credit for all outstanding packets of a port are exhausted. This creates back pressure and the switch cannot receive more packets for this input port until it obtains new credits from the receiver. This is called credit starvation. To mitigate the problem of switch contention, we devised a network scheduling approach that avoids HOL blocking for Ethernet respectively credit starvation for Infiniband by dividing the communication into distinct phases. In each phase a node has a single target to which it sends,
4.
TUNING THE EXCHANGE OPERATOR FOR MODERN HARDWARE
The exchange operator allows to implement distributed query processing within a single-node database system without changing the relational operator implementations. However, the exchange operator needs to be carefully adapted for modern hardware. This section describes how we tuned the exchange operator for high-speed networks in the large and integrated it with the morsel-driven parallelization engine in the small, in order to enable fast processing on a cluster of NUMA servers.
4.1
Query Compilation
HyPer’s exchange operator is implemented in the same framework as the other relational operators. It is fully parallelized and integrated with the morsel-driven parallelization engine [24]. The exchange operator consists of two parts: The first consumes tuples from the preceding execution pipeline, determines the target node for a tuple by hashing the specified distribution keys, serializes it to the message buffer that corresponds to this target node, and finally passes full messages to the network multiplexer for transfer. The second part receives messages from the multiplexer, deserializes the tuples, and pushes them through the subsequent execution pipeline. Currently, we process the first part strictly before the second. However, an interleaved execution is also possible and could improve pipelining. We do not rely on a third-party framework for serialization such as Protocol Buffers or Thrift. Instead, the exchange operator generates LLVM code that efficiently serializes and deserializes tuples. The code is generated for the specific schema of the input tuples and thus does not need to dynamically interpret a schema. This avoids branching and 6
host00 host
host11 host
host00 host
host11 host
thread00 thread
thread00 thread
thread00 thread
thread00 thread
thread11 thread
thread11 thread
thread11 thread
thread11 thread
socket 0 W0 W1 W2 W3 W4 W5
one receive queue per exchange operator
W6 W7
thread00 thread
thread00 thread
thread00 thread
thread00 thread
thread11 thread
thread11 thread
thread11 thread
thread11 thread
host22 host
host33 host
host22 host
host33 host
(a) Thread addressing: one socket per remote thread
work stealing socket 1 W0 W1
(b) Node addressing: one socket per remote node
one send queues per remote server HCA
W2 W3 W4 W5
Figure 7: Node addressing requires significantly fewer connections than thread addressing
W6 W7
one receive queue per exchange operator
Figure 8: Communication multiplexing for 4 servers with 2 NUMA sockets each and 8 workers per socket
leads to much better code and data locality. HyPer’s code generation further knows which attributes are required for subsequent operators and prunes unnecessary columns as early as possible. This avoids that useless data is shipped over the network. Only attributes that are required by subsequent operators are serialized and sent to other nodes.
the example. HyPer currently uses node-addressing, which sufficiently reduces the number of connections.
4.3 4.2
multiplexer
Communication Topology
Communication Multiplexing
HyPer has a dedicated network thread per server, called multiplexer, that manages the data transfer between local workers and workers on remote nodes according to a global round-robin schedule as discussed in Section 3.3. Local workers are not connected to all remote workers as this would lead to an excessive number of connections. Instead, the multiplexers on different nodes are connected with each other. Any available worker can process any message as long as it belongs to its currently processed exchange operator. This enables work stealing and flexible parallelism, where the number of workers for a query can change dynamically.
Each worker thread in a parallel database system has to be able to exchange state with every other worker thread. Communication within a database server usually uses shared memory. Across nodes, each of the t local threads communicates with the (n − 1) × t remote threads via network connections. A straightforward implementation connects every thread with every remote thread as shown in Figure 7(a). This is called thread addressing and requires (n − 1) × t connections per thread, resulting in (n − 1) × t2 per machine, and n × (n − 1) × t2 connections in the cluster. For our relatively small 6-node cluster with a moderate number of 20 hyper-threaded cores and thus 40 threads per machine, this results in a total of 48,000 connections in the cluster. Even for small clusters this can lead to serious scalability problems. Additionally, the connections consume a lot of memory. E.g., every RDMA queue pair requires at least 4 KB of main memory, more than 180 MB for the 48,000 connections. The alternative is to create a dedicated communication thread on each machine that multiplexes communication to the remote nodes. This node addressing is shown in Figure 7(b). It reduces the number of connections in the cluster to n × (n − 1), or 30 for the example. However, the required multiplexing of the communication and scheduling of the network is not straightforward. HyPer implements this approach as we will describe in the following section. There is another possibility that reduces the number of connections even further: Unreliable communication, which requires only one connection per node. Instead, messages specify their target node. However, the database system now has to ensure reliability over the unreliable connection. This is complex and will likely introduce a significant overhead. Infiniband offers an alternative called XRC, which is a reliable connection-oriented communication method that also requires only one connection per node. Again, messages have to specify the target node, but reliability is realized in hardware. This reduces the number of connections to a minimum of n per node, resulting in 6 connections for
4.3.1
Message Queues
The multiplexer manages several send and receive queues. There is one send queue per remote node in the cluster and one receive queue per exchange operator as shown in Figure 8. The network thread continuously sends messages in a round-robin fashion to use all the available network bandwidth. Exchange operators serialize their input tuples and add them to a message corresponding to the target node. Each operator has its own message buffer for each target node to avoid synchronization when inserting tuples. Once a message is full it is inserted into the corresponding send queue in the network abstraction. The exchange operator then requests a new empty message. The processing of incoming messages happens in a similar fashion: Workers take messages from the receive queue corresponding to the currently processed exchange operator, deserialize the tuples, and push them into the next execution pipeline.
4.3.2
Message Buffer Reuse
HyPer’s multiplexer manages the send and receive queues for outgoing and incoming messages as well as the reuse of messages via retain counting. Instead of deallocating messages when they are no longer needed, they are placed in a message pool. This avoids repeated memory allocation and deallocation during query processing as well as the expensive pinning of new messages and registering them with the 7
hash-join [s] broadcast [s] improvement
Q2
Q3
Q4
Q5
Q7
Q8
Q9
Q10
Q11
Q12
Q17
Q18
Q20
Q21
Q1-22
0.24 0.04 6.3×
0.53 0.34 1.5×
0.18 0.17 1.1×
1.00 0.22 4.5×
0.52 0.32 1.6×
1.07 0.17 6.4×
1.61 0.59 2.7×
0.71 0.47 1.5×
0.22 0.06 3.7×
0.21 0.12 1.7×
1.03 1.39 0.09 0.70 11.0× 2.0×
0.18 0.13 1.4×
1.01 0.47 2.2×
10.99 4.97 2.2×
Table 2: Broadcast improves performance when joining small with large relations (TPC-H SF 100, 6 nodes) cores per socket and 1 TB of main memory. Sockets are fullyconnected via one QPI link for each combination of sockets. Interleaved allocation of the network buffers reduces TPC-H performance by 24 % compared to NUMA-aware allocation, allocating messages on a single socket even by 84 %. This demonstrates that the network in the small can severely impact the performance of the network in the large.
0.85
one socket 0.58
interleaved 0.46
NUMA-aware 0
0.2
0.4
0.6
0.8
geometric mean
Figure 9: Impact of message buffer allocation policy on TPC-H performance on a 4-socket NUMA server
5.
Infiniband HCA to enable RDMA [14]. We use one message pool per CPU socket to enable NUMA-local processing. HyPer supports dynamic strings of arbitrary length. To avoid the expensive copying of these strings to temporary storage when deserializing tuples, we keep the string data in the message and do not reuse it until the end of the query. Using dictionary compression for strings on a per-message basis could reduce message size and thus improve performance, which is not implemented in HyPer at the moment. The retain count of network messages is also used for broadcasting similar to the original exchange operator [18]. This avoids copying a broadcast message once for every target node. A message is released when its retain count reaches zero, i.e., after it was forwarded to the last remote node.
4.3.3
DISTRIBUTED OPERATOR DETAILS
This section discusses several optimizations for distributed query plans with exchange operators. In particular, we cover distributed joins, grouping/aggregation, sorting, and top-k.
5.1
Join
A distributed equi-join requires that either both of its inputs are hash partitioned by one of the join keys or that one input is broadcast to all nodes. The former is known as Grace hash join [22] or hybrid hash join [9] and the latter as fragment-replicate join [11] or broadcast join. The broadcast join is cheaper than the hash-join when one input is at least (n − 1)× larger than the other, where n is the number of nodes in the cluster. Switching to the broadcast join when inputs have largely different sizes yields significant performance benefits as shown in Table 2 (e.g., a 11× improvement in execution time for query 17). The overall runtime for all TPC-H queries improves by 2.2× when hashjoins are replaced by broadcast joins where beneficial. The exchange operator in HyPer implements broadcasting via reference counting of messages. This avoids copying broadcast messages once for every target node. A message is released when it reaches a zero retain count, i.e., after it was sent to the last node. Replacing hash joins with broadcast joins where applicable, reduces the overall amount of data that needs to be materialized for the 22 TPC-H queries by a factor of 6 from 154 GB to 25 GB (SF 100). Using the broadcast redistribution method requires an additional redistribution of the join result to guarantee correctness for some types of non-equi joins. While a hash join computes semi, anti, and outer joins correctly without modification, a broadcast join might produce duplicates and false results for left semi, left anti, and left outer join. The left semi join may produce duplicate result tuples at different join sites. The result therefore has to be redistributed to eliminate these duplicates. The anti join will produce valid results n times, once for each join site. A replicated tuple may join only at a subset of the join sites and thus cause spurious results at remaining sites, where it does not find a join partner. Again, the result needs to be redistributed and a tuple is to be included in the result only if it occurs exactly n times, i.e., if it found no join partner at any join site. The left outer join may produce duplicates and false results similar to the anti join. Dangling tuples have to be redistributed and counted. Dangling tuples that found no partner across sites are kept in the final result [38].
NUMA-aware Multiplexing
Modern servers with large main-memory capacities feature a non-uniform memory architecture (NUMA). Every CPU has its own local memory controller and accesses remote memory via QPI links that connect CPUs. As QPI speed is slower than local memory and has a higher latency, a remote access is more expensive than a local access. The query execution engine has to take this into account and restrict itself to local memory accesses as much as possible. Relying simply on the OS default placement incurs a performance penalty of up to 4.95× on a 4-socket Nehalem EX system and 5.81× on a Sandy Bridge EP (also 4-socket) in case of the TPC-H analytical benchmark [24]. The network abstraction exposes NUMA characteristics to the database system to avoid incurring this performance penalty. The multiplexer has one receive queue for every NUMA socket as shown in Figure 8 and alternatively receives messages for each of them. This also means that NUMA is hidden inside the server so that servers in the cluster could potentially have heterogenous architectures. The multiplexer supports work-stealing: workers take messages from remote queues when their NUMA-local queue is empty. For the 6-node cluster, allocating messages on a single socket reduces TPC-H performance for the hash-join plans by a mere 8 %. However, these servers have only two sockets that are further well-connected via two QPI links. This explains the minimal NUMA effects. Figure 9 shows the measurement for a 4-socket Sandy Bridge EP server with 15 8
This optimization applies to query 2, query 3, and query 18 of the TPC-H benchmark. However, it does not result in measurable improvements as the number of input tuples for the top-k operators are too small in these cases. Query 2 computes a top-100 on 47,107 tuples, query 3 a top-10 on 131,041 tuples, and query 18 a top-100 on 6,398 tuples (SF 100). However, this optimization should reduce the number of redistributed tuples significantly when the input is larger. Distributed sorting can be further optimized using special techniques. One option is to split sort key and payload prior to actual sorting similar to CloudRAMSort [21] in order to overlap computation and communication further.
RNS RQS RTS RWS RES RHS RBθ S hash-join broadcast
✓ ✓
✓ ✓2
✓ ✓
✓ ✓2
✓ ✓
✓ ✓3
– ✓
Table 3: Support for distributed non-equi joins
The broadcast join can further be used to compute theta joins as every combination of tuples from the two relations is compared at some join-site. This is not the case for the hashjoin that only compares tuples that fall into the same partition. Table 3 summarizes the findings for non-equi joins.
5.2
Grouping/Aggregation
6.
EVALUATION
Distributed aggregations require that either their input is partitioned by one of the grouping keys or that all input tuples are sent to a single node. However, both options are overly expensive for large inputs. It is therefore often beneficial to introduce a pre-aggregation prior to the exchange operator. The actual aggregation operator following the exchange operator then combines the resulting partial aggregates (cf., Figure 3(c)). This can significantly reduce the number of tuples redistributed across the cluster—especially if the number of groups is small. It is straightforward to implement the pre-aggregation for sum and count. The average of an expression can be computed based on sum and count using a subsequent map operator. A pre-aggregation cannot be used when the aggregation computes a distinct count. HyPer’s distributed query plans use a standard aggregation operator for pre-aggregation. A dedicated in-cache preaggregation operator could improve performance further. This operator would spill partially aggregated results to the exchange operator whenever its buffer exceeds the size of the cache—similar to the IBM DB2 BLU [32] implementation of the aggregation operator for a single machine. A spilling pre-aggregation may produce multiple output tuples for the same aggregation key. However, the final aggregation after the exchange operator will take care of these duplicates. Query 1 is an excellent example that demonstrates the benefits of pre-aggregation. It finishes in 6.4 seconds when the aggregation is computed without pre-aggregation on the node that is also responsible for producing the result. A distributed aggregation based on hash-partitioning the input reduces the execution time to 3.33 seconds. The introduction of a pre-aggregation before the redistribution reduces the runtime considerably to a mere 0.08 seconds. This is an improvement of 80× compared to aggregating on the final node and 40× compared to distributed aggregation.
We integrated our distributed query processing engine in HyPer, a fully-fledged main-memory database system that supports the SQL-92 standard and offers excellent singlenode performance. The experiments focus on ad-hoc analytical distributed query processing performance.
5.3
6.2
6.1
We conducted all experiments on a cluster of six identical servers connected via ConnectX-3 host channel adapters (HCAs) to an 8-port QSFP InfiniScale IV InfiniBand IS5022 switch operating at 4× quad data rate (QDR) resulting in a theoretical network bandwidth of 4 GB/s per link. Each Linux server (Ubuntu 14.10, kernel 3.16.0-29) is equipped with two Intel Xeon E5-2660 v2 CPUs clocked at 2.20 GHz with 10 physical cores (20 hardware contexts due to hyperthreading) and 128 GB of main memory—resulting in a total of 120 cores (240 hardware contexts) and 1.5 TB of main memory in the cluster. The RDMA host channel adapter runs firmware version 2.32.5100 and driver version 2.2-1. It is connected via 16× PCIe 3.0. The hardware setup is illustrated in Figure 1. The thickness of a line in the diagram corresponds to the respective bandwidth of the connection. HyPer offers both row and column-wise data storage; all experiments were conducted using the columnar format and only primary key indexes were created. On each node, HyPer transparently distributes the input relations over all available NUMA sockets (two in our case) by partitioning each relation using the first attribute of the primary key. All execution times include memory allocation (from the operating system) and page faulting as well as deallocation for intermediate results, hash tables, etc. Messages are reused to avoid repeated allocation as well as the cost of pinning the memory regions and registering with the Infiniband HCA.
Sorting/Top-k
A sort operator can be distributed across nodes by sorting the input locally on every node and merging the resulting sorted runs on one node. HyPer’s distributed query plans currently do not apply this optimization due to the lack of a merge operator in HyPer. However, the plans use local sorts to reduce the number of tuples redistributed for top-k queries. Every node computes a local top-k and transfers its k result tuples to a single node that then finalizes the distributed top-k computation with a final top-k operator. 2 3
Experimental Setup
TPC-H
We compared HyPer with state-of-the-art distributed SQL systems: Apache Hive 0.13.1, Cloudera Impala 2.1 [23], and Vectorwise Vortex [7] (commercially known as Actian Analytics Platform, Express Hadoop SQL Edition 2.0). We installed Hive and Impala as part of the Cloudera Hadoop distribution (CDH 5.3). HDFS is configured with a replication factor of 3 and short-circuit reads. The cluster is configured so that the systems use the Infiniband network via IPoIB. We used unmodified TPC-H queries except for Hive—which we measured with the rewritten TPC-H queries from a recent study [12]—and a modified query 11 for Impala, which does not support subqueries in the having clause.
Requires redistribution of the result. Requires redistribution of dangling tuples. 9
Apache Hive
Cloudera Impala
Vectorwise Vortex
HyPer
execution time [s]
100 10 1 0.1 0.01 Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 TPC-H queries
Figure 10: TPC-H query runtimes for several distributed database systems on a cluster of 6 nodes (SF 100) 31m 39s
We tuned the systems according to their documentation: Hive is measured with statistics processing ORC files using a tuned Hadoop YARN (optimized main-memory limits, JVM reuse enabled, output compression via Snappy, increased I/O buffers). The rewritten queries [12] enable Hive’s correlation optimization, predicate push down, and map-side join/aggregation. Impala is run with statistics on parquet files with disabled pretty printing. Short-circuit reads and runtime code generation are enabled. Vortex is measured with statistics, primary keys, and foreign keys. We used the following TPC-H configuration recommended by an Actian engineer specifically for our cluster: Vortex is set to a max_parallelism_level of 120 and num_cores of 20. Clustered indexes are created on l_orderkey for lineitem, o_orderdate for orders, p_partkey for part, ps_partkey for partsupp, r_regionkey for region, n_regionkey for nation, c_nationkey for customer, and s_nationkey for supplier. part is partitioned on p_partkey, partsupp on ps_partkey, orders on o_orderkey, and lineitem on l_orderkey with 24 partitions each. Nation, region, customer, and supplier are replicated across nodes. For non-replicated customer and supplier, the runtime increases to 19.6 seconds and the geometric mean to 0.56 (this also disables the clustered indexes on customer and supplier due to a technical restriction). For HyPer, we replicated only nation and region. All other relations are separated into 6 chunks using the TPC-H dbgen tool. HyPer makes no assumptions on the data distribution and thus has to use distributed joins and aggregations. Queries are run 10 times with the median being used for the comparison and to compute the geometric mean. File buffers are not flushed between runs to keep all data hot in main memory so that disk I/O performance should not be an issue. This was verified for Impala where the HDFS cache did not lead to a measurable performance improvement. Figure 11 compares the distributed SQL systems for the TPC-H benchmark (SF 100 with ∼100 GB data). The figure shows the runtime for all 22 queries. A complete TPC-H run takes 31 minutes and 39 seconds for Hive, 10 minutes and 33 seconds for Impala, 16.6 seconds for Vortex, and 5 seconds for HyPer. Figure 10 compares the detailed query execution times in logarithmic scale (numbers shown in Table 4). Facebook Presto is missing in the comparison as it lacks support for several SQL features required for TPC-H. Apache SparkSQL, the follow-up to Shark, is still early in development and unsupported in the current version of CDH.
10m 33s
execution time [s]
30
20
16.6s
10 5s 0 Hive
Impala
Vortex
HyPer
distributed SQL systems
Figure 11: Execution time for all 22 TPC-H queries comparing several distributed SQL systems (SF 100) Hive
Impala
Vortex
HyPer
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
54.58s 84.59s 67.70s 105.81s 67.06s 23.15s 132.54s 69.42s 67.90s 67.27s 96.34s 67.11s 91.46s 28.10s 111.40s 59.42s 70.02s 111.09s 27.96s 155.24s 195.44s 145.31s
7.60s 9.86s 23.24s 26.00s 24.87s 3.01s 43.01s 19.98s 67.29s 12.32s 3.99s 8.09s 26.37s 5.15s 4.45s 6.20s 85.39s 66.72s 100.84s 15.42s 67.35s 6.31s
0.81s 0.24s 0.19s 0.04s 0.34s 0.04s 0.35s 0.63s 1.36s 1.69s 0.25s 0.10s 2.14s 0.66s 1.14s 0.66s 0.52s 1.27s 0.74s 0.45s 1.85s 1.13s
0.08s 0.04s 0.33s 0.17s 0.23s 0.03s 0.33s 0.17s 0.59s 0.48s 0.06s 0.12s 0.38s 0.07s 0.10s 0.20s 0.09s 0.61s 0.20s 0.13s 0.48s 0.06s
geo. mean total
75.84 31m 39s
16.94 10m 33s
0.49 16.6s
0.16 5s
Table 4: TPC-H query runtimes (cf., Figure 10)
10
chunked partitioned 4.97
5
4.13
0 hash
broadcast
join method
(a) HyPer’s TPC-H performance for different data placement strategies
15×
HyPer
Vortex
10× 5×
geometric mean
7.61
speed-up
execution time [s]
10.99 10
4×SDR
0.4 0.3 0.2
4×DDR 4×QDR
0.1 0
0× 1 5 10 15 20 25 30 35 40 number of threads
(b) Speed-up when increasing the number of threads for Vortex and HyPer
0
1
2
3
4
network throughput [GB/s]
(c) HyPer’s TPC-H performance depending on the network throughput
Figure 12: Further TPC-H experiments on a 6-node cluster for scale factor 100 HyPer processes a TPC-H SF 300 data set in 15.6 seconds. This is a factor of 3.1 longer than the 5 seconds for SF 100. The data placement impacts query performance: Hashpartitioning the TPC-H relations by the first attribute of the primary key improves performance by 44 % for hash-join plans and 20 % for broadcast plans as show in Figure 12(a). Figure 12(b) shows the speed-up for a TPC-H SF 100 experiment that increases the number of threads for Vortex and HyPer; Impala’s query execution is limited to one thread per node. Actian recommends that hyper-threads should be disabled; we thus constrained the experiment to the 20 physical cores for Vortex. HyPer scales up to 12× before it becomes network-bound. Vortex scales only to a factor of 3 due to its non-optimal scalability on many-core machines [24], and probably due to the use of thread-addressing, which likely makes the exchange operator a bottleneck. Our Infiniband hardware allows to configure the data rate from single data rate (SDR), over double data rate (DDR), up to quad data rate (QDR). Figure 12(c) correlates the TPC-H SF 100 performance of HyPer for these data rates with their respective maximum achievable throughput.
7.
sten [17] extended MonetDB with a novel distributed query processing scheme based on continuously rotating data in a modern RDMA network with a ring topology. However, ring topologies, by design, use only a fraction of the available network bandwidth in a fully-connected network. Mühleisen et al. [28] pursued a different approach, using RDMA to utilize remote main memory for temporary database files in MonetDB. Costea and Ionescu [7] extended Vectorwise, which originated from the MonetDB/X100 project [39], to a distributed system using MPI over Infiniband. The project is called Vectorwise Vortex and is included in our evaluation. Several papers discuss the implications of NUMA for database systems. Teubner et al. [37] designed a stream-based join for NUMA systems and emphasized the importance of NUMA locality. Albutiu et al. [1] presented MPSM, a NUMA-aware sort-merge based join algorithm. Li et al. [25] applied data shuffling to NUMA systems. There has been various research analyzing the performance of TCP [6, 19, 13]. We refer the reader to [14] for a detailed discussion. Previous studies have found that TCP performance does not scale well to higher network bandwidths, as the receiver becomes CPU-bound. For large messages, the high CPU load is caused by data touching operations such as checksums and copying, as well as context switches for interrupts raised by the network interface card [13, 14]. TCP offloading, already proposed in the 1980s to alleviate the CPU bottleneck [6], has since been implemented in hardware. Techniques include checksum, segmentation, and generic receive offloading. A second problem identified in the literature is TCP’s high memory bus load [6]: Every byte sent and received over the network causes 2-4 bytes traversing the memory bus [13]. However, the recent introduction of data direct I/O (DDIO) solves this problem for NUMA-aware applications as we discussed in Section 3.1.1. While HyPer’s exchange operator also supports high-speed TCP interconnects, we found that RDMA is a better choice as it requires less tuning and causes a much lower CPU load.
RELATED WORK
Parallel databases are a well-studied field of research that attracted considerable attention in the 1980s/90s with the Grace [16], Gamma [10], Bubba [4], Volcano [18], and Prisma [2] database systems. Several parallel database systems of the time, both commercial and academic, are discussed in an influential survey [8] that further highlights the advantages of the shared nothing architecture [35]. Today’s commercial parallel database systems include Teradata, Exasol, IBM DB2, Oracle RAC, Greenplum, SAP HANA, HP Vertica (which evolved from C-Store [36]), Cloudera Impala [23], and Vectorwise Vortex [7]. The comparatively low bandwidth of standard network interconnects such as Gigabit Ethernet creates a bottleneck for distributed query processing. Consequently, recent research focused on minimizing the network traffic: Neo-Join [34] and Track Join [31] decide during query processing whether to assign or broadcast tuples to exploit locality in the data distribution. Bloom filters [3] are an older technique that is commonly used to reduce the network traffic. High-speed networks such as Infiniband and RDMA remove this bottleneck and have been applied to database systems before. Frey et al. [15] designed the Cyclo Join for join processing within a ring topology. Goncalves and Ker-
8.
CONCLUDING REMARKS
This systems paper aims to provide a blueprint for designing a distributed query engine for modern network clusters consisting of NUMA servers. Using the right abstractions for modern high-speed networks allows to scale the query performance of main-memory database systems in a cluster of machines. While RDMA has already been applied to database systems before, to the best of our knowledge, 11
we are the first to show a NUMA-aware implementation of a Volcano-style exchange operator tailored for Infiniband and RDMA. This operator allowed us to extend the HyPer database system to distributed query processing in a cluster and achieve excellent performance on analytical workloads. Further, we have shown potential optimizations for joins, aggregations, sorting, and top-k. We augmented the single-node query plans with exchange operators to enable distributed processing. Making the query optimizer aware of the distribution costs would certainly yield better plans. The actual integration into the optimizer is beyond the scope of this work and left for future work. Another direction for future research will be the support of distributed OLTP&OLAP processing on the same data, extending our previous work on transaction processing in a cluster of machines [27] to fragmented relations.
9.
[13] A. P. Foong, T. R. Huff, H. H. Hum, J. R. Patwardhan, and G. J. Regnier. TCP performance re-visited. In ISPAS, 2003. [14] P. W. Frey. Zero-copy network communication. PhD thesis, ETH Zürich, Zurich, Switzerland, 2010. [15] P. W. Frey, R. Goncalves, M. Kersten, and J. Teubner. Spinning relations: high-speed networks for distributed join processing. In DaMoN, 2009. [16] S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, 1986. [17] R. Goncalves and M. Kersten. The Data Cyclotron query processing scheme. TODS, 36(4), 2011. [18] G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, 1990. [19] J. S. Kay and J. Pasquale. Profiling and reducing processing overheads in TCP/IP. TON, 4(6), 1996. [20] A. Kemper and T. Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, 2011. [21] C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In SIGMOD, 2012. [22] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. NGC, 1(1), 1983. [23] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, and D. Hecht. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015. [24] V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In SIGMOD, 2014. [25] Y. Li, I. Pandis, R. Mueller, V. Raman, and G. Lohman. NUMA-aware algorithms: The case of data shuffling. In CIDR, 2013. [26] S. Manegold, M. L. Kersten, and P. Boncz. Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB, 2(2), 2009. [27] T. Mühlbauer, W. Rödiger, A. Reiser, A. Kemper, and T. Neumann. ScyPer: Elastic OLAP throughput on transactional data. In DanaC, 2013. [28] H. Mühleisen, R. Gonçalves, and M. Kersten. Peak performance: Remote memory revisited. In DaMoN, 2013. [29] T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011. [30] H. Plattner. The impact of in-memory databases on applications. Personal communication, July 2014. [31] O. Polychroniou, R. Sen, and K. A. Ross. Track join: distributed joins with minimal network traffic. In SIGMOD, 2014. [32] V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11), 2013.
ACKNOWLEDGMENTS
Wolf Rödiger is a recipient of the Oracle External Research Fellowship. Tobias Mühlbauer is a recipient of the Google Europe Fellowship in Structured Data Analysis, and this research is supported in part by this Google Fellowship. This work has further been partially sponsored by the German Federal Ministry of Education and Research (BMBF) grant RTBI 01IS12057.
10.
REFERENCES
[1] M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10), 2012. [2] P. M. G. Apers, C. A. van den Berg, J. Flokstra, P. W. P. J. Grefen, M. L. Kersten, and A. N. Wilschut. PRISMA/DB: A parallel main memory relational DBMS. TKDE, 4(6), 1992. [3] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7), 1970. [4] H. Boral, W. Alexander, L. Clay, G. Copeland, S. Danforth, M. Franklin, B. Hart, M. Smith, and P. Valduriez. Prototyping Bubba, a highly parallel database system. TKDE, 2(1), 1990. [5] V. Cerf and R. Kahn. A protocol for packet network intercommunication. TCOM, 22(5), 1974. [6] D. D. Clark, J. Romkey, and H. C. Salwe. An analysis of TCP processing overhead. In LCN, 1988. [7] A. Costea and A. Ionescu. Query optimization and execution in Vectorwise MPP. Master’s thesis, Vrije Universiteit, Amsterdam, Netherlands, 2012. [8] D. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. CACM, 35(6), 1992. [9] D. J. DeWitt and R. H. Gerber. Multiprocessor hash-based join algorithms. In VLDB, 1985. [10] D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma database machine project. TKDE, 2(1), 1990. [11] R. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In SIGMOD, 1978. [12] A. Floratou, U. F. Minhas, and F. Özcan. SQL-onHadoop: Full circle back to shared-nothing database architectures. PVLDB, 7(12), 2014.
12
[33] Y. Ren, T. Li, D. Yu, S. Jin, and T. G. Robertazzi. Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems. In SC, 2013. [34] W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-sensitive operators for parallel main-memory database clusters. In ICDE, 2014. [35] M. Stonebraker. The case for shared nothing. DEBU, 9(1), 1986. [36] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,
M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A column-oriented DBMS. In VLDB, 2005. [37] J. Teubner and R. Mueller. How soccer players would do stream joins. In SIGMOD, 2011. [38] Y. Xu and P. Kostamaa. A new algorithm for small-large table outer joins in parallel DBMS. In ICDE, 2010. [39] M. Zukowski, M. van de Wiel, and P. Boncz. Vectorwise: A vectorized analytical DBMS. In ICDE, 2012.
13