A Hybrid GPU/CPU Bitmap Join Index Mechanism for ...

7 downloads 0 Views 569KB Size Report
Moore's law drives hardware techniques move forward. First, the frequency of com- puter kept increasing until power wall became bottleneck, and then the cores ...
HG-Bitmap Join index: A Hybrid GPU/CPU Bitmap Join Index Mechanism for OLAP Yu Zhang1,2, Yansong Zhang1,3, Ji Zhang1,3, Hong Chen1,2 1

School of Information, Renmin University of China, Beijing 100872, China 2 DEKE Lab, Renmin University of China, Beijing 100872, China 3 National Survey Research Center at Renmin University of China, Beijing 100872, China [email protected]

Abstract. In-memory big data OLAP(on-line analytical processing) is still time consuming task for data access latency and complex star join processing overhead. GPU is introduced to DBMSs for its remarkable parallel computing power but also restricted by its limited GPU memory size and low PCI-E bandwidth between GPU and memory. GPU is suitable for linear processing with its powerful SIMD(Single Instruction Multiple Data) parallel processing, and lack efficiency for complex control and logic processing. So how to optimize management for dimension tables and fact table, how to dispatch different processing stages of OLAP(Select, Project, Join, Grouping, Aggregate) between CPU and GPU devices and how to minimize data movement latency and maximize parallel processing efficiency of GPU are important for a hybrid GPU/CPU OLAP platform. We propose a hybrid GPU/CPU Bitmap Join index(HG-Bitmap Join index) for OLAP to exploit a GPU memory resident join index mechanism to accelerate star join in a star schema OLAP workload. We design memory constraint bitmap join index with fine granularity keyword based bitmaps from TOP K predicates to accurately assign specified GPU memory size for specified frequent keyword bitmap join indexes. A OLAP query is transformed into bitwise operations on matched bitmaps first to generate global bitmap filter to minimize big fact table scan cost. In this mechanism, GPU is fully utilized with simple bitmap store and processing, the small bitmap filter from GPU to memory minimizes the data movement overhead, and the hybrid GPU/CPU join index can improve OLAP performance dramatically. Keywords: OLAP, hybrid GPU/CPU platform, star join, bitmap join index

1

Introduction

Moore’s law drives hardware techniques move forward. First, the frequency of computer kept increasing until power wall became bottleneck, and then the cores increased gradually in general purpose CPU from dual core to deca-core processor (contains ten cores e.g. Intel Xeon E7-2850). Compared with larger and larger memory size(at most TB size), more cores are required for in-memory big data with adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

larger bandwidth for just-in-time applications. In another way, many-core coprocessors such as GPGPU and Intel Xeon Phi™ Coprocessor 5110P can provide more simple processing cores and higher memory bandwidth as shown in table 1. Table 11. multi-core and many-core processors Type Core/thread Frequency Memory Memory type Memory bandwidth

Xeon E7-8870 10 /20 2.40 GHz 4096GB(maximum) DDR-3 30 GB/s

Xeon Phi 5110P 60/240 1.053 GHz 8GB GDDR5 ECC 320 GB/s

Tesla K20X 2688 CUDA cores 732MHz 6GB GDDR5 249.6 GB/s

GPU memory is much smaller but much faster in throughput than CPU memory. In table 1, nowadays GPU memory bandwidth exceeds 200GB/s. But limited GPU memory cause data intensive applications transfer large dataset from memory to GPU. Data movement overhead between GPU and CPU memory is much costly: in real system, the PCI-E×16 Bus can achieve a throughput of around 4GB/s, for moderate computer, this throughput may be around 2GB/s which is far below the bandwidth of CPU memory. So we have two choices for GPU: one is to utilize GPU as an off-board co-processor on the condition that GPU processing profit outperforms data movement overhead, and as the PCI-E throughput increases the performance profit rises up automatically. The other is focusing on the locality of GPU memory to make critical processing stage GPU memory resident, and make GPU as a key accelerator in whole query process. Behind the large web application such as E-business is big data house. It is essential for the backend data warehouse to provide real-time processing to create new value. In data warehousing workload, query cost mainly lies on two operations: one is sequential scan on big fact table with high memory bandwidth consuming; the other is star join between fact table and multiple dimension tables under foreign key references. Join index[1] is a kind of index for join relation between tables, and bitmap join index[2] is commonly employed in data warehouse to reduce both join cost and fact table scan cost, but state-of-the-art in-memory analytical database such as MonetDB[3], VectorWise[4], SAP HANA[5] etc. have not announced to support inmemory bitmap join index for analytical workload. Bitmap join index can map dimensional attributes to fact table bitmaps, joins can be dramatically reduced by bitmap filters. For complex queries, more dimensions involve more candidate bitmap join index processing overhead. This kind of sequential bitwise operation is suitable for GPU’s powerful SIMD processing, and the operation only output a small bitmap for low percent of fact table, this can reduce query cost. We setup a light-weighted bitmap join index on hybrid GPU/CPU platform, GPU is used as bitmap join index container and engine, GPU memory is carefully used to hold the most frequently accessed bitmaps, bitmap operations are performed with low latency by GPU parallel processing power, the small bitmap filter is piped to CPU database engine to drop most useless processing cost. In this paper, our contributions include two folds:  A keywords based GPU bitmap join index for global database

 A cost model for hybrid GPU/CPU platform based OLAP query processing The key issues for GPU bitmap join index lies on how to manage bitmap join index with different cardinalities and different size in limited GPU memory and how to update index from main-memory to GPU memory. We propose a cost model for hybrid GPU/CPU platform to figure out key performance bottlenecks and how to leverage the overhead by distributing processing stages to different storage and processing platform. The related work is presented in Sec.2. GPU bitmap join index is discussed in Sec.3. Sec.4 shows the results of experiments. Finally, Sec.5 summarizes the paper.

2

Related work

2.1

OLAP query processing

Multiple joins, especially hash join[6,7,8] for the equal-join between primary-foreign key pairs, seriously affect OLAP performance,. In CPU architecture, cache locality is essential for join performance. Radix-Decluster[9] is proposed to improve cache hit ratio by cache conscious radix partition and it is widely accepted in database and is also employed in GDB[10,11] as a GPU hash join solution. The latest research discovered that partition based hash joins sounds good for multicore parallel processing but the answer is not so[12]. In OLAP scenario, star join requires multiple pass data partition and parallel join operations, and join stream materialization between two parallel joins are space and time consuming workloads. Furthermore, partitioning needs large swap memory and more logical controls . This is not suitable for GPU. So, for star join processing on GPU, simple one-pass parallel scan based star join is the best candidate. Invisible join[13] scans each foreign key column with probing in dimension hash table to generate join result bitmap, and then performs a bitwise AND operation on these bitmaps to produce star join result bitmaps. DDTA-JOIN[14] uses surrogate key as offset address in column in star schema and can avoid materializing fact table bitmaps for star join. It is a bitmap and vector based star join algorithm, the one-pass scan based OLAP makes it suitable for multicore/many core processing. It is also proved to be higher performance than hash join based OLAP engines. With In-memory analytical database becomes larger and larger, foreign keys in big fact table are also big data for limited GPU memory, and moving foreign keys from memory to GPU memory to perform foreign key join[15] becomes the main bottleneck. DMA(Direct Memory Access) and asynchronous bulk implementation can help some but the bandwidth between the GPU and host main memory restricts sending foreign key columns for efficient GPU processing. 2.2

Bitmap join index

Bitmap join index can dramatically reduce big fact table scan cost and useless joins between dimension tables and fact table. It is organized as a B+-tree for low cardinali-

ty attribute, bitmaps which represent join positions in table to be joined are leaf nodes. Bitmap join index can be created for single column, multiple columns, columns from different tables, even columns without direct join relations. The bitmap join index prefers to low cardinality columns for space efficiency, but low cardinality means that the selectivity of each member is higher than high cardinality column. Index is tradeoff between space and performance. [16] proposed bitmap join index selection strategy based on cost models of bitmap join index storage, maintenance, data access and join without indexes. Bitmap join index size can be accurately computed with cardinality and rows in fact table, and selectivity in each bitmap is accurate for evaluating data access cost. Selecting indexing attributes is key issue. Many data mining algorithms[17,18,19] are applied for choosing better indexing candidate attributes. A fine granularity bitmap join index mechanism is essential for improving space efficiency and index effectiveness. The frequency of indexed members may vary with different workloads, a dynamic index update mechanism is required for space constrain bitmap join index especially for a GPU memory resident bitmap join index with limited size. The existing researches usually focused on how to develop a GPU based general purpose index, such as B+-tree index and hash index, GPU bitmap join index is still uncovered.

3

Hybrid GPU/CPU Bitmap Join Index

3.1

Architecture of hierarchical Bitmap join index

Bitmap join index is efficient for OLAP workload, query involved predicates in dimensions or hierarchies can be mapped to bitmaps to identify join result without real join operations. But bitmap join index is space consuming index with big cardinality for each member in indexed columns producing one bitmap equal to the length of table to be joined. For the join example D⋈F, Di is the column to be indexed, | Di | is the cardinality of Di, N is the rows of table F, then bitmap join index size S can be calculated as S=| Di |×N/8(Bytes). OLAP is multidimensional model based, and queries commonly involve many dimension columns in different dimension tables. So, bitmap join indices cover a large range of predicate candidates and need large storage space. For in-memory OLAP workloads, memory is more expensive than disk and raw data should be resident in memory as much as possible to improve memory utilization. So bitmap join indexes must be space constraint for specified memory quotas, this constraint means that only specified m bitmaps can be memory resident from different disk resident bitmap join indices. Bitwise operations on large bitmaps are also costly(hundreds of ms) in memory resident processing scenario oppose to disk resident scenario. While GPU is much superior in bitmap processing(hundreds ofμs). Compared with existing data-move-process-move style join processing, bitmap join index is GPU resident, local processing, minimizing data movement and eliminating the vast majority of tuple joins in big fact table(the least selectivity 0.0000762% in Star Schema Benchmark). So we use GPU as bitmap join index processer to improve

star join performance in a hybrid GPU/CPU platform, and this hybrid platform can adapt to the nowadays low PCI-E throughput circumstance. We have three hierarchies for bitmap join indices storage: GPU memory, mainmemory(RAM) and disk. Creating more bitmap join indices can better improve query processing performance but consuming more space for the large amount of candidate bitmaps. According to Power law, only small part of all the candidate bitmaps is frequently accessed. So we propose a hierarchical bitmap join index mechanism for hybrid GPU/CPU platform in figure 1.

GPU L1 Bitmap Join Index Cache(GPU)

0 1 0 1 0 0 0 1

0 1 0 0 0 1 0 0

C_region_America G S_region_America G C C_nation_China C S_region_Asia G P_mfgr_MFGR#1 C P_mfgr_MFGR#2 C D_year_1997 …… …… ……

1 1 0 1 1 0 0 1

GPU memory

0 0 0 0 0 0 1 0

1 1 0 1 1 0 1 0

1 0 0 1 0 0 1 0

1 0 0 1 0 0 1 0

CPU L2 Bitmap Join Index Cache(RAM)

RAM

... Disk resident bitmap join indices

0 1 0 1 0 0 0 1

0 1 0 0 0 1 0 0

...

0 1 0 1 0 1 0 1

1 1 0 1 1 0 0 1

0 0 0 0 0 0 1 0

...

0 1 0 1 0 1 0 1

1 1 0 1 1 0 0 1

0 0 0 0 0 0 1 0

...

0 1 0 1 0 1 0 1

L3 Bitmap Join Index Cache(disk)

Fig. 1. Hybrid GPU/CPU Bitmap Join Index Architecture

Bitmap join indices are managed with 3-level cache. Traditional bitmap join indices are created on disk to support a large join bitmap candidate set. The disk resident bitmap join indices are used as L3 level bitmap join index cache. Bitmaps are organized with column granularity. By analyzing keyword logs in query statements, we choose TOP K frequent bitmaps from different bitmap join indices as global bitmap join index. The K bitmaps are further divided into two sets: one is memory resident, the other is GPU memory resident. GPU memory resident bitmap join index is L1 bitmap join index cache with highest processing power but limited memory size; memory resident bitmap join index is L2 bitmap join index cache for specified memory quotas with relative lower bitmap processing performance. This hierarchical bitmap join index mechanism includes the following modules:     

Keyword log derived from query statements to record keyword frequency TOP K keyword mining for GPU/CPU bitmap join index candidates Bitmap join index updating mechanism in hybrid GPU/CPU platform Bitmap storage and compression Bitwise operation in hybrid GPU/CPU platform

3.2

Keyword log and TOP K keywords mining

In a typical query, WHERE clause comprises with two types of conditions: equal join conditions and predicates. Predicates commonly include attribute name and predicate expression like the following query “SELECT d_year, c_nation, SUM(lo_revenue-lo_supplycost) AS profit FROM date, customer, supplier, part, lineorder WHERE lo_custkey=c_custkey AND lo_suppkey=s_suppkey AND lo_partkey=p_partkey AND lo_orderdate=d_datekey AND c_region=’AMERICAN’ AND s_region=’AMERICAN’ AND (p_mfgr=’MFGR#1’ OR p_mfgr=’MFGR#2’) GROUP BY d_year, c_nation ORDER BY d_year, c_nation”

Keywords are extracted from WHERE clauses except equal join clauses no matter they are from fact table, dimension tables or which attributes they belong. Keywords are organized with uniform name, for example:    

d_year=1993➪d_year_1993 lo_discount between1 and 3➪lo_discount_between_1_3 lo_quantity < 25➪lo_quantity_

Suggest Documents