GPU SQL Query Accelerator - International Journal of Information ...

8 downloads 46574 Views 1MB Size Report
Exploiting current GPU computing capabilities for database operations, we have to take into .... and widely deployable in big data analytics for database systems.
International Journal of Information Technology

Vol. 22 No. 1 2016

GPU SQL Query Accelerator Keh Kok Yong, Hong Hoe Ong

Vooi Voon Yap

Accelerative Technology Lab MIMOS Berhad Kuala Lumpur, Malaysia

Department of Electronic Engineering Universiti Tunku Abdul Rahman Perak, Malaysia

[email protected], [email protected]

[email protected]

Abstract The world rapidly grows with every connected sensors and devices with geo-location capabilities to update its location. Data analytic industries are finding ways to store the data, and also turn this raw data into valuable information as an eminent business intelligence services. It has inadvertently conformed a flood of granular data about our world. Crucially, this data flood has outpaced the traditional compute capabilities to process and analyze it. Thus, it reveals the potential economic benefits and becomes an overwhelming new research area that requiring sophisticated mechanisms and technologies to reach the demand.

Over the past decade, there have attempts of using

accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an emerging SQL-like query accelerator, Mi-Galactica.

In addition, we extended our system by

offloading geo-spatial computation to the GPU devices. The query operation executes parallelly with drawing support from a high performance and energy efficient NVIDIA Tesla technology. Our result has shown the significant speedup.

Keyword: Geospatial, Graphics Processing Units, Database Query Processing, Big Data, Cloud

1

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

I.

Introduction

The world rapidly grows with every connected sensors and devices with geo-location capabilities to update its location. These location-aware data refers as spatial dataset. Gartner reported that Cisco had projected 50 billion of connected objects. Besides, Digital Universe & EMC estimated that 44 trillion gigabytes of data will be collected in the year of 2020 [1].

Data analytic industries are

finding ways to store the spatial dataset, and also turn this raw data into valuable information as an eminent business intelligence services. In addition, the value of spatial dataset is already evident. DataSift uses the collected social media data for predicting consumer actions.

Facebook uses an

accumulation of 350 million daily photos upload for the deep learning in image recognition [2]. Importantly, the demand of speedy computation with an appealing visualization is crucial to success. Thus, it reveals the potential economic benefits and becomes an overwhelming new research area that requiring sophisticated mechanisms and technologies to reach the demand. A Graphics Processing Unit (GPU) is not only used for the optimization of image filtering and video processing, but also it is widely adopted in accelerating big data analytics for scientific, engineering, and enterprise applications. Jem uses GPU to accelerate texture-based geographic mapping, which exploiting rendering performance [3]. Chenggang uses the two latest GPU technologies, Kepler and MIC for accelerating Geospatial application.

The parallel implementation shows the massive

speedup and strong scalability in a cluster [4]. Various recent studies have shown that the GPU manages unprecedented acceleration of applications by offloading the compute-intensive tasks [5] [6] [7]. The ultra-fast analytic application is crucial to drive business success through quick and accurate decision making from mining the massive data. Over the past decade, GPUs have taken the lead in high performance computing. Its evolution of parallel processing components to fully programmable and powerful co-processors working along with CPU has allowed cheaper, more energy efficient and faster super computers to be built. Zhe uses a cluster of GPUs with 30 worker nodes to develop a parallel flow simulation using the lattice Boltzmann model (LBM) [8].

Titan is the first major supercomputing system to utilize hybrid

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

architecture with both 16-core AMD Opteron CPUs and NVIDIA Tesla K20 GPU accelerators for scientific computation such as simulation of climate change, nuclear energy modelling, nanoscale analysis of materials, and other disciplines. The top ranked energy efficient supercomputers in the world, TSUBAME-KFC, Wilkes and HA-PACS, use NVIDIA’s Kepler GPUs along with high speed network communication devices such as Infiniband.

These facilities have allowed computational

scientists and researchers to address the world’s most challenging computational problems by up to several orders of magnitude faster. However, there are studies pointing out that using GPU as a general-purpose computing device has limitations [9] [10]. The fundamental problem of data transfer between CPUs and GPUs is a cause of tremendous concern to the accelerator community. The ultra-high speed computation provided by the GPU may not be able to compensate for the IO latency experienced at the PCIe bus. In addition, it may turn out to be even more expensive if the parallel computation is not complex enough, where more time is spent on transferring data to and from the GPU rather than for computation. Despite this shortcoming, various empirical studies and experiments have shown that GPU is highly energy efficient and has contributed to significant performance breakthrough across the computing industry. Exploiting current GPU computing capabilities for database operations, we have to take into a consideration of the hardware characteristics on a parallel algorithm execution. Also, it needs the main processor, CPU to direct the main workflow. We propose and implement a GPU query accelerator called Mi-Galactica using CUDA, and benchmark its performance on NVIDIA Tesla Kepler architectures against standard PostgreSQL and various distributed Hadoop systems. The detailed needs for this accelerator and parallel query processing in our work are: •

Partitioning data into fine grained chunks for parallel processing and reducing I/O access



Applying compression and decompression mechanisms to speedup data I/O operations via PCI-e transfer.



Maximizing the usage of single instruction for multiple data to optimize the degree of parallelism for query operations 3

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap



Performance on the GPU implementation should yield a significant acceleration over one based solely on the CPU

The paper is organized as follows. In Section 2, it provides a review of database accelerator related works.

In Section 3, it briefly discusses the parallel CUDA programming on GPU and the

architecture of the latest NVIDIA Tesla technology. In Section 4, we present the implementation of the proposed GPU query accelerator with ESRI GIS software.

Section 5 briefly discusses

experiments and performance results. Finally, Section 6 concludes and discusses future work.

II. Database Accelerator Database systems are extremely important to a wide array of industries.

There have been

tremendous changes in the various hardware technologies used in accelerating database operations. The well-known emerging technologies like GPU and FPGA (Field-Programmable Gate Arrays) have led to an evolution of parallelism, compilation, and I/O reduction for producing more highly efficient systems. In Govindarauju’s experiment, it presented several common query operations in million records which storing in a database by using NVIDIA GeForce FX5900 [11] . It showed GPUs as an effective co-processor for performing database operations. Mueller used FPGAs to accelerate data processing [12]. This work opened up interesting opportunities for heterogeneous many-core implementation. In addition, these hardware accelerators offer significant benefits in power consumption. Recent works on FPGA query accelerators have attempted various approaches to parallelize data processing. Glacier [13] implements a set of streaming operators in composing digital circuit. It has a library of compositional hardware modules. Each circuit is able to execute a specific query. Woods [14] presents an FPGA framework for static complex event detection. This research looks to transfer more complex data processing to FPGAs as a means to enhance the classical CPU-based architectures. Netezza [15], [16] provides a pipeline consisting of DMA, Decompress, Project, and Restrict computing engines. It reduces the amount of data access by performing projection and

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

restriction using data from prior requested tables. It hides the slow I/O transfer latency, by compression and decompression of data. However, FPGA raises immense challenges to the developers. It is generally more complicated and difficult to implement and debug. Thus, it has not been able to gain a large foothold in the market. A large body of research has investigated the acceleration of database systems using NVIDIA GPUs with CUDA programming. Bakkum 1 implemented a GPU query acceleration database called Virginian Database [17]. It is based on SQLite and develops a subset of commands that are directly executed in GPU. Also, it uses GPU-mapped memory with efficient caching; therefore, it can compute a larger size of data which exceeds GPU physical memory size. CoGaDB 2 is a new column-oriented and GPU accelerate database management system. It designs a co-processing scheme for GPU memory by caching the working set of data. It minimizes the performance penalty by using a list of tuple identifiers representation for the data rather than the complete data to minimize transportation between CPU and GPU [18]. Heimel’s3 [19] approach uses GPU-assisted query optimization for real-valued range queries based on kernel density estimation into PostgreSQL. It uses OpenCL because of its open standard that allows it to be easily ported to other accelerator devices.

PG-Storm4 developed a Foreign Data Wrapper module in PostgreSQL, and

offloads the sequential scan operation with massive data to GPU. It also takes the advantage of GPU’s massively parallel computation capabilities to perform numerical calculation. Todd and Sam built a Massive Parallel Database (MapD 5) to handle big data analytics for an almost boundless number of interactive socio-economic queries. It applies to geospatial visualization tool that can probe and inspect more than a billion tweets worldwide. This has given a new emerging trend to database management system.

https://github.com/bakks/virginian http://wwwiti.cs.uni-magdeburg.de/iti_db/research/gpu/cogadb/ 3 https://bitbucket.org/mheimel/gpukde/ 4 https://wiki.postgresql.org/wiki/PGStrom 5 http://www.map-d.com/ 1 2

5

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

Plentitude of emphasis in researching query related parallel algorithms have cultivated to the development of GPU database. Red Fox works on relational operators to be executed in a GPU parallel manner [20].

[21], [22] investigate GPU acceleration in indexing, scan and search

operations. [23] examine the important computational building blocks of aggregation. [24] focus the studies on optimizing GPU sort. These studies have significantly brought up the awareness of using GPU in big data analytic businesses. It is our belief that GPU can be beneficial for query processing and widely deployable in big data analytics for database systems. This GPU query accelerator has to be cautiously designed for parallel data structure and harmonizing the processes between CPU and GPU. It is our belief that GPU can be beneficial for query processing by widely deploying it for big data analytics in database systems. The GPU query accelerator has to be carefully designed with parallel data structure, and harmonizing the processing between CPU and GPU.

III. Graphic processing unit In this section, we first discuss the background of GPUs and introduce the NVIDIA’s Kepler Architecture. Next, we describe how thread and block works in the NVIDIA Kepler architecture. Finally, we discuss the memory hierarchy of the NVIDIA’s Kepler architecture. A. Background GPUs first gained popularity with the rise of 3D gaming in the mid-1990. The demand of even more powerful and energy efficient GPUs has become increasing ever since. The increase of computational power of GPUs has attracted many researchers to use the GPU for more general purpose computing. NVIDIA realized the potential GPUs for general computing and released CUDA (Compute Unified Device Architecture) in 2006 so that the researcher community can leverage upon the power of the large number of streaming processors in GPUs. GPUs nowadays are powering a large range of industries from supercomputers to embedded system.

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

The latest NVIDIA GPU architecture is called Tesla Maxwell, just introduce Q3 2015. These new cards focus on deep learning sector. However, this paper is based on the Kepler architecture. It has included a lot of improvements from its predecessor architecture, Fermi. With the current architecture, a single GPU die can contain up to 2880 CUDA cores. Besides that, the Kepler architecture introduced new features like Dynamic Parallelism, Hyper-Q, Grid Management unit, and NVIDIA GPU Direct. It also contains enhanced memory subsystems offering additional caching capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O implementation. The principal design goal for the Kepler architecture has been met with the new features providing the improved power efficiency. B. Grid, Blocks, Threads and Warps The programming model for CUDA introduces us to the concepts of threads, blocks, and grids which run GPU codes called kernels. These threads, blocks, and grids will then run in multiple SMXs (streaming multiprocessor) in the GPUs in groups of warps. Figure 1 shows the examples of threads, blocks, and grids. From a programmer’s perspective, they only need to handle the threads, blocks, and grids assignments, and kernel programming, while the hardware will manage how all the threads, blocks, and grids are mapped into the SMXs and warps. In CUDA, all the threads in the same grid will execute the same kernel function but each thread mostly handles different data. This type of programming model is known as Single Instruction Multiple Data (SIMD). With the new Kepler architecture, a block can consist up to 1024 threads in each of the x, y, and z dimensions. The maximum number of block in the x dimension in a grid can go up to 232-1. Previously on the Fermi architecture, once a kernel has been launched, its dimension cannot be changed. In the Kepler architecture, the programmer is allowed to launch another set of grids and blocks within the kernel, which enables a more flexible programming model. This feature is called the Dynamic Parallelism. 7

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

A warp is a unit of thread scheduling in the SMXs. Once a block is assigned to an SMX, it is divided into units of warps. Each warp can support up to 32 threads in the Kepler architecture. Each thread in a warp will run in parallel executing the same line of code. To increase the efficiency of the warps, we should avoid branch divergence as much as possible. Branch divergence occurs when threads inside a warp branches into different execution paths. Thread

Block

Registers

Grid 0

Level I

192 CUDA CORES

... Shared Memory

Grid 1 ...

(i)

(ii)

(iii)

Figure 1: Thread, Block and Grids

L1 cache

Read-Only data cache

Level 2

L2 cache

Level 3

DRAM

Level 4

Figure 2: Hierarchy of GPU memory

C. Memory Hierarchy There are four levels of memory hierarchy in the NVIDIA GPUs as shown in Figure 2. The first level is register memory. Register memory is a local memory for the in the CUDA cores and have a total size of 64KB. It is the fastest memory among all the memory types in the SMX. The second level consists of Shared Memory, L1 cache and read-only data cache. These memories are located very near the SMX core, and are shared among the 192 CUDA cores in the SMX. The Shared Memory is usually used to communicate among different threads in the block. The third level consists of L2 cache, and finally, the fourth level consists of DRAM memory that serves as the main storage in GPUs and is used to send and read data in bulk from the CPU’s memory. In the Fermi architecture, the Shared Memory and L1 Cache can be configured as 48 KB of Shared Memory with 16 KB of L1 cache, or vice versa. The new Kepler architecture allows for additional flexibility by permitting a 32KB / 32KB split between Shared Memory and L1 cache. The ReadOnly data cache is also new in the Kepler architecture. Previously, programmers would use the Texture unit to store and load cache memory. However, this method had many limitations. The

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

benefit of the “Read-Only Data cache” uses a separate load set footprint off from Shared/L1 cache memory. It also supports full speed unaligned memory access patterns.

IV. Implementation A. Overview of Mi-Galactica Mi-Galactica is a SQL-like query accelerator. There are four major components to formulate the system: Connector, Preprocessor, Scheduler and Query Engine; as shown in Figure 3.

Connector

enables Mi-Galactica to communicate to PostgreSQL and MySQL. It is to perform frontend application interaction, data extraction and data interchange. In addition, it can support for the comma-separated values (CSV) files processing. Scheduler is an internal task engine for managing the user workloads. Query engine carries out various processes of query investigation by performing the basic parsing and positioning operations. Then, it produces an execution query plan. There is further adjustment of the plan by analyzing and tracing parallelizable points and rearranges clause of objects execution. Mi-Galactica execution engine performs the accelerated query execution in either CPU or GPU. On the other hand, source data in the database which is needed to be transformed, and output to a parallel columnar accessible storage system. These components are designed to run on an energy efficient commodity of GPU accelerator. In addition, it powers to strive forward for handling big data challenges. Mi-Galactica adopts the effectiveness from the previous studies [17], [18], [19], [20] and [21] in query co-processing of the heterogeneous workloads. Figure 4 shows the architectural design of coupled CPU-GPU architectures. It designs to support plug-ins for acceleration components, which enabling customization. It eases up developer for adding new features and improves productivity. In addition, it reduces the size of application as well. The implementation of plug-ins functionality uses shared libraries. It is installed in a place that prescribed by Mi-Galactica.

9

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

Figure 3: Mi-Galactica Four Major Components

Figure 5: GPU Columnar File System

Figure 4: Mi-Galactica Architecture

Source data in the database requires a preprocessing stage. It converts the data into a parallelizable files structure, GPU FS (File System), then it stores into a column-based orientation, as shown in Figure 5. Thus, data can access independently, compute parallelly and maximize CPU multithreaded processing. Each column segments into multi files and the size of each segment is customizable. Therefore, CPU and GPU have sufficient amount of memory to compute larger data set. Furthermore, it allows GPU to independently process each column in the segment. Nevertheless, the change of the data in the database does not automatically trigger an update on the preprocessed data. Thus, it needs to be re-created or complemented (when only new data is added).

The CudaSet is a

parallel file structure, which improving the parallel geo-spatial processing jobs in GPU.

It is not a

legacy array of structure (AOS) design that losing of bandwidth and wasting of L2 cache memory. Mi-Galactica uses CudaSet representation to arrange data in Structure of Array (SOA) access pattern. It certainly gains high throughput by coalescing the memory access in GPU. Also, it is critical to memory-bound kernel functions. The required elements of structure can load individually and no data interleaving as shown in Figure 6. Thus, it achieves high global memory performance.

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

Figure 6: SOA CudaSet Structure

The overhead of data transfer becomes an important factor. It is a bottleneck for fetching data to GPU computation.

Mi-Galactica uses compression to alleviate the performance issues.

It

compresses the data into smaller size, which reduces the I/O operation and offloads the task to GPU. It restructured the data processing by using co-processers schema structure for given database architecture differently.

There are dual compression scheme implemented on GPU.

Firstly,

compression scheme applies on Integer data type; it is based on Zukowski work, PFOR-Delta [25]. It store differences instead of actual value. Only the difference of the data is stored between subsequent values. Thus, bit packing mechanism can further optimize it by using just enough number of bits to store each element.

Secondly, string compression scheme was applied on

characters or text data type; it is based on Lempel–Ziv (LZ77) compression algorithm [26] with dynamic representation and expression matching.

It is a fine-grained parallel redundancy for

encoding and decoding of data with flexible representation. The key of the efficiency is fast retrieval on the compressed data on CPU. Then, the lookup process has offloaded to GPU. Query engine comprises both CPU and GPU phases. The CPU phases are in charge of parsing clause into objects. It identifies the required data source, translates the operation into low level instruction sets. Then, it arranges execution sequence and dispatch for execution. It uses the combination of Bison and Flex implementing a SQL query parser. There can be both CPU and GPU related

11

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

workloads. However, CPU starts initializing GPU contexts, preparing input data, launching GPU kernel functions, materializing query results, and controlling the steps of query progress. GPU phases are in executing specific optimized kernel functions. compute intensive functions.

These are mostly aggregate and

The data are used thousands of core to process at once. The GPU

computation related operations involve select, sort, projection, joins and basic aggregation.

It

utilizes the mixture of in-house built accelerated parallel processing library – Mi-AccLib6 and open source libraries such as NVIDIA Thrust7 and CUDPP8. Scheduler is responsible for managing the received queries. There is an implementation on queue processing across a pool of work threads in CPU. It controls the concurrency level and intensity of resource contention on CPU. The resource monitor collects the current status of GPU devices usage. Then, scheduler uses the collected information for assigning the task to the available GPU. At this stage, CPU performs an important role in concurrent queueing. Thus, data can safely be added by one thread, joined or removed by another thread without corrupting the data.

In addition, it

maintains optimal concurrency level and workload on the GPUs. There is a data swapping mechanism to maximize the effective utilization of GPU device memory. Through these processes, it improves resource utilization. This implementation uses mixture of API (Application Program Interface) in Boost9 libraries. Mi-Galactica optimizes the concurrency through pipelining mechanism to overlap the data transfer via PCI-e bus and the arithmetic computation.

These CUDA streams can be executed

asynchronously, which queues the work and returns to CPU immediately.

Pinned memory

mechanism is often adapted in certain queries implementation. It uses the Direct Memory Access (DMA) engine, which can achieve a higher percentage of peak bandwidth.

Hype-Q in Kepler

6

MIMOS accelerated library consist of high speed multi-algorithm search engines for text processing, data security engine and video analytics engines, http://atl.mimos.my/

7

Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity, https://developer.nvidia.com/thrust

8

CUDPP is the CUDA Data Parallel Primitives Library, http://cudpp.github.io/

9

Boost is a set of libraries for the C++ programming language that provide support for tasks and structures, http://www.boost.org/

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

architecture supports multiple CPU threads to be launched in GPU simultaneously, thereby intensely rising GPU utilization and reducing CPU idle times. B. Interacting Mi-Galactica with ESRI ArcGIS There are two methods for Mi-Galactica interacting with ESRI Geographic Information System software, web based GIS and geodatabase management applications, such as ArcGIS desktop, FileGDB and ArcGIS JavaScript. Both methods are adopting standard database system to view, store, query and analyze the contain geo dataset.

Users use the choice of database such Oracle,

PostgreSQL, Microsoft SQL server and others. ArcGIS transforms the geographical computation into set of SQL statements. Then, it channels to Mi-Galactica for the parallel computation. Once it completed, the result set returns back to ArcGIS application for processing the map visualization. Our in-house builds ODBC database connector is to divert the SQL operation to Mi-Galactica.

V.

Experiment Results

In this section, we report our experimental results and analysis. We focus on the execution time on Mi-Galactica with the utilizing of GPU accelerator and a CPU based system, Apache Spark. It is one of the fast engines for big data processing system in the market at the moment. We measure the execution period by adding fixed number of data records for each test run. The dataset is from 1 to 20 million rows of records. In addition, we compare the data preprocessing operation in using both systems.

Heat map is generated with 2 hours’ time interval location data of everyday of the month.

A. Hardware and Operation System We performed the experiments into a single NVIDIA Tesla K20c GPU computational device. It has 2496 CUDA cores and 5GB GDDR5 RAM, launched in 2013. The workstation is HP Z800. It contains dual sockets Intel Xeon X5680 CPU with the total of 12 cores, clock rate is 3.33GHz, 32GB DDR3 RAM, 1TB Hard Disk, and ATI FireMV 2260 as a display device. For software, it has Microsoft Windows 7 (64bits), Spark Version 1.3.0 and CUDA 7 (Release Candidate). 13

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

B. Experiment 1: Data Preprocessing The computation time of data preprocessing is tested with various size of data. The raw input data stores in CSV format. It need to process and import into corresponding data warehouse systems. Spark is converted into Parquet format, which is in columnar storage layout. Mi-Galactica is stored in a GPU accessible columnar format. Figure 7 shows the execution time on data preprocessing. The timing includes the data transfer between hard disk, CPU and GPU memory. The data preprocessing is not a compute intensive operation and requiring high I/O data transfer between the CPU and GPU memory. In addition, Spark has a minimum startup overhead without enabling the data compression for processing the raw CSV files. As observed in Figure 7, the processing speed of Spark eventually overtakes Mi-Galactica. Spark applies the optimization of utilizing the CPU multi-threading features to preparing these CSV files.

Figure 7: Result of Data Preprocessing

C. Experiment 2 There are a series of processes to produce a heat map. It represents the geographic density of features on a map. The colored areas represent points that making for layers with a large number of features. It requires certain toolboxes in ArcGIS to complete entire process and visualize the map, such as Density toolset, Spatial Analyst & Statistics toolbox. These set of toolboxes generate the SQL statement and pass to database system to execute. Heat map.

Figure 8 shows a sample of SQL statement for

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

Figure 8: Sample of Heat Map SQL statement

The speedup measurement calculates in (Spark’s Execution Time) / (Mi-Galactica Execution Time). The results turn out that Mi-Galactica had out performed Spark. The speedup is between 4x to 18x speedup in executing the SQL statement shown in Figure 9. As observed, Mi-Galactica reduces the speedup while the rows of data are increasing. It is due to the handling of the I/O (Input/Output) movement between CPU and GPU memory without applying an efficient streaming mechanism during this preprocessing stage. In fact, there does not have complex computation to maximize the GPU resource utilization as well. Thus, it lost the optimization effort at transferring data between CPU and GPU.

However, Mi-Galactica performance is still good enough to provide timely

visualization on geo-location data. Figure 10 shows a snapshot of final visualization output of the heat map.

Figure 9: Heat map query execution Figure 10: Visualization of Heat Map

15

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

VI. Conclusion We have presented Mi-Galactica as a GPU query accelerator for assisting geo-spatial data computation via the generated SQL statement from ESRI software. It applies on any data analytic operations by using SQL statement, such as Data Cleansing as an example application. The results have shown that the GPU based solution is an alternative for Big Data processing. In addition, our GPU query accelerator approach has facilitated seamless integration to other front end application via a database connector. It allows users to exploit the powers of the GPU by providing the ability for efficient work distribution in GPU cores with regards to I/O access and compression. In the future, we are extending our

system to be executed in distributed computation environment with

multi nodes processing to support bigger dataset.

Furthermore, Mi-Galactica strives towards the

full support of the SQL standard and enabling parallel accelerated query processing.

VII. References [1]

C. McLellan, “The internet of things and big data: Unlocking the power,” ZDNet, Mar-2015.

[2]

C. Smith, “Social Media’s New Big Data Frontiers -- Artificial Intelligence, Deep Learning, And Predictive Marketing,” Business Insider Australia, 2014.

[3]

M. Jern, T. Astrom, and S. Johansson, “GeoAnalytics Tools Applied to Large Geospatial Datasets,” in Information Visualisation, 2008. IV ’08. 12th International Conference, 2008, pp. 362–372.

[4]

C. Lai, M. Huang, X. Shi, and H. You, “Accelerating Geospatial Applications on Hybrid Architectures,” in High Performance Computing and Communications 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on, 2013, pp. 1545–1552.

[5]

J. Zhang and S. You, “CudaGIS: Report on the Design and Realization of a Massive Data Parallel GIS on GPUs,” in Proceedings of the Third ACM SIGSPATIAL International Workshop on GeoStreaming, 2012, pp. 101–108.

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

[6]

C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, “GPU accelerated item-based collaborative filtering for big-data applications,” in Big Data, 2013 IEEE International Conference on, 2013, pp. 175–180.

[7]

S. K. Prasad, M. McDermott, S. Puri, D. Shah, D. Aghajarian, S. Shekhar, and X. Zhou, “A Vision for GPU-accelerated Parallel Computation on Geo-spatial Datasets,” SIGSPATIAL Spec., vol. 6, no. 3, pp. 19–26, Apr. 2015.

[8]

Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, “GPU Cluster for High Performance Computing,” in Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, 2004, p. 47–.

[9]

L. A. S. Gomes, B. S. Neves, and L. B. Pinho, “Empirical Analysis of Multicore CPU and GPU-Based Parallel Solutions to Sustain Throughput Needed by Scalable Proxy Servers for Protected Videos,” in Computer Systems (WSCAD-SSC), 2012 13th Symposium on, 2012, pp. 49–56.

[10] C.-J. S. Kyle E Niemeyer, “Recent progress and challenges in exploiting graphics processors in computational fluid dynamics,” J. Supercomput., vol. 67, pp. 528–564, 2014. [11] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha, “Fast Computation of Database Operations Using Graphics Processors .” ACM, New York, NY, USA, 2005. [12] R. Mueller, J. Teubner, and G. Alonso, “Data Processing on FPGAs,” Proc. VLDB Endow., vol. 2, no. 1, pp. 910–921, Aug. 2009. [13] R. Mueller, J. Teubner, and G. Alonso, “Glacier: A Query-to-hardware Compiler .” ACM, New York, NY, USA, pp. 1159–1162, 2010. [14] L. Woods and G. Alonso, “Fast data analytics with FPGAs .” pp. 296–299, Apr-2011. [15] F. D. Hinshaw, J. K. Metzger, and B. M. Zane, “Optimized database appliance,” 2011. [16] P. Francisco, “The Netezza Data Appliance Architecture: A Platform for High Performance Data Warehousing and Analytics ,” 2011.

17

Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

[17] P. Bakkum and K. Skadron, “Accelerating SQL database operations on a GPU with CUDA,” in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010, pp. 94–103. [18] S. Breß and G. Saake, “Why It is Time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS,” Proc. VLDB Endow., vol. 6, no. 12, pp. 1398–1403, Aug. 2013. [19] M. Heimel and V. Markl, “A First Step Towards GPU-assisted Query Optimization,” Proc. VLDB Endow., pp. 33–44, 2012. [20] H. Wu, F. Drive, G. Diamos, S. Baxter, M. Garland, T. Sheard, M. Aref, and S. Yalamanchili, “Red Fox: An Execution Environment for Relational Query Processing on GPUs,” in Proceeding of theInternational Symposium on Code Generation and Optimization (CGO), 2014, pp. 44:44–44:54. [21] F. Beier, T. Kilias, and K.-U. Sattler, “GiST Scan Acceleration Using Coprocessors .” ACM, New York, NY, USA, pp. 63–69, 2012. [22] K. K. Yong and E. K. Karuppiah, “Hash match on GPU,” in IEEE International Conference on Open Systems, ICOS 2013, 2013, pp. 150–155. [23] T. Lauer, A. Datta, Z. Khadikov, and C. Anselm, “Exploring Graphics Processing Units As Parallel Coprocessors for Online Aggregation .” ACM, New York, NY, USA, pp. 77–84, 2010. [24] D. Cederman and P. Tsigas, “GPU-Quicksort: A Practical Quicksort Algorithm for Graphics Processors,” J. Exp. Algorithmics, vol. 14 . ACM, New York, NY, USA, pp. 4:1.4–4:1.24, Jan-2010. [25] M. Zukowski, S. Heman, N. Nes, and P. Boncz, “Super-Scalar RAM-CPU Cache Compression .” p. 59, Apr-2006. [26] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” Inf. Theory, IEEE Trans., vol. 23, no. 3, pp. 337–343, May 1977.

Suggest Documents