Exploiting Reconfigurable FPGA for Parallel Query ... - CiteSeerX

Exploiting Reconfigurable FPGA for Parallel Query Processing in Computation Intensive Data Mining Applications Kelvin T. Leung1, Professor Milos Ercegovac, Professor Richard R. Muntz {kelvin,milos,muntz}@cs.ucla.edu MICRO Proj#97-126 Computer Science Department University of California, Los Angeles, Los Angeles, CA 90024-1596 ABSTRACT This work concentrates on exploiting re-configurable Field Programmable Gate Arrays (FPGAs), an SRAM-based FPGA coprocessor, for query processing in computation-intensive data mining applications. Complex computation-intensive data mining applications in geoscientific and medical information systems environments often require support for extensibility and parallel processing to deliver the necessary functionality and high performance. Emerging FPGA technology represents a promising hybrid hardware/software (HW/SW) co-design approach [20,21] to augment traditional query processing techniques by efficient use of reconfigurable task-specific hardware kernels and host processor(s). In this work, we study the properties and characteristics of a FPGA co-processor system, VCCEVC, which uses a Xilinx 4020E part and a slaveinterface for data acquisition. We present a simple HW/SW cost model for parallel query processing. And also, we study a HW/SW partitioning problem by looking into the NASA Cyclone-Tracking Data Mining application. We identify the most computation intensive routines in this application that runs on our extensible parallel geoscientific query-processing environment, Conquest. Finally, we discuss the applicability of the hybrid approach for the given application based on our results. 1.

INTRODUCTION

As data becomes more complex and abstract (e.g. multimedia and geoscientific data types [1]), and storage cost decreases rapidly, it is a natural consequence that the role of data processing and manipulation would become the dominant factor in the query processing. Data I/O [7] most likely will not be the primary bottleneck in future high performance query processing. General Purpose Processors (GPPs) do not provide enough computation power for processing and manipulating such complex data [17,22,23]. It is noticeable that the gap between specialized hardware (ASICs) and GPPs has been broadening; as is exemplified by the appearance of the specialized PC co-processing peripherals such as graphics/video accelerator cards. Specialization often provides tremendous gains in performance. Coprocessors such as SRAM-based FPGAs have bridged two extreme worlds (ASICs and GPPs) of computer architecture and provide not only the flexibility of binding the functionality of hardware dynamically, but

Figure 1: Cyclone Tracking in Global Climate Change Study

also achieve a significant performance gain in certain applications [22-28]. "Adaptability", "extensibility", "scalability" and "high query throughputs" are the key factors for future high performance query executions [1,2,4-6]. It is our intention to explore the emerging FPGA technology in the domain of query processing for computation-intensive data mining applications in hope of achieving high query throughputs. 2.

BACKGROUND

A tremendous amount of raw spatio-temporal data is generated as a result of various observations, experiments, and model simulations. For example, NASA EOS expects to produce over 1 TByte of raw data and scientific data products per day by the year 2000, and a 100-year UCLA AGCM simulation [3] running at a resolution of 1o x 1.25o with 57 levels generates approximately 30 TBytes of data when the model's output is written out to the database every 12 simulated hours. In geoscientific studies, a scientist often wants to extract interesting geoscientific phenomena that are not directly observed from the raw dataset. Figure 1 depicts a set of cyclone tracks which are the trajectories traveled by low-pressure areas over time, and which can be extracted from a sea-level pressure dataset by linking observed areas of local pressure minima at successive time steps. We cast such geoscientific studies as a "database processing problem" and a study (query) can be expressed in terms of at least one Query Execution Plan (QEP) [8,9] which is a logical tree-structure collection of operations, which computes the final results. 2.1 Reconfigurable Computing for Data Mining 2.1.1 SRAM-based FPGA coprocessors In general, an SRAM-based FPGA is a kind of reconfigurable coprocessor that possesses the following properties:

1. 2. 3. 4.

Configurable computational resources [15,16]. Software-controlled hardware [13,14]. Low-level granularity parallelism [18,19,29]. High computation throughputs for streaming data [24-28]. 5. High pipelining efficiency [17,29]. 6. Low inter-process communication costs.

2.1.2 HW/SW cost model for parallel query processing Availability of SRAM-based FPGA coprocessors creates a new design space and optimization problem for high performance parallel query processing. This problem can be cast as a HW/SW partitioning problem in which a computation tree of a computationintensive routine can be partitioned into FPGA coprocessor(s) and the host processor(s) for parallel execution in order to obtain higher computation throughputs. The applications that have been reported to perform well on FPGAs are stream oriented, signal processing type applications [22,23]. A relatively small algorithm is applied to large regular blocks of data. The data moves through the logic and the results are returned to users as a stream without having to wait for the completion of the evaluation of the entire task. The requirement for a relative small algorithm implies that such algorithm must be small enough to fit into FPGA due to area and interconnect wiring constraints. The large regular blocks of data imply that potentially long execution time and opportunities for parallelization. In other words, the ratio between total execution time of a computation and overheads due to communications shall be large enough to yield high performance gain (10x-100x). Note that the overhead time is the maximum of {Tconf, TI/O} where Tconf is the FPGA configuration time and TI/O is the data I/O time. Thus, we have

Texe >> 1 Toverhead The above equation implies that it is not desirable to have the reconfiguration time of FPGA or data I/O time be dominant compared to the total execution time running in FPGAs. To simplify our studies, we assume that there is only a host processor and a FPGA coprocessor in the system. Given a computation tree, we may want to divide such a tree into sub-trees for pipelining execution between the host and coprocessor where the branch of the subtree running in FPGA satisfies with the above constraint. Hence, a portion of the sub-tree would be executed in the host processor and the rest of the tree would be executed in the FPGA coprocessor in a pipeline fashion. The partial results will be merged to produce a final result at the host. For a given computation, in order to justify pipelining the use of the FPGA coprocessor and host processor to achieve

higher computation throughputs, equation must be observed.

the

following

THW/SW