FPGA 2009 Poster Session 1: Processors & CAD ... - ACM Digital Library

3 downloads 5449 Views 170KB Size Report
Feb 24, 2009 - Jonathan Rose, University of Toronto, Canada. Contact Author's Email: ... implementation executing on a soft processor and a custom. FPGA hardware ... Modi cations to the software toolchain are discussed as they pertain to ...
FPGA 2009 Poster Session 1: Processors & CAD Tools

Soft Vector Processors vs FPGA Custom Hardware: Measuring and Reducing the Gap

Data Streaming and SIMD Support for the MicroBlaze Architecture

Peter Yiannacouras, J. Gregory Steffan, Jonathan Rose, University of Toronto, Canada Contact Author’s Email: [email protected]

Paul E. Marks, Cameron D. Patterson, Virginia Tech, USA Contact Author’s Email: [email protected]

Soft processors are often used in FPGA-based systems because of their ease-of-use, but for a given computation there is a significant gap in area/performance between a C code implementation executing on a soft processor and a custom FPGA hardware implementation. Recent research has demonstrated that soft processors augmented with support for vector instructions provide significant improvements in performance and scalability for data-parallel workloads. In this work, using an FPGA platform equipped with DDR memory executing data-parallel benchmarks from the industry-standard EEMBC suite, we measure the area/performance gaps between (i) C programs executing on a scalar soft processor, (ii) handvectorized programs executing on a soft vector processor, and (iii) custom FPGA hardware. We demonstrate that the wall clock performance gap between scalar executed C and custom hardware can be drastically reduced using our improved soft vector processors, even though they are still clocked 3x slower than custom hardware. We identify loop overhead, data delivery, and exact resource usage as three key advantages of custom hardware that we propose to mitigate in our soft vector processor respectively by decoupling pipelines, tuning cache design, supporting prefetching, and automatically eliminating unused instructions and datapath width. We show that together these improvements increase performance by 3x and reduce the area of the fastest soft vector processor by 2x, significantly reducing the need for designers to resort to more challenging custom hardware implementations.

The MicroBlaze architecture is a con gurable soft core processor by Xilinx intended for use in embedded FPGA applications. Some embedded data processing algorithms bene t from the use of vectorized code and streaming I/O. Thus, con gurable extensions that add these to the MicroBlaze architecture are developed. We discuss the current architecture, the necessary modi cations, the methods employed, and analyze the theoretical gains. We then study the full synthesis of an example hardware platform with and without the extensions, comparing the expected performance of each. Following that, we discuss the work toward integrating with Xilinx s development tools, which allows simple and rapid deployment of the extended architecture. Modi cations to the software toolchain are discussed as they pertain to extending the instruction set. Finally, we conclude with an assessment of the developed extensions and future work that could make them even more useful for recon gurable, high-performance embedded applications. ACM Categories & Descriptors: C.1.2 Multiple Data Stream Architectures (Multiprocessors)~ Single-instruction-stream, multiple-data-stream processors (SIMD); C.3 SPECIALPURPOSE AND APPLICATION-BASED SYSTEMS~ Realtime and embedded systems General Terms: Algorithms; Design; Performance; Keywords: Reconfigurability; Vector units; Streaming coprocessors

ACM Categories & Descriptors: C.1.3 Other Architecture Styles~ Adaptable architectures General Terms: Design; Performance; Measuremen Keywords: Vector; FPGA; Processor; adaptable; VIRAM; Data parallel; SIMD; EEMBC; Soft

Copyright is held by the author/owner(s). FPGA’09, February 22–24, 2009, Monterey, California, USA. ACM 978-1-60558-410-2/09/02.

277

Small Scale Multiprocessor Soft IP (SSM IP): Single FPGA Chip Area and Performance Evaluation

Customizable Bit-width in an OpenMP-based Circuit Design Tool Timothy F. Beatty, Eric E. Aubanel, Kenneth B. Kent, University of New Brunswick, Canada Contact Author’s Email: [email protected]

Xinyu Li, Omar Hammami, ENSTA ParisTech, France Contact Author’s Email: [email protected]

As transistor density grows, increasingly complex hardware designs are implemented. In order to manage this complexity, hardware design can be performed at a higher level of abstraction. High level synthesis enables the automatic conversion of algorithms into hardware implementations, abstracting away the underlying complexities of hardware from the designer. A number of high level synthesis tools have recently been developed, including an OpenMP to Handel-C translator. Improvements to the translator, including a new compiler directive allowing customizable register width, are described. Using a set of benchmark tests, the OpenMP to Handel-C translator is evaluated on several criteria, with the goal of evaluating the variable bitwidth effects and identifying further areas for improvement.

Future generation multiprocessor system on chip (MPSOC) will be based on hundreds of processors connected through network on chips. One of the challenges is to tackle the design productivity required to reach this goal. We propose a NOC based small scale multiprocessor IP (SSM IP) as a building block for large scale multiprocessor. The architecture of the small scale multiprocessor is based on a 2x2 mesh with 3 Xilinx Microblaze processors and 2 SRAM on chip memories per switch. The network on chip topology is mesh for its scalability properties and easy extensibility. A clustered design have been preferred over a full mesh in order to fully exploit the data locality processing of image and multimedia applications. The implementation of the small scale multiprocessor has been realized by targeting the largest Xilinx Virtex-4 FPGA chip the FX140. Design has been realized using the Xilinx tools (EDK, ISE) with the Xilinx library of IPs. The objective of the implementation was to design a multiprocessor of sufficient scale to be significant while leaving some chip area and resources for design space exploration. Images can be distributed equally among the shared memories of each cluster so that processors belonging to a cluster can operate on the image portion associated to a cluster. Architectural variations among 4 selected architectures demonstrate the area saving and performance potential of soft IP. In addition reasonable synthesis, place and route execution time and achieved target frequencies justify the design effort.

ACM Categories & Descriptors: B.6.3 Design Aids~ Hardware description languages General Terms: Design; Performance Keywords: HandelC, OpenMP, Hardware specification

Revisiting Bitwidth Optimizations Jason Cong, Karthik Gururaj, Bin Liu, Chunyue Liu, Yi Zou, UCLA Computer Science Department, USA Zhiru Zhang, Sheng Zhou, AutoESL Design Technologies, USA Contact Author’s Email: [email protected]

ACM Categories & Descriptors: C.3 Real-time and embedded systems General Terms: Algorithms, Design

This paper revisits the classical bitwidth optimization problem for fixed-point designs. Our approach also starts from static analysis (range analysis and precision analysis) techniques. We first point out that AA-based precision analysis, which is widely used in many previous works, are not correctly applied for some corner cases. We propose to use interval coefficients to model the sensitivity (the mathematic form is similar to classical generalized interval arithmetic) to correct those issues. Based on the analysis, we use a convex optimization problem to optimize the area costs. In the formulation, we also allow different bitwidths for the multiple uses of the same signal. We get an additional 2% to 3% area reduction compared with previous AA-based analysis and optimizations. We also conduct some optimality study to see how much the gap between the current methods and the possible optimal point is.

Keywords: FPGA, Multiprocessor, Network on chip

ACM Categories & Descriptors: B.7.2 Design Aids General Terms: Algorithms; Design; Performance Keywords: FPGA, bitwidth, arithmetic, fixed-point

278

Since the EXU runs in parallel to the LSU, the memory access can be overlapped with the data processing. In addition, the SPHW that has a synchronization mechanism supports an execution of the multiple hardware threads on the EXU. By using the SPHW, the C programs can be easily converted to the hardware modules with data prefetching mechanisms. An experiment is performed using some application programs that show different memory access patterns. Compared with the cases that the custom data prefetching circuit is attached instead of the LSU, the SPHW can significantly reduce a design cost, achieving a comparable performance.

A Clustering Framework for Task Partitioning Based on Function-level Data Usage Analysis S. Arash Ostadzadeh, Roel J. Meeuws, Kamana Sigdel, Koen Bertels, Delft University of Technology, Netherlands Contact Author’s Email: [email protected] Recently, reconfigurable computing has received a great deal of attention due to its ability to increase an application performance with hardware execution, while possessing the flexibility of software solution. One of the major requirements for such systems is to identify which application or part of the application can be implemented as software and which can be mapped onto reconfigurable devices. Grouping the tasks within an application can intensify coarse-grained partitioning of the application, which can eventually improve the performance of the system. In this work, we introduce a clustering framework along with a flexible multipurpose clustering algorithm that initiates task clustering at the functional level based on dynamic profiling information. The clustering framework can be used as the basic step to modify the granularity of tasks in the hardware/software partitioning and scheduling phases. As a result, an elaborate mapping onto the system resources and possibly a higher degree of task parallelism can be obtained. In an initial attempt, the framework addresses two primary objectives to create workload-balanced and looselycoupled clusters. The experimental results show that the clustering complies with the desired metrics, which were defined through the objectives.

ACM Categories & Descriptors: B.5.2 Design Aids General Terms: Verification; Design; Performance Keywords: Memory, Design Method, Buffering, FPGA

N-port Memory Mapping for LUT-based FPGAs Zuo Wang, High Performance Embedded Computation Lab, P.R. China Feng Shi, Qi Zuo, Weixing Ji, Mengxiao Liu, Beijing Institute of Technology, P.R. China Contact Author’s Email: [email protected] As current FPGAs grow in logic capacity, they are widely used to implement entire systems. In some specific applications, such as our embedded multi-core processor TriBA[1],user memory models are not limited to single-port or dual-port. Thus, we need a cost-effective way to realize N-port memory on FPGA since most commercial products do not provide N-port physical arrays. In this paper, we propose a hierarchical N-port memory architecture for LUT-based FPGAs. The principle of this architecture is to create a two-level memory hierarchy formed by different resources. We map the memory resources inside LUTs as 1-port memory banks, and interleave these banks to create N-port L1 memory. We also interleave physical dual-port arrays to build N-port L2 memory. We also provide the data transfer between L1 and L2 memories and assume that such data transfer is managed by software control just like the strategy used by SPM. Compared to L1 memory, L2 memory has the advantage in cost and also has several disadvantages, such as longer access time and higher conflict probability. If most accesses are served by its L1 memory portion, hierarchical memory architecture will achieve both goals in cost and access time. We implement this architecture on Xilinx VirtexII chips to measure its cost and also use the memory trace collected from multi-core simulator to measure its average access time. The product of cost and average access time shows that, hierarchical memory architecture is a cost-effective way to realize N-port memory on FPGA.

ACM Categories & Descriptors: C.3 SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS~ Real-time and embedded systems General Terms: Algorithms; Design; Performance Keywords: Reconfigurable Computing; Heterogeneous Multiprocessors; Task Clustering; Hardware/Software Co-design

An Intermediate Hardware Model with Load/Store Unit for C to FPGA Akira Yamawaki, Masahiko Iwane, Kyushu Institute of Technology, Japan Contact Author’s Email: [email protected] We propose the semi-programmable hardware (SPHW) as an intermediate hardware that can be used by the designers and the high-level synthesis tools converting the C programs to FPGAs. The SPHW consists of a load/store unit (LSU), a reconfigurable register file (RRF) and an execution unit (EXU). The LSU, executing the load/store instructions, transfers the data between the memory and the RRF. The hardware designed to make the computation faster is implemented on the EXU. The EXU is a reconfigurable hardware unit and processes the data on the RRF. The LSU flexibly performs complex memory accesses and bufferings by programming, so that the EXU can uniformly process the sequential data on the RRF.

ACM Categories & Descriptors: B.7.1 Types and Design Styles~ Gate arrays; B.6.3 Verification, VHDL General Terms: Design Keywords: FPGA; N-port memory; Logical-to-physical mapping; Hierarchy

279

constructed using the results from all sub-problems, is further improved using a fast low-temperature annealing refinement step. Using this technique, we parallelize the simulated annealing based placement algorithm of VPR. The new parallel placement algorithm achieves an average speed-up of 2.5x using four threads, while the wirelength after placement and circuit delay after routing increased on average with 3.7% and 2.15% respectively.

Parallel Placement for FPGAs Revisited Cristinel Ababei, North Dakota State University, USA Contact Author’s Email: [email protected] The runtime of classic sequential placement algorithms for FPGAs continues to represent a serious problem, aggravated by the continuous increase of FPGAs. The traditional way to parallelize the placement step is to use parallel distributed implementations run on a network of processors. This approach can suffer from significant communication and synchronization runtime overheads. To address that, we propose the use of multithreading for parallelization. The top level placement problem is decomposed into region-based placement sub-problems using four-way min-cut partitioning. These sub-problems are then processed in parallel by worker threads. The final solution,

ACM Categories & Descriptors: D.1.3 Concurrent Programming~ Parallel programming; G.1.0 General~ Parallel algorithms General Terms: Algorithms Keywords: Parallel simulated annealing; FPGA placement; Multithreading

280