Very High Level Synthesis for image processing ... - ACM Digital Library

0 downloads 0 Views 3MB Size Report
Sep 15, 2016 - high level languages, such as Matlab or OpenCL. b) to handle the ..... cube retrieved through a neural network designed by Man- souri et al.
Very High Level Synthesis for image processing applications Yanjing BI

Chao LI

Fan YANG

Laboratory CPTC Univ. Bourgogne Franche-Comté F-21000 Dijon, France

Laboratory Le2I UMR6306, CNRS, Arts et Métiers Univ. Bourgogne Franche-Comté F-21000 Dijon, France

Laboratory Le2I UMR6306, CNRS, Arts et Métiers Univ. Bourgogne Franche-Comté F-21000 Dijon, France

[email protected]

[email protected]

[email protected] ABSTRACT

Since the recent 20 years, High Level Synthesis (HLS) has made significantly progresses. This technique greatly benefits the R&D productivity of the FPGA designs and helps for adding to the maintainability of the products by automating the C-to-RTL conversion. However, due to the high complexity and computational intensity, image processing designs usually necessitate a higher abstraction level than Csynthesis, and the current HLS tools do not have the ability of this kind. This paper presents a Very High Level Synthesis method which allows fast prototyping and verifying the FPGA designs in the Matlab environment. We build a heterogeneous design flow by using currently-available tool kits for verifying the proposed approach and evaluated it within two real-life applications. Experiment results demonstrate that it can effectively reduce the complexity of the design and give play to the advantages of FPGAs related to the other devices.

Keywords High-Level Synthesis, Programming Linguistic, FPGA, Software/Hardware Co-design, Image Processing, Embedded System, Design Space Exploration, Electronic Design Automating

1.

INTRODUCTION

With the development and popularization of computer graphics and vision, the algorithms or architectures of today’s image processing designs have been getting more and more complicate to be specified. Furthermore, some designs need to be carefully adjusted for different targets even for the same algorithm or fundamental concepts. This fact results in a high R&D effort-cost and requires the product to have a strong maintainability. Therefore, its designers usually made the development in a high abstraction environment Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

ICDSC ’16, September 12 - 15, 2016, Paris, France c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-4786-0/16/09. . . $15.00 DOI: http://dx.doi.org/10.1145/2967413.2967414

in order to facilitate its prototyping, verifying, simulating, deploying, maintaining and updating processes. Meanwhile, Field Programmable Gate Arrays (FPGAs) are increasingly widely used in digital signal processing designs for its significant advantages in terms of running-cost, embedability, power consumption and flexibility. Advanced Digital Sciences Center (ADSC) of the University of Illinois at Urbana-Champaign reported that FPGAs can achieve a speedup up to 2 − 2.5× and save 84 − 92% of the energy consumption related Graphics Processing Units (GPUs) [6]. However, its users have to suffer from the complexity of the design flow due to the fact that the configuration of the devices of this type necessitates the low abstraction register transfer languages, such as VHSIC Hardware Description Language (VHDL) or Verilog. Historically, the complex embedded image processing designs are usually made through a Software/Hardware Codesign framework, in which the overall design discipline is partitioned into two sub ones, software and hardware disciplines, to be handled by different experts. Since FPGAs are mature stand-alone devices, the design flows based on it are usually horizontally partitioned as follows: first of all, the desired system is analyzed and partitioned; next, mathematical or software engineers prototype and functionally verified the algorithm in a user-convenient level; thirdly, FPGA experts transplant the verified algorithm into RegisterTransfer Level (RTL) from research environment; finally, the generated RTLs are evaluated within a co-simulation test bench. If the evaluation results can not satisfy the design requirement, go back to the second step. However, this approach can not effectively improve the R&D productivity of FPGA designs for the following three reasons: a) the two sub disciplines have to be serially scheduled and impossible to be parallelized, which will raise the development period, b) team-work multiplies the development effort (evaluated by persons × time) and c) some complex algorithms are time-costly to transplant into RTL even for well-trained and experienced FPGA experts. According to the evaluation results of Xilinx shown in Fig.1 [15], the RTL based FPGA designs consume the highest development period in all the devices available for digital signal processing. This point can be corroborated by the data statistics published by ADSC, in which it is indicated that a manual FPGA design may consume 6-18 months and even years for a full custom hardware, while the GPUs (CUDA) based designs only 1-2 weeks [1]. Consequently, today’s elec-

Figure 1: Design Time vs. Application Performance with RTL Design Entry (c.f. [15]). tronic manufacturers are increasingly pushed to select low design effort-cost devices by the pressure of market competition so that in most cases designers may accept increased performance, power, or cost in order to reduce design time. Despite of many advantages related to the other devices, FPGAs are usually applied only when the other solutions can not satisfy the design requirements. The abstraction level gap between the user-convenient and hardware-available languages seriously narrows the industrial applications of FPGAs in the image processing field. In our work, we attempt to build an automatical Very High Level Synthesis (VHLS) framework with the following properties: a) to handle the algorithm behavior described in very high level languages, such as Matlab or OpenCL. b) to handle the code written without FPGA expertise or even not for FPGAs but the platforms of other types. c) to optimize the performances of the designs with the hardware constrains such as frequency or area of the target device. d) to automatically generate the desired RTL implementation in a short time rather than hours or even days. e) to be capable of being implemented by using the currentlyavailable Electronic Design Automation (EDA) tools. This is important for industrial designs, because it can effectively reduce the R&D cost by helping for fast building the desired Design Space Exploration (DSE) framework and avoiding the additional cost for the new tool kits. To do this, we select Matlab as the favorite user-level design environment for its advantages in the terms of vector processing and powerful built-in image processing tools. Next, the challenges of Matlab-to-RTL synthesis are explored. In this exploration work, we find that Matlab is inherently a vector-based representation, whereas RTLs are fully scalar-based ones. This issue may seriously constrains the benefits of the desired FPGA-available Matlab and is hardly solved by using a direct synthesis process. Therefore, we innovatively incorporate the source-to-source compile into the proposed approach in order to turn the nature of the source code from vector- into scalar-oriented programming. Finally, the generated code is classically synthesized via control and data flow extraction and RTL generation processes.

The proposed approach is evaluated by using two complex image processing algorithms: Kubelka-Munk Genetic Algrithm (KMGA) for the multi-spectral image based skin lesion assessments [10] and Level Set Method (LSM) for very high resolution satellite image segmentation [3]. Experiment results demonstrate that the proposed approach can effectively reduce the complexity of description of the target algorithm related to reference method. Meanwhile, the generated RTL implementations have a higher performance than the reference implementations realized by using the other devices. After this introduction section, the remainder of this paper is organized as follows: Section 2 presents the proposed very high level synthesis method. Section 3 evaluates our methods and analyzes the experiment results. Finally, a conclusion is given in Section 4.

2.

PROPOSED VERY HIGH LEVEL SYNTHESIS

This section presents the proposed VHLS method in detail. Fig.2 displays the flowchart of this synthesis process. We can see that it consists of three steps: source-to-source compile, code and data extraction and RTL generation. For each step, interdependent tasks are executed.

2.1

Source-to-Source Compile

It is known that the nature of vector-based representation of Matlab and its powerful built-in tools result in many challenges for synthesizing it to RTL. We summed up these issues as the following three problems: dynamic variable problem, operation polymorphism problem and built-in function problem, and solve them by comply the Matlab source code into a second intermediate code. Dynamic variable problem Since the corresponding memories can be re-allocated over and over for the new contents with different lengths (types), Matlab users do not have to declare the types for variable initializing or allocate the right amount of memory for vector variable before each use. The variables can automatically change their lengthes (types) or dimensions depending on the content to be held. But none of the currently-available register transfer languages have the dynamic variable ability like this. Therefore, we must allocated explicitly enough storage for every variable. In the proposed approach, a manually-specified define file is required for memory allocation. Further more, when a variable is dynamically reused in the source code, i.e. x and X, some additional definitions with other names are necessitated, i.e. x 1 and X 1. Operation polymorphism problem Matlab allows operator and function polymorphism. That is, its operators or functions may support either a matrix or a scalar as a second operand or argument, and their returns are changed as well depending on the inputs. For the operators, the nature of each invocation has to be determined first. If scalar, they are mapped directly as the corresponding monomorphism ones of the target languages, else they are replaced with a loop construct. For the function mapping, the types of all the arguments should be determined as well. Depending on the invocation natures, we have to create multiple different versions for the same Matlab functions to satisfy the operations of different invocations.

VHLS process Function library

Variable mapping

define file

Operation mapping

behavior.m

Built-in function compile

Data flow extraction

Source-to-source compile

Control-and-datapath extraction

Behavior specification

Control flow extraction Intermediate code

Scheduling CDFG model

Generation

Target RTL

Allocation and binding IP library RTL generation

Figure 2: Flowchart of the proposed VHLS method. We can see that either operator or function mapping may necessitate additional loop constructs in the present or deeper function level. In RTL, all the loop boundaries must be initialized in advance instead of using the unknown variables due to the fact that FPGAs support only static compile. In this paper, we use the vector information, which has been explicitly defined in define file, to help to compute the boundaries of the loops generated for operation polymorphism problem. Meanwhile, it should be noted that different vector dimensions may require more additional function versions even for the invocations that have same nature. Built-in function problem The built-in algorithms/functions of Matlab provide users many benefits of facilitating their algorithm specifying, but they are usually invisible to a third-party compiler and detected as undefined functions when invoked. Therefore we need to build a new library by using the synthesizable code for these powerful algorithms. For example, the function sort of Matlab is re-specified in the library. When it is invoked in the source code, the corresponding routine can be easily compiled by including its specification into the generated file. Since the built-in functions of Matlab have as well the polymorphism nature, we create different versions each. Depending on the argument types, the right version is re-targeted to the invocation in the source code.

2.2

Code and Data Extraction

Since there may be still a lot of variation between the semantics of the pre-compiled code and the target architectures, a canonical intermediate representation is needed for data dependency analysis. As presented in the Chapiter 4 of [7], this issue can be classically solved by using a Data Flow Graph (DFG), in which all the intrinsic parallelism of the specification can be exhibited easily. However, this method results in a large formal representation, with which designers usually have to suffer from the area constraints in the design. In this paper, we formally represent the precompiled code by using a Control and Data Flow Graph (CDFG), which is one of the most widely accepted modeling paradigm for specifications that are processed by HLS tools. Comparing with using a pure DFG, this method can lead to more efficient circuit implementation in terms of area, frequency, power, etc..

In the experiments of this paper, we classically create CDFG by using a Finite State Machine (FSM) with datapath. This model is one of the most popular methods for digital system specification at RTL [15]. The generated FSM divides the elements of CDFG into a set of states S and control steps for synthesis. Meanwhile, it should be noted that the overall process of this transformation can be automated or user-driven [5, 8, 13].

2.3

RTL generation

This subsection presents how the desired RTL is generated. To do this, three interdependent tasks are needed first, including scheduling, allocation and binding. Scheduling Scheduling task schedules the operations represented in CDFG into cycles. More precisely, for every operation, its operands must be read from either storage or functional unit components, and the results must be assigned to its destinations (another operation, storage or functional unit). These operations need to be scheduled within a single clock or over several cycles one by one. In the experiments of this paper, Integer Linear Programming algorithm [11] is used for this problem, its solver is guaranteed to find an optimal schedule from the problem models. Allocation and binding Allocation and binding processes come after scheduling. In allocation, the type and quantity of hardware resources, i.e. functional units, storage or connectivity components, are determined first depending on the scheduled CDFG. Next, the desired hardware resources are selected from the given IP library that contains all the necessary information for every component, such as area, delay, power and metrics to be used by other synthesis tasks as well. Finally, binding is done with the following tasks: a) Functional binding: bind all the arithmetic or logic operations to the functional units allocated from the IP library. b) Storage binding: each variable that carries values across cycles must be bound to a storage unit like register. c) Connectivity binding: bind data transfers to the connective units, such as assignments, buses etc.. In ad-

ditional, if the interconnects are shared by multiple data transfers, a multiplexer will be needed between the sources and destinations. Since an IP library usually has several alternatives for a given functional unit or a storage unit, i.e. ripple-carry adder vs. carry-look-ahead adder or reset-only register vs. the register with both pre-sets and reset, besides the right function, the selected hardware must be able to optimize the generated RTL or minimize either the total cost of source consumption or interconnection length with the design constraints. This paper formulate the allocation problem as a clique-partitioning problem classically, and then solves it by using Tseng and Siewiork’s algorithm [14]. Generation Once all the decisions of the preceding tasks, including scheduling, allocation and binding, have been made, we can start to generate the desired RTL. In this paper, the FSM diagram is mapped as a logic controller to orchestrate the data flow through the control signals, i.e. selecting the right inputs for functional units, registers or multiplexers. This architecture consists of state logic, state register and output logic. The state register stores the present state of the data path, while the state logic computes the next state to be loaded depending on the control inputs from the external world and the state signals from the processor (data path architecture). Finally, the corresponding control signals and control outputs are generated and exports by output logic. On the other hand, the computing of each state is mapped as a data path architecture that consists of a set of storage components, a set of functional units and connectivity components. The quantities and types of storage components and functional units are allocated according to the decision made in the allocation and binding tasks, then arbitrarily connected through connectivity components. The interconnection behavior of data path architecture is described in register level language.

3.

EXPERIMENTS

We evaluate the proposed approach by using two real-life applications: KMGA method [10] and LSM based image segmentation algorithm [3] (defined as KMGA and LSM in this paper). Tab. 1 describes the implementations used for this evaluation experiments. In order to obtain an unbias comparison, the different implementations of Matlab and VHLS versions are developed from the same algorithm behavior described in Matlab, while the two C versions are manually implemented additionally. Furthermore, we specially select a GPUoptimized implementation for LSM because it has a nature of high parallelism. First of all, we functionally verified the generated RTLs, kmga fpga vhls and lsm fpga vhls. Fig.3 compares the fitness values of the evolution process of KMGA (lower is better). We can see that the fitness differences of kmga fpga vhls vs. kmga cpu m and kmga fpga vhls vs. kmga cpu c are only 5.77E-4 and -0.03E-4 respectively. On the other hand, the function of LSM is verified by observing the similarities between the segmentation results between the proposed and reference implementations. Tab. 2 shows the image similarities tested by using the three images taken by IKONOS and GeoEye-1 shown in Fig.4. The image similarities are figured out by using the built-in function corr2() of Matlab. We

Figure 3: Fitness values of KMGA implementations.

Table 2: Segmentation result similarities of the LSM implementations: proposed implementation v.s. reference implementations.

lsm cpu m lsm cpu-gpu m lsm cpu c

Umxal

Volcano

Ice sheet

0.9025 0.9025 0.9362

0.9838 0.9838 0.9896

0.9357 0.9357 0.9949

can see as well that lsm fpga vhls has different segmentation results with the other three implementations. All the differences of the obtained experiment results are caused by two reasons: (a) for the KMGA algorithm, the evolution process is random, so the results are hardly identical for each running, and (b) we convert the data type from double to float to save the area consummation of the design, which results in a lower accuracy. Further more, in C-synthesis process, the single-precision floating point numbers are further mapped to the fixed point ones, which results in a second change of experiment result. However, all these differences are tiny, so it will not seriously influence the performances of the final implementations. These experiments demonstrate that the proposed VHLS can generate the desired implementations with the correct functions. Next, we show the total resource consummation estimation of kmga fpga vhls and lsm fpga vhls implemented by using the proposed approach in Tab. 3. We can see that neither of the two implementations consume a large number of resources. kmga fpga vhls takes up only 4% of Look-Up Tables (LUTs) of the target device, while lsm fpga vhls only 55.5%. This is because the selected High-Level Synthesis (HLS) tool, Auto Pilot, provides a high quality scheduling, allocation and binding. Thirdly, we show the running-speed ratio in Fig.5. This experiment is made by using a standard skin lesion reflectance cube retrieved through a neural network designed by Mansouri et al. [12] for KMGA, and the photos taken by the satellite IKONOS for LSM. The Matlab CPU-only implementations of each algorithm are set as the reference. We can see that the proposed VHLS method achieve a speedup of around 6× and 1.42× for kmga fpga vhls vs. kmga cpu m and lsm fpga vhls vs. lsm cpu m. This demonstrates that our approach can effectively give play to the advantages of FPGAs in terms of running speed when the design is made in the same high abstraction level. Related to the C implementations, our approach achieves

Table 1: Implementation description. Algorithm

Implementation

Tool

Target devices/platform

KMGA [10]

kmga cpu m kmga cpu c kmga fpga vhls

Matlab R2012a Intel C++ Compiler (ICC) 13.1.1 proposed VHLS

Intel Q6600 processor Intel Q6600 processor Xilinx Virtex-7

LSM [3]

lsm cpu m lsm cpu-gpu m

Matlab R2012a Matlab R2012a

lsm cpu c lsm fpga vhls

ICC 13.1.1 proposed VHLS

Intel Q6600 processor Host: PC AMD Athlon 5200 Device: NVIDIA GPU GT 430 Intel Q6600 processor Xilinx Kintex 7

(a) Uxmal

(b) Volcano

(c) Ice sheet

Figure 4: Photos taken by IKONOS and GeoEye-1 for LSM evaluation.

Table 3: Consummation estimation of the implementations realized by using the proposed approach. Implementation

BRAM 18K

DSP48E

FF

LUT

kmga fpga vhls lsm fpga vhls

14 3

116 74

19475 13462

32245 23403

a speedup of 2.2× for KMGA but reduces the running speed of LSM by 47.4%. This is because the selected C compiler, ICC, allows optimizations of different forms [2, 4, 9]. In this experiment, ”Maximize speed” mode is used to schedule the algorithm automatically. Since the evolution process of KMGA is a iteration-dependent loop, the parallel optimizations, such as vectorization or Streaming SIMD Extensions, can not benefit the running-cost performance of the design. In contrast, thanks to the Lattice Boltzmann Method, LSM has a high parallelism, so this method produce a very high quality implementation. But it should be noted that the comparisons between the C and VHLS based implementations are not fair, because the former must be developed in the lower C abstraction level, which usually requires more design efforts than the latter. Further more, this performance gap is not impassable, because we do not make any parallel optimizations in the VHLS based implementations yet. Comparing with FPGA, it is found that GPU can not effectively accelerate the design. This is because LSM has a iterative architecture, and at the begin or end of each iteration the entire image must be transferred one time between the host and device. Since the host-device communication of the platform of this type consumes a high running-cost, the architecture of LSM is not suitable to be accelerated by using GPUs. However, this problem does not exist in FPGAs, so our approach can achieve better implementations related to the former.

Finally, the complexity of resource code of the different implementations is evaluated. We represent the complexity of each implementation by using the line quantities of their routines and show the result in Fig.6. We can see that the VHLS based implementations has the same complexity with the Matlab ones. This demonstrates that the proposed approach has a high compatibility with Matlab. Meanwhile, the line quantities of kmga fpga vhls and lsm fpga vhls are only around 50% of their C code, so it is concluded that the proposed approach can effectively reduce the complexity of the design related the reference methods.

4.

CONCLUSIONS AND PERSPECTIVES

This paper proposes a VHLS method for image processing designs by combining the recent source-to-source compilation and HLS techniques. It provides a high abstraction level development environment to the users, which can greatly benefit the development productivity by automating the Matlab-to-RTL synthesis process. We implemented the proposed synthesis method by using currently-available Electronic Design Automation tools, and then verified it within two real-life designs. Experiments demonstrate that the proposed method can effective give play to the advantages of FPGAs related to the devices of other types in the same abstraction level. Furthermore, it will not increase the complexity of the algorithm behaviors described using Matlab in routine level, even if they are not specially developed for FPGAs. So this method can effectively facilitate the transplantation between the different platforms, such as CPU vs. FPGA. On the other hand, some new issues and challenges are found as well. The first is that manual code transformation is still needed due to the incompatibility between the selected source-to-source compiler and HLS tool. The second one is that performance gap still exists once the General Purposed Processor based designs are made in a lower

(a) KMGA Figure 6: Complexity comparison.

(b) LSM Figure 5: Acceleration ratio. abstraction level, C for example. Finally, in HLS, many cycles provide potential optimization opportunities that may further benefit the performances of the generated RTL, especially the running speed, but they are not discussed yet. Consequently, we plan to focus the future work on the improvement of the VHLS method of this paper. Some sourceto-source compilation based optimization strategies may be developed to improve the quality of the generated RTLs. We hope that these achievements can bring help to the other relative performance optimizations researches.

5.

ACKNOWLEDGMENTS

The authors would like to thank the China Scholarship Council and Conseil R´egional de Bourgogne (France) for funding this work.

6.

ADDITIONAL AUTHORS

7.

REFERENCES

[1] Adsc research highlights: Synthesize hardware, without hardware expertise. Online, January 2016. [2] B. Armstrong, S. W. Kim, I. Park, M. Voss, and R. Eigenmann. Compiler-based tools for analyzing parallel programs. Parallel Computing, 24:401–420, May 1998. [3] S. Balla-Arabe, X. Gao, B. Wang, F. Yang, and V. Brost. Multi-kernel implicit curve evolution for selected texture region segmentation in vhr satellite images. Geoscience and Remote Sensing, IEEE Transactions on, 52(8):5183–5192, Aug 2014.

[4] B. Barney. Introduction to parallel computing. article published online. [5] P. Coussy and A. Morawiec. High-Level Synthesis: From Algorithm to Digital Circuit. Springer Publishing Company, Incorporated, 1st edition, 2008. [6] C. Deming, L. Eric, R. Kyle, and C. Zheng. Hardware synthesis without hardware expertise. Technical report, Advanced Digital Sciences Center (ADSC) of the University of Illinois at Urbana-Champaign, 2011. [7] M. Fingeroff. High-Level Synthesis Blue Book. Xlibris Corporation, 2010. [8] M. Girkar and C. Polychronopoulos. Automatic extraction of functional parallelism from ordinary programs. Parallel and Distributed Systems, IEEE Transactions on, 3(2):166–178, Mar 1992. r C++ Compiler User and [9] Intel Corporation. Intel Reference Guides, 304968-022us edition, 2008. [10] R. Jolivot, Y. Benezeth, and F. Marzani. Skin parameter map retrieval from a dedicated multispectral imaging system applied to dermatology/cosmetology. International Journal of Biomedical Imaging, 2013:15, 2013. [11] J.-H. Lee, Y.-C. Hsu, and Y.-L. Lin. A new integer linear programming formulation for the scheduling problem in data path synthesis. In Computer-Aided Design, 1989. ICCAD-89. Digest of Technical Papers., 1989 IEEE International Conference on, pages 20–23, Nov 1989. [12] A. Mansouri, F. Marzani, and P. Gouton. Neural networks in two cascade algorithms for spectral reflectance reconstruction. In ICIP (2), pages 718–721. IEEE, 2005. [13] G. Sumit, G. Rajesh, D. Nikil D., and N. Alexandru. SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits. Springer US, 2004. [14] C.-J. Tseng and D. Siewiorek. Automated synthesis of data paths in digital systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 5(3):379–395, July 1986. [15] Xilinx. Introduction to fpga design with vivado high-level synthesis. Technical Report UG998 (v1.0), Xilinx, July 2013.

Suggest Documents