Definition and SIMD Implementation of a Multi-Processing ... - CiteSeerX

Definition and SIMD Implementation of a Multi-Processing Architecture Approach on FPGA Philippe Bonnot, Fabrice Lemonnier, Gilbert Edelin Thales Research & Technology RD 128, 91767 Palaiseau, France

Thales Optronique SA rue Guynemer,78283 Guyancourt, France

[email protected] [email protected] [email protected]

[email protected] [email protected] [email protected]

Abstract In a context of high performance, low technology access cost and application code reusability objectives, this paper presents an "architectured FPGA" approach that consists in the definition of a general frame for embedded system application implementations. Addressing image processing as a first application domain, a FPGA architecture implementation based on that approach is presented. Built around SIMD architecture, the "Ter@Core" FPGA implementation illustrates the competitiveness of the approach compared to off-the-shelf processors and to usual FPGA approach. The presented implementation gathers 128 processing elements on a single FPGA providing 19.2 GOPS performance and very high application development productivity.

Keywords: image processing, data dependent processing, long lifecycle, FPGA, platform approach, domain specific API, MIMD architecture, SIMD architecture, middleware.

1

Gérard Gaillat, Olivier Ruch, Pascal Gauget

Introduction

Embedded image or signal processing systems require high performance and highly integrated implementation solutions. The domain of interest considered here concerns defense and aerospace applications like smart cameras, airborne radar, etc. In this domain, one has to conciliate high processing power, limited size, limited power consumption, limited production volumes, procurement issues due to the need to guarantee shipping and support during long lifecycles and nevertheless a strong need of evolutions in order to keep competitive during the lifecycle. In addition, there is an

978-3-9810801-3-1/DATE08 © 2008 EDAA

increasing customer demand to move from simple display systems to intelligent systems capable to extract relevant information from a signal or an image. This means a strong increase in the complexity of algorithms (more complex codes and more instructions per processed element) combined with data dependent processing which can be implemented either through reconfigurable hardware or through a programmable processor. When looking for a solution to meet such requirements, one can today observe that neither general purpose COTS microprocessors nor general purpose COTS DSP implement massively parallel architectures. As a consequence, they are no longer able to take full advantage of the evolution of integrated circuits density (Moore's law). The introduction of “multi-core” in the last years is obviously a step in the good direction, but it is unlikely that it results in large scale parallelism since most customers of these chips prefer ease of programming in a sequential language to extreme performance which (today) requires the programmer to take parallelism into account when coding. Only the introduction of a new generation of compilers and tools could change this situation. Significant breakthrough in processing power can be expected from new architectural paradigms and products (MPSoC, parallel tiled chips or “many-cores” [1] [2]). However, these highly parallel programmable components are generally considered not mature enough for defense and aerospace applications. This might change in the future due to markets like video and games, but, designing today a long lifecycle system on the top of such components means taking the risk of severe procurement issues and redesign costs all over the lifecycle of the system. FPGA are another solution from the market. Compared to ASIC, they provide significantly less processing power, but, on the other hand they require

much less development costs which is a very important issue as far as limited production volumes are concerned. Hence, they appear as a good compromise between microprocessors or DSP which most often do not provide the required processing power and ASIC which lead to unaffordable development costs. As far as long term availability is concerned, they are promoted by companies which can be expected both to survive and to have a stable strategy in the long term. In practice, although the architecture of FPGA have significantly evolved over the years and will probably continue to evolve, some kind of stability is guaranteed due to the design environment and languages. Hence, FPGA are very widely used in many defense and aerospace applications today. In fact, the main limitation of FPGA regarding the requirements listed above do not come from the component itself but from the way the component is currently used in signal or image processing applications. Very often, a FPGA is used to implement (to “hardwire”) a specific algorithm. Due to the increasing complexity of algorithms in recent systems, most often, FPGA can implement only some parts of the algorithm while the other parts need to be coded in a DSP. Several people from different disciplines (algorithm, FPGA, interface software) have to co-operate and understand each other. This usually leads to very high development costs, poor time to market and makes evolutions very difficult or at least very costly. Huge efforts have been made to support such a process, but the way seems to be very long before tools are mature and stable enough to allow both efficient implementation, low cost development and easy evolution over a long period. Furthermore, data dependent processing is very difficult to implement with this type of “hardwired” implementation of an algorithm. Some efforts are currently made by FPGA manufacturers to provide some kind of reconfigurability, but really reconfigurable hardware will probably require that much more innovative architectures are introduced on the market such as the one considered in the EC supported MORPHEUS project [3]. For all those reasons, our choice has been to design a programmable parallel processing architecture on the top of a FPGA. Doing this, we can benefit from the advantages of both FPGA and programmable processors. FPGA bring a fairly high processing power and a rather good guarantee of stability in the long term. Saying this, we mean that an architecture built on the top of a FPGA can be expected to be reproduced more or less the same during years on successive generations of FPGA and hence benefit from increasing performances due to technology evolution without requiring a too costly effort. Furthermore, large families of products are provided which allow the same architecture concept to be implemented from low consumption devices to high end systems. Implementing a

programmable parallel processing architecture on the top of a FPGA allows the same VHDL code to be used for a large variety of applications provided these applications belong to the same domain. It allows to address much more complex algorithms than “hardwiring” algorithms on a FPGA and nevertheless makes good use of the processing power of a FPGA. In fact, the two main advantages are first a dramatic reduction in development costs, evolution costs and time to market and second the capability to implement sophisticated data dependent processing algorithms. A possible drawback of such a programmable solution compared to “hardwiring” an algorithm is the loss of efficiency due to the use of a general purpose solution. This is true in principle but a first remark is that this comparison applies only to those algorithms which are simple enough to allow both implementations. For sophisticated or data dependent algorithms, there is no choice. A second remark is that the effort which can be afforded to optimize a multi application architecture is much higher that the effort which can be afforded to optimize a single application architecture. This will appear clearly later in this paper. A third remark is that an algorithm engineer can understand implementation issues on a parallel processing architecture and even implement himself while understanding implementation issues on a FPGA is generally too far from his background. As a consequence, when solving a problem, he can select the algorithm which leads to the most efficient implementation in the case of a parallel processing architecture, not in the case of a “hardwired” FPGA implementation. The approach we have selected is described in section 2. A first release for image processing applications is described in sections 3 (hardware) and 4 (middleware). Implementation is described in section 5. Performances and productivity gains in an application are discussed in section 6. Evolution perspectives are sketched in section 7.

2

“Architectured FPGA” approach

Although at present we have made only one implementation targeting the image processing domain, our long term goal is to promote a general hardware and middleware architecture framework which can be used for a lot of image or signal processing applications. Hence, this framework is intended first to be customized per domain, second to be customized per size (from low consumption devices to high end systems) and third to be ported in order to maintain competitiveness every time a new FPGA technology is introduced. Our analysis is that although each domain requires specific optimizations (typically dealing with complex number FFT in signal processing does not require the same kind of optimization

Bridge µP

µP

PU

µP

PU

µP

I/O

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

Data Mover Unit

storage RAM

storage RAM

µcode

storage RAM

Figure 1. “Architectured” FPGA global view.

RAM

PC

Ter@Core hardware architecture

Ter@Core (see figure 2) is the name of the first circuit developed within the framework discussed above. It targets image processing applications such as image enhancement, detection, identification and tracking. It has been implemented on a Virtex-4 SX-55 from Xilinx [4]. It. contains only one µP, one PU, one DMU and one bus. It is linked to a Pentium PC board and acts as an accelerator of the Pentium. The µP uses the MicroBlaze softcore from Xilinx. The PU is a SIMD ([6], [7], [8] also proposed SIMD on FPGA). It implements a global controller plus 128 PE (processing elements). It runs at 150 MHz. The PE are connected according to a one-dimension ring topology in order to keep interconnection distances at a minimum.

Calls Exec Reports

Global RAM

sequencer + addressing

Local RAM

PE

Private RAM

I/O

Local RAM

PE

Private RAM

I/O

Local RAM

PE

Private RAM

Local RAM

PE

Private RAM

Local RAM

PE

Private RAM

Local RAM

PE

Private RAM

I/O I/O

Data Mover Unit

...

storage RAM

Figure 2. Ter@Core architecture: 1 µP and 1 PU. Design of the PE (see figure 3) is strongly constrained by the available resources on the chip. Each PE contains one DSP block which is used as an ALU, one 1Kx18b RAM, one 512x36b RAM and a maximum budget of 96 slices for the glue (these characteristics can be compared to generally higher slices budget or more complex interconnect [9] [10]). The content of a 16Kx18b global RAM can be broadcasted to all the PE. Indirect addressing is supported at PE level. A patented addressing scheme allows the set of 128 PE to fetch or store 128 adjacent pixels in an image line, even if those pixels are not located at the same address on each PE. Compared to many other SIMD, this brings huge simplification in image processing codes and saves a lot of processing power. @im Ad im re ma re

P

Local RAM

im

Private RAM

re

Data Mover Unit

support exchange of data. External RAM (DDR, QDR, …) are connected to the DMU to allow storage of intermediate data. Bridges are provided to allow µP of one chip to communicate with µP of other chips or with single board computers such as PC COTS boards. In such a scheme, µP are primarily intended to implement the control part of the application while PU are primarily intended to implement the inner parts of the algorithms, those parts where very high processing power is required. One FPGA can contain one or several µP and each µP can control zero, one or several PU. In the same chip, all the PU can be either the same (for instance a SIMD) or different.

Calls

µP

storage RAM

µP

SIMD Processing Unit

messages

Message Exchange Network

Bridge

3

Control

Exec Reports

as dealing with 1 bit or 16 bit integers in image processing), these optimizations can be quite local and 80% of both the hardware and the middleware can be the same for a large variety of applications. At highest level, the framework features a MIMD scheme (see figure 1): Several µP (general purpose microprocessors) possibly assisted by PU (processing units) exchange data and messages through two networks. A first network (in fact a bus) with low latency and low throughput is intended to support exchange of (short) messages. A second network called DMU (data mover unit) with high latency and high throughput is intended to

@re re

Ad As Al

OP (if) controller (loop)

Figure 3. Processing Element architecture. It is clear however that the challenge to get the best from an existing FPGA does not leave much room for innovation at inner level of the architecture. But the innovation is elsewhere. It is to propose a competitive alternative to “hardwired” algorithm implementations. To win such a competition on every contract, efficiency has to be preferred to innovation as a design driver. Note also that our framework permits several types of

PU to co-operate on the same chip. Hence, the role of the SIMD is not to cover 100% of the domain but to efficiently implement the most currently used algorithms. For instance, it does not efficiently implement operators such as image rotation that require complex processor-memory interconnections. Thanks to our multi PU scheme another type of PU can be added to implement such an operator if required for another contract.

4

Ter@Core middleware architecture

Hardware layer

Middleware layer

Capitalisation layer

Application layer

Middleware plays a double role: First role is to maximize productivity and hence to reduce development and evolution costs. Second role is to implement an Application Programming (or Portability) Interface (API) in order to guarantee that the hardware layer can evolve without affecting the application software layer and conversely. Such a double role is often referred to as a “platform approach”. Note that this second role is particularly important in the case of an architecture with a long lifecycle goal but based on evolving devices like FPGA: Whatever be our efforts in the long term to try to reproduce more or less the same hardware architecture during years, the constraint to maintain competitiveness will require some tuning for each new generation of FPGA. Hence, an API must not be defined at the level of the hardware layer but at a higher level. Middleware splits into three parts (see figure 4): A first part implements a standard Operating System like Linux or Windows. It runs on a µP generally located outside the FPGA. Its role is to support communication with external devices, man machine interface, overall control of the application, etc …

Application software

A second part implements a set of message passing primitives which allow to exchange information and to synchronize between tasks running on the various µP. A third part implements a series of image processing operators. An operator is a piece of microcode running on the SIMD. It can be activated by a µP through a remote call which indicates the starting address in the microcode plus call parameters. The list of the presently implemented operators is given in figure 5. Note that due to memory image = f(image1, image2)

Statistics

Inf or Sup of 2 images Weighted addition (ax+by) of 2 images Multiplication of 2 images

Mean and variance of an image Min, Max, Histogram of an image

image = f(image, coefficients)

Convolution with horizontal segment Convolution with vertical segment Convolution with rectangular window

Upper or Lower thresholding of an image Linear transformation of luminance (ax+b) Look-up table transformation of luminance Division by 2n Binarisation against a threshold

Geometric transformations Interpolated zoom/de-zoom along x or y Factor 2N de-zoom along x or y 90° rotation

Filtering

Morphology Erosion (2, 4 and 8 Dilatation neighbour topology) Partition into connex areas

Miscellaneous Absolute value of an image Normalisation to [0, 255] dynamics

Figure 5. Domain-dedicated operators (partial list) size limitations inside the FPGA, operators generally do not work on a full image but on a window. The role of the µP is to split images into windows and to make remote calls in order first to get a window loaded by the DMU from the storage RAM, second to get it processed by the SIMD, third to get the result unloaded by the DMU into the storage RAM. Of course, data transfer and processing tasks are done in parallel. Figure 6 illustrates how work can be shared between a PC Pentium in charge of the overall control of the application, a MicroBlaze in charge of activating the SIMD and the DMU and the SIMD in charge of running the operators called by the MicroBlaze.

High level API addition / substraction

Re-use library (algo, MMI, MPI, tests, fault detection, ...)

Posix Drivers + Linux (or Windows)

API (operators + messages) Messages middleware (C)

I/F µP Master µP (eg: PC)

Library of domain operators (µcode + C)

man machine interface

overall control program

I/F SoC SoC µP

PU

µP

PU

µP

I/O

µP

µP

µP

DMU

µP

I/O

Storage Memory

Figure 4. Programming layers.

external information

detection from shape

multiplication

detection from texture

binarisation

detection from motion

erosion

object tracking

dilatation

object identification ...

Sobel horizontal Sobel vertical ...

control level (C code on Pentium)

algorithms level (C code on µBlaze)

operators level (µcode on SIMD)

Figure 6. Programming paradigm.

Our design driver for this second and third part was to keep it as simple and as fast as possible. Having a very short but very carefully optimized code is a way to guarantee two major issues which are efficiency and low cost portability from one hardware to another one. More sophisticated primitives can be capitalized on the top of this first simple minded API so as to provide a second level much more sophisticated API as shown in figure 4.

5

Implementation results and performances

The Ter@Core architecture implements 128 PE on a Xilinx Virtex4 SX55 FPGA. Place and route results lead to an area occupation of 82% for logic block slices, 75% for DSP blocks and 95% for RAM blocks. The critical resource is the memory. In the present design, half of the DSP blocks are used just for multiplexing purposes in order to save slices. No more than 128 DSP blocks could be used for computation purpose in the PE because, in Virtex 4 architecture, one of the DSP inputs is shared by 2 DSP. This is no longer the case in the new Xilinx Virtex 5 architecture. So we can envisage to use the DSP blocks more efficiently in a future version of our design. The MicroBlaze runs at 100 MHz which is sufficient for its global control role. The throughput of the external RAM is 1 GB/s. Idem for the DMU. A frequency value of 150 MHz has been reached for the SIMD due to careful optimization of the routing and of the pipelines. This results in a 19.2 GOPS peak processing power (note that one OP stands for one ADD + one MULT + two memory accesses + address incrementing). As an average for the applications considered, 10% of the cycles are made of NOP instructions due to some pipeline effects in inter-PE communications. This means an efficiency of 90%. 100% efficiency can be reached in favorable cases such as convolution processing. Since the RAM of the PE are dual port, memory load/unload by the DMU and processing job by the SIMD are achieved fully in parallel without any interference or overhead. The power consumption is 14W. To illustrate these figures in the domain of image processing, it is interesting to compare them with the input induced by conventional TV-rate which is 576 x 768 x 25 Hz = 10.6 Mpixels/s or 21.2 MB/s. 19.2 GOPS means that 1800 operations are available per pixel while 1 GB/s means that 47 inputs or outputs are available per frame. Even in the case of HDTV-rate which is 1080 x 1920 x 25 Hz = 52 Mpixels/s, 370 operations per pixel and 9 inputs or outputs per frame are available. These figures are at least one order of magnitude above the figures which can be reached on general purpose microprocessors. For instance, in the same class of power consumption a PowerPC G4 ceils at 1 GOPS for the application that has been studied.

They also prove that the “architectured” FPGA approach is competitive compared to “hard-wired” FPGA solutions even for those simple algorithms which can be implemented on both solutions. In fact, the poor use of the DSP blocks for computation purpose is compensated by the strong optimization of the design leading to a high frequency. Such an optimization can be afforded only for a multi application architecture. Furthermore, 100% of the device runs at maximum speed during 100% of the time which is rarely the case in “hard-wired” implementations. In addition, the fact that a processor architecture is much easier to understand by algorithm engineers gives an opportunity to optimize at algorithm design level. Compared to hardware-based SIMD solutions such as the Xetal-II [11] from NXP which provides around 100 GOPS, the ratio is a factor 5. This can be considered as the penalty to pay to secure long term procurement and evolving performances for long lifecycle systems.

6

Performance and productivity results on a typical application

The Ter@Core platform has been used to emulate a reconfigurable architecture in the framework of the EC supported MORPHEUS project. In this contract, we wanted first to demonstrate the interest of reconfigurable architectures, second to get dimensioning factors to specify such an architecture and third to demonstrate that the above discussed middleware architecture could be used both for reconfigurable architectures and for parallel processing, thus allowing easy migration from one technology to the other. In order to achieve this, we have implemented a motion detection application. The first step has been to code a set of 30 application independent operators (the above mentioned API operators). This resulted in 800 lines of SIMD microcode. 4 weeks were needed to develop and test those 800 lines. This was done by an engineer from the algorithm group, without requiring assistance from other hardware or software engineers. The second step has been to code the application itself. This resulted in 225 lines of C code to be run by the MicroBlaze (essentially calls to the SIMD and to the DMU). One week was needed to develop and test those 225 lines. Before coding, one more week was devoted to optimize transfers on the DMU. Again, all this was done by an engineer from the algorithm group. Note that only 11 out of the 30 operators are invoked by this application representing 380 lines of SIMD microcode. In total, we spent 4 weeks to write a multi-application generic code (the API operators) and 2 weeks to implement an application on the top of this API. This proves the dramatic gain of productivity of such

a programmable platform compared to “hardwired” FPGA approaches. As a comparison, the cost estimate for the same algorithm to be implemented on a FPGA in a conventional “hardwired” way was 18 months. Remember also that many algorithms which can be implemented on such a platform cannot be implemented using a “hardwired” FPGA approach either because they are too sophisticated or because they are too data dependent. Concerning performances, let’s mention that the input for this application is a conventional TV-rate on gray scale images which means 576 x 768 x 25 Hz = 10.6 Mpixels/s or 21.2 MB/s. 873810 cycles are required to process an image which means that the load on SIMD is 14.56%. The MicroBlaze CPU load is negligible since calls to the SIMD and the DMU occur for large 128x128 windows. Since 5 inputs and 4 outputs per image are required, the load on the DMU is 19.1%. Obviously, the machine is strongly oversized for this application, but remember that this application is supposed to be only a first processing step in a real operational surveillance system.

7

Evolution perspectives

The evolution perspectives of the presented approach are numerous. Main evolutions, either under development or only planned are: - An implementation of the proposed approach on other FPGA families or technologies: Xilinx Virtex 5 [4], Altera Stratix III [5], and other future technologies. - Tool enhancements. This concerns compilation first. The objective is to avoid coding manually the VLIW parallel ALU code of algorithmic primitives. This can be obtained thanks to an optimized compilation of domain specific C language description of these primitives toward a parallel VLIW assembly code. Among the intended tracks, integer linear programming is one of our preferred thanks to the near-optimum results it can provide for the small programs that are met at that level. High-level placement tools are also investigated on the basis of the SPEAR tool [12] which permits to generate communication code (for example on communication API) from an interactive mapping of data arrays on the target architecture along the processing chain for a data-streaming class of applications. - Extension to other application domains. This will require a customization of the processing architecture to these domains particularities. For example, the data format but also the inter-PE communication principles may be different. The tools will thus have to be adapted (compilation, assembly, simulation). Nevertheless, the global architecture and building blocks like global controllers as well as global control primitives are expected to be common from one domain to another.

8

Conclusion

Our analysis is that an “architectured FPGA” approach such as the one described in this paper is a promising solution to implement the new generation of computation intensive signal or image processing systems. Compared to classical FPGA implementations, not only it allows to dramatically reduce development and evolution costs but it also gives access to sophisticated data dependent algorithms such as the ones required to make systems more intelligent. In fact, seen from industries dealing with long lifecycle, it gives access to similar features as the ones offered by emerging technologies such as massively parallel processing or reconfigurable computing but provides a much more secure way to guarantee long term availability. Moreover, the platform approach with an API as suggested in this paper is probably a way to allow a smooth transition toward those emerging technologies when they are mature.

9

Acknowledgment

The work presented in this paper is partly done within the MORPHEUS [3] project (IST FP6, project no. 027342), which is sponsored by the European Commission under the 6th Framework Program.

References [1] [2] [3] [4] [5] [6]

www.tilera.com www.ambric.com www.morpheus-ist.org www.xilinx.com www.altera.com Jones, A.K.; Hoare, R. et al., “A 64-way VLIW/SIMD FPGA architecture and design flow”, ICECS, Dec. 2004 [7] Hoare, R.; Tung, S.et al., “An 88-way multiprocessor within an FPGA with customizable instructions”, IPDPS, April 2004 [8] Lopez, R. et al, “SIMD Architecture for Image Segmentation using Sobel Operators Implemented in FPGA Technology”, ICEEE, September 2005 [9] Kulkarni, C. et al, “Micro-coded datapaths: populating the space between finite state machine and processors”, FPL 2006 [10] Nikolov, H. et al., “Efficient automated synthesis, programming and implementation of Multiprocessor platforms on FPGA chips”, FPL 2006 [11] Abbo, A. et al., “XETAL-II: A 107 GOPS, 600 mW Massively-Parallel Processor for Video Scene Analysis”, ISSCC 2007 [12] Lenormand, E.; Edelin, G. “An industrial perspective: pragmatic high-end signal processing design environment”, SAMOS 2003

Definition and SIMD Implementation of a Multi-Processing ... - CiteSeerX

Definition and SIMD Implementation of a Multi-Processing ... - CiteSeerX

Suggest Documents

A Compact FPGA Implementation of a Bit-Serial SIMD Cellular ...

Definition and Implementation of Context Information 1 ... - CiteSeerX

video-rate hough transform implementation on the simd ... - CiteSeerX

Definition, Implementation, and Calibration of the

Revisiting SIMD programming - CiteSeerX

Design and implementation of the SIMD-MIMD GPU architecture

The definition, selection and implementation of a new hospital ...

Definition and Implementation of a SAML-XACML Profile for ...

generalized multiprocessing and multiprogramming systems

Highly Scalable Multiprocessing Algorithms for Preference ... - CiteSeerX

Optimizing packet capture on symmetric multiprocessing ... - CiteSeerX

Optimizing packet capture on symmetric multiprocessing ... - CiteSeerX

Programming and compiling for embedded SIMD ... - CiteSeerX

Designing Area and Performance Constrained SIMD ... - CiteSeerX

A Real-Time Stereo SmartCam, using FPGA, SIMD and ... - CiteSeerX

SIMD-Based Decoding of Posting Lists - CiteSeerX

A Video Signal Processor for MIMD Multiprocessing - CiteSeerX

Multiprocessing for Pd

Computational Ram: A Memory-SIMD Hybrid and its ... - CiteSeerX

A Fast and Low Cost SIMD Architecture for Fingerprint ... - CiteSeerX

Efficient Implementation of Sorting on Multi-Core SIMD CPU ...

Heterogeneous Multiprocessing - T. Schneider

Vector and SIMD Processors

Towards A Unified Definition of Function - CiteSeerX