A Fast Run Time Reconfigurable Platform for Image Edge ... - CiteSeerX

9 downloads 28514 Views 343KB Size Report
1 ITEE, University of Queensland, Brisbane,. QLD 4072, Australia ... is a reconfigurable architecture built using hardware-software co-design concepts. For each application ... designing custom hardware for each new algorithm. The QUKU .... for Fast Board Development and Reconfigurable Computing, Proc. SPIE 2607, pp.
QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection Sunil Shukla1,2, Neil W. Bergmann1, Jürgen Becker2 1

ITEE, University of Queensland, Brisbane, QLD 4072, Australia {sunil, n.bergmann}@itee.uq.edu.au http://www.itee.uq.edu.au/ 2 ITIV, Universität Karlsruhe , 76131 Karlsruhe, Germany {shukla, becker}@itiv.uni-karlsruhe.de

Abstract. To fill the gap between increasing demand for reconfigurability and performance efficiency, CGRAs are seen to be an emerging platform. In this paper, a new architecture, QUKU, is described which uses a coarse-grained reconfigurable PE array (CGRA) overlaid on an FPGA. The low-speed reconfigurability of the FPGA is used to optimize the CGRA for different applications, whilst the high-speed CGRA reconfiguration is used within an application for operator re-use. We will demonstrate the dynamic reconfigurability of QUKU by porting Sobel and Laplacian kernel for edge detection in an image frame.

1 Introduction With the rapid emergence of multi-protocol standards, the need for a platform which is able to reconfigure itself rapidly and on the fly has grown tremendously. This situation demands a high level of flexibility, as seen in GPP (General Purpose Processor), performance and power efficiency. GPP has failed to show its mark in terms of processing efficiency. To fill the gap between flexibility and efficiency, several architectures have been proposed. Generally, ASICs are seen as the only low-power solution which can meet this demand for fast computational efficiency and high I/O bandwidth. However, the NRE cost of ASICs are a strong motivation for some sort of general purpose computing chip with better power efficiency than microprocessors. Recently, coarse grained reconfigurable architectures (CGRA) have been developed to bridge the gap between power-hungry microprocessors and single-purpose ASICs. Coarse-grained arrays have the advantages of power-efficiency for their intended application domain, and also they are designed with efficient implementation of dynamic reconfiguration supported in hardware and in design software. Furthermore, this dynamic reconfiguration can be very fast, since configuration codes are very short. But their use remains in question due to the lack of a well defined design flow and commercial availability. Traditionally, FPGAs have been thought of as a prototyping device for small scale

digital systems. However current generation mega-gate FPGA chips have changed this perception. FPGAs can be reconfigured for an unlimited number of times to implement any circuit. FPGAs have proved their usability in control intensive as well as computational intensive tasks. An attractive feature of FPGAs is the ability to partially and dynamically reconfigure the FPGA function. The next section gives a short overview of FPGA based reconfigurable architectures. Section 3 describes the QUKU architecture. Section 4 describes the run time reconfigurability of QUKU. Section 5 discusses the implementation results followed by conclusion.

2 Related Work Some other systems consisting of MPU and FPGAs have been proposed in past [3, 4, 5, 6, 7, 8]. Guccione [9] provides a detailed list of FPGA based reconfigurable architectures. YARDS [5] is a hybrid system of MPU connected with an array of FPGA and has been targeted for telecommunication applications. Spyder [7] consists of a processor with 3 reconfigurable execution units working in parallel. The configuration of the execution units is determined from the set of applications. DISC [6] is an FPGA based processor which loads application specific instruction from a library of image processing elements. It takes advantage of the partial reconfiguration feature of FPGA and implements a system analogous to memory paging in software. PRISM [8] is a reconfigurable architecture built using hardware-software co-design concepts. For each application, new instructions are compiled, synthesized and loaded. GARP [4] combines a MIPS processor with a reconfigurable array to achieve accelerated performance in certain applications.

3 QUKU Architecture QUKU [1, 2] is a merger of two technologies: CGRAs and FPGAs. Fig 1 shows the system level description of QUKU. It consists of a dynamically reconfigurable PE array, configuration controller and address controller for data and result memory along with a Microblaze based soft processor. QUKU is a coarse-grained PE matrix overlaid on a conventional FPGA. The aim is to develop a system which is based on commercially available and affordable technologies, but at the same time provides active support for fast and efficient dynamic reconfiguration. Our system is unique in that it provides two levels of application-specific reconfigurability. The operation of each PE, and the interconnections between PEs can be reconfigured on a cycle-bycycle basis, giving maximum reuse of arithmetic and logical operators without long reconfiguration delays. However, the CGRA does not suffer from the normal CGRA problem of trying to identify the optimal mix of operators for all present and future applications to be used on that array. Rather, the structure of the CGRA can be periodically re-optimized for each new application at a cost of several milliseconds of FPGA reconfiguration time.

Configuration Contoller Module

Configuration BRAM

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Configuration controller

Address Manager Module

Processor for reconfiguration control and also for running software application processes.

Fig. 1. System level diagram of QUKU

A commonly perceived obstacle to provide algorithm speedup is the difficulty of designing custom hardware for each new algorithm. The QUKU architecture moves the design problem from a complex hardware design problem to a simpler problem of programming a PE array. QUKU compiles these PEs to a heterogeneous array, with each PE optimized for just the range of operations it requires to implement for one application set. It is our conjecture that such a PE array has the potential to provide area and power efficiencies not normally associated with FPGAs, and this paper outlines some of our initial investigations into this potential.

4 Run Time Reconfiguration In [2] we have described the design process and shown that QUKU performance is much better than a FPGA based soft processor implementation. In this section, we will describe the configuration memory organization to support run time reconfiguration of whole or partial PE array. As shown in fig. 1, the configuration controller is responsible for the loading of different modules on the PE array. The configuration BRAM contains configuration codes for up to four modules. Fig 2 shows the address space of modules and the configuration word structure. The first location of a module configuration space is reserved for specifying the address of PEs involved in the module mapping. Address of PE is written in decoded form say, if a module is to be loaded on PE 0, 1, 2, 4, 7, 8 and 10 then it is written in the address location as “0000010110010111”. This ensures that the corresponding PEs can be reset before they are loaded with the new configuration. Actual PE configuration word starts after the PE reset information. A PE array consists of 16 PEs. Each PE may have three configuration layers: start, mid-

dle and end layer. Each configuration layer is programmed to run for a certain number of iterations as a 10 bit value in iteration counter. The iteration counter value is written with the configuration bits in odd numbered locations. The even numbered locations contain the address of PEs for which the configuration is valid. Address 000000000

Module 1 '1' Reserved[30:16] PE address[15:0]

000000001 Configuration[31:16] Iteration Cntr[15:0] 000000010

'1' Reserved[30:16] PE address[15:0]

00XXXXXX1 Configuration[31:16] Iteration Cntr[15:0] 00XXXXXX0

010000000 100000000 110000000

'0' Reserved[30:16] PE address[15:0]

Module 2 Module 3 Module 4

Fig. 2. Configuration BRAM memory structure

Depth of the address space is decided by the worst case in which no two PEs share the same configuration. For this worst case, each PE requires 3 configuration word (start, middle and end layer) and each configuration word along with address occupies two locations leading to a total of 96 (16X3X2) locations, plus one location for the reset information. Hence a total of 97 locations are required. To keep the memory map simple, this depth is extended to the nearest power of 2, making it 128. To mark the end of configuration, we have reserved bit 31 of the even numbered location. The last configuration word is indicated by a ‘0’ at bit 31 in the even numbered location. 4.1 Run Time Reconfiguration for Image Processing Kernel In this section, we will be concentrating on edge detection algorithms that find widespread application in image processing [10,11]. There are a lot of methods for edge detection. But most of them can be grouped into either gradient or Laplacian. Gradient method looks for finding out minima or maxima in the first derivative while Laplacian method looks for zero crossings in the second derivative. We have mapped Sobel and Laplace edge detection algorithms on QUKU. The Sobel and Laplacian masks are shown in Fig. 3. Sobel edge detector uses a pair of 3X3 convolution masks, one estimating the gradient in X direction and the other in Y direction. The approximate gradient magnitude is then obtained by adding the X and Y gradients The Laplacian technique uses a 5X5 convoluted mask to approximate the second derivative in both X and Y directions. The idea of generating gradient from image frame, utilizes the concept of sliding window over the image frame. The mask

is slid over an area of the input image. The new value is calculated for the middle pixel, then the mask is slid one pixel to the right. This continues from left to right and top to bottom.

-1

0

1

1

2

-1

-1

-1

-1

1

-1

-1

-1

-1

-1 -1

-1

24

-1

-1

-2

0

2

0

0

0

-1

-1

0

1

-1

-2

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

Sobel X Mask

Sobel Y Mask

Laplacian Mask

Fig. 3. Sobel & Laplacian filter mask for edge detection in image frame

5 Result Analysis For edge detection purpose, we have chosen an image with 336 rows and 341 columns (fig. 4(a)) of pixels. Fig. 4(b) and 4(c) represents the image obtained after applying Laplacian and Sobel edge detection algorithms respectively. Initially, Laplacian kernel was mapped followed by Sobel kernel. Table 1. Result comparison for Laplacian and Sobel kernel implementation on QUKU

Application Laplacian Kernel Sobel kernel

a) Original Image

Execution time/frame 33.5 ms 54.4 ms

Configuration time 105 ns 105 ns

b) Image filtered using Laplacian algorithm

c) Image filtered using Sobel algorithm

Fig. 4. Original image and image after processing with Laplacian and Sobel filter

6 Conclusion In this paper we showed how QUKU can be dynamically reconfigured at the application level within a few clock cycles. The configuration time was found to be a very small fraction (10-7) of the execution time. The preliminary results support the fact that QUKU can be used in real time image processing applications and can be dynamically reconfigured at a very fast speed to switch between different kernels Development of QUKU is still in progress. This first set of experiments is mostly to confirm the APEX design flow, and to establish performance and reconfigurability features. We are yet to develop a Microblaze based soft processor interface to QUKU for system and software control.

7 Acknowledgement This project is proudly supported by the International Science Linkages programme established under the Australian Government’s innovation statement, Backing Australia’s Ability.

References 1. Sunil Shukla, Neil Bergmann, Jürgen Becker, “APEX – A Coarse Grained Reconfigurable Overlay for FPGA”, Proceedings of the IFIP VLSI SoC 2005, pp 581-585 2. Sunil Shukla, Neil Bergmann, Jürgen Becker, “QUKU: A Two-Level Reconfigurable Architecture”, accepted for presentation in ISVLSI 2006 3. Alexandro M. S. Adário, E. L. Roehe and S. Bampi, “Dynamically Reconfigurable Architecture for Image Processor Applications”, DAC 99, New Orleans, Louisiana 4. Hauser, J. R.; Wawrzyneck J., “GARP: A MIPS Processor with a Reconfigurable Coprocessor” Proceedings of FCCM (1997), pp 24-33 5. A. Tsutsui, T. Miyazaki, “YARDS: FPGA/MPU Hybrid Architecture for Telecommunication Data Processing”. Proceedings of FPGA 1997, pp. 93-99 6. M. J. Wirthlin, B. L. Hutchings, “DISC: The dynamic instruction set compiler”, in FPGAs for Fast Board Development and Reconfigurable Computing, Proc. SPIE 2607, pp. 92-103 (1995) 7. C. Iseli, E. Sanchez, “Spyder: A Reconfigurable VLIW processor using FPGAs” in proceedings of FCCM, 1993, pp. 17 – 24 8. P. Athanas, H. F. Silverman, “Processor Reconfiguration through Instruction Set Metamorphosis”, IEEE Computer, March 1993, pp. 11-18 9. S. Guccione, “List of FPGA based Computing Machine”, online at http://www.io.com/~guccione/HW_list.html 10. L. S. Davis, “A survey of edge detection techniques”, Computer Graphics and Image Processing, vol. 4, no. 3, pp. 248-270, September 1975 11. V. S. Nalwa and T. O. Binford, “On detecting edges”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 699-714, November 1986

Suggest Documents