A Custom DSP for Feature Extraction with a Calculation Capacity of > 1G MAC/s and Low I/O Bandwidth 1
1
Viktor Öwall , Mats Torkelson , and Peter Egelberg
2
Dept. of Applied Elec., Lund University, 221 00 Lund, Sweden, Tel: +46 46 222 91 10 Fax: +46 46 12 99 48, email:
[email protected] and
[email protected] 2 Agrovision AB, 223 70 Lund, Sweden, Tel: +46 46 18 26 00, Fax: +46 46 14 26 69, email:
[email protected] 1
Abstract
A customized processor for real time image convolution has been designed to increase the performance of an instrument for automated cereal grain quality assessment. Image convolution requires an extensive amount of calculation capacity hard to sustain with standard processors. Therefore, a tailored architecture with a calculation capacity surpassing that of TMS320C80 has been developed. A simpli ed system design is achieved with an on-chip pixel memory bank with reduce the amount of external data transfers. The processor achieves a calculation capacity of >1G MAC/s at a clock frequency of 20MHz and has been designed for a Plessey gate array implementation.
1 Introduction
An instrument for automated cereal grain quality assessment using digitized video images has been developed by AgroVision AB [1]. To enhance the performance of the instrument, feature extraction by using two-dimensional convolution of the digitized image is desired. The twodimensional convolution is computationally very intensive and must be performed in real time. Furthermore, one convolution extracts one feature and if several features are of interest, several convolutions have to be performed with an increase in required calculation capacity. To achieve the desired ltering with standard DSPs would lead to a very high hardware cost which results in a complicated system design. Therefore, an algorithm speci c DSP for with a tailored processor architecture has been developed. Due to the tailored architecture, the presented processor has a calculation capacity surpassing that of the MVP TMS320C80 for this application. Since the performance requirements of the design are high, no prede ned processor cores have been used but a tailored architecture has been assembled. The presented processor architecture: (1) yields a very high calculation capacity, (2) reduces data transfers, (3) reduces the number of o-chip busses, (4) and reduces the amount of external data handling. These factors lead to a simpli ed system design and ease system integration. The processor has been designed and simulated in the Plessey Design System which guarantees the performance of the
M1 -1 2
K1 columns
M 2 -1 2
M1 M2
K2 rows
Kernel array, h()
Image array, x() Frame
Figure 1: Convolution of an image by a kernel function. fabricated chip to coincide with the simulation.
2 Image Convolution
Two-dimensional image convolution [2, 3], hx, is performed by scanning the K1 K2 image, x(), with the M1M2 kernel function, or impulse response, h(), Fig. 1. A value is calculated for each position according to
y(k1; k2 ) =
X X x(k ?m ; k ?m )h(m ; m ) m1 m2
1
1
2
2
1
2
(1)
where m1 goes from ?(M1 ? 1)=2 to (M1 ? 1)=2 and m2 from ?(M2 ? 1)=2 to (M2 ? 1)=2. This operation is performed for all pixel values according to
M1 ? 1 k K ? 1 + M1 ? 1 1 1 2 2 M2 ? 1 k K ? 1 + M2 ? 1 ; 2 2 2 2
(2) (3)
and a ltered output image is produced. To deal with border eects, an image frame is added to the image data according to Fig. 1 [3].
Controller with Control Kernel RAM Address Address Processors Microprogram Memory
Flag
Data Out - 4x8 bits
Processor Core
Processor Core
Processor Core
Processor Core
1
2
3
4
Kernel 6 bits
Loop
Handling
Size - 8 bits
Input 8 bits
R/W
AP
Line 1
AP External Control
Line 2
Line Memory Addresses
Line 3
Line 14
Line 15
Pixel 6 bits
Pixel Memory Bank
Figure 2: Architecture of the designed image convolution processor.
Processing Capacity, GMAC/s
1.4 theoretical asymptotic
1.2
sustained
1 0.8
Maximum size for square images, 128x128 pixels
0.6 theoretical MVP
0.4 0.2 0 1
50
100
150
200
250
Number of Pixels per side of a Square Image
Figure 3: Calculation capacity as a function of size of square images.
3 Design Considerations
When designing a processor, trade-os have to be made since there are limitations to the available die area. For the designed processor, trade-os have been made between wordlength, kernel size, and on-chip memory to be able to t the processor on an aordable gate array. Gray scale/color resolution is important in image analysis applications but often spatial resolution is more important. Therefore, in the trade-o between the size of the convolution kernel and the gray scale resolution the choice for this application has been to have a relatively large kernel and to reduce the wordlength. A large kernel size of 15 15 sustains powerful and versatile ltering while the wordlength has been minimized to reduce hardware cost. The high level simulations have been performed in C and have shown that a 6-bit gray scale resolution is adequate for the target application to achieve the desired recognition rate. The size of an image is not xed but a limitation is set by the processor to 128 255 pixels which is enough for the application. Due to the computational burden of image convolutions, very high demands are set on the system implementation. To perform image convolutions with computers or standard DSPs would require a considerable
number of processing elements with a corresponding increase in system complexity. Each of the designed processors extracts four features and has a theoretical processing capacity of 1.2G MAC/s, if a 100% utilization of the multipliers is achieved. Since the data ow of the designed processor is tailored to the algorithm, the sustained processing capacity comes close to the theoretical optimum for large images, as shown in Fig. 3. In the microprogram, one pixel operation takes 16 clock cycles compared to the optimal 15 which results in an asymptotic level for in nitely large images of 1.125G MAC/s. Reduction of the sustained processing capacity also arises from initial lling of the pixel memory bank and new line handling. For image sizes relevant for the application, i.e. > 41 41, a sustained calculation capacity of close to or >1G MAC/s is achieved. A comparison has been made with the TMS320C80 Multimedia Video Processor (MVP) which has ve programmable processors; four parallel DSPs and one master processor. Each DSP contains one 16 16 bit multiplier, which can be split into two 8 8 bit multipliers. At a clock frequency of 50MHz this corresponds to a maximum calculation capacity of 0.4G MAC/s.
4 Processor Architecture
A design environment for implementation of arbitrary algorithms on fully customized processors [4] has been used to develop the presented processor. No system de ned processor cores limit the architectural freedom. Instead the designer has complete control of the processor architecture and is able to make design trade-os. The designed processor architecture is divided into six main parts according to Fig. 2: four identical processor cores, a Pixel Memory Bank (PMB), and a controller with address processors. The identical processor cores perform four convolutions in parallel on the same image with dierent kernel functions. The PMB stores pixel values allowing each value to be read only once, thus reducing the input datarate. The processor has a single 8 bit input bus, a 32 bit output bus, and 5 signals for processor control and communication.
Line 1 Pixel/6 bits
Line 2
Line 15 6 bits
kernel
kernel Z -1
kernel
Z -1
Z -1
11 bits Z -1
Z -1
12 bits 0
Clear
14 bits
Z -1
18 bits
Control Pipelining
Z -1
Load
Register Output/8 bits
Figure 4: Schematic diagram of a processor core. Other custom processors for two-dimensional convolution are presented in [5, 6], [7], and [8]. These papers present dierent trade-os between computational power and on-chip image memory. Due to technological advances in silicon fabrication the complexity of a single die has increased. Therefore, it has been possible to combine a high calculation capacity with a large on-chip image memory in the presented processor. Processor Core A processor core contains 15 multipliers, an adder tree, and an accumulator, Fig. 4. Each processor core performs 15 multiplications, corresponding to one column of the kernel, each clock cycle which enables one pixel operation to be calculated in 15 clock cycles (cc) when the pipe is lled. However, one extra clock cycle for loop counter operations is added and a complete pixel operation is calculated in 16cc. The calculated values are added in a tree structure of adders and pipeline registers and stored in an accumulator. In the tree structure the wordlength increases to avoid over ow from an initial width of 11 bits, since the lsb of the multiplier output is truncated. As the design environment does not tie the design to a particular implementation technique or cell library, one processor core with kernel RAMs was fabricated and tested in a full custom design. A one micron standard CMOS technology was used and the circuit contains > 50 000 transistors on a die area of 8 6:5 mm2, Fig. 5. The designed processor core is highly data intensive and do not make use of a complex control structure. A simple controller has been implemented containing an FSM which can be seen close to the center of the die. Kernel Function The kernel size has been xed to 1515 programmable values and each processor core requires 15 kernel values each clock cycle (cc). Therefore, to have multiple ports for the kernel function and reduce data transfers over long wires the kernel function is placed in distributed RAMs throughout the processor cores. Each RAM is connected to the input port through an input register, Fig. 2, and the kernel functions are stored during a dedicated part of the microprogram.
Figure 5: Die photo of the fabricated processor core.
Pixel Memory Bank To fully utilize the processor capacity 15 pixel values have to be passed to the processor cores each clock cycle. If pixel values are fed from input ports each clock cycle, a large number of input pads and a high input bandwidth would be required, i.e. 15wordlength bits/cc. Even at moderate wordlengths this becomes a substantial number of input ports 5 with the additional drawback that external processing capacity is required to handle those data transfers. However, by studying Eqn. 1 and Fig. 1 we see that each pixel, except the extreme corner pixels, are used in several calculations. Therefore, a Pixel Memory Bank (PMB) is implemented on-chip storing all pixel values to be used in consecutive calculations, enabling each value to be read only once [5, 6]. In the designed system an external processor stores the image in an external FIFO and triggers the image convolution processor. During the image convolution the external processor is not involved and can perform other tasks. At the beginning of a convolution, the PMB is lled according to Fig. 6a to allow the rst pixel operation to begin. As the kernel moves through the image, three transfers are made for each pixel operation: one pixel value is read representing the lower right corner of the kernel, one value is shifted between the line memories, and the pixel value corresponding to the upper left corner of the kernel will not be used in further calculations and is discarded, Fig. 6b. Hereby, the input bandwidth is reduced from 15 pixels/cc to 1 pixel/16cc. At the end of a line the rst 14 pixel values of the next line are read at maximum speed, Fig. 6c.
(a) Throw Away Used Pixel Value
(b)
New Pixel Value
cessor has been speci ed, designed and simulated for a clock frequency of 20MHz which due to the tailored architecture yields a calculation capacity of >1G MAC/s for image sizes relevant for the application. The useable gate count for the four processor cores, the control structure, and registers in the pixel memory bank totals 42k gates (one gate is four transistors). The nal gate count depends signi cantly on the implementation of RAMs, i.e. the pixel and the kernel memories. For Classic70000 the nal useable gate count is 150k gates.
7 Conclusion
To enhance the system performance of an instrument for cereal grain quality assessment an algorithm speci c DSP for two-dimensional image convolution has been developed. Image convolution requires a high calculation capacity as well as a large amount of data for each calculation. The tailored architecture yields a very high Pixel Values of New Line sustained calculation capacity and a low I/O bandwidth. (c) Four tailored processor cores enable four convolutions to be performed in parallel with a total processing capacFigure 6: The pixel memory bank: (a) Initial lling of ity of >1G MAC/s at the relatively low clock frequency the PMB, (b) one pixel operation, (c) new line. of 20MHz. This processing capacity surpasses the one of existing o-the-shelf processors for this application, including the MVP TMS320C80. The use of an on-chip pixel memory bank reduces the input bandwidth from The processor cores require a very simple controller with 15 pixels/cc to 1 pixel/16cc. just a single control signal while the PMB require extensive address calculations and loop control. Therefore, a controller synthesizer [9] has been used to synthesize [1] P. Egelberg et al. \Assessing Cereal Grain Quality with a complete controller from a microprogram. The syna Fully Automated Instrument Using Arti cial Neural thesized control structure consist of the microprogram Networks Processing of Digitized Color Video Images". memory which in combination with the loop counters In Proc. of the SPIE's Intl. Symp. on Photonic Sensors & Controls for Commercial Applications, 1994. and the ag handling unit execute the microprogram. The controller sends control signals to the processor [2] J. S. Lim. Two-Dimensional Signal and Image Processcores and the address processors as well as read/write ing. Prentice-Hall, 1990. signals to memories. In addition, the loop counters are [3] W. K. Pratt. Digital Image Processing. John Wiley & used to calculate the addresses of the kernel RAMs loSons, Inc., 1991. cated in the processor cores. The controller responds to [4] V. Öwall et al. \Custom DSP Design of a GSM Speech external signals and is thus controlled from o-chip. Coder.". J. of VLSI Signal Processing, Vol. 11, No. 3, The size of the synthesized controller depends both 1995, pp. 213{228. on the implementation technique of the control logic [5] P. A. Ruetz and R. W. Brodersen. \Architectures and the structure of the microprogram. The controller and Design Techniques for Real-Time Image-Processing architecture resulting in the lowest gate count was a IC's". IEEE J. of Solid-State Circuits, Vol. SC-22, No. 2, Apr. 1987, pp. 233{250. decomposed controller structure with separate microinstruction logic, generating control signals, and sequenc- [6] R. W. Berger. \VLSI Structures for Real-Time Image ing logic, handling subroutine addresses. A separate ag Convolution". In Proc. of IEEE Intl. Conf. on Cybernetics and Society, pages 676{679, 1985. handling module was used to handle conditional statements and loop control.. [7] F. Jutand et al. \A New VLSI Architecture for Large Throw Away Used Pixel Values
5 The Controller
References
6 Results
The image convolution processor has been designed and extensively simulated in the Plessey Design System using the Classic70000 cell library. Though it has not been fabricated, the Plessey Design System guarantees the fabricated circuit to coincide with simulations and the design is thus presented in this paper. The pro-
Kernel Real Time Convolution". In Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, pages 921{924, 1990. [8] P. A. Ruetz. \The Architectures and Design of a 20MHz Real-Time DSP Chip Set". IEEE J. of Solid-State Circuits, Vol. 24, No. 2, Apr. 1989, pp. 338{348. [9] V. Öwall. Synthesis of Controllers from a Range of Controller Architectures. PhD thesis, Lund University, Sweden, Dec. 1994.