CNN Image Processing on a Xilinx Virtex-II 6000 Suleyman Malki and Lambert Spaanenburg* Abstract - Image processing is one of the popular applications of Cellular Neural Networks. Macro enriched field-programmable gate-arrays can be used to realize such systems on silicon. The paper discusses a pipelined implementation that supports the handling of gray-level images at 180 to 240 Mpixels per second by exploiting the Virtex-II macros to spatially unroll the local feedback.
1
INTRODUCTION
Over the past years, images have gained increasing interest as a means to compactly represent complex forms of information. Advances in micro-electronic fabrication have led to a sharp decrease in the cost of camera and signal processing technology, allowing to conquer many existing and to open many new markets. Such prospects gave rise to the heralding of image wave computing as the computational paradigm of the future [1]. Since their invention in 1988, Cellular Neural Networks have notably been pursued for solving problems defined in space, like image processing tasks and partial differential equations [2]. Such problems are often characterized by the fact that the information necessary to compute the solution at a certain point in space is within a finite distance to that point. An example is edge extraction for digital images: whether or not a pixel belongs to an edge depends only on the color of neighboring pixels. After the introduction of the Chua and Yang network, a large number of CNN models have appeared in literature. Harrer and Nossek have introduced the discrete-time version DT-CNN [3], as will be applied here. Like a cellular automaton [4], a CNN is made of a regularly spaced grid of processing units (cells) that only communicate directly with cells in the immediate neighborhood. Despite the obvious advantages for VLSI implementation, the first CNN hardware has been in analog (the ACE4k chip) and mixed analog/digital technology (the Aladdin visual computer). In this paper we discuss the potential of a fully digital approach. The implementation relies on the field-programmable gate-array (FPGA), originally popularized as a post-fabrication programmable container for logic circuitry. André deHon has posed that the archetypical phase of FPGA-based design Department of Information Technology, Lund University, P.O. Box 118, SE-22100 Lund, Sweden, e-mail: {
[email protected],
[email protected]}.
was characterized by the severe limitation on hardware resources, making it necessary to use every hardware element as much as possible. The popular way to achieve this goal is by unraveling in time: the computational process is scheduled to execute-inorder on the few computational elements. This is called “temporal computing” in contrast to “spatial computing”, where the process is unraveled in area to reduce spurious latency [5]. The facility for spatial computing makes the FPGA already very popular as hardware accelerator. The other innovation of partial reconfiguration of hardwired modules as additional level of programming has been lesser utilized, though the potential benefit was already illustrated early on [6]. The recent addition of specialized macros for popular digital signal processing functions, together with a modular construction have created the flexibility, that promises an excellent experimentation platform for silicon systems. Early attempts in the FPGA realization of CNN functionality have shown impressive prospects [7], [8]. The paper continues in this direction by stressing the exploitation of the built-in macros to spatially unroll the local feedback. It assumes that the templates for the CNN are optimized from an initial morphological specification, as originally proposed in [9] and recently systematized to minimize the number of discrete processing steps [10]. Therefore a short network depth may be expected. The paper is structured as follows. In section 2 we discuss how a CNN can be architectured for spatial computing on an FPGA. Ensuing, section 3 gives some implementation details for a particular design, the Halvar. Finally we conclude on the FPGA metrics. 2
Mapping a CNN
The major problem for the implementation of a CNN on a spatial systolic architecture using an FPGA is to define a suitable geometry. Though the CNN nodal equation,
x(k ) =
åa
d ÎN r
d
y d (k ) +
åb u d
d
+i,
d ÎN r
assumes all data to be present simultaneously, it would be next to impossible to provide such highways on the limited wiring of the FPGA. Instead we will mix spatial with temporal elements. This can be done in several ways; here we confine ourselves to a number of generally valid observations.
In order to bring the image into the system, we assume a serial pipeline. The image (or in effect a cut through the image with a width conform the pixel count the CNN system can handle in parallel) is entered on a line-by-line basis. For the purpose of the discussion we identify such lines by subsequent characters: A, B, C, ... (Figure 1c). In order to imitate direct image access, we have to read a landscape picture from right to left. Then, if a column in the CNN structure contains line B, the column to the left will contain the next line C and the column to the right the previous line A (Figure 1b). In other words, at any given moment the CNN contains a part of the image as seen by viewing the picture. -1,-1
-1,0
-1,1
Cu
Bu
Au
0,-1
0,0
0,1
C
B
A
1,-1
1,0
1,1
Cd
Bd
Ad
(a)
dimensions are not independent, as the feedback is intimately coupled to the iterations. A likely architecture is based on the following “Hänsl und Gretchen” algorithm (see also Figure 2): for a pixel line of limited length do { compute the constant contribution U + i. pass U+i with Y to the next layer perform an iteration while there are stages left do { pass U+i and iteration result to the next layer perform an iteration } output the neuron states to the image store } U-cell Y-cell 1 Y-cell 2 Y-cell 3 Y-cell 4 Y-cell 5 Precalc Iter. 1 Iter. 2 Iter. 3 Iter. 4 Iter. 5
(b)
U G
F
E
D
C
B
A
Y (c)
Figure 1: Numbering of CNN cells (a), pixels (c) and in combination (b) As this snap shot is moving through the CNN, it will overrun the previous shots in a FIFO-like fashion. This gives the facility to unroll the computation of the basic CNN equation in several ways. Looking back at the basic formula for the output of a single neuron, three contributions can be distinguished: ·
åa
d
d ÎN r
y d (k ) for the feedback a of the present
and surrounding states y, ·
åb u d
d
for the control b by the applied
d ÎN r
pixels u, and · i for the offset for the present neuron. The feedback contribution is based on the cellular grid and does not move with the image but is involved in the iterations towards convergence, the second contribution is valid for the snap shot and does not depend on the grid nor on the iterations while the last contribution simply replaces the summed contributions to the right position for the final discrimination. In short, we have three dimensions to consider: (a) grid structure, (b) image size, and (c) number of iterations. But these
Figure 2: Dataflow of U- and Y-matrix. The underlying idea is that the 2-dimensional computation of the local neuron is flattened into a series of 1-dimensional computations by dropping intermediate results on the computational path. This is supposed to mimic how Hänsl and Gretchen planned to find the way out of the forest by leaving breadcrumbs on the path. 3
CELL DESIGN
A gray-level CNN cell needs I/O information from eight neighbors to find the output for the next iteration. Letting every cell in the CNN network contain image data from three pixels does this. This limits the cell communication so that one cell only requires contact with the cell above and the cell below in order to process a pixel to the next iteration. The required time for processing one pixel into the next iteration is called one iteration cycle. Figure 3 shows how data flows over the chip and which values are used to calculate the next pixel. The figure shows one pipeline row in the FPGA on the horizontal axis and time on the vertical axis (one iteration at every line). The arrows go from the used pixels to the calculated pixels. The index tells which iteration level the pixel has reached and the input image row looks like Figure 1c.
In the figure it is assumed that the feedback constant is already calculated and relates _ based on this assumption _ how per vertical line 9 pixels from 3 cells are used to calculate a new pixel in the Y image to the next iteration. One cell contains three pixels and calculates the new value for one pixel and one iteration. This takes 15 clock cycles and this is what’s referred to as one iteration cycle. During this time, each neighboring pixel, and the pixel itself, should be multiplied with its template value, the results are summed up and the feedback constant is added. From this sum a new value should be calculated by a threshold function. Time step
Pipeline Stage 2 Stage 1
Stage 3
1
A0
2
B0 A0
3
C0 B0 A0
A1
4
D0 C0 B0
B1 A1
5
E0 D0 C0
C1 B1 A1
A2
6
F0 E0 D0
D1 C1 B1
B2 A2
7
G0 F0 E0
E1 D1 C1
C2 B2 A2
Stage 4
Stage 5
A3
Figure 3: Fundamental systolic operation Note that a pixel only iterates every other iteration cycle and is only used as a “help pixel” at every other
iteration. This brings that every cell gives latency for a pixel of two iteration cycles. All data and signals belonging to a pixel (feedback constant, template select, template control and frame control) must also have latency of two iteration cycles. This requires double flip-flops with enable in every data path. Because of the fixed geometry of a CNN structure and the normalized character of the input data, there is no need for floating-point arithmetic. As fixedpoint number representation suffices, all multiplications can be built from the Virtex-II multiplier macros. This makes that a CNN node can be very area efficient and needs only a Block-Select macro, a multiplier macro and the attached LUTs. Image data is represented with 8 bits (1.7) fixedpoint signed values. The templates are represented with 8 bits (4.4) fixed-points signed values. The bias is represented with a 16 bit (9.7) fixed-point signed value. The feedback constants, which flow over the chip between iteration steps, are represented in full 21 bit (10.11) fixed-point signed values. This brings us to the actual design of the cells. Due to page limitations we will only present the schematic of the Y cell; the U cell has a similar structure. The K Memory stores three feedback constants, since each cell contains three pixels A, B and C where C is already calculated, B is currently being calculated and A are to be calculated at next iteration cycle. The 21®8 selector chooses 8 bits from the feedback constant to send to the next cell. 21®8
K Memory t2 t1 t0
Data Memory A
A
B addB
Mult
addA B
Sign Ext.
Norm.
sync
Main
T update
Circ.
frame control template control template select
Figure 3: Schematic of the Y cell
Acc.
The Virtex-II macro Dual Port Data Block Memory is used for storing the threshold function lookup table, the control template data and the pixel data in 8-bit words. Template data is generally addressed at port B and pixel data plus threshold lookup is addressed at port A, in Read after Write mode. The multiplier is also a built-in primitive in the Virtex-II. Port A is used for pixel data (1.7 bit fixedpoint) and port B is used for template data (4.4 fixedpoint). The output is thus a 5.11 fixed-point, followed by sign extension 10.11 fixed-point. The accumulator collects the 21 bits sign extended values from the multiplier. It has a bypass signal controlled by the Control Block for reset. The Threshold Address Converter provides for normalization by converting the 21 bit wide sum to an 11-bit wide lookup address for threshold. The function implemented is an arctan function. The Main controller has a 15 clock cycle period and steers most of the parts in a cell. The Circular Address block converts a request for a pixel, A, B or C, to an address. This is updated every iteration cycle. The Template Update counts iterations cycles at a template update. 4
CONCLUSIONS
The presented design by ISE5.2 takes 10M equivalent gates, uses all the available Block RAMs and multipliers, but consumes just 50% of the available slices and somewhat more than half of their logic content after synthesis with Synplify. This is primarily due to interconnect limitations. According to ModelSim simulation for the backannotated design after mapping on a Xilinx Virtex-II 6000, the system clock operates on 110 MHz. With 24 pixels handled in parallel during an iteration of 15 clock cycles, the effective speed gets 180 MPixels per second. For a complete image processing system the communication between the CNN and the image RAM will be crucial. Each new pixel requires the transfer of 8 bits initial activation pattern u, 8 bits initial output y and 8 bits processed pixel. This brings the required maximum data transfer at 540 Mbytes per second. Three variations on the same spatial architecture have been built and analyzed. Details can be found at site http://www.it.lth.se/vlsi/vlsi2002/final.html. They show up a different balance between cycles per iteration and maximum clock speed. In Ilva, a 10 cycle throughput raises the throughput to 205 – 240 MHz, while in Wickie a 13 cycle iteration proved to be worse. Of the ways to improve the current design, a strict interleaving of the micro cycles seems the most promising at the moment. Further the potential improvement by in-line reconfiguration is still untried.
ACKNOWLEDGMENTS The authors mention with pleasure the collaboration with Zalan Blenessy, Irina Fältman, Marcus Hast, Johan Hoberg, Tory Li, Andreas Lundgren, Erik Montnémery, Anders Rångevall, Johannes Sandvall, and Milan Stamenkovic in the development of the basic architecture and the further design of the three variations: Halvar, Ilva and Wickie. Further the advice of Joe G. Thompson of Advanced Principles on the use of ISE5.2 is sincerely appreciated. REFERENCES [1] T. Roska, “Computational and Computer Complexity of Analogic Cellular Wave Computers, Proceedings 7th IEEE Workshop on CNNs and their Applications, R. Tetzlaff, 323338, World Scientific (Singapore), 2002. [2] L. O. Chua and L. Yang, “Cellular Neural Networks: Theory", IEEE Transactions on Circuits and Systems, 35, 1257-1272 and 12731290, October 1988. [3] H. Harrer and J.A. Nossek, “Discrete-Time Cellular Neural Networks", International Journal of Circuit Theory and Applications, 20, 453-467, September 1992. [4] K. Preston Jr., M.J.B. Duff, Modern Cellular Automata: Theory and Applications, Plenum Press, New York, 1984. [5] A. DeHon, Reconfigurable Architectures for General-Purpose Computing, Ph.D. Thesis, MIT, Cambridge (USA), 1996. [6] J. Villasenor, C. Jones, and B. Schoner, "Video Communications Using Rapidly Reconfigurable Hardware," IEEE Transactions on Circuits and Systems for Video Processing, 5, 565-567, December 1995. [7] T.Uchimoto, H.Hjime, Y.Tanji and M.Tanaka, “Design of DTCNN image processing” (Japanese) Vol. J84-D-2, No.7 (2001) 1464 1474. [8] Z. Nagy and P. Szolgay, “Configurable MultiLayer CNN-UM Emulator on FPGA”, Proceedings 7th IEEE Int. Workshop on Cellular Neural Networks and their Applications (2002) 164 – 171. [9] M.H. ter Brugge, J.A.G. Nijhuis and L. Spaanenburg, “Transformational DT-CNN design from morphological specifications”, IEEE Transactions on Circuits and Systems I 45, nr. 9 (1998) pp. 879-888. [10] M.H. ter Brugge, Morphological Design of Discrete-Time Cellular Neural Networks, Ph.D. thesis, Rijksuniversiteit Groningen, Groningen (Netherlands), 2003.