Image Processing Architecture for Local Features Computation Javier Díaz1, Eduardo Ros1, Sonia Mota2, and Richard Carrillo1 1
Dep. Arquitectura y Tecnología de Computadores, Universidad de Granada, Spain 2 Dep.Informática y Análisis Numérico, Universidad de Córdoba, Spain {eros, jdiaz, rcarrillo}@atc.ugr.es,
[email protected]
Abstract. Quadrature filters are widely used in the Computer Vision community because of their biological support and also because they allow an efficient coding of local features. They can be used to estimate local energy, phase, and orientation or even allow classifying image textures. The drawback of this image decomposition is that requires performing intensive pixel-wise computations which makes it impossible to use in most real-time applications. In this contribution we present a high performance architecture capable of extracting local phase, orientation and energy at a rate of 56.5 Mpps. Taking into account that FPGA resources are constrained, we have implemented a steerable filters bank (instead of Gabor filters bank) based on Second Order Gaussian derivatives. This method has the advantage that the filters are 2-D separable and each image orientation can be extracted from a basic set of seven filters. We present in this paper the proposed architecture and analyze the quantization degradation error generated by using fixed point arithmetic. We show the resources consumption, the performance and finally, we present some results from the developed system. Keywords: Real-time image processing, quadrature filters, local image phase, orientation and energy, fixed point arithmetic.
1 Introduction The analysis of image features such as colour, edges, corners, phase, orientation or contrast provides significant cues in the process of image understanding. They are used as base for higher level task such as object segmentation, object recognition, texture analysis, motion and stereo processing, image enhancement and restoration or special effects [1], [2]. In this contribution we focus on three basic image properties, phase, energy and orientation. They have been extensively used on computer vision [3], [4] and, as we will see, they can be obtained from the same set of image operations. Local orientation is an early vision feature that encodes the geometric information of the image. It has an ambiguity in the representation of local orientation coming from the two possible directions, for each orientation because it is a 180º periodic measurement. We solve this uncertainty using the double angle representation P.C. Diniz et al. (Eds.): ARC 2007, LNCS 4419, pp. 259 – 270, 2007. © Springer-Verlag Berlin Heidelberg 2007
260
J. Díaz et al.
proposed by G. H. Granlund in [5], which is to simply double the angle of each orientation estimate. Energy and phase information has been widely used at the literature. Energy informs us about the areas of higher contrast (and therefore is a luminance-depending feature). Phase information, as it is stated in the literature [6], is more stable against change on contrast, scale, or orientation. It can be used to interpret the kind of contrast transition at its maximum [7]. It has been applied to numerous applications, especially for motion [8], [9], and stereo processing [10], [11]. These three features can be extracted using the same preliminary operations using Quadrature filters. They are based on symmetric and antisymmetric pairs such as Gabor filters [12] or Gaussian derivatives [13] and allow decomposing the image into structural and luminance information. This contribution has been structured as follows. Section 2 presents the computer vision theory required for the development of the presented architecture. Section 3 focuses on the evaluation of the accuracy degradation due to the utilization of fixed point arithmetic. Section 4 will cover the hardware architecture details, the resources and the performance evaluation. Finally, Section 5 describes our main conclusions and future work.
2 Quadrature Filters Based on Steerable Gaussian Derivatives for Local Orientation, Energy and Phase Computation A Quadrature filter is a complex filter that allows decomposing each output as phase and magnitude. If we note h(x,k) for a complex filter tuned to a spatial frequency f0 along the x axe then:
h( x; f 0 ) = c ( x; f 0 ) + js ( x; f 0 )
(1)
Where c and s respectively represent the even and odd components of the filters, fulfilling the condition that the even and odd filters are Hilbert transforms of each other. The convolution with the image I(x) is expressed in equation (2):
I * h = ∫ I (ξ )h( x − ξ ; k 0 )dξ = C ( x) + jS ( x) = ρ ( x)e jφ ( x )
(2)
Where ρ(x) denotes its amplitude (that we will also note as magnitude and energy when we refer to its square value), φ(x) is the phase and the components C(x) and S(x) will represent respectively the responses of the even and odd filters. Although local energy can be easily extended to a bidimensional signal, phase information is intrinsically one-dimensional. Therefore, phase information only can be computed for image areas which can be locally approximated as lines or edges and therefore, their main orientation needs to be estimated. An extensive analysis of the available theories can be found in [14]. As conclusion, orientation needs to be computed to properly estimate phase. This relation between features makes Quadrature filters based on Gabor or Gaussian derivatives a good approach because they are oriented filters which analyze the image at discrete orientations.
Image Processing Architecture for Local Features Computation
261
Furthermore, Quadrature filters are based on convolutions which are hardware friendly operations which fit quite well on FPGA devices. 2.1 Steerable Gaussian Derivatives Concerning the constraints of real-time embedded systems we will focus on Gaussian derivatives because they represent a good trade-off between accuracy and resources consumption. Gaussian derivatives allow the computation of any particular orientation based on a basic set of separable kernels, this is usually referenced as stereability [13]. In our architecture we use the second derivative order with a 9 taps kernel. The peak frequency is f0=0.21 pixels-1 and bandwidth β= 0.1 pixel·s-1.As presented on [13], this derivative order is able to provide high accuracy at a reasonable system cost. Kernel equation can be found in [13] and its shape is presented in Fig. 1. Note that one of the advantages of Gaussian derivatives (compared to Gabor filters) is that they are computed from a set of 2-D separable convolutions which significantly reduces the required resources.
Fig. 1. Second order Gaussian derivatives separable base set. From left to right, filter Gxx, Gxy, Gyy, Hxx, Hxy, Hyx, Hyy. Subindexes stand for the derivative variables, G indicates Gaussian filter and H their Hilbert transforms. With this set of filters, we can estimate the output at any orientation just combining the base set output. This allows building oriented Quadrature filters banks at any desired orientation.
After convolving the image with the filters presented in Fig. 1, we steer this filter to estimate the Quadrature filters at some predefined orientations. In our system, we compute the Quadrature filters output at 8 different orientations which properly sample the orientation space (further details c.f. [14]). The filters steering is done following equation (3) where θ stands for the orientation. cθ = cos 2 (θ )G xx − cos(θ ) sin(θ )G xy + sin 2 (θ )G yy
sθ = cos 3 (θ )H xx − 3 cos 2 (θ ) sin(θ )H xy − 3 cos(θ ) sin 2 (θ )H yx + sin 3 (θ )H yy
(3)
The computation stages described in this section can be summarized as follows: 1. Second order Gaussian derivatives Gxx, Gxy and Gyy and their Hilbert transforms Hxx, Hxy, Hyx and Hyy computation, using separable convolutions. 2. Linear filters steering at 8 orientations using equation (3). After these two stages for each pixel we have 8 complex numbers corresponding to the different orientations and even-odd filter outputs pairs. Section 2.2 describes how to interpolate this data to properly estimate the orientation and then the phase and energy at this local image orientation.
262
J. Díaz et al.
2.2 Features Interpolation from a Set of Oriented Filters If we consider 8 oriented filters (computed using Gabor or Gaussian Derivatives), is likely that the local orientation of some features do not fit this discrete number of orientations. Under this circumstance, we require to interpolate the feature values computed from this set of outputs. In this paper we will focus in the tensor-based Haglund’s approach [15]. Based on a local tensor that projects the different orientations, information can be computed as described in equations (4-6). Note that these equations, as described in [14], are a modification of the presented ones in [15] in order to improve the hardware feasibility. In these equations, j stands for the complex unit and i for the orientation.
Elocal = 1 2
⎛
θ local = arg⎜ ∑ ⎝
i
∑E
i
(4)
i
N
4 2 ⎞ ci + si2 exp{ j 2θ i }⎟ 3 ⎠
(
)
(5)
c = ∑ ci i
s = ∑ si i
(6)
( c)
Plocal = arctan s
3 Bit-Width Analysis and Evaluation of the Accuracy Using Fixed Point Arithmetic The effect of the arithmetic representation and bit-width quantization on the embedded system design is addressed in this section. On one hand, current standard processors use floating point representation which has very large data range and precision. On the other hand, digital circuits usually use fixed point representation with a very restricted number of bits. Unfortunately, embedded systems with specific circuitry for each computation can only afford floating representation at critical stages [16] and most of the computation should be done using fixed point arithmetic. The quantization effects should be properly studied to ensure that the data dynamic range is properly adjusted and the bit quantization noise is kept at a small value. Furthermore, we need to reduce the data variables bit-width to limit resources consumption. Each of the algorithm main stages requires a careful substages bit-width design but here we are only interested on highlighting the main designing decisions. Therefore, on this section we focus on the required bit-width of the following stages: 1. Convolution kernel weights of the Gxx, Gxy, Gyy, Hxx, Hxy, Hyx and Hyy. 2. Convolution outputs results of this filter bank.
Image Processing Architecture for Local Features Computation
263
3. Oriented quadrature filters after steering operations. 4. Trigonometric functions (sin, cos and their powers or multiplications) that are used at the steering and interpolation stages. 5. Main non linear operations: division and arctan. 6. Estimated goal features: local energy, phase and orientation. This analysis has been performed using the tool MCode for DSP Analyzer described in [14]. The error metric for evaluating the system degradation has been the Signal to Quantization Noise Ratio SQNR. Relative to the first point of the list, convolution kernel weights, the methodology is the following. We keep all the computation using double precision floating point representation but the kernel values and we compare the results with the complete software version. The results are shown on Fig. 2. Orientation
Energy
50
0 0
3
6
9 12 15 18 21 24 kernel bit-width
150 SQNR (dB)
SQNR (dB)
SQNR (dB)
100
Phase
60
150
40
20
0 0
3
6
9 12 15 18 kernel bit-width
21 24
100
50
0 0
3
6
9 12 15 18 kernel bit-width
21
24
Fig. 2. System SQNR vs. Gaussian derivatives kernels quantization. Energy and phase have approximately linear behavior. For orientation, bit-widths larger than 11 bits do not produce any significant accuracy improvement. This is due to the fact that, on the hardware simulator we use energy instead of magnitude as weight for equation (6) that makes not possible to achieve higher SQNR than 57.4 dB. Nevertheless, there is not significant error degradation and the quality of the system is very high already using 11 bits in this stage.
Our first conclusion is that, values higher than 11 bits for the kernels do not improve the accuracy on orientation but they do for energy and phase features. The next stage is the analysis of the bit-width used to store the output produced after convolution with these quantized kernels. These results are illustrated in Fig. 3. Different kernels bit-widths are represented with different curves for each image feature. On this experiment we have considered (without loss of generality) 8 bits for the integer part and a variable number of bits for the fractional part of the convolution outputs (represented in the x axis). Different plots represent different bit-widths used for the kernel weights mask. This is a general method because a proper scaling was used. The main conclusion is that, in order to take full advantage of larger bit-widths of the convolution kernels, a larger bit-width should be used for the convolutions output. At this point we need to define some accuracy goals. We consider acceptable SQNR values larger than 30 dB. On Table 1 we show two SQNRs of our system for
264
J. Díaz et al. Energy 80
SQNR (dB)
70 60 50 40 30 0
1
2
3 4 5 6 7 Convolution outputs bit-widths
8
9
10
8
9
10
8
9
10
Orientation
SQNR (dB)
60
50
40
30 0
1
2
3 4 5 6 7 Convolution outputs bit-widths
Phase
SQNR (dB)
60 50 40 30 20 0
1
2
3 4 5 6 7 Convolution outputs bit-widths
Fig. 3. SQNR (db) evolution for different bit-widths configuration for kernel and convolution outputs quantization. The y-axe represents the SQNR values and x-axe the convolution outputs fractional part bit-widths. The integer part is fixed to eight bits. Four different curves are presented at each plot. They represent different kernels bit-width configurations: triangles represent 9, stars 11, squares 13 and circles 15 bits to store the convolution outputs. Note that in the orientation plot top curve combine the starts, squares and circles data.
two well balanced low area decision cases. Already in Fig. 2 was shown that the phase estimation is the most demanding (in terms of bit-width) to achieve high SQNR (phase requires larger bit-width to achieve the same SQNR). For instance with 11 bits we obtain approximately SQNR=40 dB. The results presented in Table 1 show that the most sensible stages are the kernel and convolution outputs quantization. Trigonometric, non linear function as well as the image features themselves can be quantized with a relatively small number of bits without extra accuracy degradation as can be deduced from the SQNR results. From this table we also conclude that fixed point arithmetic can be properly used for the system design on embedded systems due to the high SQNR obtained.
Image Processing Architecture for Local Features Computation
265
Table 1. SQNR at different stages for two well defined configuration examples. Example A uses 11 bits for the kernel quantization and 9 bits for the convolution stage (only one fractional bit). Example B uses 13 bits for the kernels and 11 for the convolution outputs. Each stage (0 to 3) represents the SQNR of the Energy, Orientation and Phase after consecutive bit-widths quantizations. At stage 0, only the kernels and convolution bits have been quantized. Stage 1 represents the SQNR after the previous quantization plus the quantization of the oriented quadrature filters with the same number of bits than the convolution outputs. Stage 2 shows the results with the addition of the next quantization stage, the trigonometric and non linear functions (arctan and division). Trigonometric functions require only 9 bits. Arctan function uses (2·bit-width-convolution outputs+2) bits to avoid degrading the system. Finally, stage 3 shows the SQNR of the whole system quantized using fixed-point data representation. Orientation and phase are angles and therefore their dynamic range is well defined (0 to π for orientation and -π to π for phase). We have used 9 bits to represent their values. For the energy, the dynamic range depends on the convolution outputs bit-widths. We have used (2·bit-widthconvolution outputs) bits to achieve satisfactory results.
Stage
SQNR (dB)
Energy
Orientation
Phase
A .0
Kernel and quantization
convolution
outputs
40.739
36.323
27.036
A. 1
Orientation quantization
Quadrature
filters
39.853
35.404
26.061
A. 2
Trigonometric and functions quantization.
39.849
35.403
26.061
A. 3
Image features results quantization.
39.842
35.027
26.008
B. 0
Kernel and quantization
convolution
outputs
53.234
48.238
39.194
B. 1
Orientation quantization
Quadrature
filters
48.560
46.923
40.512
B. 2
Trigonometric and functions quantization.
48.557
46.922
40.512
B. 3
Image features results quantization.
48.550
46.272
40.163
non
non
linear
linear
4 System Architecture We have designed a very fine grain pipeline architecture whose parallelism grows along the stages to keep our throughput goal of one pixel output for clock cycle. The main circuit stages are described on Fig. 4, each of them finely pipelined to achieve our
266
J. Díaz et al.
throughput goal. Note that the coarse parallelism level of the circuit is very high (for example state S1 requires 16 parallel paths, eight for the orientation and 2 filter types, even and odd each). The stage S0 is basically composed of 7 separable convolvers corresponding to the convolutions with Gxx, Gxy, Gyy, Hxx, Hxy, Hyx and Hyy kernels. The separability property of the convolution operation allows computing first the convolution by rows and after that by columns. This configuration requires only 8 double port memories shared by all the convolvers. They are arranged storing one image row at each memory to allow parallel data access at each column. This stage S0 is finely pipelined in 24 microstages as indicated in brackets in the upper part of Fig. 4.
Fig. 4. Image features processing core. Coarse pipeline stages are represented at the top and superpipelined scalar units at the bottom. The number of parallel datapaths increases according to the algorithm structure. The whole system has more than 59 pipelined stages (without counting other interfacing hardware controllers such as memory or video input/output interfaces). This allows computing the three image features at one estimation per clock cycle. The number of substages for each coarse-pipeline stage is indicated in brackets in the upper part of the figure.
Stage S1 is used to compute the oriented Quadrature filters. This is achieved by multiplying each output of stage S0 by some interpolation weights as shown on equation (3). These coefficients are stored using distributed registers for parallel access. In a sequential approach they could be stored using embedded memory but the required data throughput makes necessary full parallel access to all of them making hard to implement this configuration.
Image Processing Architecture for Local Features Computation
267
Stage S2 computes the image features from this set of even and odd complex filters. Figure 4 shows the three different computations which calculate the local features of orientation, phase and energy. Note that the last stage S2 has different latencies (7, 27 and 30) for each feature (Energy, phase and orientation). Nevertheless, since we need all of them at the same time we use delay buffers to synchronize the outputs. 4.1 Hardware Resources Consumption Analysis Based on the analysis of Section 3, we have chosen the bit-widths described in Table 1, configuration B. The study described in previous sections can be summarized as follows. We have used 13 bits for the kernels coefficients and 11 for the convolution outputs. The trigonometric LUT values use 9 bits. Intermediate values are properly scaled to keep the data accuracy. The final orientation and phase results use 9 bits and 22 bits for the energy. The final achieved SQNR of the Energy, Orientation and Phase after the full quantization process are 48, 46 and 40 respectively. This accuracy fulfils the requirements to utilize the system as coprocessor without any significant degradation. Figure 3 shows that the hardware resources for this approach represent a good trade-off between accuracy and resources consumption. The whole core has been implemented on the RC300 board of Celoxica [17]. This prototyping board is provided with a Virtex II XC2V6000-4 Xilinx FPGA as processing element including also video input/output circuits and user interfaces/communication buses. The final required hardware consumption and performance measures of our implementation are shown in Table 2. Table 2. Complete system resources required for the local image features computing circuit. The circuits have been implemented on the RC300 prototyping board [17]. The only computing element is the Xilinx FPGA Virtex II XC2V6000-4. The system includes the image features processing unit, memory management unit, camera Frame-grabber, VGA signal output generation and user configuration interface. Results obtained using the Xilinx ISE Foundation software [18]. (Mpps: mega-pixels per second at the maximum system processing clock frequency, EMBS: embedded memory blocks).
Slices / (%)
EMBS / (%)
Embedded multipliers / (% )
Mpps
9135 (27%)
8 (5%)
65 (45%)
56.5
Image Resolution 1000x1000
Fps
56.5
Detailed information about the image features core and the coarse pipeline stages are shown in Table 3. Note that the sum of the partial stages is not the total number of resources. This can be easily explained based on the synthesis tools. For the whole system is more likely that the synthesizer engine can share some resources and reduce hardware consumption. This also explains why the whole system has lower clock frequency than the slowest subsystem.
268
J. Díaz et al.
The previous argument does not explain why the number of multipliers grows on the complete system with respect to the sum of the multipliers used at each substage. We have used automatic inference of multipliers on the system. Because the number of multiplications by constants is quite high, it is difficult to manually determine when is better to use an embedded multiplier or compute it using FPGA logic. Synthesis tools define cost functions that are able to evaluate this based on technological parameters for each multiplication which is the best option. We have compared the manual and the automatic inference of multipliers and, though differences are not large, automatic inference usually achieves higher clocks rates than manual generation with slightly less logic. This automatic inference changes depending on the circuit that is being synthesized and produces the differences presented in Table 3. The final qualitative results for a test image are shown in Fig. 5. Though there are significant differences in the bit-width of the software and hardware data, the high SQNR guarantees high accuracy with imperceptible degradation.
(a)
(b)
(c) Fig. 5. Example of system results for Saturn image. (a) Original image. (b) From left to right, local energy, phase and orientation computed with double floating point precision. Energy and phase are encoded using gray levels. Orientation is represented by oriented arrows. (c) From left to right, local energy, phase and orientation computed with our customized system. Note that the differences are practically indiscernible though hardware uses much more constrained data bit-width.
Image Processing Architecture for Local Features Computation
269
Table 3. Partial system resources required on a Virtex II XC2V6000-4 for the coarse pipeline stages described for this circuit. Results obtained using the Xilinx ISE Foundation software [18]. (EMBS stands for embedded memory blocks). The differences between the sum of partial subsystems and the whole core are explained in the text.
Circuit stage S0 S1 S2
Gaussian base convolutions Oriented quadrature filters Features Energy, Phase and Orientation computation Whole processing core
Slices / (%)
EMBS / (%)
Embedded multipliers / (% )
fclk (MHz)
4,170 (12)
8 (5)
50 (34)
85
1,057 (3)
0
0
69
2,963 (8)
0
6 (4)
89
7627 (22)
8 (5)
65 (45)
58.8
5 Conclusions We have presented a customized architecture for computation of local image features: energy, phase and orientation. The presented implementation combines a well justified bit-width analysis with a high performance computing architecture. The circuits SQNR is very high (more than 40 dB) which provides a high accuracy computation. The computation performance allows processing more than 56 Mpixels per second, where the image resolution can be adapted to the target application. For instance, using VGA resolution of 640x480 pixels, the device can compute up to 182 fps. This is achieved thanks to the fine grain pipeline computing architecture and to the highly parallelism employed. This outstanding power performance is required for the target specifications addressed for instance on complex vision projects such as DRIVSCO [19]. Note that, even current standard processors running at more than 3 GHz, 60 times faster than the presented FPGA (which runs at 50 MHz) hardly process this model in real time. The superscalar and superpipelined proposed architecture fulfils our requirements and illustrates that, like biological systems, even with a low clock rate, a well defined architecture can achieve an outstanding computing performance unviable with current processors. Future work will cover the implementation of this system using image analysis at different resolutions, trying to properly sample the spatial frequency spectrum of the images. We also plan to evaluate different accuracy vs. resources consumption tradeoff depending on the target application to properly tune the proposed architecture for specific application fields.
270
J. Díaz et al.
Acknowledgment This work has been supported by the Spanish project National Project DEPROVI (DPI2004-07032) and by the EU grant DRIVSCO (IST-016276-2).
References [1] Gonzalez and Woods: Digital Image Processing. 2nd Edition, Prentice Hall, (2002). [2] Granlund, G. H. and Knutsson, H.: Signal Processing for Computer Vision. Norwell, MA: Kluwer Academic Publishers, (1995). [3] Krüger, N. and Wörgötter, F.: Multi Modal Estimation of Collinearity and Parallelism in Natural Image Sequences. Network: Computation in Neural Systems, vol. 13, no. 4, (2002) pp. 553-576. [4] Krüger, N. and Felsberg, M.: An Explicit and Compact Coding of Geometric and Structural Information Applied to Stereo Matching. Pattern Recognition Letters, vol. 25, Issue 8, (2004) pp. 849-863. [5] Granlund, G. H.: In search of a general picture processing operator. Computer Graphics and Image Processing, vol. 8, no. 2, (1978) pp. 155-173. [6] Fleet, D. J. and Jepson, A. D.: Stability of phase information. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, issue 12, (1993) pp. 1253-1268. [7] Kovesi, P.: Image Features From Phase Congruency. Videre [Online]. 1 (3) (1999). Available: http://mitpress.mit.edu/e-journals/Videre/001/v13.html. [8] Fleet, D. J.: Measurement of Image Velocity. Norwell, MA: Engineering and Computer Science, Kluwer Academic Publishers, (1992). [9] Gautama, T. and Van Hulle, M. M.: A Phase-based Approach to the Estimation of the Optical Flow Field Using Spatial Filtering. IEEE Trans. Neural Networks, vol. 13, issue 5, (2002) pp. 1127-1136. [10] Solari, F., Sabatini, S. P., Bisio, G. M.: Fast technique for phase-based disparity estimation with no explicit calculation of phase. Elec. Letters, vol. 37, issue 23, (2001) pp. 1382 -1383. [11] Fleet, D. J., Jepson, A. D., Jenkin, M. R. M.: Phase-Based Disparity Measurement. CVGIP: Image Understanding, vol. 53, issue 2, (1991) pp. 198-210. [12] Nestares, O., Navarro, R., Portilla, J., Tabernero, A.: Efficient Spatial-Domain Implementation of a Multiscale Image Representation Based on Gabor Functions. Journal of Electronic Imaging, vol. 7, no. 1, (1998) pp. 166-173. [13] Freeman, W. and Adelson, E.: The design and use of steerable filters. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, issue 9, (1991) pp. 891-906. [14] Díaz, J.: Multimodal bio-inspired vision system. High performance motion and stereo processing architecture. PhD Thesis, (2006), University of Granada. [15] Haglund, L.: Adaptive Multidimensional Filtering. Ph.D. thesis, nº 284, Linköping University, Sweden, SE-581 83 Linköping, Sweden, (1992). [16] Díaz, J., Ros, E., Pelayo, F., Ortigosa, E. M., Mota, S.: FPGA based real-time opticalflow system. IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, no. 2, (2006) pp. 274-279. [17] Celoxica, RC300 board information. Last access December 2006. Available at: http://www.celoxica.com/products/rc340/default.asp. [18] Xilinx, ISE Foundation software. Last access January 2007. Available at: http://www.xilinx.com/ise/logic_design_prod/foundation.htm. [19] DRIVSCO, project web page, Last access January 2007. Available at: http://www.pspc.dibe.unige.it/~drivsco/