Real Time Optical Flow Processing System

5 downloads 113 Views 289KB Size Report
We call it virtual sensor because it consists on a .... Construction of least-square matrices of equations (2) and (3). S5. Custom floating-point unit. Final velocity ...
Real Time Optical Flow Processing System Javier Díaz, Eduardo Ros, Sonia Mota, Richard Carrillo, and Rodrigo Agis Departamento de Arquitectura y Tecnología de Computadores, E.T.S.I. Informática, Universidad de Granada, Periodista Daniel Saucedo Aranda s/n, 18071 Granada, Spain {jdiaz, eros, smota, rcarrillo, ragis}@atc.ugr.es

Abstract. We describe an optical flow processing system that works as a virtual motion sensor. It is based on an FPGA device; this enables the easy change of configuring parameters to adapt the sensor to different motion speeds, light conditions and other environment factors. We call it virtual sensor because it consists on a conventional camera as front-end and a processing FPGA device which embeds the frame grabber, the optical flow algorithm implementation, the output module and some configuring and storage circuitry. To the best of our knowledge this paper represents the first description of a fully working optical flow processing system that includes accuracy and processing speed measurements to evaluate the platform performance.

1 Introduction Optical flow algorithms have been widely described in the literature. Some authors have addressed a comparative study of their accuracy on synthetic sequences [1]. Their evaluation using real world sequences is difficult to address because the real optical flow of such sequences is unknown. We have chosen to implement a classical gradient model based on work done by Lucas & Kanade (L&K) [1, 2]. Several authors have pointed out the good trade-off between accuracy and efficiency of this model that is an important factor to decide which approach is more suitable to be implemented as a real time processing system. For example, in [1] L&K’s algorithm provides very accurate results. Liu et al. [3] evaluate the efficiency vs. accuracy tradeoff of different optical flow approaches and outline that L&K is a good candidate. Finally, McCane et al. [4] also give L&K a good score and conclude that this approach requires an affordable computational power. This has motivated some other authors to focus on the L&K algorithm [5, 6]. In this paper we describe the hardware implementation of the L&K algorithm. There are other authors that have described recently the hardware implementations of optical flow algorithms [7, 8, 9], but most of them do not provide results to evaluate the performance of the system, i.e. the accuracy and the computation speed. This approach is a fully working system at conventional camera frame rates of 30 Hz, with images sizes of 320x240 pixels. To the best of our knowledge, this is the first description of such a system and therefore represents the state of art in this area.

J. Becker, M. Platzner, S. Vernalde (Eds.): FPL 2004, LNCS 3203, pp. 617–626, 2004. © Springer-Verlag Berlin Heidelberg 2004

618

J. Díaz et al.

2 Optical Flow Model Although the original algorithm was proposed as a method to estimate disparity map in stereo pair images [2], we have implemented the Barron’s description of L&K algorithm that can be applied to optical flow computation [1]. Several modifications have been added to improve the hardware implementation feasibility. Instead of using temporal FIR filters, we have implemented IIR filters as described in [10]. A FIR approach is also feasible in the used prototyping platform because it includes four independent memory banks. But a IIR approach is much more easy to configure for different time constants (only a single coefficient needs to be modified). On the other hand, the FIR approach requires variable memory resources and memory accesses depending on the constant of the temporal filter. Other modification is adopted to provide estimations when the aperture problem appears. In this situation, as described in [11], we can add a small modification that allows to provide an estimation in the maximum gradient direction. In the following equations, we describe briefly the computations in which is based the L&K approach. We will refer to these computational stages when describing the system architecture. A detailed description of the L&K model is provided in [1,2]. The algorithm belongs to the gradient based techniques that are characterized by gradient search performed on extracted spatial and temporal derivatives. Making the assumption of constant luminance values across the time, L&K method constructs a flow estimation based on first order derivatives of the image. By least squares fitting, the model extracts the motion estimation based on the hypothesis of similarity of velocitiy values in a neighbourhood of a central pixel. W(x) is a window that weights the constraints with higher weights near the centre of the spatial neighbourhood Ω. The known solution to this problem is: G −1 G v = ATW 2 A ATW 2b

[

]

(1)

A = [∇I ( x1 ),...., ∇I ( x n ) ]

W = diag [W ( x1 ),...., W ( x 2 ) ] G b = −[I t ( x1 ),...., I t ( x n ) ]

(2)

An inherent limitation of these models appears in blank wall or aperture problem T 2 situations. In these cases, the problem has no solution (matrix A W A is not invertible) and the model can not provide any motion estimation. For this reason, we have added a small constant α to the matrix diagonal according to [9] that allows us to estimate the normal velocity field in situations where 2-D velocity can not be extracted due to the lack of contrast information. Therefore, the term of equation (1) is computed with expression (3).  ∑ W 2 I 2x + α x∈Ω A W A= W2IxIy x∑ ∈Ω T

2

∑W

IxIy   2 2 + W I α ∑ y  x∈Ω x∈Ω

2

(3)

Summarizing, we have to compute the 2x2 matrix of equation (3), its inverse and the 2x1 matrix indicated in expression (4).

Real Time Optical Flow Processing System − ∑ W 2 I x I t  G   B = A T W 2 b =  x∈Ω 2 W I y I t   − x∑  ∈Ω 

619

(4)

The Gaussian smoothing is the pre-processing stage before the computation of the image derivatives in the matrices elements of equations (3) and (4). It reduces image noise and generates higher correlation between adjacent pixels. Typically, Gaussian spatio-temporal filters of 2 pixels variance plus a temporal derivative of 5 pixels are used. All the temporal operations require storage of 15 images for the entire process. This is hardly affordable in embedded systems; therefore, as indicated in [10], a more efficient implementation can be implemented using IIR temporal recursive smoothing and derivative filters. In this way, the temporal storage requirement is reduced to 3 frames, and the computation time improved, at cost of a slightly reduced accuracy. For an extensive discussion about how to design the IIR filters see [10].

3 Hardware Implementation Nowadays software real-time computation of simple optical flow algorithms for small images is possible due to the outstanding computational power of the PCs. The drawback is that is difficult to adapt these systems to be used as embedded solutions. In the presented approach the motion computation chip can be regarded as part of a smart sensor. Alternatively, several hardware technologies can be used to implement an embedded system. The use of specific integrated circuits (ASIC) can achieve sufficient power to allow real-time computation but they are an expensive option. DSPs represent a valid alternative but if we need considerable computational power, the DSP solution is not powerful enough and a multiprocessor scheme needs to be designed [12]. The solution we propose is based on the use of programmable logic circuits (FPGAs). These circuits allow us to design a customized DSP circuit in a single chip of high computational power due to an intensive use of their intrinsic parallelism and pipeline resources. As we will show in later sections, the solution we propose uses this technology to implement a real-time hardware device capable of working as a PC coprocessor or smart sensor in embedded applications. For our design we have used the RC1000-PP board from Celoxica [13] and Handel-C [14] as the hardware specification language. This board is connected to the PC by a PCI bus and it can be used as hardware accelerator board or as prototyping board. It contains a 2 million gates Virtex-E FPGA and four 2 MB SRAM memory banks accessible in parallel. 3.1 System Implementation Overview The efficient implementation of the algorithm onto a FPGA device requires the efficient exploitation of the intrinsic processing parallelism of this kind of device. We use segmented pipeline architecture as shown in Fig. 1.

620

J. Díaz et al.

Fig. 1. Coarse pipeline processing architecture.

The basic computational stages in Fig. 1 can be summarised as follows: S0. Frame-Grabber receives the pixels from the camera and stores them in one of the external memory banks using a double-buffer technique to avoid temporization problems. S1. Spatial Gaussian filters smoothing stage. S2. IIR temporal filter affords temporal derivative and spatio-temporal smoothed images. S3. Spatial derivatives stage. S4. Construction of least-square matrices of equations (2) and (3). S5. Custom floating-point unit. Final velocity estimation need the computation of a matrix inversion, which requires a division operation. At this stage the resolution of the incoming data bits is significant and expensive arithmetic operations are required. Fixed point arithmetic becomes then too expensive and therefore we have designed a customized floating-point unit. The computation bit-depth increases throughout the pipeline structure. For example, for a high precision system, with low degradation, we use 8 bits in the two first stages, 12 bits in the third and fourth stages, 24 in the construction of least-square matrices and 25 bits for the floating-point unit although a less hardware expensive approach has been tested with good results. The computation of the least-square matrices (S4) is the most expensive stage in computational resources. Different parallelism strategies can be adopted at this point. Basic parameters of the pipeline structure are latency (L) and the maximum number of cycles (MNC) required in the longest stage, which is the limiting factor of the computing speed. The circuit scheme gives us a basic relationship between the MNC and the system frequency clock (fclk) to know the computing speed in pixels per second (pps), i.e. pps=fclk/MNC. Due to the expensive requirements of the stages 5 and 6, the following subsection focus on their implementation and architectural design strategy.

Real Time Optical Flow Processing System

621

3.2 Least Square Matrices Construction This is a critical stage where the trade-off between efficiency and cost can be widely 2 2 studied. Equations (2) and (3) require the generation of five products: Ix , Iy , IxIy, IxIt, IyIt. Then we have to make a weighted sum in a window (W) over a neighbourhood of size w x by w y. Due to memory limitations we save the Ix, Iy, and It values instead of the five crossed products. Therefore, the operations to do are: a) products computation for all the elements in a neighbourhood. We need to do 5 x w x x w y multiplications. b) Row convolutions operation. We have 5 x w y convolutions to do and c) Column convolutions operation. It requires the computation of 5 convolutions. The scheme of these operations can be seen in the Fig. 3.

Fig. 3. Leas squares matrices circuit builder for a 3x3 neighbourhood.

This is an important stage where we can bias the trade-off between efficiency and hardware cost. The important parameters to choose are: neighbourhood weighted sum area, number of multiplication units and number of row-column convolution units. For example, if we use a 3x3 neighbourhood, we can use between 1 to 45 multipliers, 1 to 15 row convolutions unit and 1 to 5 column convolution units. This choice allows us to compute the weighted sum values in one clock cycle with a highly parallel hardware unit or to compute it in a sequential way. Results using different configurations are shown in section 4. 3.3 Final Velocity Calculation Using a Custom Floating Point Unit At this stage the expression (1) is computed. Until now, the arithmetic operations have been done using integer or fix point arithmetic with truncation operations. Convolution operations work well with this representation but when the bit depth is too high, a floating point data representation become better suited for hardware implementation. This is done with a customized superscalar floating point unit whose

622

J. Díaz et al.

architecture is illustrated in Fig. 4. Since at the previous stage, a high bit-depth (24 bits) is used to preserve the computation accuracy, it is a very expensive stage in terms of hardware resources. Therefore it is a critical stage that highly affects the accuracy vs. processing speed trade-off. This stage computes the inverse of a matrix and the multiplication of a (2x2) matrix by a (2x1) vector. This calculus involves the basic arithmetic operations: subtraction, multiplication and division. The hardware structure of the unit developed to compute this is shown in Fig. 4.

Fig. 4. Floating-point unit scheme

When arithmetic operations are done with large bit depth, the signal delays associated to carry lines degrades the global system performance decreasing the maximum system frequency. To avoid this, pipeline arithmetic operators or sequential iterative operators can be used. The first one allows us to make the computation in few (1 or 2) clock cycles after a given latency at an expensive cost in terms of hardware resources. The second option takes several clock cycles therefore, degrading the MNC of the system, but allows us to use the same hardware for each iteration. We define a system which uses 1 cycle floating point hardware circuits because this works at the desired maximum clock frequency (without becoming the limiting stage) for all the operations but the division, because it is a difficult operation. We have used a hardware sequential divisor instead a pipelined divisor that needs 21 cycles to compute the division of 25 bits floating numbers therefore, the MNC is too high and it highly limits the system pipeline performance. The chosen solution uses up to 3–ways division units and, depending on the system performance required, we can synthesize more or less ways. Each floating-point unit needs: one to five fix-points to floating point converter units; one to six 25 bits floating point multipliers; one to three subtractors; one to two divisor units. If n-ways divisor scheme is used, then we use n to 2n divisor units. Results using different configurations are shown in section 4.

4 Hardware Resources Consumption Study The system has been designed in a very modular way. The parallelism and the bit accuracy at the different stages can be easily modified. Due to the high level of abstraction that Handel-C provides [14] it is easy to manage the parallelism of the computing circuits and the bit-depth at the different stages. In table 1 is summarized the hardware resources of the different stages using a XCV2000E-6 Virtex FPGA for a concrete implementation (called HSHQ in the following section).

Real Time Optical Flow Processing System

623

Table 1. Detail sub-circuits hardware requirements for a Virtex XCV2000E-6. Note that the sum (% of the device in the first column) is larger than 100%, this can be explained because these data have been obtained by partial compilations and the synthesis tool makes a wide use of the available resources. When the whole design is compiled it consumes 99% of the device. Number of slices / (% of the device) / equivalent gates

Computing cycles

ISE maximum Clock frequency (MHz)

Memory requirements / (% of the device)

Spatial Gaussian (17 taps)

220 / (1%) / 270,175

8

29.2

16 / (10%)

IIR filter

134 / (1%) / 51,971

7

38.5

3 / (1%)

287 / (1%) / 121,296

7

28.0

7 / (4%)

15,288 / (79%) / 642,705

10

20.3

24 / (15%)

5,720 / (29%) / 90,993

10

17.4

0

Spatial derivative convolution Least square matrices construction Superscalar floating point unit

The last two stages have the larger MNC values. Note that a lower MNC are possible for other stages but there is no reason to improve them due to the other existing limiting stages. The results of the Xilinx timing analyser are not always accurate. In fact, it can underestimate the speed at which a circuit can run, leading to the place and route tool to take much longer than it needs to; i.e. the maximum frequency allowed by the system has been experimentally measured and it is 10-20 MHz higher than the very conservative results given by ISE. This arises because the analyser looks at the static logic path, rather than the dynamic one (see [15]) and because of that we measure experimentally the maximum working frequency. One important topic is the system configuration possibilities. We have evaluated several configurations to explore different trade-offs between accuracy, hardware cost and computing speed. In all these configurations we have used the same basic architecture but with different parallelism levels. Table 2 summarises the main properties of the different configurations. The ones using a 5x5 average window of the least-square-matrix neighbourhood we call high quality (HQ) approaches, and the ones using a 3x3 window, medium quality (MQ). Other modifiable parameters are the smoothing and spatial derivative filter sizes. HQ and MQ models include 5-pixel derivative filters and 9-pixel Gaussians. A low cost (LQ) version uses 3-pixel derivatives and a Gaussian filter of the same size. If we fix the optical flow quality of the system, another factor to take into account is the performance vs. hardware cost trade-off. If the system works with maximum parallelism the MNC is 10. Lower cost approaches are possible if we reduce the parallelism level, thus increasing MNC. For example, we implemented a high-speed (HS) version with MNC=10 cycles using a three-way division unit and maximum parallelism. A slower version was implemented reducing the parallelism. We call this

624

J. Díaz et al.

version medium speed (MS). Finally, we implemented a low-speed (LS) version. Table 2 summarises the performance of the systems and hardware costs. Table 2. Performance and hardware cost of different configurations in a Virtex 2000-E FPGA (2 million gates and 640 Kbits of internal memory). (Kpps Æ kilopixels per second, Fps Æ frames per second). All the performance values were measured using a system clock frequency of fclk=27MHz. These measurements (Kpps and Fps) are underestimations because the computing time measured also include data transmission to the prototyping board. Version

% device occupation

HSHQ

99

HSMQ

65

MSMQ LSLQ

43 36

% on-chip memory 17 31 16 31 16 8

Kpps

Image resolution

177 6 177 6 625 400

160x120 320x240 160x120 320x240 160x120 120x90

Fps (fclk=27M Hz) 95 24 97 24 33 38

Max. fclk (MHz) 35 35 35 35

It is important to note that in our experiments data transmission to the prototyping board using PCI bus takes about 40% of the total processing time, and therefore higher frame rates are expected using a direct connection between the camera and the FPGA. For instance, as explained in section 1, the theoretical bit-through of the HSHQ is 2700Kpps. This topic is widely discussed in [16]. Until now, we have shown the system flexibility and the trade-off between number of gates and system performance. Other important topic is the scalability at the level of functional units. All our results make the assumption that only one computational unit is used. Local image processing algorithm can take advantage of the FPGA splitting possibilities. We can synthesize some computational units in the same FPGA or in several of them and compute larger images in real time. If a memory buffer is used, it is straightforward to assign a small area to each computational unit and run it in parallel. The computational power is then increased by a factor equal to the number of computational units running in parallel. Within the pipeline computing structure, the scalability principles have been used in the floating point unit design where we have implemented a three ways superscalar division unit. This has been done to reduce the number of cycles required by this stage from 21 to 7, therefore obtaining a well balanced pipeline computing architecture.

5 Performance Evaluation As commented in the introduction, the accuracy of the optic flow computation of real world sequences is difficult to assess because the real flow of these sequences is unknown. Therefore to evaluate the accuracy of our design that depends on the bitdepth of the different stages, we have adopted the test scheme and the synthetic sequence from the comparative study done by Barron et al. [1]. The results using the HSHQ approach are summarized in Table 3. In the first row of Table 3 is compared the accuracy of the L&K algorithm computed by a standard PC using double precision variables and adopting the IIR

Real Time Optical Flow Processing System

625

filters using the error measure proposed in [17,18]. The second row includes the performance obtained by our hardware implementation. It can be seen that the accuracy is reasonable taking into account that fixed point variables and restricted bit depths are used in this approach. Table 3. Yosemite sequence results using Fleet angle error measure [17,18]. Model LK IIR software vs. real flow Hardware implementation vs. real flow

Average Error

Standard deviation

Density %

15.91 º

11.5 º

100

18.30 º

15.8 º

100

Parameters λmin=0, σxy=0.8, τ=2, α=1 λmin=0, σxy=0.8, τ=2, α=1

We also have compared the performance of the software and the hardware implementations using sinusoidal grating sequences. We used different stimulus frequencies (f0=0.02 and f0=0.05) and velocities (V=0.25 ppf and V=1 ppf). With these tests the hardware obtained results very similar to those of the software approach (less than 5% of error in the speed calculation). The software implementation (standard C) runs at a 30 fps of 160x120 pixels on a AMD 1800+.

6 Conclusions and Future Work The system described here shows how an optical flow estimation circuit can be implemented using an FPGA platform as a specific purpose DSP to achieve real-time computation. The paper describes a scalable architecture that can work with large image data at video-frame rate. Table 3 summarises the results of a comparison between the software and hardware results using the Yosemite sequence test and unthresholded results. It can be seen that the performance of the hardware is only slightly worse (2.48º increment of error) than the software version with data precision of 64 bits. The results of the hardware implementation described in this paper are in the range of other software approaches considered in the study of Barron et al. [1]. Therefore, the performance of the hardware is of reasonable quality provided that it computes in real time (at a speed of 1776 Kpps). In the future, we plan address two main goals. The first one is to study the bitdepth needed for different applications. Although this hardware approach is being used with real-world sequences with satisfactory results, it uses a high depth that can be reduced significantly. The second goal is to use a multiscale computation to detect faster motion properly. Classical gradient models estimate velocities well for speeds slower than one pixel per frame but faster motion produces temporal aliasing. The basic solution consists in computing motion at higher frame rates (that needs special and expensive cameras). Alternatively, using multiscale approaches the same cameras can be used.

626

J. Díaz et al.

Acknowledgments. This work has been supported by the V EU research framework funds through the European Projects ECOVISION (IST-2001-32114).

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

J. Barron, D. Fleet, S. Beauchemin: Performance of optical flow techniques. Internat. J. Computer Vision, Vol. 12, nº.1, pp 43-77, 1994. B. Lucas & T. Kanade: An iterative image registration technique with an applications to stereo vision. Proc DARPA Image Understanding Workshop, pp. 121-130, 1984. H. Liu , T.H. Hong , M. Herman , T. Camus and R. Chellappa: Accuracy vs. Efficiency Trade-offs in Optical Flow Algorithms. Computer Vision and Image Understanding. Vol. 72 , Issue 3 (Dec) pp. 271 – 286, 1998. B. McCane, K. Novins, D. Crannitch and B. Galvin: On Benchmarking Optical Flow. Computer Vision and Image Understanding. Vol. 84, pp 126–143, 2001. S. Baker and I. Matthews: Lucas-Kanade 20 Years On: A Unifying Framework. International Journal of Computer Vision, Vol. 56, nº.3, pp. 221 – 255, March, 2004. Gamal, A. El, Optical Flow Estimation using High Frame Rate Sequences, Proceedings of the 2001 International Conference on Image Processing, Vol. 2, pp 925-928, 2001. P. Cobos, F. Monasterio: FPGA implementation of the Horn & Shunk Optical Flow Algorithm for Motion Detection in real time Images. Proc of the XIII Design of Circuits and Integrated Systems Conference, pp. 616-621, 1998. P. Cobos, F. Monasterio: FPGA implementation of Camus correlation Optical flow algorithm for real time images. Proc of Int. Conf. on Vision Interface, pp. 7-9, 2001. S. Maya-Rueda, M. Arias-Estrada: FPGA Processor for Real-Time Optical Flow Computation. Lecture Notes in Computer Science, Vol. 2778, pp. 1103-1016, 2003. D. J. Fleet, K. Langley. Recursive filters for optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 17, N. 1, pp. 61-67, 1995. E P Simoncelli and E H Adelson and D J Heeger. Probability distributions of optical flow. IEEE Conf on Computer Vision and Pattern Recognition, Mauii, Hawaii. June 1991. T. Rowekamp, M. Platzner, and L. Peters: Specialized Architectures for Optical Flow Computation: A Performance Comparison of ASIC, DSP, and Multi-DSP. Proc ICSPAT'97, 1997. www.celoxica.com Handel-C language referent manual. Celoxica 2003. Celoxica application note AN 68 v1.1: Timing analysis. Timing Analysis and Optimisation of Handel-C Designs for Xilinx Chips. D. Benitez, Performance of reconfigurable architectures for image-processing applications. J. of Systems Architecture: the euromicro journal, vol 49 (4-6), pp. 193-210, 2003. D.J Fleet, and A. D. Jepson: Computation of Component Image Velocity from Local Phase Information, International Journal of Computer Vision, Vol. 5, N.1, pp. 77-104, 1990. D. J. Fleet, Measurement of Image Velocity. Engineering and Computer Science. Kluwer Academic Publishers, 1992.

Suggest Documents