Document not found! Please try again

Active Visual Tracking of Heading Direction By Combining Motion

0 downloads 0 Views 384KB Size Report
different orientations, positions and directions of motion are suc- cessively transformed into a population of neurons that collec- tively encode the focus of ...
Active Visual Tracking of Heading Direction By Combining Motion Energy Neurons Stanley Y. M. Lam and Bertram E. Shi Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Kowloon, HONG KONG email:{eelym,eebert}@ust.hk Abstract - We describe a robotic vision system that aligns a camera's optical axis with its direction of translation by estimating the focus of expansion. Visual processing is based on functional models of populations of neurons in cortical areas V1 through MST. Populations of motion energy neurons tuned to different orientations, positions and directions of motion are successively transformed into a population of neurons that collectively encode the focus of expansion at 25 frames per second. We characterize the performance of the system translating through a cluttered environment, and show that the performance is robust to variations in system parameters.

I. INTRODUCTION The visual motion perceived by a subject interacting with its environment may have many different sources. For example, motion may be introduced by translation through the environment or by gaze shifts. Although it is intuitively appealing to treat cases separately, behaving subjects simultaneously combine them. Combining multiple types of motion may simplify the interpretation of visual motion. For example, if a translating subject shifts its gaze to align it with its direction of translation, the processing required for tasks such as the estimation of time to contact can be simplified significantly. This work describes the implementation of this strategy in a robotic system where the visual processing is based upon functional models of neurons in the dorsal stream of mammalian visual cortex [1][2]. Neurons in lower cortical areas such as primary visual cortex (V1), have spatially localized receptive fields and tend to respond to simple patterns of motion, such as oriented sine wave gratings or bars moving in a preferred direction. Higher cortical areas, such as MT and MST, generally have larger receptive fields, and respond to more complex stimuli. We refer to a population of neurons with similar stimulus selectivity but at different retinal locations as a map. Our system estimates the heading direction from the time varying image received by the camera. For pure translation with a forward component, the optical flow (the projection of the 3D motion vectors to the 2D image plane) expands radially around an image point called the focus of expansion (FOE), which is the intersection of the image plane with the direction of translation. However, our system does not estimate optical flow explicitly. Rather, it successively transforms the population outputs to generate a new population of neurons whose activity implicitly encodes the focus of expansion. This avoids problems where the optical flow vector may not be well defined, e.g. in the case of transparency or occlusion.

This work was supported by the Hong Kong Research Grants Council.

In these cases, a population encoding that can encode multiple hypotheses may be preferable. Our approach is similar to template matching [3] and population heading map [4] approaches in that we build neurons tuned to different heading directions, and decode heading by searching for neurons with the maximum response. However, rather than assuming a set of neurons tuned to different speeds in different directions or knowledge of the optical flow, it is based on a front end with a coarse binary quantization of velocity in different directions, exploiting key insights provided by Fermuller and Aloimonos [5]. Although this binary quantization reduces velocity resolution, it reduces the effect of confounding stimulus dimensions such as image contrast, orientation and spatial frequency on the responses of the motion energy neurons we use in the front end. We were motivated to examine the performance based on binary encoded signals, since binary tuning is optimal for spike rate transmission [6], which we plan to adopt in a distributed processor implementation of this algorithm. Binary quantization of velocity for measurement of self motion was also adopted in the neuromorphic multi-chip architecture presented by Higgins and Shams. Our work differs in the front end (motion energy filters versus edge tracking), and in that we estimate heading, rather than trying to differentiate between eight different motion types (expansion, contraction, two rotations and four translations) [7]. Section 2 describes the hardware that implements the system. Section 3 describes the algorithm to estimate the heading direction. Section 4 gives experimental results, demonstrating that the system can correctly estimate the location of the focus of expansion and make a motor command to align the camera so that the optical axis intersects the image plane at the focus of expansion. We also characterize the effect of changing the properties of the motion selective neurons on the system performance. II. HARDWARE The system, pictured in Figure 1, consists of a digital processor to implement visual processing, and a robotic platform to control the translation and orientation of the camera. The digital processing board was developed to enable fast, flexible and expandable simulation of large populations of visual cortical neurons [8]. The bulk of computation is done by a 600MHz Texas Instruments 6416 DSP chip. Images are captured by a OmniVision OV6630 CMOS sensor through a 4mm focal length lens. The sensor resolution is 352x288 pixels. Communication between the camera and the DSP chip is controlled by a Spartan III FPGA, which buffers the incoming frame from the camera so that the DSP is free to concentrate on implementing the neuron populations. A USB connection

sends data to a PC for logging and visualization. USB Chip DSP Chip

Servo motors Accelerator Board

FPGA

Camera

Camera IO port

Translation track

(a)

(b)

Fig. 1: (a) A photograph of the accelerator board. (b) A photograph of the experimental setup.

The camera is mounted onto a pan-tilt head. Two Futaba S2303 servo motors enable us to direct the optical axis over a pan range of 180 degrees and a tilt range of 90 degrees. The pan-tilt head itself is mounted onto an Arrick Robotics single axis linear position table. The platform holding the pan-tilt head can be translated both forward and backward by a stepper motor. In our experiments, we used a forward translation speed of 5.08cm/s (1.8km/hr). III. CONSTRUCTION OF THE HEADING MAP The heading map is a retinotopic array of neurons, each corresponding to a different possible focus of expansion. The neuron with the largest activity is the system’s estimate of the FOE, and is used to update the camera orientation. Heading maps are computed at a frame rate of 25fps, and integrate information over the entire image as well as the recent past. Figure 2 illustrates the algorithm. integrate

motion filter

compare

to motor control

compare

projected flow

Spatio-temporal Gabor-like filters model the receptive field properties of motion sensitive neurons in V1. We square and sum the outputs of a pair of phase quadrature filters tuned to the same orientation, and spatial and temporal frequencies to give a retinotopic map of motion energy. We compute four pairs of motion energy map. The maps in each pair are tuned to the same orientation chosen from { 0, π ⁄ 4, π ⁄ 2, 3π ⁄ 4 } , but opposite directions of motion. We implement the spatiotemporal filters as a cascade of complex valued spatial and temporal filters. The real and imaginary parts of the filter outputs represent the pair of phase quadrature filters. Motion energy is computed as the squared magnitude of the filter output. Each pair of motion energy maps shares the same spatial filtering stage. For the map tuned to orientation θ , the spatial filter impulse response is given by ⎛ ( x' ) 2 ( y' ) 2 ⎞ 1 – -----------⎟ exp ( jΩ x x' ) (1) g θ ( x, y ) = ------------------------ exp ⎜ – ----------( 2π )σ x σ y ⎝ 2σ 2 2σ 2⎠ y

where heading map

compare : pos. motion : neg. motion

B Motion energy maps and motion opponent maps

x

compare

optical flow

We separate the filter into 1D filters applied along the rows and columns of the image. Each 1D filter is implemented using the forward and backward algorithm as proposed by Young et al. [9]. We apply contrast normalization to the high pass filter output so that the magnitudes of the motion energy outputs are less dependent on the image contrast, thus better reflecting the motion in the image. The normalization applies a spatially dependent gain to the filter output at each pixel. The gain at each location is the inverse of the full wave rectified filter output pooled over a Gaussian weighted neighborhood with standard deviation 3 pixels. Thus, gain is high in areas where the magnitude of the high pass filter is low and vice versa.

expected flow

Fig. 2: Heading map construction. The incoming image is expanding around the focus of expansion. The image is filtered by a bank of motion energy filters tuned to different orientations and directions of motion, modelling motion selective complex cells in V1. By comparing pairs of filters tuned to the same orientation but opposite directions, the system estimates the direction of the projection of the image motion vector. Each cell in the heading map integrates information from neurons consistent with its preferred FOE location over space and time.

A Image preprocessing Incoming images are first spatially high pass filtered to approximate the pre-whitening operation commonly associated with the retina. The high pass filter takes difference between the original image and the output of unity DC gain spatial low pass filter. The low pass filter is a two-dimensional radially symmetric Gaussian with standard deviation 3 pixels.

x′ = y′

cos θ – sin θ x sin θ cos θ y

(2)

is a rotated coordinate system, Ω x is the preferred spatial frequency and σ x and σ y control the width of the impulse response in the directions parallel and orthogonal to the preferred orientation. Consistent with the responses of cortical neurons, we choose σ y = 2σ x resulting in elongated receptive fields. For computational efficiency, we computed the Gabor filter outputs using the modulation/demodulation approach combined with recursive approximations to oriented Gaussian filters [9][10]. The parameters we used were Ω x = 2π ⁄ 10 , σ x = 4 , σ y = 2 . The output of the spatial filtering stage is fed into a pair of first order complex valued temporal filters. The output of a temporal filter satisfies w ± θ(x, y, t) = ae ± jΩ t w ± θ(x, y, t – 1) + ( 1 – a )u θ(x, y, t) (3) where u θ(x, y, t) is the spatial filter output, 0 < a < 1 controls the temporal bandwidth and Ω t is the preferred temporal fre-

quency. The change in sign of Ω t changes the preferred direction. Unlike spatio-temporal Gabor filters, our filter is causal in time. The causality and recursive implementation enables real-time operation by eliminating the need to store a large number of past frames. In addition, causality better models the temporal responses of biological neurons.

ference between motion energy neurons tuned to opposite directions exceeds a threshold, Z and zero otherwise: S ± θ ( x, y, t ) = U( w ± θ(x, y, t) 2 – w +− θ(x, y, t) 2 – Z) where U(w) is the unit step function.

(6)

C Heading map Each neuron in the heading map represents a hypothesized FOE location. It accumulates evidence that is consistent with its hypothesis over space and time. The system estimates heading as the FOE location corresponding to the neuron with the largest response. The neurons accumulate binary valued information extracted from the outputs of the motion energy neurons. Fermuller and Aloimonos [5] show that robust and reliable egomotion information can be extracted from the binary patterns of the sign of projection between the optical flow vectors and pre-defined directions. For an observer undergoing pure translation in a static environment with velocity u = ( u x, u y, u z ) , the optical flow vector at image point ( x, y ) is given by: uz T (4) v ( x, y ) = --------------- ( x – x FOE ) ( y – y FOE ) z ( x, y ) where z ( x, y ) is the depth of the environmental point imaged by ( x, y ) and ( x FOE, y FOE ) = ( – u x ⁄ u z, – u y ⁄ u z ) is the focus of expansion. Although the magnitude of the optical flow vectors depend upon the scene depth, which is unknown, their directions give scene-independent information about the location of the focus of expansion. They always point radially away from the focus of expansion if u z > 0 . Thus, the projected flow v θ ( x, y ) along the direction θ : uz T v θ ( x, y ) = cos θ sin θ ⋅ ⎛ -------------- x – x FOE y – y FOE ⎞ (5) ⎝ z(x, y) ⎠ is positive along one side of a line that passes though the FOE and is perpendicular to θ , and negative along the other. The output of the motion energy filter tuned to orientation θ does not encode the projected flow v θ ( x, y ) explicitly. Rather it responds maximally to sine wave gratings of a preferred spatial frequency and orientation moving in a preferred direction with a preferred temporal frequency. In a retinotopic array, we would expect the responses of the temporal filter w +θ(x, y, t) to be large if the sign of the projected flow v θ ( x, y ) is positive and small if it is negative. Since the terms “large” and “small” are relative, we use motion opponency to set the reference for comparison. In cortical area MT, neurons tuned to the same orientation but opposite directions of motion are often co-located. Since variations on the input orientation and spatial frequency should cause similar changes on the responses of both types of neurons, differences in their responses are most likely due to differences in the sign of the component of motion in the preferred direction. Thus, the system estimates the sign of the projected flow by comparing them. In particular, it obtains a pair of binary valued maps which are equal to one if the corresponding dif-

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3: Experimental results when the camera is moving forward. (a) Original image. (b), (c), (e) and (f) Motion opponent maps of filters with preferred orientation θ = { 0, π ⁄ 4, π ⁄ 2, 3π ⁄ 4 } respectively. Motions with opposite direction are colored by black and white respectively. (d) The computed heading map. The detected FOE location is indicated by the marker.

The neuron encoding FOE location ( x FOE, y FOE ) integrates information according to the following equations: H(t) = α ⋅ H(t – 1) + (1 – α) ⋅ F(t) where H(t) is the neuron response and F(t) =



∑ ⎜⎝ ∑ θ

x, y ∈ R +θ

S +θ ( x, y ) +



( x, y ) ∈ R –θ

⎞ S – θ ( x, y )⎟ ⎠

(7)

(8)

The regions of summation are defined by R +θ = { ( x, y ) cos θ ( x – x FOE ) + sin θ ( y – y FOE ) > 0 } (9) and similarly for R – θ , except that the inequality sign is reversed. Finally, the system aligns the camera with the heading direction by updating its orientation (pan and tilt angles) by a saccadic eye movement. The direction and magnitude of the saccade is based upon the difference between the current estimated FOE and image point that intersects the camera’s optical axis. IV. EXPERIMENTAL RESULTS Our system operates on 176x144 subsampled input images. In each 40ms frame, fifteen 176x144 neuron maps are computed: the high pass filter map, the gain map, 8 motion energy maps, 4 motion opponency maps and one heading map. Since all temporal filters are first order, the system updates each filter state directly and no image frames need to be stored. Figure 3 illustrates the output of our system during operation. To characterize the performance of the system as the bandwidth of temporal frequency tuning of the motion energy neurons changes, we considered the variability in the FOE estimate over different trials and environments. We translate

a

y FOE

FOE

FOE

Variance

shows that the system can successfully center the FOE after a single saccade.

a

(a)

(b) x FOE

H ( jω t )

ωt

a (c)

(b)

Fig. 5: (a) The detected FOE locations as a function of time for one trial. A saccade is made at the 75th frame (grey vertical line). (b) Trajectory of FOE locations over 22 trials. The crosses and circles represent the estimated FOE locations before and after the saccade.

2

Variance

frame count (a)

(d)

Fig. 4: Performance of the system as the temporal bandwidth changes. (a) The mean detected FOE locations computed over 10 trials. (b) The variance of the FOE locations over 10 trials. (c) The variance of the FOE locations across 100 frames in time. (d) Temporal turning curves for selected values of a .

the camera through a cluttered environment. In each trial, the system generates a time trajectory of estimates of the focus of expansion. We define the estimated focus of expansion as the average over 100 frames in time, and considered the variability in this estimate over 10 trials with 5 different environments. Figure 4(a) shows that the detected FOE locations remain similar over a range of temporal bandwidth parameters a between 0.6 and 0.9. Figure 4(b) and Figure 4(c) show that both the trial-to-trial variance and the variance over time within one trial achieve broad minima near a = 0.8 . Taken collectively, these results indicate the algorithm is insensitive to the exact parameter choice. This is not due to increased temporal smoothing, since if this were the case, we would expect the variance to decrease monotonically with a . Figure 4(d) compares the temporal frequency tuning curves for selected values of alpha. Interestingly, the curves do not cross over at the point of maximum slope as would be predicted by a sensitivity analysis. The system may be trading off increased temporal integration of information with the sensitivity of the turning function to changes in temporal frequency. In our second test, we examined how well the system could center the FOE after a single saccade. In each trial, a random initial camera orientation was selected, and the system run for 75 frames (3 seconds). The instantaneous estimate of the FOE is then used to compute a correction to the camera angles based on linearly interpolating values from a lookup table, and the camera angles updated in a single saccade. The lookup table is generated by storing the estimated FOE for a 10x10 array of camera angles differing by 3 degrees. After the saccade, the system translated for another 3 seconds and the location of the new estimated FOE was recorded. Figure 5

V. CONCLUSION We described a neuromorphic robotic visual system that computes a topographic map of heading direction, based on motion patterns induced by pure translation. Incoming image sequences are filtered by spatio-temporal Gabor like filters, which model the functional properties of neurons in V1, and combined in a retinotopic map of neurons tuned to large expanding patterns of motion, as observed in MST. Our experimental measurements indicate that the performance of the system is insensitive to exact parameter choices. Constraining the focus of expansion to align with the direction of translation can simplify subsequent processing. Our system shows how this constraint can be actively enforced. REFERENCES [1] M. Lappe, “Computational Mechanisms for Optic Flow Analysis in Primate Cortex”, International Review of Neurobiology, vol. 440, pp. 235268, 2000. [2] E. H. Adelson and J. R. Bergen, “Spatiotemporal energy models for the perception of motion”, J. Opt. Soc. Am. A, vol. 2, pp. 284-299, 1985. [3] J. A. Perrone, “Model for the computation of self-motion in biological systems,” J. Opt. Soc. Am. A, vol. 9, pp. 177-194, 1992. [4] M. Lappe and J. P. Rauschecker, “A Neural Network for the Processing of Optic Flow from Ego-Motion in Man and Higher Mammals”, Neural Computation, vol. 5, pp. 374-391, 1993. [5] C. Fermuller and Y. Aloimonos, “Direct perception of three-dimensional motion from patterns of visual motion”, Science, vol. 270, pp. 19731976, 1995. [6] M. Bethge, D. Rotermund, and K. Pawelzik, “Binary tuning is optimal for neural rate coding with high temporal resolution,” in Advances Neural Information Processing Systems, vol. 15, 2002. [7] C. M. Higgins and S. A. Shams, “A Biologically Inspired Modular VLSI System for Visual Measurement of Self-Motion”, IEEE Sensors Journal, vol. 2, pp. 508-528, 2002. [8] B. Shi, E. Tsang, S. Lam and Y. Meng, “Expandable Hardware for Computing Cortical Maps,” presented at IEEE International Symposium on Circuits and Systems 2006 (ISCAS), Island of Kos, Greece. [9] I. T. Young, L. J. van Vliet and M. van Ginkel, “Recursive Gabor Filtering,” IEEE Transactions on Signal Processing, vol. 50, pp. 2798-2805, 2002. [10] J. M. Geusebroek, A. W. M. Smeulders and J. van de Weijer, “Fast Anisotropic Gauss Filtering,” IEEE Transactions on Image Processing, vol. 22, pp.938-943, 2003.

Suggest Documents