sensor, with embedded processing resources to perform ... To perform embedded image processing, .... programmer, who is able to add new operators to this.
Embedded Early Vision Systems: Implementation Proposal and Hardware Architecture Fabio DIAS, Pierre CHALIMBAUD, François BERRY, Jocelyn SEROT, François MARMOITON LASMEA - Laboratoire des Sciences et Matériaux pour l’Electronique et d’Automatique (Laboratory of Sciences and Materials for Electronics and Automation) Université Blaise Pascal — Clermont-Ferrand — France 24 avenue des Landais - 63177 AUBIERE Cedex {Fabio.Dias, Pierre.Chalimbaud, Francois.Berry, Francois.Marmoiton, Jocelyn.Serot}@univ-bpclermont.fr
Abstract: We present our approach towards a smart sensor, with embedded processing resources to perform early vision tasks. Our work is based on the active vision paradigm, adapting the perceptual aspects of biological vision to artificial systems. The FPGA/DSP-based hardware platform developed as prototype of a smart camera is presented, and a design methodology to reduce implementation time and complexity is sketched. Some examples of implemented applications are shown. Keywords: Early vision, active vision, embedded systems, smart sensors, FPGA-based architectures
1. Introduction Boosted by constant evolutions in microelectronics, embedded systems are present today in several technological domains. Artificial vision is one of them. Embedded electronic vision systems may be found in cell-phones, automobiles, robots, surveillance systems, biometric devices and even in artistic performances. The purpose of such systems is currently not only to acquire images, to be viewed or analyzed by an human observer/operator, but also to process these images automatically, in order to extract useful information and execute tasks of different complexities. To perform embedded image processing, electronic vision systems dispose of a wide variety of “off the shelf” devices as micro-controllers, micro-processors, DSPs and reconfigurable structures as FPGAs (Field Programmable Gate Array). This last one is the subject of growing interest in the scientific community, due mainly to its flexibility and to the expressive evolutions observed, notably in array sizes. ASIC solutions are also exploited, being known as electronic retinas. Vision sensors with embedded processing are often called “smart cameras”. In this paper, we propose an embedded early vision system, biologically inspired and based on the active vision paradigm. Our approach toward a smart camera consists in performing most of low-level vision tasks at the sensor level, before transmitting the information to the main processing unit. This behaviour is
inspired from the human vision system, where eyes are responsible for attention and fixation tasks, sending to the brain only pertinent information about the observed scene. This way, data amount to be transmitted and analyzed in order to perform high-level tasks is strongly reduced, and communication bottlenecks can be avoided. The exploitation of some biological vision behaviours in artificial vision systems is known as active vision, and will be explained in section 2. The early vision concept will be also presented and discussed in this section. Guided by the framework of active and early vision, we developed suitable hardware architecture to host such applications. Parallel processing and reconfigurability being two important desired features, the choice of a FPGA based platform was natural. On the other hand, specialized digital signal processing devices (DSP) are interesting when dealing with complex calculations, which frequently occur in image processing. In section 3, our mixed architecture based on FPGA/DSP technologies will be briefly described. The heterogeneous nature of the proposed architecture increases the complexity of design and implementation stages. Data and task sharing between processing units must be managed and optimized, as well as communication among different devices (image sensor, memory blocks, processing units, external host system...). A methodological approach would help to reduce design complexity, performing some of these tasks in a quasiautomatic manner, through pre-defined data-flow models and communication modes. One of the main goals of our work is to propose a design methodology, based on a previously designed library of functional blocks, containing several elementary operators that can be connected serially or in parallel to perform a given task. This would help to simplify and speed-up the design and implementation process of early vision systems in a heterogeneous platform. A brief discussion about this subject will be led in section 4. Finally, some examples of early vision applications implemented on our smart camera are shown in section 5, followed by conclusions and some future perspectives of our work in the last section.
COGIS 2006 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS
2. Active and Early Vision
Computer vision algorithms like target detection, motion tracking and face recognition are known for their high complexity, dealing with large amounts of data and involving resource consuming calculations. During a time, research efforts in this domain were guided by the idea that an image understanding system must transform two dimensional data into a description of the three dimensional world, inferring surfaces, volumes, boundaries, shadows, occlusion, depth and motion. However, many attempts to build full representations of a three-dimensional scene were unsuccessful, and even with outstanding evolutions observed in related technological domains, such tasks are still challenging. Active vision appears as an alternative approach to deal with artificial vision problems. The central idea is to take into account the perceptual aspect of visual tasks, based on biological vision systems. So, instead of a full 3D representation of the observed scene, the system is supposed to extract only the useful information to solve a given problem, through a task-driven observation strategy. In the last years, several researchers worked to develop and apply active vision [1] [2] [3]. More often, studies are concentrated in the robotic field, using binocular systems to control movements of a mechanical head or a robot. One of the active vision basic features is retroaction. It means that a feedback loop can drive the dynamic adaptation of data acquisition process, depending on the state of the system and on the current task to perform. This retroactivity, in artificial systems, may appear under different forms: mechanically (camera movements), optically (zooming), electronically (image acquisition control) or algorithmically (acquisition strategy). The originality of our work is to exploit electronic and algorithmic retroaction using a monocular system, instead of the more classical pan/tilt/zoom approach, used in robotic binocular heads. A schematic model of our vision system is shown in the figure below:
selection and acquisition, extracting suitable visual information to correctly achieve the high-level tasks. They work as pre-processing stages, and are usually called early vision. Other pre-processing such as contrast correction or noise filtering may be also considered as low-level early vision tasks. In human vision system, early vision tasks are performed by the eyes, through its movements (saccades), pupil dilatation, crystalline action, etc. The combination of these behaviours, in a task-driven strategy, allows acquisition of the best information set to perform a given task, like reading or face recognition [4]. Saccadic movements of the eyes are extremely important, due to the heterogeneous retina resolution. In fact, retina resolution decreases with the eccentricity to the optical axis. Through eye saccades, the central highresolution zone (fovea) scans the scene, fixating the “salient” points of this one, i.e. those points presenting a peculiar feature in relation to its neighbourhood (color, contrast, orientation, etc.). Meanwhile, the peripheral vision remains paying attention in the whole scene, and retina’s low-resolution zone is used to identify new “salient” points, candidates to the next fovea fixations. In the brain, images are constructed as a combination of several saccade/fixation actions. This behaviour can be resumed as attention and fixation tasks being executed in parallel and at different resolutions, collecting data in a selective way in order to accomplish a given task. Yarbus 67
Smart Camera CMOS sensor Image flow
Feedback (Task-driven Adaptation)
Embedded Processing
Controlled Visual Information Flow
Early Vision Tasks Host System (High-level tasks and displaying)
Figure 1: Synoptic scheme of an embedded early vision system.
In our approach towards an active smart sensor we suppose that visual processes can be divided into three main layers: attention, fixation and high-level processing. Attention and fixation layers are responsible for data
Figure 2: Saccadic eye movement and task-driven strategy.
In figure 2 we have an example of saccadic eye movements through task-driven strategy, from an experiment done by Yarbus [5]. In the left side a reproduction of Repin’s picture “An Unexpected Visitor” is shown. A subject is asked to examine this picture and perform a given task. The top right picture is the record of eye movements when subject is asked to remember the position of people and objects in the room. The bottom right shows the record when subject is asked to give the ages of these people. Eye movement recordings show that observation strategy strongly depends on the task to accomplish.
COGIS 2006 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS
The heterogeneous retina resolution, combined with the saccade/fixation strategy, results in a strong reduction in data amounts to be analyzed. It allows humans to obtain a good image resolution, keeping a wide range of view and a satisfactory acquisition rate. We believe that an analogous strategy can be applied for machine vision, avoiding the communication bottleneck caused by huge data amounts. Attention modules can be employed to identify those points “popping-out” from the scene, using lowresolution images and detecting simple features, like edges, corners or contrast. Colors and motion can also be used (motion is definitely one of the most “salient” visual features: as soon as a moving object enters our field of view, our attention is immediately deviated to it). Different feature detections can be weighted and combined to compose a “saliency map” [6]. This way, the system can determine those image regions deserving a “closer look”, i.e. those points presenting high visual interest, according to a specified goal. Finally, saccadic fixations in full resolution may be performed over these specific regions, extracting more detailed information. Examples of multi-resolution attention/fixation algorithms can be found in [7] and [8]. However, even if an active approach, when compared to a classical passive method, tends to simplify some vision tasks, its processing remains relatively resource consuming. Parallelization is necessary and desirable, on one hand to respect real-time constraints, and on the other hand due to intrinsic characteristics of active vision algorithms which often supposes the parallel execution of several tasks. A dedicated architecture for active vision must consider these aspects, offering a suitable platform to support such applications. In the next section, our dedicated hardware platform is presented and discussed.
3. Hardware Architecture
Guided by the context described in the previous section, we developed a smart camera prototype, based on a CMOS image sensor and embedded processing resources [10]. The purpose of such camera is to perform early vision tasks in real-time, before data transmission to the external host system.
Figure 3: Smart camera hardware architecture.
The choice of a CMOS imager was justified by its ability to address specific pixels. This feature allows the controlled acquisition of a WOI (window of interest), instead of the whole image. This possibility is extremely useful when dealing with fixation algorithms. Capture of a little image sample, containing only necessary data, allows very high acquisition rates. Acquisition frequency is 8Mpixels/s. It means about 25 full images or more than 7000 windows of size 32x32 pixels per second. A processor core (NIOS) instantiated in the FPGA is used to manage communications and imager access. NIOS is a user-configurable soft core processor, featuring many implementation and optimization options. Interface with the host system is performed through a high-speed USB 2.0 interface. The image processing functions are implemented in the FPGA, through hardwired operators, and in the DSP. The latter integrated in a separate board which can be connected to the main circuit. From a functional point of view, the DSP works as a supplementary FPGA operator. The presented architecture allows different parallelization modes, like task parallelism and pipelining. In the next section we will discuss some implementation issues on this hardware platform.
The prototype architecture is represented in figure 3. A Stratix® EP1S60 FPGA device plays the central role in the system, being responsible to interconnect all other hardware devices. Surrounding it, 10Mb (5x2) of SRAM and 64Mb of SDRAM are available for image and other data storage. The VGA (640x480 pixels) CMOS imager is connected to 4 digital-analog converters (DAC), allowing to control, via 4 reference voltages, the pixel analogdigital conversion. This feature can be exploited to control sensor light-response. An illustrative example of dynamic contrast optimization is shown in section 5.
Figure 4: Smart camera prototype.
COGIS 2006 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS
4. Design Methodology
One of the main goals of our work is to define a design methodology, allowing the fast prototyping and implementation of different applications on the smart camera platform, with a minimal burden for the application programmer. The central concept is to base the approach on a pre-defined set of elementary operators (EOs), containing the main functions currently employed in early vision algorithms (image filtering, image difference, motion detection/estimation, mask convolution, etc.). A library of basic EOs is available to the programmer, who is able to add new operators to this library, according to the needings of his application. However, these new custom EOs must met some design constraints to ensure its compatibility with others operators, as well as with the rest of the system. A communication protocol, including control signals driven by the NIOS, data formats and synchronization rules, must be respected. Our proposal of SoPC (System on Programmable Chip) is shown in figure 5:
The main goal of this methodology is to make all resource managing tasks almost transparent to the application designer. Ideally, he would be able to design and implement his application without considering hardware aspects. His work would be only to connect suitable EOs, following some connection rules and the data-flow scheme of his tasks, and program the NIOS to coordinate the general fonctionning of the whole system. The SoPC shown in figure 5 has an integrated FPN correction module [10]. FPN (Fixed Pattern Noise) is a known problem of CMOS imagers. The integrated corrector allows reducing its influence, giving best image results. The DSP device is not represented in the figure. In fact, it is seen as a complementary EO, communicating through the same modes as the other ones. Thus, at the software level, it’s therefore transparent to the rest of the system. In the next section we will show some examples of elementary operators which can be used to perform early vision tasks. In a first stage of our work, such operators were implemented and tested separately. A future work will be to interconnect these operators to compose a more complex system.
5. Applications and Examples
Figure 5: Example of SoPC pre-defined functional architecture.
The dark rectangle (“Set of custom modules”) indicates the only block that must be configured by the programmer according to the application. For this, suitable EOs are interconnected in order to perform the desired tasks, following a data-flow scheme. Control signals as operators activation/deactivation, data path control (mux/demux) and parameter changes are managed by the NIOS processor. Addressing and control of the image sensor are also managed by the NIOS processor, as well as the interface with the host computer. As seen, the NIOS processor plays a central role in the system, managing almost all data transfers and synchronization. Using an embedded soft core processor as the system master is an interesting way to combine the flexibility of a software solution with the high-performances offered by a hardwired solution.
In order to assess the validity of the proposed hardware platform and design methodology, some early vision algorithms were studied, implemented and tested. Three examples will be briefly described here. The first one is an attention module example, scanning image to detect movement in the scene. The second example is a dynamic contrast optimization algorithm, allowing the acquisition of good quality images even in deficient/saturated illumination conditions. The last example is a tracking algorithm, able to track a given image pattern at very high acquisition rates (1000 images/s).
Motion detection: based on an image difference method, this algorithm searches moving objects in the scene. In image plan, motion is transduced to temporal and spatial grey-level changes. This module detects temporal changes and defines rectangular windows surrounding the moving objects. In a first step, a difference image is computed between two consecutive frames. This difference image is thresholded, resulting in a binary 2D array. Its vertical projection (column-wise sum) is computed, and a peak detector is applied to identify those vertical bands where moving objects can be probably found. Finally, the horizontal projection (row-wise sum) is computed inside each selected vertical band, and a second peak detection defines the position of the moving object. This way, it’s possible to detect and localize several moving objects simultaneously.
COGIS 2006 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS
The motion detector is an example of attention module. Exploiting the characteristics of the proposed hardware platform, it was implemented and tested. It is executed through a pipeline of simple operations applied directly on the incoming pixel flow (image difference – thresholding – accumulation – peak detection). It gives satisfactory results for multiple moving objects detection (figure 6).
Figure 7: Contrast optimization example. Top left: a light saturated scene, acquired with standard adjustment. Top right: contrast optimization result. Bottom: zoom on lamp filament using contrast optimization: image isn’t bloomed, even in extreme illumination condition. Figure 6: Motion detection algorithm. Top: two consecutive frames of a sequence. Middle and bottom left: thresholded difference image and its vertical projection. Middle right: horizontal projection inside a selected vertical band. Bottom right: final result - two moving objects were correctly detected and localized in the image.
Motion Tracking: the tracking algorithm is based on the KLT (Kanade – Luca – Tomasi) method [9]. Given a reference image sample, the system tracks the position of this sample estimating its movements across the scene. Acquisition is performed only for a little portion of the image, where the searched sample is supposed to be. It allows a very high acquisition rate (necessary for efficient tracking), and is an example of fixation algorithm.
Other feature detections can be performed in parallel, and their results can be combined to compose a saliency map. Thus, always exploiting a pipelined chain, this saliency map may be used as input parameter for others algorithms. It is therefore a first step toward a cognitive artificial vision system.
Contrast Optimization: working as the eye’s pupil, this algorithm dynamically adapts the image acquisition to the light conditions of the scene. A control loop acts directly in the conversion reference voltages of the CMOS sensor, yielding an optimal image acquisition even in situations of improper illumination [11]. It can work as pre-processing for others algorithms, conditioning acquired images to allow a better detection or tracking for instance. Obtained results may be seen in figure 7. Suitable images are obtained even in extreme illumination conditions, where a standard camera normally would be completely bloomed.
Reference Image Template
Acquired Sample (Estimated Translation)
(0,6)
(-3,4)
Corrected Window
Figure 8: Motion tracking algorithm.
COGIS 2006 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS
(-7,-8)
Once the reference template and its initial position are defined, system repeteadly acquires an image sample in the defined position, estimates its translation in relation to the reference one, and finally it upgrades the acquisition window position to compensate for object motion (figure 8). This way, the acquisition window will follow the object across the scene, trying to always acquire a sample identical to the reference one. More details about this algorithm and its implementation are found in [10].
7. References [1] [2] [3]
[4]
The three functional modules presented above may be taken as elementary operators in the context of our design methodology. Using the presented hardware architecture, they can be interconnected and executed simultaneously, in a processing chain performing a complete early vision process: image conditioning, attention and fixation.
[5] [6]
[7]
6. Conclusions and Perspectives
We presented our implementation proposal for an embedded early vision system. Our work is based on the active vision paradigm, with special focus on early vision processes. We call early vision all those visual processes which don’t involve decision or interpretation skills. The latter are considered as high-level tasks, and are executed by the host system instead. A smart camera prototype has been developed and briefly described here. Its hardware architecture is composed of a CMOS imager and embedded processing resources, through a FPGA/DSP mixed platform. This prototype is used as a research and test platform for early vision algorithms. Some examples of implemented applications were presented.
[8]
[9]
[10]
[11]
D. Ballard: “Animate Vision”, Artificial Intelligence, 48(1): 57-86, 1991. R. Bajcsy: “Active Perception”, IEEE Proceedings, 76(8): 996-1005, 1988. G. Granlund: “Does vision inevitably have to be active”, SCIA (Scandinavian Conference on Image Analysis), Kangerlussuaq – Greenland, 1999. J. Findlay, I. Gilchrist: “Visual attention: the active vision perspective”, in M. Jenkin & L.R. Harris (Eds.) “Vision and attention”, chapter 5, pp. 83-103. SpringerVerlag, 2001. A. Yarbus: “Eye movements and vision”, Plenum Press, 1967. M. Park, K. Cheoi, T. Hamamoto: “A Smart Image Sensor with Attention Modules”, CAMP (International Workshop on Computer Architecture for Machine Perception), Palermo – Italy, 2005. W. Wong, R. Hornsey: “Design of an Electronic Saccadic Imaging System”, CCECE (Canadian Conference on Electrical and Computer Engineering), Ontario – Canada, 2004. P. Camacho, F. Arrebola, F. Sandoval: “Multiresolution Sensors with Adaptive Structure”, IECON (Conference of the IEEE Industrial Electronics Society), Aachen – Germany, 1998. C. Tomasi, T. Kanade: “Detection and tracking of point features”, Carnegie Mellon University Technical Report, CMUCS - 91 - 132, April 1991. P. Chalimbaud, F. Berry: “Design of an Imaging System based on FPGA Technology and CMOS Imager”, FPT (IEEE International Conference on Field-Programmable Technology), Brisbane – Australia, 2004. P. Chalimbaud, F. Berry: “Contrast Optimization in a Multi-Windowing Image Processing Architecture”, MVA (IAPR Conference on Machine Vision Applications), Tsukuba – Japan, 2005.
We showed how parallel processing can give artificial vision systems a cognitive behaviour. Such approach may contribute to simplify some complicated high-level visual tasks, through a data amount reduction and a better pertinence of acquired information. A future work will be to implement and evaluate the interconnection of several elementary operators. For example, contrast optimization, motion detection and sample tracking working simultaneously and cooperating in a pipelined chain. A fundamental goal of our work is to propose a design methodology, allowing a speed-up in conception and implementation stages of active vision systems. Design complexity must also be reduced, using a previously designed library of compatible EOs.
COGIS 2006 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS