FPGA–DSP co-processing for feature tracking in

1 downloads 0 Views 2MB Size Report
processor accelerated by a dedicated hardware such as a. DSP or a graphic ...... DSP) results and the OpenCV version of the Lucas and Kanade algorithm.
FPGA–DSP co-processing for feature tracking in smart video sensors

Matteo Tomasi, Shrinivas Pundlik & Gang Luo

Journal of Real-Time Image Processing ISSN 1861-8200 J Real-Time Image Proc DOI 10.1007/s11554-014-0413-2

1 23

Your article is protected by copyright and all rights are held exclusively by SpringerVerlag Berlin Heidelberg. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy J Real-Time Image Proc DOI 10.1007/s11554-014-0413-2

SPECIAL ISSUE PAPER

FPGA–DSP co-processing for feature tracking in smart video sensors Matteo Tomasi • Shrinivas Pundlik Gang Luo



Received: 9 December 2013 / Accepted: 12 March 2014  Springer-Verlag Berlin Heidelberg 2014

Abstract Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA ? DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM ? DSP and DSP only. The proposed FPGA ? DSP system offers a speedup of about 20 times and 3 times over ARM ? DSP and DSP only configurations, respectively. A comparison of the Harris feature

M. Tomasi (&)  S. Pundlik  G. Luo Schepens Eye Research Institute, Massachusetts Eye and Ear, Harvard Medical School, 20 Staniford Street, Boston, MA 02114, USA e-mail: [email protected] S. Pundlik e-mail: [email protected] G. Luo e-mail: [email protected]

detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions. Keywords Feature tracking  Embedded systems  FPGA  DSP  Real-time processing  Co-processing

1 Introduction In general, the term ‘‘smart video sensor’’ implies that the conventional image acquisition process is accompanied by an intrinsic scene understanding procedure, to offer a value added solution to the users. The image processing outcomes provided by the camera, in addition to the actual images, can be used for higher-level computer vision tasks. In video surveillance, for example, [1] integrated processing allows the system to save hardware resources and communication bandwidth, while providing real-time results. Smart cameras are increasingly being used in industrial machine vision, automobiles, robots, and mobile devices. Some of the desired characteristics of a smart camera system are flexibility, powerful processing capabilities, low power consumption, and small size. Popular processor platforms of smart video sensors include field programmable gate arrays (FPGA), digital signal processors (DSP), general purpose processors such as acorn RISC machines (ARM), or a hybrid system containing two of these processors on a single chip, commonly referred to as system on chip (SoC) [2]. A large number of existing approaches [3–5] adopt DSP and ARM processors because of their user-friendly programmability, large-scale production, and relatively cheap cost. While being more flexible, this solution provides less processing power as

123

Author's personal copy J Real-Time Image Proc

compared to a DSP only implementation for computationally intensive vision algorithms [6–8]. In general the common trend in embedded systems is to use a general purpose processor accelerated by a dedicated hardware such as a DSP or a graphic processor units (GPU). Such SoCs are commonly seen in mobile devices and can provide a flexible computation platform with low power requirements. In the context of embedded computer vision applications involving heavy computation, the main drawback of these devices is their fixed level of parallelism, which can significantly constrain the final throughput for the stream on output. In order to achieve processing optimization for such computationally intensive applications, FPGA or applicationspecific integrated circuits (ASIC) are ideal. However, they need a longer development time as compared to the general purpose processors. Some hardware manufacturers now [2, 9] offer development boards that combine customizable hardware (FPGAs) and easy-to-program processor architectures (DSPs or ARM processor) to achieve higher degrees of operational capabilities, while maintaining the flexibility in system development. In order to facilitate the communication between the two processors, co-processors are often embedded in the same platform, or within the same chip. These optimized architectures facilitate the HWSW co-development [10]. In this paper, we present the implementation of sparse optical flow intended for smart video sensors on hybrid DSP ? FPGA architecture. With this rarely used combination, we intended to achieve a balance between throughput and development time. In fact our solution runs faster than a full DSP, or an ARM ? DSP based implementation, and is relatively easier to develop than a fully FPGA-based solution. We implement Harris corner detection [11] using a fine grain pipeline in the FPGA that allows a throughput of 60 megabytes per second (MBPS). The DSP receives the detected features from the FPGA and tracks them concurrently, achieving an overall frame rate of 160 fps for VGA images. The task division has been done in the way described above so as to leverage the strengths of both the processing cores. We decided to implement a pixel-wise operation such as Harris corner detection into the FPGA because its massive parallel processing capability can be optimally used to compute dense Harris strength map and perform non-maxima suppression and thresholding operations. While the DSP does not offer the kind of parallelism that is offered by the FPGA, it provides a robust platform to deal with the iterative processing involved in tracking the detected feature points. Furthermore, since the detected feature points are far less than the total number of image pixels, the DSP has to process less data as compared to the FPGA. Comparing to two conventional solutions based on ARM ? DSP or DSP only, the FPGA ? DSP system offers a speedup of about

123

20 times and 3 times, respectively. We designed a system architecture where the FPGA acts as a heavy duty, pixellevel image processor for pre-filtering and feature detection, in addition to its common role as a master for the communication and data control. The communication with an external DSP is achieved through a specific protocol that takes advantage of the video ports. The DSP taking advantage of the enhanced direct memory access (EDMA) capabilities concurrently processes a pair of video frames using a double-buffer technique: a video frame pair is processed while another one is being received. To the best of our knowledge this is the first time in the case of FPGA ? DSP systems that the FPGA performs a major image processing task instead of just acting as a controller for I/O devices and the co-processor. The proposed approach and the developed IP cores have the advantages of high modularity and scalability, and can be easily integrated in other existing platforms, or optimized for new ad-hoc smart video sensors. For this work we use a commercial platform that includes a DSP and an FPGA on a single board. However, due to the inherent design constrains of the platform, there is no optimum way for communicating between the two processors. Oftentimes circuitry for data sharing represents the bottleneck of coprocessing. Nevertheless, our outcome, proving the viability of the DSP–FPGA co-processing concept can inspire new commercial solutions for smart sensors where the DSP acceleration is embedded in the FPGA, reducing the burden of this inter-processor communication.

2 Related work and main contributions The increasing popularity of smart video systems is playing an important role in shaping up the imaging and computer vision technology. Nowadays, many imaging sensor manufacturers provide flexible systems that integrate cameras and image processing hardware. For example, Vision Components offers a commercial product with a DSP and a small FPGA, where the latter just plays a synchronization role without actually doing any video processing. In this case, the DSP is the main processor and the manufacturer provides custom development tools for rapid algorithm development. Similar smart camera architectures have been developed for a video surveillance using FPGA as a flexible controller for the co-processor (Blackfin DSP) and the data flow [12]. In [13] and [14], an ad-hoc multi-processor smart camera system is proposed that integrates a camera and Inertial Motion Units (IMU). The gyroscopic signal in this case is used to provide depth estimates with the moving camera. In [3], the depth estimation instead is provided by a stereo camera pair and the motion sensor information is

Author's personal copy J Real-Time Image Proc

used to refine the feature tracking and calculate the visual odometry in the ARM processor. Other approaches such as [15] focus more on optimizing the bandwidth and the power consumption of camera sensor networks. They omit the use of the FPGA for performing image processing tasks and just use an Intel portable processor instead. A trend emerges from the previous work in the case of FPGA ? DSP smart camera systems, where the FPGA is merely used as a logic controller for the video stream and not as the HW accelerator. In this paper, we present a novel strategy for image processing in the case of FPGA ? DSP hybrid systems that fully utilizes the capabilities of the FPGA to perform computationally intensive image processing tasks along with the DSP. We choose feature tracking as an example for demonstrating our FPGA ? DSP co-processing strategy. Feature matching algorithms (feature tracking) have been a focus of intense scrutiny over the past few decades, and the proposed solutions are achieving increasingly accurate results [16]. Most of the approaches that produce highly accurate results tend to be complex and computationally expensive. Thus, many state-of-the-art techniques are not able to run in real-time. Algorithms that can run in PCs, such as variational approaches [6], are not suitable for embedded systems because they include hierarchical processing that is not efficient when running at low frequencies. Implementations on embedded systems have been reported in the literature such as [3, 17–21]. Overall, it is a challenge to develop accurate motion estimation algorithm that is fast and hardware friendly [22]. In the particular case of optical flow computation, many contributions have been made to improve the state-of-the art in FPGA systems [23–31] and GPU-based systems [32–34]. The latter has been explored more lately because their high processing power. However, GPU is not feasible for smart cameras implementations in terms of size and power consumption. With this work, we demonstrate the idea of accelerating intensive processing tasks with an FPGA-based device and we compare our solution with some of the existing FPGA based solutions for feature tracking. We also exploit various embedded processors and hardware software co-design schemes to determine the optimum strategy for feature tracking. In spite of the common tendency to use only one hardware accelerator for speeding up a general purpose processor (low level of parallelism), we propose a concurrent processing approach between architectures consisting of high-level (FPGA) and middle-level parallelism (DSP). Making use of different development boards with a multi-processor system, we also evaluate various combinations of hybrid parallel processing and compare them to our approach. Furthermore, this work attempts to quantify the disparity between some of the commercial processors using the same feature detection algorithm as a benchmark.

The adopted processor platforms such as FPGA, DSP, and ARM are among the most commonly used in the current embedded systems. Specifically, we develop a hardwareaccelerated tracking system with a customized IP feature detection core in the FPGA and compare it to its embedded software version for the ARM and the DSP. This proposed strategy of FPGA ? DSP co-processing is potentially faster than the latest FPGA–ARM approaches [2]. The main contributions of this work are summarized as follows: •

• •

design of an FPGA ? DSP co-processing strategy for image processing applications, where the FPGA plays an important role in core image processing tasks, development of a reusable IP core in the FPGA that accelerates the parallel DSP co-processing, and design and analysis of real-time feature detection modules for various embedded systems (FPGA, DSP, and ARM).

3 System overview In this work we adopt existing approaches for sparse feature tracking and refine them for embedded system integration. The whole tracking system can be divided in two main stages: feature detection and feature tracking. Features are detected using the Harris corner detection process [11] denoted by the following equation: Det(AÞ  k  Tr(AÞ2 [ s;

ð1Þ

where k is the Harris parameter, s is the feature detection threshold (fixed predefined threshold value), and A is the Harris strength matrix,  2  Ix Ixy A¼ : ð1aÞ Ixy Iy2 If I is the input image, Ix and Iy represent the partial derivatives along the x and y axes. This is a pixel-wise processing step, which reduces the total number of pixels in the given image to a smaller set of corner points that are good for the tracking. For these pre-selected feature points, we calculate the optical flow using the Lucas and Kanade [15] local approach and obtain the corresponding motion vectors. Being a local approach, Lucas and Kanade method can be efficiently implemented in an embedded hardware. The sparse nature of the optical flow computation is sufficient and practical in many real-world applications, where an object’s movements can be represented by the motion of the feature points associated with it. This approach can be adapted for implementation on embedded systems and is well suited for the different levels of parallelism. We have implemented this sparse feature tracking method using a concurrent processing

123

Author's personal copy J Real-Time Image Proc Fig. 1 Two platforms used in this work for implementing feature tracking system: (left) an SMT339 platform with an FPGA ? DSP processing and (right) Texas Instrument DM3730 architecture with ARM ? DSP. Both platforms adopt the DaVinci architecture DSP with fixed point notation

architecture between FPGA and DSP. The implementation is done using an SMT339 image processing module from Sundance [9], which includes a Virtex XC4VFX60 FPGA from Xilinx and a TMS320DM642 DSP from Texas Instruments. We have also implemented a more traditional approach with a general purpose processor (ARM) and HW acceleration (DSP), to compare it with the FPGA ? DSP implementation. The ARM ? DSP approach is implemented on DM3730 architecture from Texas Instruments that includes an ARM Cortex A8 processor and a C64x DSP. Figure 1 shows an overview of the main task division for our embedded architecture according to the available processors. While the feature detection is implemented and tested in all the three different processors (FPGA, DSP, and ARM), the tracking is performed only in the DSP. Many other strategies can be explored but we decided to take advantage of the high degree of parallelism in the FPGA for pixel-wise operations, while leaving the iterative tracking step for the DSP to handle. Feature tracking on an ARM or an FPGA-based processor is not considered in this work. The adopted feature tracking algorithm performs an iterative local search, which if implemented on an ARM processor, would slow down the overall processing significantly because of the reduced level of parallelism on the ARM processor. On the other hand, an FPGA implementation will increase the processing speed, but it would lead to various design

123

challenges and reduction in the system flexibility. In this work we want to work around the loss of flexibility in the FPGA programming and an increase in the development time using the DSP. Previous works have reported the benefits of using a DSP or a GPU as compared to an FPGA [35, 36]. In the case of an FPGA, hardware re-configuration and design are much more time demanding than a traditional software implementation, compilation, and execution in a general purpose processor. For this study we make use of commercial platforms for both testing and debugging. However, the proposed system and the described cores can be transferred to other existing platforms with optimized interfaces and communication such as [13]. We report the details for each module and the hardware architecture in the following sections.

4 Feature detection Since feature detection is a pixel-wise operation, different optimization strategies need to be adopted for different platforms. In this paper, we implemented Harris Corner detection at a high level of parallelism (customized pipeline in an FPGA), a middle level of parallelism (TI C64x DSP), and low level (general purpose processor, ARM Cortex A8 processor), and a quantitative comparison experiment was conducted.

Author's personal copy J Real-Time Image Proc Fig. 2 High-level diagram for the feature detection IP core in the FPGA. For each processing step we report the number of pipeline stages

4.1 FPGA As shown in Fig. 1, we developed a feature detection core for the FPGA. The fine-grain pipeline strategy adopted for the core increases the maximum clock frequency by reducing the largest path delay between the flip-flops. This design strategy is radically different than the more common multi-core processing adopted in common parallel architectures such as DSPs, GPUS, and multi-core general purpose processor, where different tasks are assigned to each core to exploit the parallelism. Our approach for the customized FPGA architecture has been adopted previously in [26, 37, 38], and it has been demonstrated to be efficient for data streams and image processing. The only drawback of such large pipelines is the generation of a higher latency (constant delay of the first pixel on output). However, in our case, we get a small latency of five image lines (order of microseconds) that is negligible as compared to the entire image (order of milliseconds). The latency is introduced only once at the beginning of the processing and the delay is maintained constant through the continuous data streaming, not affecting the real-time performance. This means that once the pipeline is filled, the circuit is able to process 1 pixel per clock cycle. The local processing for the 2D filters and the convolution operations are optimized by the use of local embedded memories with fast access. They can be considered as a customized cache memory where the stream is temporally stored with a FIFO strategy. The core produces the final list of detected features including pixel coordinates, pixel intensity values, and the feature flag (Relevant signal in Fig. 2). The feature

detection module using the Harris method [11] can be divided into five main steps: • • • • •

3 9 3 Gaussian filter 3 9 3 gradient computation 7 9 7 average filter feature detection according to Eq. (1) 5 9 5 non-maxima suppression.

Each one of them has further internal divisions in small pipeline stages as detailed in Fig. 3. Circuits in Fig. 3a and c are very similar and implement a regular 2D convolution in windows of size 3 9 3 (Gaussian filter) and 7 9 7 (average filter) pixels. Both make use of the local multi-port random access memory (MPRAM) embedded in the FPGA to store the temporary information about the local window. As soon as the pixels enter the streamline, they are convolved with the assigned mask (horizontal orientation) and then stored for the following vertical convolution. Separable filters allow the decomposition in two 1D convolutions. In the Gaussian filters, we used a predefined look-up table for storing the weight vector x, while for the average filter we use a mathematical approximation to avoid the division. The division is converted into a multiplication by 5 and a shift by 8 (Fig. 3c). Circuits of Fig. 3b and e also use MPRAM and the same strategy as shown in Fig. 3a to store the local window, but this time no convolution operation is performed. In the first circuit (Fig. 3b), pixels are stored and the gradients are calculated from the 1 9 3 vectors in the x and y directions. The second (Fig. 3e) circuit stores the local 5 9 5 window in a vector, and after calculating the maximum it compares the pixels to suppress the smaller

123

Author's personal copy J Real-Time Image Proc

Fig. 3 Details of the fine-grain pipeline and blocks in Fig. 2. a 3 9 3 Gaussian filter, b 3 9 3 gradient computation, c 7 9 7 average filter, d features detection, and e 5 9 5 non-maxima suppression

values in the neighborhood. The feature detection module (Fig. 3d) does not have a local MPRAM. It just calculates the feature score with a combinatory logic. Note that in this stage, we apply thresholds K and s of Eq. (1) using a shifter and a comparator. In particular, we adopt K = 1, which means a shift to the left by one, equivalent to the K value of 0.5 in a floating point notation. The threshold s can be

123

adjusted as a user parameter and it has a value of 1,000 for the experiments of Sect. 6. The circuit described above has been synthesized for an xcv4fx60 Virtex 4 platform from Xilinx [39] and it was described with high-level description language, Handel-C from Mentor Graphics. A comparison between this highlevel descriptor language and a regular implementation of

Author's personal copy J Real-Time Image Proc Table 1 Synthesis results provided by the ISE 11.4 design tools Circuit

Slices (out of 25,280)

LUTs (out of 50,560)

eDSPs (out of 128)

BlockRAMs (out of 232)

Max. clock frequency (MHz)

Feature detection

5,766 (22 %)

5,893 (11 %)

9 (7 %)

59 (25 %)

62.1

The device adopted is an xcv4fx60

the same circuit in VHDL has been previously evaluated in [40], where the authors state that a careful design in Handel-C maintains high performance and loses only in terms of the number of logic gates used. The input stream is a grayscale image and the resolution can be defined by the user as a parameter. This parameter controls the indices for the scanning of the image row during the convolution operations when the embedded memories (FIFOs) are filled. Since the achieved throughput is 1 pixel per clock cycle, the frame rate varies with the image size according to the equation: fps ¼

fclock ; Xsize  Ysize

ð2Þ

where fclock is the working frequency for the synthesized circuit, and Xsize and Ysize are the image dimensions. Table 1 reports the synthesis results for the adopted FPGA. Details are expressed in terms of occupied slices, LUTs, embedded DSPs (eDSPs), BlockRAMs, and the longest path delay (maximum frequency). According to this table, the working frequency for our circuit is 62 MHz. Note that in Table 1, the eDSP represents a specific logic cell within the FPGA, where cascades of multipliers and accumulators optimize the computation, and not the co-processor. The last column of Table 1 reports the maximum clock frequency according to the Xilinx ISE tools. If we use this value for the system clock and consider HDTV images, the estimated frame rate achieved according to Eq. (2) is 28 fps. 4.2 DSP Detection of feature points in the DSP is performed using fixed point image processing libraries, VLIB and IMGLIB, provided by Texas Instruments. The IMGLIB library contains fixed point implementations of the basic image processing functions such as Gaussian smoothing. The VLIB library contains fixed point implementations of the functions for computing image gradients, Harris score, thresholding on the Harris strength map, and non-maxima suppression to obtain the locations of the feature points. The input image is first convolved with a 3 9 3 Gaussian kernel. Gradients in the x and y directions are computed using the smoothed input image. We apply the Harris corner detection procedure [11] to the first frame to obtain a strength image over 7 9 7 image patches. The function

takes the smoothed image and the two gradient images as the input. The size of the feature window (7 9 7) is fixed in the function. The decimal part of the Harris sensitivity parameter is assigned a precision of 15 bits and it is set at 2,048 (to reflect the corresponding floating value of 0.0625). A thresholding and non-maxima suppression step (over a 5 pixel radius) is applied to the Harris strength map to obtain feature point locations. The non-maxima suppression function produces a binary image for locations of the detected feature points. A list of feature point locations is obtained from this binary image, which is passed on to the feature tracking module to be tracked to the next frame. The feature detection module in the DSP using the VLIB library is not as accurate as the one performed in a PC using standard floating point hardware. The main reason for this relative inaccuracy of results in the case of DSP is the fixed point computation that leads to lower precision on the output and apparent rigidity in setting the algorithm parameters. Since the DSP architecture is similar for both SMT339 and DM3730 platform, the DSP code can be easily ported from one system to another with a slight modification in the I/O interface. The user-friendly environment with embedded Linux for the DM3730, drives our choice towards it being used for testing and performance measurement. 4.3 ARM The ARM Cortex A8 processor embedded in the DM3730 architecture has 1 GHz clock frequency, 64 KB of L1 cache, and 256 KB of L2 cache. It benefits from a more accurate floating point computation and a NEON multimedia architecture, both of which are used for our purposes. For the implementation of feature detection step on this architecture, we do not use any pre-built library. We optimize the code with the hardware-specific NEON instructions which allows a speed-up factor of 1.59 when compared with the non-optimized code. The platform adopted for testing is again the DM3730 SoC, as illustrated in Fig. 1 (right). Feature detection in ARM for this work has been implemented from scratch. Although the processing steps implemented in the ARM are broadly similar to those in the FPGA and the DSP, some modifications were made, primarily to utilize the SIMD NEON intrinsics (for C language) to optimize the performance for the given

123

Author's personal copy J Real-Time Image Proc

feature points are obtained by row-wise scanning the binary image that is the result of the non-maxima suppression step (that indicates the locations of the feature points). In the case of the ARM implementation, we sort the list of feature points according to their strengths in a descending manner (strongest feature points first). Since the number of detected features is relatively small (in the range of 500 to 1,000), this sorting operation does not add any significant computational cost to the overall feature detection operation.

5 Feature tracking in DSP In this work, feature tracking is performed only in the DSP, which has to communicate with its co-processor; this is a very important point to consider and there are significant differences in the communication strategies adopted by the various system configurations (shown in Fig. 4). This implementation along with the particular inter-processor communication is detailed in the following subsection for the SMT339 and the DM3730. 5.1 SMT339 platform: FPGA ? DSP

Fig. 4 An overview of the three different frameworks for the feature tracking. Note that the two processors reside in the same chip in the case of the DM3730, while there is a need of out of chip communication in the case of the SMT339 (a)

application. ARM NEON architecture offers sixteen 128-bit wide registers that can be used in different combinations of 8 bit, 16 bit, or 32 bit registers (vectors of length 16, 8, or 4, respectively). The NEON intrinsic covers basic register operations such as load/store, addition, subtraction, multiplication, and bit shifting, among others. Some of the intermediate steps in the feature detection module such as gradient and Harris strength map computation are more suited for optimization with NEON intrinsic than others, but the floating point operations require more processing time. Hence, to enhance the overall speed of the feature detection operation in the ARM, we maintained the gradient covariance computation in 32 bit integers. Conversion to floating point at a later stage of Harris score computation improved the overall speed. Another modification in the ARM implementation of feature detection is related to obtaining a list of feature point location after the non-maxima suppression step. In the DSP and FPGA implementations, the locations of the

123

In our system design involving HW/SW co-processing with the SMT339 platform (left side of Fig. 1), we have an external interface because the two processors reside in physically different chips. The use of the SMT339 commercial platform constrained the communication between the two processors to the available interfaces. It should be noted that different communication protocols can be developed depending on the adopted platform. A common way to share data between the processors is to use an external shared memory. This specific architecture does not have a shared memory between the FPGA and the DSP. The only available modes of communication are the 3 video ports of 20 bits each and ad-hoc blocking channels with 64 or 32 bits per transmitted word. We design a communication protocol between the processors through 2 video ports for the frame input and output, while the grayscale images are transferred with a standard BT.656 communication stream from the International Communication Union (ITU). In addition to the input frames, the FPGA sends the information regarding the detected feature points to the DSP for the purpose of tracking them to the next frame by multiplexing the information in the video port. Since the Cb and Cr components are not used, we utilize this unused bandwidth to multiplex in time the detected feature flag for every pixel. In addition to making optimum use of the video port bandwidth, this strategy also helps in avoiding synchronization problems with the video stream.

Author's personal copy J Real-Time Image Proc Table 2 Profiling results for different functions used in the feature tracking step Function

Clock cycles

Gradient

1,148,865

Tracking (10 iterations)

130,411

Smoothing (31 9 31)

8,054,310

Total

9,082,883

The time is expressed in clock cycles for a TMS320DM642 DSP running at 720 MHz, operating on VGA image resolutions, and with 500 detected features

During the FPGA processing, the DSP keeps receiving new feature points along with the input images and tracks the features from the previously stored frame with a double buffer technique. This is done using the fixed point version for the Lucas–Kanade feature tracking algorithm [41] provided in the VLIB library. Since the video ports are standard in a DaVinci DSP processor, this communication design can also be adopted in different commercial platforms, or in new prototypes with these processors. For optimizing the feature tracking code, we profile the DSP processing for 640 9 480 input images with a maximum of 500 features and obtain the timing performance as reported in Table 2. We do not use any optimization at the assembly level and just run the code with different compiler options. 5.2 DM3730: ARM ? DSP In the case of DM3730 architecture shown in Fig. 1, the DSP is a co-processor for ARM and it is embedded in the same chip. The programming framework for DM3730, Digital Video Software Development Kit (DVSDK), is provided by Texas Instruments for embedded Linux platforms. In this case, both the DSP and the ARM processor access a shared memory of 64 KB on chip RAM and 32 KB on chip ROM through a shared bus which is managed by the DSPLink library. The architecture also provides a memory controller which, in our case, can access an external DDR SDRAM of 256 MB and a NAND flash of 512 MB. The communication between processors is not faster than the one we implemented between FPGA and DSP, but it is much more general and flexible to use, since the low-level data transfer abstracted by the API functions provided in the DVSDK for the ease of software development. The OS (ARM Linux 2.36) resides in the ARM processor, which is the master for all the SoC operations. From the software development perspective, the DM3730 entails the user to perform the I/O operations from the ARM, while the DSP can be accessed only internally from the ARM. In our experiments, the ARM performs all the I/O

operations by default. It reads the images either from a file (external memory), or from a camera, and also displays or logs the output. In the ARM ? DSP configuration, the ARM detects the features in the acquired images and sends them along with the input images via DSPLink to be read by the DSP. The DSP reads the input information, tracks the features, and writes the tracked output locations to the DSPLink bus. We also implement a DSP only configuration using the DM3730, where the ARM processor only sends the input images to the DSP, which then performs both feature detection and tracking. In all these cases, we adopted the same feature tracking code for the DSP. We obtain a frame rate of 160 fps for feature tracking in VGA images (same as in the SMT339).

6 Results In order to first validate our approach, we assess the accuracy of the optical flow estimation using a well-known benchmarking dataset [16]. We choose three popular images with known ground truth (Fig. 5): Yosemite, Dimetrodon, and RubberWhale. We report the qualitative results with the tracked features in Fig. 5, as well as a quantitative analysis in terms of the average angular error (AAE) shown in degrees, the end point (EP) error shown in pixels, and the percentage of the motion vectors with an angular error less than 1 in Table 3. The performance of our system is comparable to many other contributions [24, 25]. For the Yosemite sequence (Fig. 5), we obtain a high AAE as compared to the previous contributions because of a few wrong estimates on the left side of the scene. Still the overall percentage of points with an error less than 1 is above 90 (Table 3). After comparing with well-known benchmarks, we now analyze our FPGA ? DSP system using some real-world scenarios. In the absence of the ground truth flow, we instead compare the feature results of the FPGA ? DSP system to the PC-based LK feature tracking results. Sparse optical flow results for three different cases are shown in Fig. 6. Three movement patterns are tested: moving camera and moving object, static camera and moving object, and moving camera and static objects. The figure also shows two zoomed-in regions per input image to clearly illustrate the qualitative comparison between the embedded and the PC versions of the LK algorithm. The details are shown for the regions where the errors between the two LK versions are typically low and high. The high-error regions are the ones where the magnitude of the motion is small. Since our system uses a fixed point notation, the accuracy (especially in terms of angular error) for small feature displacements is relatively low, but for large displacements

123

Author's personal copy J Real-Time Image Proc

Fig. 5 Feature tracking results with the proposed system (FPGA ? DSP) for three images in the Middlebury dataset. The output of our tracking is displayed with red arrows

Table 3 Accuracy of motion estimates for the Middlebury dataset images of Fig. 5 Image Yosemite Dimetrodon RubberWhale

AAE ()

EP (pixels)

Percentage Err \ 1

Tracked features

8.69

0.41

91

656

5.84 12.49

0.32 0.36

96 90

575 1,333

We report the AAE and EP with respect to the provided ground truth, as well as the number of detected features with our FPGA ? DSP system

([1 pixel), the results of the embedded implementation are comparable with the PC version. The PC version of Lucas and Kanade algorithm used for the comparison is from the OpenCV library. The OpenCV version of LK itself has limitations and its results are not necessarily the ground truth. However, the OpenCV LK

Fig. 6 Feature tracking results with the proposed system (FPGA ? DSP) for three scenarios: static objects with camera moving toward right (a), static camera with moving object to the right (b), and moving camera and moving objects (c). A blue and green colored windows overlaid on the images are the locations for which relatively low and high accuracy of the outputs are obtained,

123

algorithm should be more accurate when compared with our embedded system version because it uses a floating point notation with a hierarchical approach. Therefore, we use it as a handy tool to evaluate the accuracy of our implementation. For the given sequences, the OpenCV LK algorithm is run with 3 pyramid levels, a tracking window of 25 pixels per level, a convergence factor of 0.001, and maximum iterations of 30. We report the tracking accuracy measures for the three sequences shown in Fig. 6 in terms of AAE and EP error as suggested by [16] and they are shown in Table 4. In the same table, along with our FPGA ? DSP system, we also present a comparison with the two other embedded systems that we implement: ARM ? DSP and DSP only. We evaluate these solutions with the same benchmark adopted for the FPGA ? DSP system. The use of the fixed-point VLIB library functions for feature detection in the DSP is not optimized for accuracy and leads to less accurate motion estimates as compared to its FPGA and ARM counterparts.

respectively. These windows are zoomed in for clarity. In both the cases, the output of our implementation (red arrows) is compared with a PC version of the LK algorithm (dashed green motion vectors). It can be seen that small motion in the image leads to higher inaccuracies in the output

Author's personal copy J Real-Time Image Proc Table 4 Accuracy of the motion estimates for different embedded system solutions in terms of AAE and EP (computed with respect to the PC-based LK implementation)

Sequence

ARM ? DSP

ARM (ms)

FPGA ? DSP

AAE ()

EP (pixels)

Detected features

AAE ()

EP (pixels)

Detected features

AAE ()

EP (pixels)

Detected features

Image A

10.6948

0.5548

343

21.2737

1.0450

251

15.2

0.76

295

Image B

6.557

0.2346

210

15.0725

0.5404

203

12.8

0.46

172

Image C

6.5738

0.189

326

15.4664

0.5336

307

0.31

254

Table 5 Speed performance comparison for feature detection between embedded processors Image size

DSP only

DSP (ms)

FPGA (ms)

320 9 240

30

9.5

1.2

640 9 480

138

10.6

4.9

If high flexibility, high accuracy of the output, and lower development time represent mandatory requirements for the final system, then the floating point solution consisting of an ARM ? DSP co-processing can be considered, as it turns out to be the most accurate (Table 3). In general, we obtain high values of AAE due to the large angular errors for small displacement as shown in Fig. 6. The EP errors, however, are always less than 1 pixel, which is reasonable considering fixed point nature of our implementation. For a complete overview and comparison of FPGA ? DSP approach with ARM ? DSP and DSP only systems, we also present an analysis in terms of processing times and hardware performances. The ARM ? DSP version of the algorithm is implemented on a DM3730 platform, where the detection is performed in ARM and the tracking in DSP, obtaining a frame rate of only 7 fps for VGA images. This is caused by the low scalability of the feature detection step in ARM (Table 5). The fastest solution, as expected, is our HW/SW coprocessing performed in FPGA and DSP, where the detection is running on FPGA and the frame rate is accelerated up to 160 fps for VGA images. As explained in Sect. 5.1, the communication channel (video port) between FPGA and the DSP has sufficient bandwidth. Hence, the performance bottleneck here is the DSP because of the reduced level of parallelization. The solution involving full processing on the DSP comes second in terms of processing speed. While there is no co-processing benefit, inter-processor communication is not a big issue since the processing is done in the same chip. Again, we implement this solution in the DM3730 chip and obtain a final frame rate of 61 fps. This is slower than the FPGA ? DSP implementation with a factor of 2.7. Table 5 reports the processing speed in milliseconds for two different image resolutions and all the proposed

9.76

embedded system solutions for the feature detection step. The reader should also bear in mind the great flexibility and scalability of the DSP processing from QVGA to VGA image size. The above analyses show that our approach of FPGA ? DSP offers one of the best trade-offs between the processing performance and the accuracy of the output. We now report the results of some further experiments with the FPGA ? DSP approach. Since we are using a fixed point implementation in both FPGA and DSP, the overall accuracy of the output optical flow vectors is considerably lower as compared to the floating point implementations (such as the LK implementation in the OpenCV library). We improve the results of our fixed point computation by applying a median filter to the motion vectors obtained from tracking step. We have developed two different versions of median filtering. The first one, optimized for speed, transforms the tracked features to a fixed grid for smoothing and is able to produce good results only in cases of high density of features (for example in image shown in Fig. 6a). The second version of the median filter is optimized for accuracy and is based on a user-defined kernel size. It should be noted that the minimum distance between two features is set at 5 pixels during the feature detection stage, which sets the lower bound on the median filter kernel sizes. Computational expense is the chief factor in setting of the upper bound on the kernel size. Table 6 includes the results for both the fast median filter and a median filter of size 21 9 21 (the same presented in Table 3). We chose this particular filter size since it is a compromise between speed and accuracy as shown later. Again we use the OpenCV results as a ground truth to obtain AAE and EP errors. Table 6 also reports a measure of the processing time in clock cycles for the tracking and smoothing operations in the DSP. The fast median filter reduces the clock cycles up to a factor x58 but it clearly compromises the accuracy of the output optical flow. We also conducted further analysis, taking into account the variation of two key parameters for the proposed implementation. As a general rule, improvements in accuracy correspond to worse timing performances. In particular we study the effect of two different processing parameters, the size of post-processing median filter and the number of LK iterations, on the accuracy of the motion

123

Author's personal copy J Real-Time Image Proc Table 6 Quantitative comparison between our embedded system (FPGA ? DSP) results and the OpenCV version of the Lucas and Kanade algorithm Input image

Number of features

Fast Median Filtering

21 9 21 median filter kernel

AAE ()

EP (pixels)

Clock cycles

AAE ()

EP (pixels)

Clock cycles

Image A

295

13.25

0.75

53,033

15.2

0.76

3,075,285

Image B

172

23.08

0.72

43,247

12.8

0.46

1,094,625

Image C

254

38.44

0.95

50,220

0.31

2,201,903

9.76

Average angular error (AAE) and end point (EP) error are reported for the same images presented in Fig. 6

6

18 16 14 12 10

1

3

0.8 0.7 0.6

5

10

15

20

25

30

median filter kernel size

35

40

45

2.5 2 1.5

0.5 1 0.4

0.2 0

Image a Image b Image c

0.9

0.5

0.3

8

x 10

3.5

Clock cycles

Endpoint Error (pixels)

Average Angular Error (degrees)

20

Image a Image b Image c

1.1

Image a Image b Image c

22

6

4

1.2

24

0

5

10

15

20

25

30

median filter kernel size

(a)

(b)

35

40

45

0

0

5

10

15

20

25

30

35

40

45

median filter kernel size

(c)

Fig. 7 The effect of varying the size of the post-processing median filter on the system output. The plots show the effect of increasing the kernel size on the average angular error (a), the end point error (b), and the clock cycles (c)

estimates and the computation time. The main purpose of these analyses is to closely examine the trade-off between the speed of operation and the accuracy of the results. In this case we are not acquiring new images from the camera, but for comparison purposes we simulate the results using the previously acquired images (Fig. 6). We studied the effect of changing median filter kernel size on the AAE (Fig. 7a) and EP (Fig. 7b) for the tracked features and the number of clock cycles (Fig. 7c). For this analysis, we use the version of the median filter optimized in accuracy because it allows changing the kernel size. For the three studies we report the kernel size on the x axis while the errors or the clock cycles are on the y axis. A kernel size of 0 means no filtering while a positive number represents the filter size. Our ground truth is again based on the floating point PC-based Open CV result. As expected, the tracking errors decrease as the size of the kernel increases, while the number of clock cycles increases for all the three image sequences presented in Fig. 6. Note that for a kernel size larger than 31, we do not obtain any significant improvement in the accuracy, and in general, a big improvement is obtained for kernel sizes larger than 11. We also study the effect of varying the number of iterations in the LK tracking function. For this experiment, keeping the median filter size fixed at 21 9 21 pixels, we

123

recorded the AAE (reported in Fig. 8a), the EP (Fig. 8b), and the clock cycles (Fig. 8c) for the LK iteration numbers ranging from 5 to 15. This range is close to the recommended range of values for the LK tracking function from the VLIB library. Figure 8 shows that the improvements in the accuracy achieved by varying the number of LK iterations are not as significant as those obtained from varying the median filter size. The accuracy is not dramatically affected for the LK iteration greater than 9, while the number of clock cycles increases almost linearly. This suggests keeping the number of iterations low for the fixed point LK implementations because the benefit (accuracy improvement) is much lower than the cost (clock cycles). After giving a detailed view of our approach across different implementations and embedded systems, we now compare the performance of our system (FPGA ? DSP based solution) with some of the previous approaches described in the literature. Our primary areas of comparison are the kind of device used, the maximum processor frequency available, and the throughput in fps and MBPS. Table 7 reports the comparison with different approaches (if all the data are not reported for a certain publication, only the available values are shown). For the sake of clarity we separated the feature detection implementations (highlighted in gray) from the feature detection and tracking

Author's personal copy J Real-Time Image Proc 5

Image a Image b Image c

Image a Image b Image c

0.9

16

14

12

10

0.8 0.7 0.6 0.5

6

7

8

9

10

11

12

13

14

0.2

15

Image a Image b Image c

5

4

3

0.4

2

0.3

5

x 10

6

Clock cycles

18

8

7

1

Endpoint Error (pixels)

Average Angular Error (degrees)

20

5

6

7

8

9

10

11

12

13

14

15

1

5

6

7

8

LK iterations

LK iterations

(a)

9

10

11

12

13

14

15

LK iterations

(b)

(c)

Fig. 8 The plots show the effect of changing the number of LK iterations on the average angular error (a), the end point error (b), and the clock cycles (c)

Table 7 Performance comparison of our FPGA ? DSP based feature tracking with some of the previous approaches in the literature Work

Image resolution

Max. frequency (MHz)

Frame rate (fps)

MBPS

Algorithm

Device

Proposed detection

1,920 9 1,080

62.1

29

60.1

Harrisa

Virtex4 FPGA

Birem et al. [42]

256 9 256

22.5





Harrisa

Cyclone III FPGA

Forlenza et al. [19]

640 9 480

600

22.2

6.8

Harrisa

BlackFin ADSP BF561

Giacon et al. [44]

512 9 512



100

26.2

Kanade and Tomasia

Spartan3 FPGA

Cabani and MacLean [20]

640 9 480



30

9.2

Multi-Scale Harrisa

Altera Stratix S80

da Cunha Possa et al. [45]

1,024 9 1,024

100

94

98.5

Harrisa

Altera Cyclone III

Proposed tracking

640 9 480

800

160

49.1

LK

DaVinci DSP

Schlessman et al. [46]



67.4

30



KLT

Virtex II

Roudel et al. [18]

500 9 500

35

40

10

LK

Altera Stratix EP1S60

Goldberg and Matthies [3]

320 9 240

720

28

2.1

FAST detection ? SAD

OMAP3530 (ARM side)

Ali et al. [17] Dı´az et al. [26]

360 9 288



25

2.5

Mean shift

Spartan3 ? MicroBlaze

800 9 600



170

81.6

LK

Virtex 4 FPGA

Mahalingam et al. [25]

316 9 252

55

125

9.95

LK

Virtex-II Pro FPGA Tesla C1060 GPU

Ayuso et al. [32]

256 9 256



185

13

Neuromorphic

Barranco et al. [43]

640 9 480

83

32

9.8

Pyramidal LK

Virtex 4 FPGA

Pauwels et al. [36]

640 9 512

1,296

185

61

Phase-based

GeForce GTX 280 GPU

Note that our approach is divided into detection (first 6 rows) and tracking for comparison purposes. The final complete system works at a frame rate equal to 160 fps for VGA images (same as tracking) a

Detection only

implementations (at the bottom of Table 7). Our feature detection approach in the FPGA achieves one of the highest processing speed, especially when compared to similar works such as [20, 42] or [19]. Da Cunha Possa et al. report a faster performance compared to ours mainly because they are using a higher processor frequency. For the feature tracking step, we have a drop-off in the frame rate as compared to the pure FPGA processing [26], but we still maintain a very high overall system throughput of 160 fps for VGA images using a co-processing strategy. Our throughput, as compared to multi-scale FPGA implementation [43], is higher but it loses in terms of accuracy. With

the same platform and a mono-scale design, Barranco et al. [43] achieved 270 fps, which is faster than our approach. In the table we also report few approaches with GPU just for the sake of comparison. Pauwels et al. [36] present a fast image processing system in GPU and compare it with its counterpart in the FPGA, presenting advantages and drawbacks for both. For the GPU implementation they obtained a throughput of 185 fps at 640 9 512 resolutions for a phase-based approach. Other authors [32] have explored the possibility to implement a neuromorphic approach for optical flow on GPU, thereby achieving 13 Mpixels/s throughput.

123

Author's personal copy J Real-Time Image Proc

Other HW-SW co-processing approaches are also presented, for example in [17], where the co-processor is a soft-core (MicroBlaze) implemented in Xilinx FPGA. The MicroBlaze is embedded in the FPGA, and this solves many issues related to inter-processor communication. At the same time, the maximum frequency of the processor is limited (generally the slower of the two). In the case of [17] it is 50 MHz. To the best of our knowledge, co-processing using FPGA and DSP similar to our approach is seldom reported for feature tracking. The principle adopted in our approach relies on the reduction of information from the whole set of image pixels to a small amount of relevant information: the FPGA provides the DSP with only the detected features. Usually, the number of detected feature points is much smaller than the total image pixels (e.g. less than 295 feature points for the images of Fig. 6, as reported in Table 6), and this number is still relatively low even when the input image dimensions are scaled up. This means that in the DSP we have a small drop-off in frame rate when we increase the resolution. Some of the key advantages of using a DSP are flexibility of operation, the relatively small developing time, and a better portability as compared to the FPGA. Our final throughput for the entire feature tracking algorithm in the FPGA ? DSP system is determined by the slower of the two processors, since the communication bandwidth for the used video port allows us to send up to 80 M words per second (pixel values plus feature flag). Even if the DSP acts as the bottleneck in slowing down the high performance achieved by the FPGA, our FPGA ? DSP solution for feature tracking is still faster than other LK approaches such as [18] and [17]. Even though we are not achieving the same performance of a full FPGA implementation for the LK algorithm, such as in [26], we are gaining in development time.

7 Discussion Our approach can be considered suitable for smart video sensor applications because it not only provides fast results, but also has low power consumption. The feature detection operation in the FPGA consumes 686 mW according to the estimate provided by the Xpower tool from Xilinx. The DSP consumes 1,697 mW based on our configuration, according to the Texas Instruments documentation [47]. In spite of the various common commercial approaches, where the full processing is done in the DSP, we decide to optimize feature tracking with a HW-SW co-design. In this way we will still maintain some flexibility from the DSP processing and not be spending excessive work effort in the hardware acceleration. The proposed feature detection in the FPGA is capable of real-time performance for full

123

HDTV resolutions and allows a speed-up factor of almost 39 with respect to a system completely executed in DSP. In fact, the same Harris feature detection would have been much slower in DSP with 25.7 Mpixels/s. A fast parallel co-processor such as the FPGA relieves the DSP from this big work load by executing it with a maximum throughput of 60.1 Mpixels/s. The same strategy can also be applied for other pixel-wise operations such as pre or postprocessing. The experiments performed to analyze the effects by varying different algorithm parameters show a significant improvement in the accuracy of the output motion vectors with a median filtering post-processing step. Highest accuracy of the motion estimates is achieved with a local median filtering window of 31 9 31 pixels. The optimum window size choice is a tradeoff between the speed of operation and the accuracy of the output and also depends on the overall texture of the scene being processed. An equivalent analysis of the motion estimation accuracy with respect to the number of iterations in the LK tracking algorithm for the given sequences is inconclusive. A further valuable contribution we present in this work is the comparison between the implementations of the same feature detection approach in three of the most used architectures in embedded systems: FPGA, DSP, and ARM. This is seldom addressed in the literature. This work can be useful to inform the embedded system designers about the various design strategies and choices of platforms for balancing the expected output accuracy and the speed of operation. The FPGA is confirmed to be the fastest processor with a factor of 25 as compared to the ARM processor. At the same time, we demonstrate the scalability of the DSP as compared to the FPGA. For the Harris feature detection algorithm, the DSP lost only 11 % in processing time going from QVGA to VGA images, while the FPGA decreased its performances by the 75 %. The fixed point notation for both DSP and FPGA was evaluated and asserted to be sufficient for both synthetic and real images. Floating point computation is easy to implement in ARM, but the gain on accuracy may not be worthwhile because of the sacrifice on real-time performance. All these findings can be reproduced with the implementation of the same algorithm in similar processors and the general principle extended to other local image processing. Another important factor for co-processing system design is the interface between the co-processors. The commercial system used in our work (SMT339) lacks an optimized communication protocol and a shared memory between the FPGA and the DSP. We propose an efficient communication protocol using the dedicated video ports connecting the FPGA and the DSP. This strategy is designed for our specific application and might need some adaptation for other algorithms, but the principle of using

Author's personal copy J Real-Time Image Proc

the high bandwidth of the video ports can be adopted in other embedded designs. The results of our current approach could be further improved by using other parallel communication channels such as the universal parallel port (uPP) from TI and memory sharing protocols, such as in Zynq [2]. The use of the video ports for sending the feature location takes advantage of the 80-MHz communication channel, and it can be applied to most of the Texas Instruments DSP families with DaVinci architecture. However, great care has to be taken in system design (software) so that large data transfers (such as images) between the multiple processors are minimized. FPGA ? DSP systems are commercially available, but in a large number of such systems, the FPGA is used merely as a controller without contributing toward actual image processing tasks. In this paper we demonstrate that our strategy of utilizing the acceleration capability of the FPGA for video processing with a DSP co-processor environment can be considered for new designs in smart sensors.

8 Conclusion We have presented a HW-SW co-design for feature tracking in smart video sensors. The proposed approach differentiates from the previous works by combining the high level of parallelism in the FPGA devices with the flexibility and shorter development time of the DSP. A Harris feature detection step runs in a Virtex4 FPGA, while the Lucas–Kanade feature tracking algorithm runs in a TMS320DM642 DSP. The two processors are capable of running in parallel to generate sparse motion estimations at 160 fps for VGA resolutions. We tested the co-processing platform with real sequences including static and moving camera scenarios. Quantitative results show good accuracy in terms of the average angular error and the end point error as compared to a floating point version of the same algorithm in a PC. Detailed experiments show that post-processing the sparse optical flow with a median filter is effective in increasing the accuracy of the results, but it comes with a penalty in terms of increased clock cycles. This step can be optimized to obtain tremendous speed-ups, but it leads to degradation of the accuracy in the feature tracking. Optimizing both the speed and the accuracy is a challenge, and an effective solution to this problem is to select the correct parameters which suit the given application. In terms of performances and processing speed, compared to the existing approaches, our embedded system represents a significant contribution to the state-of-the-art. If integrated in a compact smart camera system, our solution

can be reused for further high-level vision processing applications such as autonomous vehicle navigation. For a comparative study, we have also implemented the same feature tracking algorithm in different multi-processor systems such as DM3730 from Texas Instruments which includes a Cortex A8 ARM processor and a DSP. This solution may require less development time and is available to be used by a wide variety of applications, but its main drawback is the processing speed. Our study of the same feature detection algorithm for FPGA, ARM, and DSP shows a speed-up factor of 89 between FPGA and DSP and 259 between FPGA and ARM for QVGA images. Acknowledgments DM090201.

This work was supported in part by DOD Grant

References 1. Bramberger, M., Doblander, A., Maier, A., Rinner, B., Schwabach, H.: Distributed embedded smart cameras for surveillance applications. Computer 39, 68–75 (2006) 2. Xilinx: Zynq-7000 All Programmable SoC Overview (2013). http://www.xilinx.com/support/documentation/data_sheets/ds190Zynq-7000-Overview.pdf 3. Goldberg, S.B., Matthies, L.: Stereo and IMU assisted visual odometry on an OMAP3530 for small robots. In: 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 20–25 June 2011, pp. 169–176 (2011) 4. Zhou, J., Zhou, J.: Research on embedded digital image recognition system based on ARM-DSP. In: 2nd IEEE International Conference on Computer Science and Information Technology, 2009. ICCSIT 2009, 8–11 Aug 2009, pp. 524–527 (2009) 5. Jun, Y., Peihuang, L., Xing, W.: A dual-core real-time embedded system for vision-based automated guided vehicle. In: IITA International Conference on Control, Automation and Systems Engineering, 2009. CASE 2009, 11–12 July 2009, pp. 207–211 (2009) 6. Brox, T.: Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 500–513 (2011) 7. Ralli, J., Dı´az, J., Ros, E.: Spatial and temporal constraints in variational correspondence methods. Mach. Vis. Appl. 24, 275–287 (2013) 8. Ralli, J., Diaz, J., Guzman, P., Ros, E.: Experimental study of image representation spaces in variational disparity calculation. EURASIP J. Adv. Signal Process. 2012, 254 (2012) 9. Sundance. http://www.sundance.com/prod_info.php?board=smt 339. April 2013 10. Go´mez-Pulido, J.A.: Editorial: recent advances in hardware/ software co-design. J. Syst. Archit. 56, 303–304 (2010) 11. Harris, C., Stephens, M.: A combined corner and edge detector. In: The Fourth Alvey Vision Conference, Manchester, 1988, Manchester, pp. 147–151 (1988) 12. Sa´nchez, J., Benet, G., Simo´, J.E.: Video sensor architecture for surveillance applications. Sensors 12, 1509–1528 (2012) 13. Chalimbaud, P., Berry, F.: Embedded active vision system based on an FPGA architecture. EURASIP J. Embed. Syst. 2007, 26–38 (2007)

123

Author's personal copy J Real-Time Image Proc 14. Chalimbaud, P., Marmoiton, F., Berry, F.: Towards an embedded visuo-inertial smart sensor. Int. J. Robot. Res. 26, 537–546 (2007) 15. Chen, P., Hong, K., Naikal, N., Sastry, S.S., Tygar, D., Yan, P., Yang, A.Y., Chang, L.-C., Lin, L., Wang, S., Lobato´n, E., Oh, S., Ahammad, P.: A low-bandwidth camera sensor platform with applications in smart camera networks. ACM Trans. Sens. Netw. 9, 1–23 (2013) 16. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92, 1–31 (2011) 17. Ali, U., Malik, M.B.: Hardware/software co-design of a realtime kernel based tracking system. J. Syst. Archit. 56, 317–326 (2010) 18. Roudel, N., Berry, F., Serot, J., Eck, L.: Hardware implementation of a real time Lucas and Kanade optical flow. In: Conference on Design and Architectures for Signal and Image Processing, 2009 (2009) 19. Forlenza, L., Carton, P., Accardo, D., Fasano, G., Moccia, A.: Real time corner detection for miniaturized electro-optical sensors onboard small unmanned aerial systems. Sensors 12, 863–877 (2012) 20. Cabani, C., MacLean, W.J.: Implementation of an affine-covariant feature detector in field-programmable gate arrays. In: The 5th International Conference on Computer Vision Systems, 2007 (2007) 21. Botella, G., Gonzalez, D.: Real-time motion processing estimation methods in embedded systems. In: Babamir, S.M. (ed.) RealTime Systems, Architecture, Scheduling, and Application. InTech (2012) 22. Porikli, F.: Achieving real-time object detection and tracking under extreme conditions. J. Real-Time Image Process. 1, 33–40 (2006) 23. Wei, Z., Lee, D.-J., Nelson, B.E., Archibald, J.K.: Hardwarefriendly vision algorithms for embedded obstacle detection applications. IEEE Trans. Circuits Syst. Video Technol. 20(11), 1577–1589 (2010). doi:10.1109/TCSVT.2010.2087451 24. Botella, G., Garcia, A., Rodriguez-Alvarez, M., Ros, E., MeyerBaese, U., Molina, M.C.: Robust bioinspired architecture for optical-flow computation. IEEE Trans. Very Large Scale Integr. Syst. 18, 616–629 (2010) 25. Mahalingam, V., Bhattacharya, K., Ranganathan, N., Chakravarthula, H., Murphy, R.R., Pratt, K.S.: A VLSI architecture and algorithm for Lucas–Kanade-based optical flow computation. IEEE Trans. Very Large Scale Integr. Syst. 18, 29–38 (2010) 26. Dı´az, J., Ros, E., Agı´s, R., Bernier, J.L.: Superpipelined highperformance optical-flow computation architecture. Comput. Vis. Image Underst. 112, 262–273 (2008) 27. Maya-Rueda, S., Arias-Estrada, M.: FPGA processor for realtime optical flow computation. In: Cheung, P., Constantinides, G. (eds.) Field Programmable Logic and Application, vol. 2778, pp. 1103–1106. Springer, Berlin (2003) 28. Monson, J., Wirthlin, M., Hutchings, B.L.: Implementing highperformance, low-power FPGA-based optical flow accelerators in C. In: Proceedings of the 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 5-7 June 2013, pp. 363–369 (2013) 29. Botella, G., Meyer-Baese, U., Garcı´a, A., Rodrı´guez M.: Quantization analysis and enhancement of a VLSI gradient-based motion estimation architecture. Digital Signal Process. 22(6), 1174–1187 (2012). doi:10.1016/j.dsp.2012.05.013 30. Guzma´n, P., Dı´az, J., Agı´s, R., Ros, E.: Optical flow in a smart sensor based on hybrid analog-digital architecture. Sensors 10(4):2975–2994 (2010)

123

31. Honegger, D., Greisen, P., Meier, L., Tanskanen, P., Pollefeys, M.: Real-time velocity estimation based on optical flow and disparity matching. In: Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 7–12 October 2012, pp. 5177–5182 (2012) 32. Ayuso, F., Botella, G., Garcı´a, C., Prieto, M., Tirado, F.: GPUbased acceleration of bio-inspired motion estimation model. Concurr. Comput. Pract. Exp. 25, 1037–1056 (2013) 33. Minami, S., Yamaguchi, T., Harada, H.: Real-time optical flow measurement based on GPU architecture. In: 2012 12th International Conference on Control, Automation and Systems (ICCAS), 17–21 Oct. 2012, pp. 305–307 (2012) 34. Chase, J., Nelson, B., Bodily, J., Zhaoyi, W., Dah-Jye, L.: Realtime optical flow calculations on FPGA and GPU architectures: a comparison study. In: 16th International Symposium on FieldProgrammable Custom Computing Machines, 2008. FCCM ‘08, 14–15 April 2008, pp. 173–182 (2008) 35. Duren, R., Stevenson, J., Thompson, M.: A comparison of FPGA and DSP development environments and performance for acoustic array processing. In: 50th Midwest Symposium on Circuits and Systems, 2007. MWSCAS 2007, 5–8 Aug. 2007, pp. 1177–1180 (2007) 36. Pauwels, K., Tomasi, M., Diaz Alonso, J., Ros, E., Van Hulle, M.M.: A comparison of FPGA and GPU for real-time phasebased optical flow, stereo, and local image features. IEEE Trans. Comput. 61, 999–1012 (2012) 37. Barranco, F., Diaz, J., Gibaldi, A., Sabatini, S.P., Ros, E.: Vector disparity sensor with vergence control for active vision systems. Sensors 12, 1771–1799 (2012) 38. Tomasi, M., Vanegas, M., Barranco, F., Dı´az, J., Ros, E.: Massive parallel-hardware architecture for multiscale stereo, optical flow and image-structure computation. IEEE Trans. Circuits Syst. Video Technol. 22, 282–294 (2012) 39. Xilinx. http://www.xilinx.com/support/documentation/virtex-4. htm. April 2013 40. Ortigosa, E.M., Can˜as, A., Ros, E., Ortigosa, P.M., Mota, S., Dı´az, J.: Hardware description of multi-layer perceptrons with different abstraction levels. Microprocess. Microsyst. 30, 435–444 (2006) 41. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial intelligence, vol. 2, pp. 674–679. Morgan Kaufmann Publishers Inc., Vancouver (1981) 42. Birem, M., Francois, B.: Real-time Harris and Stephen implementation on Smart camera. In: Workshop on Architecture of Smart Camera Clermont-Ferrand, France, 5–6 April 2012. Clermont-Ferrand, France (2012) 43. Barranco, F., Tomasi, M., Vanegas, M., Diaz, J., Granados, S., Ros, E.: Hierarchical architecture for motion and depth estimations based on color cues. J Real-Time Image Process, pp. 1–18 (2012) 44. Giacon, P., Saggin, S., Tommasi, G., Busti, M.: Implementing DSP algorithms using Spartan-3 FPGAs. XCell J. 53, 22–25 (2005) 45. da Cunha Possa, P., Mahmoudi, S.A., Harb, N., Valderrama, C.: A new self-adapting architecture for feature detection. In: 2012 22nd International Conference on Field Programmable Logic and Applications (FPL), 29–31 August 2012, pp. 643–646 (2012) 46. Schlessman, J., Cheng-Yao, C., Wolf, W., Ozer, B., Fujino, K., Itoh, K.: Hardware/software co-design of an FPGA-based embedded tracking system. In: Conference on Computer Vision and Pattern Recognition Workshop, 2006. CVPRW ‘06, 17–22 June 2006, pp. 123–131 (2006)

Author's personal copy J Real-Time Image Proc 47. Garcia, I.: TMS320DM64x Power Consumption Summary. SPRA962F, Texas Instruments: Online (2005)

Matteo Tomasi received his M.Sc. degree in Electronic Engineering from the University of Cagliari (Italy) in 2006. The same year he joined the Computer Architecture and Technology Department in the University of Granada (Spain), where he received his M.Sc. in Computer Engineering in 2008 and his Ph.D. degree in 2010. Currently, he is a post-doctoral fellow at the Schepens Eye Research Institute and a research fellow in Ophthalmology at Harvard Medical School. His main research interests include reconfigurable hardware, computer vision, and image processing.

respectively. He is currently working as a post-doctoral fellow at the Schepens Eye Research Institute, and a research fellow in ophthalmology at Harvard Medical School, Boston. His research interests include computer vision, image processing, and pattern recognition. Gang Luo received his Ph.D. degree from Chongqing University, China, in 1997. He is a faculty at The Schepens Eye Research Institute and an Assistant Professor at Harvard Medical School. He has been working in the fields of image processing and optics. His primary research interest is assistive technology for visually impaired people. He is also interested in vision science. He has been peerreviewers for multiple journals across engineering and vision science.

Shrinivas Pundlik obtained his M.S. and Ph.D. degrees in Electrical Engineering from Clemson University, USA, in 2005 and 2009,

123

Suggest Documents