An FPGA-Based Verification Framework for Real-Time Vision Systems

3 downloads 0 Views 563KB Size Report
left and right cameras and correct for lens distortions. Lowpass and gradient filters ..... long-wave and very-near infrared (SWIR, LWIR and. VNIR) cameras are ...
An FPGA-Based Verification Framework for Real-Time Vision Systems Gooitzen van der Wal, Frederic Brehm, Michael Piacentino, James Marakowitz, Eduardo Gudis, Azhar Sufi, James Montante {gvanderwal, fbrehm, mpiacentino, jmarakowitz, egudis, asufi, jmontante}@sarnoff.com Embedded Vision Systems Sarnoff Corporation Princeton, NJ 08543

Abstract Field-Programmable Gate Arrays (FPGAs) have become a mainstay in the digital electronics world both for the ease of implementation as well as their inherent usefulness in incrementally refining hardware designs. When moving to an Application Specific Integrated Circuit (ASIC) or System on a Chip (SoC), verification becomes a very time consuming process, with virtually no room for error. As a result, a variety of methods have been devised to decrease the risk when creating an ASIC or SoC. We describe a hardware and software framework for testing real-time vision algorithms for lowering the uncertainty in FPGA and SoC development, while reducing the SoC verification time. The framework benefits from hardware and software verification, ease of reconfiguration for testing multiple vision algorithms, and an iterative hardware/software co-design.

1. Introduction Over the years, embedded vision systems have evolved to become semi-autonomous intelligent visual agents due to their front-end decision-making capabilities. They are now being used in a variety of fields including but not limited to smart surveillance cameras, automotive safety systems, aerial surveillance, military night vision and visual navigation systems. These systems require high performance computing platforms able to cope with extremely high data rates. Hence, they are often implemented on FPGAs or ASICs. However, implementation of multiple vision algorithms on such platforms has often been done by designing task-specific hardware modules, undermining reusability, flexibility and programmability. In this paper, we present the verification environment developed to support a reusable, flexible and programmable parallel pipelined-architecture for implementation of vision algorithms. We further emphasize the impetus for FPGA based verification and visual algorithm optimization to minimize the risks of an undesired SoC design, while reducing the verification time. And finally, we articulate the advantages of our approach by presenting our methods of prototyping using an FPGA-based platform

and verification using C-models, specific to vision systems, followed by examples of fusion and stereo applications.

2. Parallel pipelined architecture A video pipeline architecture consists of a set of devices that process video data streams. A video device, such as image filters, scalers, DCT devices, variable length bit encoders, block motion estimators, etc, may be programmed (configured), before the video data is sent through the device. A system of video devices may also be programmed by connecting multiple video devices in a certain order through muxes or more general crosspoint switches. Pipelined devices are often more power efficient than DSPs or CPUs, as they are optimized to perform a specific function. Pipelined devices do not need to be controlled every cycle using a sequence of instructions. Moreover, with a crosspoint switch, memory interfaces, and a fast clock, individual devices can be used several times in different configurations, while processing a single frame [4,5].

2.1 A system example Computing disparity from two cameras (Stereo processing), is an example of an algorithm well suited for a pipelined architecture. Many implementations of Stereo on FPGAs or ASICs have been documented including the few referenced here [2,3,6,7]. Figure 1 shows the basic steps of a stereo algorithm, starting with warping to align left and right cameras and correct for lens distortions. Lowpass and gradient filters (FLT) follow the warping in each path, as well as two sum of absolute difference (SAD) devices computing the disparity for 32 shifts and 7x13 integration window. One SAD performs the left–toright disparity computation (LR) and the other the rightto-left disparity computation (RL). The ALU performs a consistency check between the LR and the RL SADs and could compute the depth. This process is then repeated for multi-resolution images 0, 1 and 2 (full, half, quarter resolution). The lower resolution levels fill in the sparse data of the highest resolution, and provide an extra consistency check for the disparity data. Section 5 shows the results of this stereo algorithm where the warper is also used to per-

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

form tilted horopter stereo, providing optimum sensitivity to objects on a road [3].

L

WARP

FLT

SAD

WARP

FLT

SAD

ALU

R

dx

Repeat for full, half & quarter resolution

terface, and a control interface. The video inputs and outputs have a common format for video data. Each device has a simple common control interface that can be easily translated to SoC interface busses, such as the AMBA bus[11]. The device also conforms to a common control model, facilitating re-use of device control in software and hardware. MULTIPORT DRAM INTERFACE

Figure 1: Example Stereo Algorithm

An example parallel pipeline system is shown in Figure 2 where a single filter and SAD device are reused 6 times per frame. The filter and SAD are used twice for each of the three image resolution levels: L0,L1, and L2. Within the system the Frame Store Ports (FSP) provide the access to video images through a Multi-port DRAM interface, sending or receiving video streams through the crosspoint switch. Such a system operating at 100 MHz pixel rate can support the stereo application with 25 MHz video input. This is more then sufficient for standard video format of 720 x 480 at 30 Hz. If figure 2 is augmented with a second warp, filter, and SAD unit, then this system operating at 100 MHz pixel rate can support the stereo application with 75MHz video input for all three multiresolution levels. Note that the video devices can also be used in parallel on different size images as long as there are sufficient ports to memory available. DRAM

Video In Video Out

FSP MULTIPORT DRAM IF

Video In

SAD FSP FSP FSP

ARM

FLT

FSP

ALU

WARP Crosspoint Switch

VIDEO IN

DEVICE

VIDEO OUT

CONTROL INTERFACE

Figure 3: Generic Video Device

With a common video and control interface, new video devices such as DCT filters, block motion estimators etc. can easily be added to this unified structure and connected to the crosspoint and/or the multi-port memory interface. An FPGA implementation may use multiple chips, multiple crosspoint switches, and may duplicate some devices to speed up the operations. An SoC will be optimized for an application, often reducing some of the flexibility. However, video devices can be fully interchangeable between FPGA and SoC implementations.

2.3 Processor Processing video using the video devices requires relatively little work on the part of the system control processor (CPU). The system processes one frame at a time by setting up crosspoint connections to form a network of video devices, setting parameters in the devices, and starting the video network. The CPU receives an interrupt when the devices have finished processing the frame. The internal system clock is much higher than the external video clock so that video devices can be used many times within one frame time. Even 100 video operations per frame will load a PowerPC 603e processor to less than 10%. This leaves plenty of time for the CPU to perform critical higher level operations (matrix inversions, etc.) that are done more easily by a general-purpose processor than by dedicated hardware.

Figure 2: Example Stereo System

The Acadia chip [5] is a parallel, pipelined ASIC similar to the one shown above that can perform the stereo processing on 720 x 480 images and two lower resolution levels simultaneously at 30 Hz. The multi-FPGA prototyping system described in chapter 4 can perform the same function and much more on 1280x1024 images at 30 Hz.

2.2 Video devices and their common structures A generic video device as shown in Figure 3, may have video inputs, video outputs, a multi-port DRAM in-

2.4 System memory Our embedded vision architectures rely on high bandwidth memories with multi-port access by the video stream interfaces such as the FSP (Frame Store Port). System memory for our vision applications consists of both processor memories as well as vision core memories. System memory is a key requirement for vision processing to store raw input imagery, intermediate image transforms used for the real-time algorithms and various output images. Within FPGA hardware, system memory is often

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

distributed across FPGAs while for SoC’s system memory is typically unified and shared between vision cores and processors for a more optimized solution. Our verification framework described in the next section allows for both models.

3. Verifying the video device and system The video pipeline architecture just described, allows verification of a large system in a reasonable time and with a reasonable effort. Video and control I/O standards allow us to test a device in relative isolation with high confidence that it will work the same way when it is connected into the final system. We use a hierarchy of test environments to cover the needs of the device designer through the final system verification, before sending the design to the integrated circuit fab. The test environments are, for the most part, software environments that are used to test the response of a device against its specification. The device designers use a RTL (Register Transfer Level) modeling language, such as Verilog, to design a device. A test environment, called a testbench, is a combination of the hardware models and additional code to provide overall control, manage the presentation of stimulus data from files or generators, and record device output data to files. Software called a C-model, written in a standard computer language (typically C or C++), provides an independent model of the behavior of the devices. The C-model is used to predict the response of a device to compare with the RTL model behavior to ensure proper design functionality. There are three software-based testbenches each fitting their own niche within the verification environment. − The device designer uses the local testbench for initial testing and for debugging. − A crosspoint testbench is the environment used for regression testing of individual video devices and combinations of video devices. − The system testbench is a full software model of the target SoC which we use to verify the operation of our custom video devices and any Commercial Off-TheShelf (COTS) components that make up the rest of the system. In addition to the purely software testbenches, we use a hardware testbench that contains a combination of software, FPGA(s), and computers. This FPGA testbench runs the device under test at high speed – although not necessarily at the speed of the final system. The device is tested with larger images and more conditions including randomization of data and parameters, than is possible with a software testbench in the same time.

3.1 The local testbench A device designer uses the local testbench to test the device with minimum overhead and maximum debugging

capability. This testbench has a small amount of supporting code that is used to stimulate the device using one frame of video input at a time. The small amount of extra code reduces the overhead so that the simulation software can run much faster than a full system model. The designer uses this testbench to perform the initial tests on the Device Under Test (DUT) while it is being designed, and to debug problems that are discovered in other testbenches. in1.vec

in2.vec

File Reader

File Reader

reg.vec Controller DUT

g.vec

out.vec

File Reader

File Writer

Compare Pass/Fail

Figure 4: Hardware/Software Local Testbench

Figure 4 illustrates the local testbench structure. An external functional or bit-accurate C-model is used to generate the “golden” response (g.vec in figure 4) for a given stimulus image. The functional model is used early in the development process. It is usually derived from a software prototype of the target video system. The software prototype typically executes on a PC and is written using floating-point arithmetic and a graphical user interface (GUI). Generating a response image requires manual interaction with the GUI. The video device rarely uses floating-point math, and when it does, it is not fully IEEE compliant. This means that the response from the functional model is not bit-for-bit the same as the hardware response for the local testbench. This is not a large problem but it requires manual interpretation of the results of the simulation. Later in the development process when a bit-accurate model is available it can be used instead of the functional model to simplify the interpretation of results. The local testbench meets the needs of the device designer because it provides maximum visibility and control over a device or a fixed pipeline. Before starting the fabrication of an integrated circuit, however, it is necessary to test all devices in all of the configurations that they will be used. The local testbench requires manual intervention to run every test so it is not possible to automatically run a suite of tests. The accuracy of the functional model neces-

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

sitates a human element in determining the outcome of every test, as some elements that cause failure may be subjectively ignored as only minor incongruities.

3.2 The video crosspoint testbench

Video In

Video Output File

Write Output Files

Testbench Controller Program

Video In

Testbench Controller

Video Out

CPU Interface

Final DRAM

DRAM Model

FSP MULTIPORT DRAM IF

Video Input File

Video Input File

Initial DRAM

The video crosspoint testbench solves these problems by providing a test environment that includes multiple video devices, a crosspoint switch, and a bit-accurate Cmodel for every device. A test designer can write a single program that generates stimulus images, runs the images through the proper sequence of C-models, and generates the proper sequence of control signals to control the devices. The hardware simulations can be run in a batch environment, and the pass/fail result is simply determined by comparing the responses from the hardware and C-model. The structure of the crosspoint may be seen in Figure 5.

unlikely the same mistake appears using such different implementation methods, guaranteeing a high confidence in a device passing its test and meeting the functional requirements. Another benefit is the ability to test a single device in isolation from or in combination with others, within one test program. The test program does this by using the crosspoint switch to reconfigure the device input and output connections. A regression test program can assess each device in isolation to ensure that it works correctly, and then connect the several devices in an application-specific network to make sure the devices work together. Crosspoint Controller Program

Crosspoint Testbench

Initial DRAM

Test Program

Compile

C-Model Script

Final DRAM

Compare

Final DRAM

C-Model

Compare

DUT

Error Report

Error Report

FSP DUT

FPGA Controller Program

FSP FSP FSP

Simulated System

DUT

FPGA Testbench

Final DRAM

DUT

Figure 6: Test Compiler Crosspoint Switch

Figure 5: Video Crosspoint Testbench

We use special memory access devices called a Frame Store Port (FSP) to read the stimulus images from the DRAM memory and to write the responses back to another area of memory. The test compiler creates an initial memory image and uses the C-models to create a final memory image. The hardware simulation loads the initial memory images at the beginning of a run and writes the final image at the end of the run. A test passes if the final memory images from the C-model and hardware simulators match bit-for-bit. This approach has several benefits. While a C-model needs to be bit accurate, it does not need to know about pipelining in the RTL or the actual timing. In some cases it is necessary to add blanking clocks at the beginning of the output stream in order to model the delay through a device. This delay may be ignored for many devices. This is much simpler than the bus cycle-accurate model normally used for RTL verification. The hardware and software implementations are also very different. Hardware takes in a video stream, while the software takes in an array of input data. This necessitates design differences to achieve equivalent functionality. Further, two separate design specialists generally code for their respective hardware and software modules. It is very

The test compiler is central to the flexibility of this testbench. It converts a Test Program containing a sequence of device parameter settings and crosspoint connections into several different executable formats. The compiler generates an initial memory image that is used as the starting point by all executable formats. It also facilitates randomization of image data, image sizes, and device parameters. The C-Model Script in Figure 6 is a shell script that runs the individual device C-model programs in the correct sequence to generate a final memory image. The Crosspoint Controller Program is an assembly language program for the Testbench Controller (a simple processor) to load device registers, make crosspoint connections, and run the devices in the hardware model. The final memory image is captured at the end of the run. A simple binary compare of the memory images is sufficient to detect any disagreement between the two simulation methods. Additional utilities are used to determine the testbench configuration that caused the disagreement. A real benefit is that the C-model of the video devices does not have to be cycle accurate, only bit-accurate. Furthermore, several critical devices or interfaces do not have to be modeled, but are verified implicitly. A good example of this is the Multi-port Memory interface. By using a proven behavioral model of the external DRAM, and a simple behavioral model of the FSP (where and in what

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

format is a video stream stored in memory or read from memory), the complete multi-port memory interface can be tested by reading and storing random sized images with random data through multiple ports with random timing. Regression tests can be fully automated with the test compiler and crosspoint testbench. Failures detected in this environment can be reproduced and debugged in the local testbench environment. The main drawback is the speed of operation. Regression testing forces the use of unrealistically small images for testing, or waiting days for the completion of a test when using realistic image sizes.

3.3 The FPGA testbench The biggest challenge for verification of vision devices is the amount of data required at the input/output and the very large number of configurations required to test corner cases. The traditional software simulation on highend computers can quickly scale from a few minutes up to a few hours for a single frame of video. This limitation affects the amount of data and configurations that can be tested to verify the correct functionality of a vision device. The FPGA Testbench consists of one or more FPGAs containing a crosspoint and several video devices. A processor connected to a local-area-network (LAN) controls the testbench. The test compiler supports this testbench by creating a program that will run on the control processor. This program downloads the initial DRAM from a networked disk, runs the tests on the devices in the FPGA(s), and compares the result with the final DRAM data stored on a networked disk. The FPGA Testbench executes tests much faster than the Crosspoint Testbench. Each test program compiles into a stand-alone executable making automated regression testing difficult to implement. The true power of the FPGA Testbench is its ability to run more complex programs than the purely software testbenches. Test engineers use the runtime library for the FPGA Testbench to write test programs that are more complex than are possible with the Crosspoint Testbench. The FPGA control processor, mentioned in Section 2.3, is a general purpose CPU while the Crosspoint Testbench Controller is a simple sequencer with a restricted instruction set and the Test Compiler is also restricted in its expressiveness. For example, a program can generate random device control parameters (like coefficients for a filter) and may even include the C-model of the device under test to check the result at runtime. This approach allows us to run thousands of different scenarios through the system in near-real-time. These scenarios can include different input resolutions as well as configurations. Further, this allows verification of the device and algorithm functionality prior to the SoC implementation.

3.4 Subjective verification of algorithms A video device does not necessarily implement a complete vision algorithm by itself. The design of a vision system often requires a tradeoff between functions performed in hardware and software. There are some things that the hardware can do easily at a very high speed— much faster than a software algorithm running on a processor. Other operations can be very easy for software, but require enormous resources in a high-speed hardware implementation. Those are the easy choices. There are other choices where a designer must find precisely the right set of functions to put into hardware that will allow the processor to solve the problem within a given time budget. Debugging and testing the full algorithm, containing both hardware and software pieces, is very important. Running myriad simulations helps, but it is hard to visualize the results. The FPGA environment includes camera inputs and display outputs. A test program can configure the hardware and run the full algorithm using video from the cameras and show the results on a display concurrently with the testing. This feature enables the algorithm, hardware, and software team members to evaluate the algorithm implementation for accuracy, and often tweaks can be applied in real time to quickly and easily enhance the functionality of the algorithm.

3.5 The system testbench An SoC can contain far more than just video devices; SoCs contain video devices along with COTS processors, busses, and peripherals. The most complex of the software testbenches, the System Testbench, is a full model of the final SoC, models of I/O devices required by the SoC, and infrastructure to control the test execution and analysis. Test engineers write a program and load it into a simulated FLASH memory. The processor boots from the memory to execute the test. The SoC outputs are analyzed to determine if tests passed or failed. The System Testbench runs much slower than the others. To reduce the testing problem, we must presume that the COTS devices work correctly, and we have already verified that our devices work correctly. The main purpose of the test programs is to verify that all of the internal connections are correct. Thus the hierarchy of testbenches – local testbench, crosspoint testbench, FPGA testbench, and System testbench – provides higher confidence in the correctness of the video devices and the application implementation in which these devices interact. In addition, this framework avoids tedious cycle accurate C-model implementations where possible.

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

4. A real-time vision test platform Many real-time video pipeline systems that use one or more FPGAs have been built and reported over the last decade or more [1,2,9]. With the advancing FPGA complexity and clock speed, the FPGA testbench becomes easier to implement and can meet or approach the realtime processing speed desired for the final system implementation. We developed, demonstrated, and tested a multispectral fusion system with 5 high resolution cameras, and two displays. We selected the 6000k10 “Monster” board from the Dini Group [10]. Figure 7 shows the FPGA development board, with nine Xilinx VP100 FPGAs and the CameraLink interface board we chose for our implementation application. This board also contains numerous banks of DDR memory ideal for our image processing requirements.

through the crosspoint. There is also a common control interface to all the FPGAs. PPC Processor PMC Board

Bi-Dir RocketIO @ 3 Gbit/s for video

PCI Bus FPGA A

Data Extract

up to 1280x1024 @30Hz

FPGA B

FPGA C

CTRL bus Auroa/ RockeIO

Display Driver

1280x1024 IO Board @60Hz

FPGA D

FPGA E

FPGA F

FPGA G

FPGA H

FPGA I

FPGA Development Board

Bi-Dir RocketIO @ 3 Gbit/s for video

Figure 8: Custom Video Interface

CameraLink

Dini Group Monster Board

Figure 7: Dini 6000k10 Monster FPGA Board with digital CameraLink input board

The use of the COTS FPGA development board was well suited for the implementation of our multi-spectral fusion algorithm, but required a couple of key infrastructure additions. To demonstrate real-time fusion we needed to use five video cameras as inputs to the development platform. For this demonstration we chose to jointly develop a custom IO board with the DINI Group to receive video from cameras and distribute video to displays. A processor interface was also required for the application to support control of the real-time fusion application. Figures 8 and 9 shows the interconnects between the PMC based Processor board and the FPGA board. Emulating an SoC application in an FPGA environment may require several FPGAs. Even if the final implementation on an SoC does not require the general crosspoint interconnect, the crosspoint facilitates the interface between many FPGA devices, and provides an efficient test strategy for video devices. The bottom half of Figure 9 shows an example of a set of video devices implemented on an FPGA, where the video data of the neighboring FPGAs are interconnected

The CameraLink board accepts five SXGA camera inputs via CameraLink connections and provides four SXGA display outputs also over CameraLink. The video inputs to the system can support up to 1280x1024 SXGA resolutions at 30Hz and the output can drive up to 1280x1024 at 60Hz to Camera Link displays and capture boards. We utilize the RocketIO high speed serial interface to support the large bandwidth required to sustain data flow necessitated by these resolutions and frame-rates. The FPGA development board has nine FPGAs interconnected as shown in Figure 9. We developed a general and very flexible structure to easily partition, modify and test our vision algorithm designs. Each FPGA has a full crosspoint that connects to other FPGAs. The video devices connect to the crosspoint and to a memory interface, if required. The AMBA-AHB [11] is used for processordevice communication. The Rocket IO interface allows multiple camera inputs and display outputs to be connected to the system. This architecture enables any device to send video to any FPGA and memory. It also enables multiple cameras and displays to send or receive video from any device. A PowerPC processor controls the vision applications being run on each of the nine FPGAs, as well as provides an Ethernet interface to external PCs for debugging, user control of the applications and sharing of processed imagery. For simulating and SoC, the PowerPC can emulate the embedded processors, such as ARM. The basic infrastructure for each FPGA consists of a full 31x30 crosspoint switch, four frame-store ports (FSP), a timing control module, a DDR memory interface, and control interface from the AMBA bus. Each frame-store port can read or write a video stream to external DDR and

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

perform other functions like scaling and timing adjustments. The timing control module is used to synchronize multiple video streams. The total size of this infrastructure is 15% of flip-flops, 25% of LUTs, 10% BRAMs, and 2% Multiply blocks on a VirtexII-Pro XC2VP100. FPGA DDR

External Processor

DDR

High-Speed Interface

AHB Interface

Memory Interface

Rckt IO

MIP Video IO

FSP

rithm that allows for the combination of multiple video modalities into one video stream while preserving the most salient features of each sensor modality [8] Figure 10 illustrates one channel of the real-time multi-spectral fusion algorithm we implemented. For our full FPGA testbench implementation we incorporated two channels of out real-time fusion, each drives its is own output display. Within our FPGA testbench we provide the infrastructure for the control and genlocking of each of the camera sources. Genlocking and control is critical to ensure the imagery is temporally aligned, as well as to provide minimal latency through the processing. Prior to the fusion of each of the channels, each video source is enhanced and warped. Enhancement of the imagery includes non-uniformity correction, noise reduction and histogramming. Warping of the imagery includes not only warping for pixel alignment of the imagery prior to fusion, but also warping to remove inherent lens distortion in each of the imagers respective lenses. ENHANCE

SWIR

ENHANCE

Y

ENHANCE

Y

LWIR

FSP

Test Pattern

Warper

YUV

VNIR

WARP

YUV

WARP

Y

WARP

Y

FUSE

DISPLAY

DIS

Figure 10: Fusion Video Path

Enhance Filter Crosspoint Switch

Figure 9: Partitioned FPGA connections in Development Environment.

5. Real-time vision algorithm example

Our implementation of multi-spectral fusion requires numerous FPGAs to implement the required features. The use of the video crosspoint, described in Section 2, is critical to ensure elements of the full fusion design get properly distributed across numerous FPGAs. Figure 11 shows example frames of the real-time SXGA 30Hz 3-band multi-spectral fusion. The shortwave, long-wave and very-near infrared (SWIR, LWIR and VNIR) cameras are source inputs to the FPGA board at 30Hz.

The FPGA testbench described in section 4 provides an ideal environment to support the full verification hierarchy described in section 3. There was significant effort involved in implementing and verifying the infrastructure with the AMBA bus, DDR interfaces, crosspoint switches, and numerous FSPs. With that interface in place, it was very easy to add video devices, verify them individually, and then use them in the final test application. Following are two real-time vision examples that were implemented on this system, executing complex video functions on 1280 x 1024 video at 30 Hz. The FPGA platform is large enough to perform both examples at the same time.

5.1 Fusion example: VNIR, SWIR and LWIR The first full application we demonstrated on the FPGA system was a real-time low latency multi-spectral fusion processor. Multi-Spectral fusion is a vision algo-

Figure 11: Clockwise from top leftVNIR, SWIR, fused, and LWIR

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

We learned several key lessons from our first realtime fusion testbench application. First, developing a versatile infrastructure is the key to the ease of future use of the testbench. Secondly, the ability to demonstrate the algorithm on the FPGA board and adjust the algorithm performance in real-time to optimize the overall performance of the algorithm was crucial. Thirdly, when developing a new SoC without real-time optimization and evaluation prior to the SoC fabrication step, you risk building an SoC with an algorithm that may not meet your expected performance criteria, hence the FPGA testbench provides a key risk mitigation step.

5.2 Stereo example: with left-right checking Another real-time vision application we implemented on the FPGA verification platform is a stereo based range estimation algorithm [3]. Figure 12 illustrates the basic algorithm used to achieve this real-time performance.

L

WARP

FLT

SAD

WARP

FLT

SAD

ALU

R

dx

Repeat for full, half & quarter resolution

Figure 12: Example Stereo Algorithm This algorithm development supports the real-time processing of stereo image pairs including the warping for lens distortion correction and perspective alignment. Since the FPGA infrastructure is already in place from the fusion application the migration to this new application has taken a minimal amount of time and effort. Within a month we achieved real-time processing (1280 x 1024 @ 30 Hz) of the algorithm for demonstration and algorithm verification. There are essential common infrastructure components shared between the fusion and the stereo application. These components are as follows: − AMBA Processor interface − DMA data transfer − Video crosspoints within in each FPGA − DDR memory interface for video storage − FPGA I/O definition and DCM timing With these components already in place from the fusion implementation it was only a matter of replacing the fusion specific vision cores with the stereo specific vision cores to create the new stereo application testbench. Figure 12 illustrates the key vision cores required for the implementation of our stereo processing algorithm. The stereo core shown in Figure 12 has warper, filter, SAD (Sum of Absolute Differences) and ALU components connected via the video crosspoint. The SAD parameters are listed in section 2.1 and includes left-right checking.

Figure 13. Stereo disparity image with tilted horopter

Figure 13 shows the original image on the left and the tilted horopter stereo image on the right, where the horopter is aligned with the road surface. The stereo image on the right therefore provides a measure of the height above the horopter plane of each object in the field of FOV. Pixels with greater intensity (whiter) indicate they are above the plane, darker pixels would indicate depressions in the road.

6. Conclusions This paper presents an efficient verification framework for real-time video ASIC or SoC using a combination of C-models and FPGA testbenches. This framework was developed for parallel pipelined-architectures that characterize reusability and programmability. Such a framework is applicable to all real-time vision applications that require robust and simultaneous algorithm implementation and verification. In other words, the architecture provides a flexible and rapid development sandbox for tweaking and assessing vision algorithms, verifying realtime system performance and algorithm optimization, while providing an efficient method for verifying the implementation accuracy. We demonstrated this for fusion, stereo and other applications. We learned several key lessons from our real-time algorithm implementation. First, the key to future reuse is developing a versatile infrastructure using common video and control interfaces. After the initial infrastructure development adding video devices for testing and verification becomes an easy task. Next, the ability to demonstrate the algorithm on the FPGA board and adjust and optimize the algorithm performance in real time is crucial to rapid turn-around. Finally, when developing a new SoC without real-time optimization and evaluation prior to the fabrication step, you risk building an SoC with an algorithm that may not meet expected performance criteria. The FPGA platform provides a key risk mitigation step.

Acknowledgements The authors would like to thank the DARPA MANTIS program for support of this work, Raytheon Advanced Technology Directorate and Dini Group for teaming with us in these implementations, and the reviewers for their constructive comments.

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

References [1] D. Buell, J. Arnold, and W. Kleinfelder. Splash 2: FPGAs in a Custom Computing Machine. IEEE CS Press, 1996. [2] J.Woodfill and B. Von Herzen. Real time stereo vision on the PARTS reconfigurable computer. In 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 201–210, 1997. [3] R. Mandelbaum, L. McDowell, L. Bogoni, B. Reich, M. Hansen. Real-time stereo processing, obstacle detection, and terrain estimation from vehicle-mounted stereo cameras. In Proceedings of the 4th IEEE Workshop on Applications of Computer Vision (WACV'98), Princeton, New Jersey, October 1998. [4] M.R. Piacentino, G.S. van der Wal, and M.W. Hansen, Reconfigurable Elements for a Video Pipeline Processor. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM99), Napa Valley, California, April 21-23, 1999, pp. 82-91. [5] G.S. van der Wal, M.W. Hansen, M.R. Piacentino, The Acadia Vision Processor. In Proc. IEEE Int. Workshop on Comp. Arch. for Machine Perception (CAMP), Italy. Sept., 2000, pp31-40. [6] Ahmad Darabiha, Jonathan Rose, and W. James MacLean. Video-rate stereo depth measurement on programmable hardware. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision & Pattern Recognition, volume 1, pages 203–210, Madison, WI, June 2003. [7] Divyang K. Masrani, W. James MacLean. Expanding Disparity Range in an FPGA Stereo System While Keeping Resource Utilization Low. In Proc. Workshop on Embedded Computer Vision, IEEE Conf. on Computer Vision and Pattern Recognition, July, 2005. [8] P.J. Burt, Pattern selective fusion of ir and visible images using pyramid transforms. In National Symposium on Sensor Fusion, 1992. [9] CHAMP-FX www.cwcembedded.com/products/0/2/78.html [10] Dini Group: www.dinigroup.com [11] AMBA: www.arm.com

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE