Real-time Hierarchical Visual Tracking Using a ... - Semantic Scholar

5 downloads 0 Views 207KB Size Report
Custom computing machines (CCMs) have recently emerged as an attractive alternative to traditional systems. CCMs retain a general-purpose nature, but can.
Real-time Hierarchical Visual Tracking Using a Configurable Computing Machine Bharadwaj Pudipeddi, A. Lynn Abbott, and Peter M. Athanas The Bradley Department of Electrical and Computer Engineering Virginia Tech Blacksburg, Virginia 24061, USA Abstract This paper describes a custom computing approach to real-time visual tracking. Traditionally, tracking systems require dedicated hardware to accommodate the computational demands and input/output rates imposed by real-time video sources. A radical alternative is represented by custom computing machines such as Splash 2, which use interconnected Field-Programmable Gate Arrays (FPGAs) to provide fine-grain parallelism and reconfigurability so that high-speed performance is possible for many different applications. The efficacy of such architectures to image-based computing is illustrated here through the implementation of a tracking system that consists of two parts: a Gaussian pyramid generator and a correlation-based tracker. The pyramid generator converts each input image to a hierarchy of images, each representing the original image at a different resolution. An object is tracked on successive frames by a coarse-to-fine search through this image hierarchy, using the sum of absolute differences as the matching criterion. Splash 2 performs these operations at rates of 15 or 30 frames per second. Its performance therefore rivals that of application-specific systems, although the architecture is inherently general-purpose in nature.

1. Introduction The purpose of a visual tracking system is to follow a moving object through a sequence of images. Visual tracking is needed in many applications, including surveillance, autonomous vehicle navigation, robotic gaze control, and missile guidance. Unfortunately, most of these involve large amounts of image data that must be processed at high speed. Because typical general-purpose computing platforms are not capable of meeting these demands, system designers have traditionally developed special-purpose hardware that is carefully tailored to meet the needs of a specific task. The problems associated with the use of application-specific hardware

Proceedings of the 1997 Computer Architectures for Machine Perception (CAMP '97) 0-8186-7987-5/97 $10.00 © 1997 IEEE

are well known, however: lengthy design cycles, high cost, and inability to accommodate problems that are not specifically addressed in the design. Custom computing machines (CCMs) have recently emerged as an attractive alternative to traditional systems. CCMs retain a general-purpose nature, but can be configured at a low level to offer performance that rivals application-specific hardware. A single CCM platform can be quickly reconfigured to implement such vastly different operations as median filtering [Abbo95], genetic database search [Hoan93], the two-dimensional Fourier transform [Shir95], or even RISC processing, and yet each of these is performed at much higher speed than is possible with today’s general-purpose machines. CCMs are typically characterized by high-capacity data paths and programmable interconnects between processing elements (PEs). Example CCMs are Splash 2 [Arno92, Atha95, Buel96], PRISM [Atha93], and Spyder [Isel95]. This paper describes the implementation of a realtime visual tracking system using the Splash 2 custom computing machine. The system generates image pyramids for 512 × 512 images at 30 frames per second, and performs a coarse-to-fine search to track the movement of an object in the field of view. The implementation combines piplining, data recirculation, and SIMD operation to demonstrate the utility of the custom computing approach on data-intensive imageanalysis tasks. The remainder of this paper describes the system in more detail. Section 2 describes the tracking algorithm, and Section 3 presents a brief overview of the Splash 2 architecture and its development environment. Section 4 discusses the Splash 2 design for Gaussian pyramid generation, and Section 5 describes the implementation of a coarse-to-fine image matching system. Section 6 describes the performance of the resulting system, and Section 7 presents concluding remarks.

2. Coarse-to-fine tracking The visual tracking problem may be stated as

follows: Given an image I n obtained at time n, and given the row and column location (r, c) for a target of interest in that image, determine the new row and column location for the same target in the subsequent image I n +1 . Typically, this requires a search within I n +1 that begins near location (r, c), and this process repeats for each successive image using the most recent knowledge of target position to select the starting point for the new search. For the implementation described here, a simple area-based matching method is used during search. A subimage S of size w × w (here, w = 16) is extracted from the reference image I n , and is compared with w × w windows within I n +1 . For each window that is chosen, the sum of absolute differences (SAD) of pixel values is computed, and the window minimizing this criterion indicates the new target location within I n +1 . This may be represented as w−1 w −1

d ( x , y ) = ∑ ∑ S (r , c ) − I n +1 (r − x , c − y ) r = 0 c =0

where the offset (x, y) that minimizes d determines the best-match location. In this implementation, x and y are both constrained to lie in a small range which yields a total search window in I n +1 of size 32 × 32. A problem with this simple matching method is that false matches may result because of noise, rapid object movement, or a host of other problems. To combat this, a common approach is to employ a hierarchical search method that begins with an initial search over the entire image at very coarse resolution, and then progresses through increasingly finer levels of resolution (but over smaller extent) for the same image. This coarse-to-fine approach is well-known in the computer vision community (e.g., see [Hall76, Tani78]), but is often prohibitively expensive from a computational perspective. For the implementation presented here, we use the popular Gaussian pyramid decomposition to obtain a hierarchy of images at different resolutions [Burt83]. Briefly, the original image is taken to be the base of the pyramid. The next higher level of the pyramid is an image with half the spatial resolution in both directions, and each pixel value for this level is computed as a weighted average of pixels from a 5 × 5 neighborhood of the previous level. Each higher level of the pyramid is computed in the same way form the previous level, resulting in increasingly coarse levels of resolution. An example pyramid, as computed by Splash 2, is given in Figure 1. The system described in this paper performs both Gaussian pyramid generation and coarse-to-fine search

Proceedings of the 1997 Computer Architectures for Machine Perception (CAMP '97) 0-8186-7987-5/97 $10.00 © 1997 IEEE

through successive image pyramids. The following is a summary of the complete algorithm. More details are provided in [Abbo94, Chen94, Pudi96]. 1. Extract the reference subimage S from the current image I n . 2. Capture the next image I n +1 and compute its 5level Gaussian pyramid. 3. For each level of the pyramid, beginning at the coarsest level, compute the sum of absolute differences over a 32 × 32 search area. At the coarsest level, in this implementation, this implies a search over the entire image. At finer levels, the search is conducted only near the best match that was detected at the previous (coarser) level. 4. Report (r-x, c-y) as the target location in the fullresolution image I n +1 . 5. Increment n, and repeat from Step 1.

3. The Splash 2 architecture Splash 2 [Buel96] is a general-purpose system that can be customized to provide computation rates that rival high-speed, dedicated hardware systems. The customization is facilitated by the use of fieldprogrammable gate arrays (FPGAs) as processing elements (PEs). Before processing can begin, a host workstation downloads configuration files into the FPGAs of Splash, effectively "programming" the system by specifying the actual hardware at run time. A single Splash 2 system contains up to 15 processor boards, each with 17 Xilinx 4010 FPGAs that are connected in a linear array, and through a 16 × 16 crossbar. The 36-bit data paths of the linear array facilitate the use of Splash as a pipelined processor, although they are bidirectional. Each FPGA contains 400 configurable logic blocks (CLBs) with associated interconnections and I/O blocks (IOBs), and is provided with an external 256K × 16 static RAM. In order to use Splash 2, it is first necessary to complete a hardware design that determines the operations of each FPGA. The high-level design must be partitioned appropriately into different FPGAs, and then each FPGA's design is specified individually using VHDL. Design tools by Synopsis are used to simulate the designs and synthesize netlist files, and after format conversion, Xilinx tools are used to partition, place, and route the logic onto the CLBs and IOBs. The result is a set of bitstream files that can be downloaded directly to FPGAs by the host.

4. Gaussian pyramid generation on Splash 2 Figure 2 illustrates a complete pyramid generation

and tracking system that has been implemented on Splash 2. This design utilizes all 17 FPGAs on one processor board, which are designated X0 to X16 and are colloquially referred to as PEs. The main duty of X0 is to accept and format input pixels, and to control crossbar interconnections between the PEs. The upper half of the figure represents the portion of the system that generates Gaussian pyramids, and the lower half performs tracking using those image pyramids. The Splash system clock is programmable, and is set to 10 MHz in this design. For the pyramid generator shown, there are 2 separate (and nearly identical) blocks, and these are pipelined for near-minimum latency. The first block, comprising PEs X1-X4, accepts 512 × 512 images and generates the next level of the pyramid, 256 × 256 images only. The second block, represented by X5-X8, accepts the output of the first block and computes the remaining levels of the pyramid. This latter block is a “recirculating pipeline” that produces these smaller images for one pyramid at the same time that the first block works to produce the 256 × 256 level for the next pyramid in the sequence. For both blocks, filter weights have been chosen to permit decomposition into 2 one-dimensional filters, designated G x (for horizontal filtering) and For simplicity, this G y (vertical) in the figure. discussion will consider only the first block. The first filter operates on a single row of an image and is easily implemented in a single chip, X1, whereas the second filter requires 2 chips to implement a delay line of 4 image rows, using local RAM. X1 also performs resampling in the horizontal dimension. X2 computes 3 of the 5 partial sums and passes the 12-bit partial result to X3. X3 computes the remaining 3 partial sums and passes the rounded 8-bit result to X4. X4 resamples the image data in the vertical dimension to reduce the number of pixels per column by half, thus completing the 256 × 256 level of the pyramid. As stated above, the design of block X5-X8 is almost identical to that of X1-X4. The primary difference is that X5 can accept input from either X4 or X8. Initially, X5 accepts a 256 × 256 image from X4 and immediately begins computing the next level (128 × 128) from that. This result is passed from X8 to X5, and the 64 × 64 level is computed. The final (32 × 32) level is generated in a similar fashion. This design generates 5-level Gaussian pyramids at a rate of 30 per second. It is also possible to use a single 4chip recirculating block to generate 5-level pyramids at a rate of 15 per second.

Proceedings of the 1997 Computer Architectures for Machine Perception (CAMP '97) 0-8186-7987-5/97 $10.00 © 1997 IEEE

5. Hierarchical image matching on Splash 2 The lower half of Figure 2 represents the tracking system, which is subdivided into 4 separate stages. For the system discussed here, only levels 256 × 256 and smaller are processed, performing the tracking operation at a rate of 30 images per second. A second design, which is not discussed here in detail, performs tracking also using the full-resolution 512 × 512 images, but at the slower rate of 15 images per second. The first stage of the tracking system, comprising X9 only, receives each incoming pyramid from X4 and X8. It passes the 256 × 256 image to stage 4, and simultaneously sends the lower-resolution levels to stage 2. Stage 2, consisting of PEs X10-X11, stores the pyramids in its memory and furnishes image data to stage 3 for processing. It sends both reference-window pixels and search-window pixels to stage 3 as needed. Stage 3 (X12-X13) performs the SAD computation and transfers the row/column locations of the winning matches to stages 2 and 4. Stage 4 (X14-X15) computes the final target position within the 256 × 256 image, highlights the target location, and transfers the resulting image to X16 for final formatting and display. Because image matching begins at the coarsest level, tracking computations do not commence until a complete pyramid has been received. To accommodate this, 2 chips each within stages 2 and 4 store and process alternate image pyramids in the sequence. Using the convention that image frames are numbered beginning with 0 at system initialization, the first chip (X10 and X14) stores all odd-numbered pyramids while the second chip (X11 and X15) stores all even-numbered pyramids. As one pyramid is being stored, the other PE in these blocks facilitates image matching on the other pyramid. The actual matching computations performed in stage 3 are compute-intensive, and utilize 2 FPGA chips in an SIMD mode to achieve real-time operation. The first chip (X12) performs computations for 16 × 16 image windows that begin on even columns, and the second chip (X13) simultaneously performs the same computations for windows that begin on odd columns. This is a form of SIMD computing because the same operations occur on different PEs, but with different data.

6. Results This system has been tested at full speed on Splash 2, and at slower speeds to aid in design verification. Sample results are shown in Figure 3, and are identical to results obtained using a software implementation on a Sun workstation. A timing analysis, as performed by the Xilinx design

tools, indicates a worst-case maximum clock frequency for the FPGAs of the tracking stages of 8.7 MHz. This is higher than the minimum 7.86 MHz clock frequency required to transfer 512 × 512 images at a rate of 30 images/second. The current Splash implementation operates at a clock frequency of 10 MHz, however, and the design of 2 of the chips will need to be improved before correct operation can be guaranteed at this frequency. However, no speed-related problems were observed at this higher clock frequency.

[Atha95]

[Buel96]

[Burt83]

7. Conclusion This paper has described the design and implementation of a real-time visual tracking system on the Splash 2 architecture. The use of Splash 2 (or other CCMs) for such image-based tasks is attractive because dedicated hardware is not needed, although the system delivers performance approaching fully custom hardware (such as that described in [Burt88, vand92, Zhan93]). This is possible because the fine-grain configurability of the system permits designs that utilize high-level pipelining coupled with lower-level data recirculation and SIMD-like processing. A drawback of this approach is that the designer faces a fairly steep learning curve, although this may be due largely to the limitations of current design tools. Based on the lessons learned using Splash, some newer CCMs are under development that incorporate such helpful features as a mixture of local and global memory, dual-port access to some of the memory elements, and increased I/O capability on the FPGAs.

References [Abbo94]

[Abbo95]

[Arno92]

[Atha93]

A. L. Abbott, P. M. Athanas, L. Chen, and R. L. Elliott, "Finding Lines and Building Pyramids with Splash 2," Proceedings: IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 1994, pp. 155-163. A. L. Abbott, P. M. Athanas, and A. Tarmaster, "Accelerating Image Filters using a Custom SPIE Computing Machine," Proceedings: Conference on Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Reconfigurable Computing, vol. 2607, Philadelphia, PA, Oct. 1995, pp. 62-70. J. Arnold, D. Buell, and E. Davis, "Splash 2," Proceedings: 4th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1993, pp. 316-322. P. M. Athanas and H. F. Silverman, "Processor Reconfiguration through Instruction Set

Proceedings of the 1997 Computer Architectures for Machine Perception (CAMP '97) 0-8186-7987-5/97 $10.00 © 1997 IEEE

[Burt88]

[Chen94]

[Hall76]

[Hoan93]

[Isel95]

[Pudi96]

[Shir95]

[Tani78]

[vand92]

[Zhan93]

Metamorphosis," IEEE Computer, vol. 26, no. 2, Feb. 1993, pp. 11-18. P. M. Athanas and A. L. Abbott, "Real-Time Image Processing on a Custom Computing Platform," IEEE Computer, vol. 28, no. 2, Feb. 1995, pp. 16-25. D. A. Buell, J. M. Arnold, and W. J. Kleinfelder, Splash 2: FPGAs in a Custom Computing Machine, IEEE Computer Society Press, 1996. P. J. Burt and E. H. Adelson, "The Laplacian Pyramid as a Compact Image Code," IEEE Transactions on Communications, vol. COM31, pp. 532-540, 1983. P. J. Burt, "Smart Sensing within a Pyramid Vision Machine," Proceedings of the IEEE, vol. 76, no. 8, Aug. 1988, pp. 1006-1015. L. Chen, "Fast Generation of Gaussian and Laplacian Image Pyramids using an FPGAbased Custom Computing Platform," M. S. Thesis, Bradley Dept. of Electrical Engineering, Virginia Tech, 1994. E. L. Hall, J. Rouge, and R. Y. Wong, "Hierarchical Search for Image Matching," Proceedings: IEEE Conference on Decision and Control, Dec. 1976, pp. 791-796. D. Hoang, "Searching Genetic Databases on Splash 2," Proceedings: IEEE Workshop on FPGAs for Custom Computing Machines, 1993, pp. 185-191. C. Iseli and E. Sanchez, "Spyder, a SURE, SUperscalar and REconfigurable, Processor," Journal of Supercomputing, vol. 9, pp. 231252, 1995. B. Pudipeddi, "Implementation of Coarse-toFine Visual Tracking on a Custom Computing Machine," M. S. Thesis, Bradley Dept. of Electrical Engineering, Virginia Tech, 1996. N. Shirazi, P. M. Athanas, and A. L. Abbott, "Implementation of a 2-D Fast Fourier Transform on a FPGA-based Custom 5th Computing Machine," Proceedings: International Workshop on FieldProgrammable Logic and Applications, Oxford, England, Aug. 1995. S. L. Tanimoto, "A Comparison of Some Image IEEE Searching Methods," Proceedings: Conference on Pattern Recognition and Image Processing, June 1978, pp. 280-286. G. S. van der Wal and P. J. Burt, "A VLSI Pyramid Chip for Multiresolution Image Analysis," International Journal of Computer Vision, vol. 8, no. 3, 1992, pp. 177-189. Z.-Y. Zhang and B.-Z. Yuan, "Multiresolution Target Detection and Tracking through a parallel Coarse-to-Fine Search Approach," Proceedings: IEEE TENCON '93, Beijing, P. R. China, 1993, pp. 1198-1202.

Figure 1. Four levels of a Gaussian pyramid. These were computed by Splash 2 from a 512 × 512 image which is not shown. The image sizes here are 256 × 256, 128 × 128, 64 × 64, and finally 32 × 32, which serves as the coarsest level in this implementation. Filter by Gx

Input image X0

Filter by Gy

Store

Filter by Gx

Filter by Gy

Combine

X1

X2

X3

X4

X5

X6

X7

X8

X16

X15

X14

X13

X12

X11

X10

X9

Output image

Format output

Tracking stage 4

Tracking stage 3

Tracking stage 2

Tracking stage 1

Figure 2. Block diagram of a tracking system on Splash 2. Chips X0 - X16 represent the 4010 FPGAs of one processor board, and they are connected as a linear array. The other connections are made though a crossbar switch. The top half of the figure is the Gaussian pyramid generator, and the bottom half performs image matching.

Figure 3. Three frames of a test image sequence, in which a taxi is seen turning a corner at an intersection. The target to be tracked was taken from the center of the first frame (not shown). White rectangles of size 16×16 represent the best matches found in each image, relative to the target's appearance in the previous image.

Proceedings of the 1997 Computer Architectures for Machine Perception (CAMP '97) 0-8186-7987-5/97 $10.00 © 1997 IEEE