Copyright © 2001 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of the SPIE, Electronic Imaging 2001, vol. 4310, pp. 667-675 and is made available as an electronic reprint with permission of SPIE. One print or electronic copy may be made for personal use only. Systematic or multiple reproduction, distribution to multiple locations via electronic or other means, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited.
A Research Environment for Developing and Testing Object Tracking Algorithms Todd Schoepflin*, Christopher Lau, Rohit Garg, Donglok Kim, and Yongmin Kim Image Computing Systems Laboratory Departments of Bioengineering and Electrical Engineering, Box 352500 University of Washington, Seattle, WA 98195-2500 ABSTRACT We present an integrated research environment (RAVEN) that we have developed for the purpose of developing and testing object tracking algorithms. As a Windows application, RAVEN provides a user interface for loading and viewing video sequences and interacting with the segmentation and object tracking algorithms, which are included at run time as plug-ins. The plug-ins interact with RAVEN via a programming interface, enabling algorithm developers to concentrate on their ideas rather than on the user interface. Over the past two years, RAVEN has greatly enhanced the productivity of our researchers, enabling them to create a variety of new algorithms and extend RAVEN’s capabilities via plug-ins. Examples include several object tracking algorithms, a live-wire segmentation algorithm, a methodology for the evaluation of segmentation quality, and even a mediaprocessor implementation of an object tracker. After implementing an algorithm, RAVEN makes it easy to present the results since it provides several mask display modes and output options for both image and video. We have found that RAVEN facilitates the entire research process, from prototyping an algorithm to visualization of the results to a mediaprocessor implementation. Keywords: algorithm development environment, dynamic programming, image segmentation, interactive systems, object tracking, video segmentation
1. INTRODUCTION With the advent of the MPEG-4 standard1 for object-based video compression and manipulation, research in the area of object tracking algorithms has grown rapidly. Nevertheless, this growth has not been without challenges, particularly in the area of software development. Researchers want to quickly prototype their ideas and visualize the results. Traditional image processing toolboxes such as MATLAB2 and Khoros3 offer a rich software development environment. However, they provide fairly weak support for the display of video data, and the built-in interactive capabilities for file management and video segmentation are quite limited. Fortunately, both packages provide toolkits that allow researchers to design a custom GUI to meet their needs. They also offer the ability to create stand-alone applications, complete with automatic source code generation. At first glance, it would seem that these characteristics would prove very attractive for researchers who are developing object tracking algorithms. However, the necessity of programming a user interface and the need to develop a specialized application often drives researchers toward other implementations. For example, Zhong and Chang4 developed a video analysis and retrieval system (AMOS) based on segmentation and region tracking algorithms. Their system was implemented in Java, which can be easily integrated into web-based applications. Although their algorithm gives excellent results, they report a processing time of 20 s/frame on a SUN Ultrasparc-2 workstation due to the calculation of complex feature maps and inefficiencies in how the Java Virtual Machine handles array access. Even with the precalculation of the feature maps for research purposes, it seems that the frame-to-frame delay remains excessive. Interpreted languages, like MATLAB’s scripting language, suffer some of the same disadvantages, though it is now easy to convert MATLAB code to *
Correspondence: Email:
[email protected]; WWW: http://icsl.ee.washington.edu/projects/mpeg4/
C/C++ when it no longer requires heavy modifications. In our experience, we have found that faster implementations eventually become essential in order to fully develop an algorithm in a timely fashion. On the other hand, some researchers explore new tracking algorithms in the context of developing an MPEG-4 encoding system (e.g., the Microsoft Windows-based system of Gu and Lee5). We distinguish two different ways to develop a tracking algorithm, then integrate and test it with the system: 1) develop the algorithm in a development environment (e.g., MATLAB), then either compile it as a library or port the code to a lower-level language and integrate it into the application; 2) develop the algorithm using an enhanced version of the encoding system GUI with the algorithm code either linked as a library or integrated into the application. As described earlier, both options will require a new user interface to manipulate and visualize the video data. Furthermore, each alternative offers two different compilation options that will differ in the degree of modularity they offer and the debugging facilities they provide. Modularity is an essential component of any research environment because it enables researchers to easily test and compare different algorithms, potentially in the same run-time session. It also helps to reduce the impact of changes in the algorithm upon the parent application. When choosing an implementation pathway, however, one must balance the advantages of modularity with the capability to step through the code and debug it or gain insight from it when necessary. Based on these examples and our experience, we believe research in object tracking benefits from the following: 1) a user interface that caters to the visualization of and interaction with video data, 2) a reasonably fast (less than 2 s/frame) implementation of the algorithm, 3) the ability to easily test different algorithms, 4) a way to debug the algorithm on every implementation level, and 5) a straightforward way for an application to use an algorithm, even when it is still undergoing substantial modifications. These considerations substantially affected our object tracking research toward an MPEG-4 video editor and encoder. Our choice of a development environment was influenced both by the interactive and encoding aspects of the application, and our need for a reasonably fast implementation of the algorithm that we could easily debug, modify, and re-integrate with the system. Because of these factors and our familiarity with Microsoft Visual C++, we chose to develop an entirely new research environment rather than build on Khoros or MATLAB. Nevertheless, the major design components that we describe—the specialized user interface and the programming interface between the application and algorithm—are independent of the implementation.
2. RESEARCH ARCHITECTURE FOR VIDEO EDITING AND ENCODING RAVEN is a Microsoft Windows application that provides user input, presentation, and file I/O services to object tracking algorithm implementations. Standard Windows features, such as clipboard and undo/redo support, were implemented as well as support for the native Windows AVI video format. RAVEN was originally designed to aid research in video object extraction for MPEG-4 encoding following these functional requirements. An MPEG-4 video editor should be able to: • • • • • • • • • •
Define and edit video objects with the least possible manual input. Maintain a record of objects in a video sequence, so that an object can be recognized if it disappears and later reappears in a sequence Encode video streams using video objects defined during editing while simultaneously showing information on the decoded image quality and bitrate. Be easily extended by the addition of new segmentation and tracking algorithms without recompiling the entire editor. Edit the shape or position of a previously-defined video object, again with a small amount of user input. Automatically segment a given sequence of frames while specifying parameters to be used for segmentation. Automatically track defined video objects across frames and refine initial definitions of objects based on motion-related cues. Track objects forward and backward in time. Allow the user or an algorithm to correct for tracking errors. Aid the user in evaluating the performance of segmentation and object tracking algorithms.
The RAVEN user interface is shown in Figure 1, and the following usage scenario illustrates how many of the user interface requirements are met:
A
C
B D
Figure 1. RAVEN user interface. 1. 2. 3.
4. 5.
A user opens a video sequence and views it in the video window (A). He/she can use the VCR-like controls in this window to sequentially view every frame or to skip around the sequence quickly. After deciding on a video object to extract, he/she defines a video object, which can be named for easier reference in the timeline window (B). This step creates a placeholder for a video object where segmentations for each frame will be stored. The user then chooses an interactive segmentation plug-in from the plug-in list (C). He/she proceeds to place a set of control points on the video object by clicking on the video window. The zoom window (D) magnifies the pixels around the mouse and helps in accurately positioning the control points. After the last control point is placed, the algorithm automatically outlines the object boundaries between the control points. The algorithm cuts off part of the object, so the user clicks and drags the part of the boundary near the cut-off part to enclose the entire object. The interactive plug-in is automatically invoked again when the user releases the mouse, and it refines the edited boundary to snap to the edges of the video object. In the video window, the user selects a range of frames to track the object across (e.g., 80 frames starting from frame 19). Then, he/she selects a tracking algorithm from the plug-in menu and executes it. The algorithm successfully tracks the object initially, but then starts to lose it after 60 frames. The user stops the algorithm at this point and selects another plug-in to take over for the rest of the frames.
6.
7.
After trying several algorithms on the remaining frames, the user cannot find a single one that can track the entire object. However, he/she notices that there is one algorithm that is able to track the top part of the object and a separate algorithm can track the bottom part, while losing the top. He/she creates two sub-objects at this point, one for the top of the video object and one for the bottom. After segmenting the top and bottom parts of the object separately, RAVEN can merge these two sub-objects into a single video object just before MPEG-4 encoding. The segmented video object is now ready to be encoded into an MPEG-4 bitstream. The user chooses a menu command that exports video object masks and configuration files into a specified directory. The user can modify the configuration files or execute the encoder using the default files produced by RAVEN. Since the verification model encoder was being constructed simultaneously with the development of RAVEN, a loose coupling between user interface and encoder was chosen.
The RAVEN environment includes other conveniences for algorithm development. The alternatives for displaying segmentations include a display of the boundary only, shading the interior of a video object with a different color, showing the interior, and displaying only the pixels in the video object and the rest of the frame as a solid color to emulate how the video object would appear when displayed by an MPEG-4 decoder. These display styles allow algorithm developers to qualitatively assess the performance of their algorithm quickly. The visual results can also be easily exported as either bitmaps of single frames transferred to a word processor via the Windows clipboard or entire sequences saved as an AVI file. 2.1. Plug-ins RAVEN plug-ins are compiled as Windows dynamic-link libraries (DLLs). At run time, RAVEN scans a pre-defined directory for plug-ins and adds them to a list that the user can choose from. When a plug-in is selected to run, RAVEN loads the plug-in DLL (i.e., the segmentation algorithm implementation) into memory and starts executing at one of a set of predefined entry points. In order to access segmentations and the video sequence, plug-ins can use a set of callback functions to RAVEN. The entry points allow plug-ins to execute in response to a user requesting a plug-in to execute non-interactively (e.g., initiate object tracking), requesting to adjust plug-in algorithm parameters, creating a new boundary, or editing an existing boundary. The boundary creation and editing entry points allow interactive segmentation plug-ins to be developed. These calls are structured such that plug-in developers can override RAVEN’s default behavior at several points during the segmentation boundary creation/editing process or allow the default behavior to proceed. For example, any single-frame segmentation algorithm can become somewhat interactive by responding to the final step in boundary editing. The effect is the algorithm executes automatically when a segmentation is edited instead of waiting for the user to click the execution button on the toolbar. More involved modifications are needed if the plug-in developer wants the algorithm to execute as the user drags the mouse to move control points, since the plug-in needs to respond to mouse events in real time. The plug-in API contains functions for fetching frames of video, retrieving segmentations, and storing segmentations. The segmentation data structures were defined by the original target application of MPEG-4 video editing. Segmentations are either stored as masks (an 8-bit image defining transparency) or closed curves that can be transformed into masks by a standard computer graphics polygon scan-conversion algorithm.6 Plug-in authors are free to choose the most convenient segmentation representation to use internally as long as they can convert it to one of the two acceptable representations for RAVEN. Internal data structures can be kept in memory by plug-ins between calls.
3. PLUG-IN DESIGN EXAMPLES RAVEN’s powerful user interface and flexible programming interface enabled the rapid prototyping and development of multiple distinct object tracking algorithms and a multitude of video processing algorithms (c.f. area C in Figure 1). Several months into the initial development of object tracking algorithms, we found manually segmenting the object in the first frame to initialize the tracker to be time-consuming and a bottleneck, so we began exploring computer-assisted methods of object segmentation. In the initial segmentation plug-in design, we obtained control points from the user via the RAVEN GUI. Next, we connected them using the level-set approach presented by Cohen and Kimmel7 and returned the segmentation to be displayed by RAVEN. Implementing the plug-in did not take long since RAVEN assumed all responsibility for any interaction with the user.
Although the level-set plug-in greatly improved the segmentation accuracy, we were disappointed in its performance on highly-curved objects, especially when the correct path was obvious. For example, Figure 2(a) contains the image of a goose. The correct boundary is clearly visible from the image gradient magnitude in Figure 2(b), so it is puzzling why the level-set approach required so many control points (c.f. Figure 2(c)). Taking advantage of RAVEN’s plug-in architecture, we created several plug-ins as we explored a variety of dynamic programming methods for tracing the contour between two points. The ability to easily switch between plug-ins (c.f. area C in Figure 1) proved to be very handy during the comparison process. Eventually, we discovered a new method we dubbed “QuickSeg,”8 which pursues strong edges for long distances, as shown by the relatively few control points needed to create the segmentation in Figure 2(d). RAVEN’s file I/O features and its ability to adjust algorithmic parameters within the GUI environment made it easy to tune the few QuickSeg parameters by segmenting objects in a variety of images.
(a)
(b)
(c) (d) Figure 2. a) Portion of frame 97 from UW-Geese, b) gradient magnitude of the image in Figure 2(a), c) segmentation and control points using Cohen and Kimmel’s method, and d) segmentation and control points using the new QuickSeg method. After creating the QuickSeg plug-in, we became aware of the live-wire interface,9,10 which inspired us to add new plug-in entry points to the RAVEN interface, enabling plug-ins to react to various mouse events. This additional capability allowed us to rapidly extend the QuickSeg plug-in to segment objects according to the live-wire paradigm. It also moved RAVEN and QuickSeg closer to meeting an important objective of the MPEG-4 video editor, namely, to define and edit video objects with the least possible manual input. The design and development of our object tracking algorithms proceeded in much the same fashion. In these cases, however, the mask display options in RAVEN proved invaluable for comparing segmentations. We often modified an algorithm or adjusted its parameters to fix a particular type of error, so it was essential to visualize any differences between video segmentations. For example, the VSnakes tracking algorithm11 performed well in many cases, but failed catastrophically in others. The left column of Figure 3 shows several frames of the VSnakes tracking results from the Garden sequence. Based on our understanding of the VSnakes algorithm, we realized that it was introducing extra edge points and attaching itself to strong edges. However, we also observed from the edge points displayed in RAVEN that these could be filtered out by a morphological opening. Thus, we modified the VSnakes plug-in by adding several optional postprocessing steps to remove the extraneous edge points. Next, we connected the postprocessing option to the user interface by adding a “morphological postprocessing” checkbox option to the VSnakes parameter dialog box. This made it particularly easy to test and compare the efficacy of the postprocessing option. After obtaining the encouraging results in the right-hand column of Figure 3, RAVEN’s ability to copy images to the Windows clipboard and the convenient mask display modes made it easy to present the results shown. Thus, we see that RAVEN meets two additional criteria of the ideal MPEG-4 editor by allowing the user to alter the algorithmic parameters and options, and enabling him or her to visualize and qualitatively evaluate the segmentation results.
Figure 3. Object tracking results for the Garden sequence in frames 1, 15, 20, 31, 32, and 40. Edge points are in black. Left column: standard VSnakes. Right column: VSnakes with morphological postprocessing.
During the development of an algorithm, a researcher typically relies on a subjective evaluation to guide the research direction. However, after developing several object tracking algorithms, our interests turned toward the objective evaluation of the segmentation quality of each algorithm. The most straightforward way to perform the evaluation is to compare the output to some ground truth. Since there were no reference segmentations available, we embarked on the process of segmenting whole video sequences, assisted by QuickSeg. To accelerate this process, we extended QuickSeg from image segmentation to video segmentation in several ways. First, by clicking a button, the user may segment the current frame using the control points from the previous frame. Second, we employed the live-lane concept10 by reducing the graph search to a band surrounding the previous segmentation. This serves to prevent QuickSeg from exploring extraneous edges. Lastly, we added code at the appropriate plug-in entry points to handle the addition, deletion, and movement of control points. This enabled QuickSeg to dynamically update the closed contour as the user drags a control point. These changes greatly reduced the user effort necessary to segment video sequences; the user typically performs less than five editing operations on each frame after the initial segmentation. Using the image of Figure 2(a), we illustrate the need for and the benefits of these modifications. The live-wire segmentation of the goose can be tedious because of the highly curved boundary of the goose. However, only the head and neck of the goose have noticeable motion, so the information needed to describe the boundary of the goose may be reduced to simply a few control points. To demonstrate the feasibility of this approach, we segmented frame 97 as shown in Figure 2(d). Next, we copied the control points to frame 100 and connected them using QuickSeg. Figure 4 contains the segmentation results from both frames, showing that the same control points can provide a very reasonable segmentation.
Figure 4. Segmentation of goose in frame 100 (dark gray) of the UW-Geese sequence using the seed points for the segmentation from frame 97 (light gray). The availability of a ground truth reference paved the way for the design of a plug-in to evaluate the segmentation quality of the masks provided by an object tracker. The new plug-in compares the masks and object contours from the object tracker output with the reference segmentation and writes the values of several metrics to a text file. This data can be easily manipulated and visualized with a spreadsheet. The addition of this plug-in considerably enhances the user’s ability to evaluate algorithmic performance, helping RAVEN meet yet another criterion for the ideal MPEG-4 editor. The final plug-in design example provides a good illustration of the flexibility and power of the RAVEN plug-in interface. We first prototyped and validated the VSnakes object tracking algorithm11 as a RAVEN plug-in compiled for Intel x86 processors. Later on, we ported it to the MAP processor12, which is one of the most recent mediaprocessors. In the new implementation, only the VSnakes algorithm executed on the MAP running from a PC expansion card, while the user interface and initial frame segmentation tools in RAVEN continued to run on the host machine's Pentium. We were able to achieve a 10:1 speed improvement over the Pentium version of VSnakes without losing the utility of the user interface and the supporting plug-ins or having to port the entire RAVEN environment to the MAP.
4. RESULTS AND DISCUSSION RAVEN has enabled several researchers in our laboratory to develop four unique object tracking algorithms. These— together with the segmentation evaluation and QuickSeg plug-ins—constitute a fairly advanced set of tools for ongoing object tracking research. A review of the aforementioned characteristics of an ideal MPEG-4 video editor reveals that we met most of them either through the RAVEN environment or via plug-ins. As our object tracker research took shape, we applied most of the major plug-ins (two trackers, QuickSeg, and the segmentation evaluator) in an effort to characterize the relative capabilities of the two object trackers. Eventually, we used QuickSeg to segment over 2,500 frames in more than 20
video sequences to establish a ground truth. After segmenting each sequence, we applied the two tracking algorithms, using several sets of parameters or options for each algorithm. Finally, the evaluator plug-in processed the segmentations and we used Visual Basic macros in Excel to extract and graph the data. Over the past two years RAVEN has proven to be a very fruitful research environment. In addition to the three patents filed on RAVEN itself, there are 14 patents either applied for or approved, with 8 invention disclosures to date based on work performed in the context of RAVEN. As shown by the examples above, RAVEN reduces the amount of code that needs to be written to implement and test segmentation and object tracking algorithms. RAVEN essentially frees researchers from dealing with issues regarding user interaction, enabling them to concentrate on developing the algorithm itself. The file-handling user interface proved particularly useful, enabling us to rapidly move between data sets. In addition, the specifications for the user interface and plug-in architecture enabled the development of the QuickSeg plug-in, which allows the user to easily define initial segmentations of video objects for refinement or tracking. Furthermore, RAVEN’s ability to operate on several different file formats (AVI, MPEG-1/2, and various raw image formats) allows developers to test their algorithms on many different sequences and different video objects within those sequences. After running experiments, the interactive segmentation display enables users to quickly assess algorithm performance qualitatively. Finally, the storage of segmentations in a common format allows users to compare the performance of different algorithms and/or different parameters either qualitatively or quantitatively. Each of these capabilities enabled us to conduct the large-scale evaluation experiment described above; the video sequence retrieval and segmentation storage system was particularly important during this experiment. The two implementations of the VSnakes tracking algorithm and the screen captures in Figure 3 provide an excellent example of how RAVEN serves researchers from the conception of an idea through its development, publication, and mediaprocessor implementation. Put another way, RAVEN enables researchers to attack research problems both from the perspective of the user (where algorithmic details are hidden) and the developer (where the focus is on the algorithm, not the user interface). This dualism has enabled RAVEN to function as both a testbed for algorithm development and as a user interface for demonstrations. From the example of our segmentation quality research above, one can begin to see how the burgeoning number of plug-ins have a positive feedback effect. New plug-ins not only extend the capabilities of RAVEN, but they provide new horizons of research that might not have been possible in the past; QuickSeg is an excellent example of this phenomenon. Clearly, the extensibility and flexibility of the RAVEN architecture and user interface have proven extremely beneficial.
5. CONCLUSIONS As an environment for object segmentation and tracking research, RAVEN has greatly facilitated the research and development of many algorithms. On the front end, the environment allows users to load and view video sequences, segment objects frame-by-frame, apply object tracking algorithms to the sequence, and save the segmentation results to disk. From the development perspective, RAVEN provides a plug-in architecture with a multilayered programming interface that enables researchers to rapidly prototype their ideas without worrying about the user interface. Other conveniences that RAVEN provides include many alternatives for displaying segmentation masks/boundaries and the ability to read a variety of video data formats. Taken together, the features and architecture of RAVEN enabled us to easily exercise many algorithms in our research on segmentation quality. As with any software tool, RAVEN could be extended in several ways, depending on the application. Researchers would find it useful if RAVEN could provide functions for the visualization and manipulation of images and other data, similar to what Matlab offers. They also might appreciate a macro or scripting language to automate tasks such as the application of a tracking algorithm to a set of video sequences. Although RAVEN was not designed to be an MPEG-4 authoring tool, it could serve as the basis for this application if the integration with the encoder was more complete and if the environment could intelligently guide the user in selecting an appropriate tracking algorithm for a given video sequence.
As RAVEN has matured over the past two years, it has enabled us to greatly increase our research productivity from conception to demonstration and publication. In fact, it now forms the basis for all of our video segmentation research and its plug-in architecture shapes how we develop new algorithms.
REFERENCES 1. 2. 3. 4.
ISO/IEC 14496-2, “MPEG-4 visual fixed draft international standard,” ISO/IEC, Oct. 1998. MATLAB software. http://www.mathworks.com/ Khoros software. http://www.khoral.com/ D. Zhong and S. Chang, “An integrated approach for content-based video object segmentation and retrieval,” IEEE Trans. Cir. and Sys. for Vid. Tech., 9, pp. 1259-1268, 1999. 5. C. Gu and M. Lee, “Semiautomatic segmentation and tracking of semantic video objects,” IEEE Trans. Cir. and Sys. for Vid. Tech., 8, pp. 572-584, 1998. 6. J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes, Chap. 3 in Computer Graphics: Principles and Practice, Addison-Wesley, Reading, MA, 1998. 7. L. Cohen and R. Kimmel, “Global minimum for active contour models: a minimal path approach,” Int. J. Comp. Vis., 24, pp. 57-78, 1997. 8. T. Schoepflin and Y. Kim, “Interactive frame segmentation with dynamic programming,” UWICSL Invention Disclosure, 2000. 9. E. N. Mortensen and W. A. Barrett, “Interactive segmentation with Intelligent Scissors,” Graph. Mod. and Image Proc., 60, pp. 349384, 1998. 10. A. X. Falcao, J. K. Udupa, S. Samarasekera, and S. Sharma, “User-steered image segmentation paradigms: Live Wire and Live Lane,” Graph. Mod. and Image Proc., 60, pp. 223-260, 1998. 11. S. Sun, D. R. Haynor, and Y. Kim, “Semiautomatic video object segmentation using VSnakes,” IEEE Trans. Cir. and Sys. for Vid. Tech., submitted, 2000. 12. C. Basoglu, W. Lee, and J. S. O'Donnell, “The MAP1000A VLIW mediaprocessor,” IEEE Micro, 20, pp. 48-59, 2000.