based video encoding in the MPEG-4 standard, is presented. It is a semi-automatic strategy, which combines the efficiency. (speed) of automatic segmentation ...
INTERACTIVE TRACKER – A SEMI-AUTOMATIC VIDEO OBJECT TRACKING AND SEGMENTATION SYSTEM Hua Zhong, Liu Wenyin, Shipeng Li Microsoft Research China No. 49 Zhichun Road, Beijing 100080, China (i-zhua, wyliu, spli)@microsoft.com
Abstract A novel strategy for video object segmentation in normal video frames, which is an indispensable preprocessing of the objectbased video encoding in the MPEG-4 standard, is presented. It is a semi-automatic strategy, which combines the efficiency (speed) of automatic segmentation and the accuracy of manual segmentation. Initially, the object contour in the first frame is specified interactively by a user using a computer-aided tool based on our improved Intelligent Scissors. The frames followed then undergo an automatic region-based object tracking process. A user interface is designed to display the object tracking result and allow the user to observe it during the tracking process. The user can pause the process when the result is considered inaccurate or unacceptable, and interactively correct the object contour in the first frame containing the unacceptable result with the proposed computer-aided tool. The automatic region-based tracking process can then be resumed with the corrected object contour as the new initial object contour. This trackingcorrecting cycle can be repeated until the end of the video clip is reached.
1. Introduction After many years of academic research and industrial practice, a series of MPEG standards have been proposed and adopted for video data representation and manipulation. Among others, MPEG-4 claims object-based video coding functionality since it enables better representation, manipulation and interaction with video data. However, how to obtain the video objects from normal video frames is still an open problem existing for many years in the machine vision field. Obviously, people would prefer the automatic way to get the segmentation result. However, since segmentation is a basic but not well-solved problem, especially for semantic objects, fully automatic segmentation cannot yield universally accurate results due to the variety of complex, nonrigid object motions. Hence, there is no automatic and working segmentation system that can be used as a general processing system for MPEG-4 object based coding. In [1], Gu and Lee have also attempted to solve the problem using manual segmentation of the initial frame and automatic tracking for the frames followed. Although manual segmentation can obtain accurate results, it is too labor intensive to be a practical solution. In this paper, we combine the efficiency of automatic segmentation and the accuracy of manual segmentation and propose a semi-automatic strategy to segment video objects. Compared to manual segmentation, the semi-automatic strategy we proposed requires the user to label or modify the object
contours for only a few I frames (Initial frames) when necessary. All the other frames are processed by an automatic region-based object tracking process. Hence, it is much faster than manual segmentation. Compared to automatic segmentation, the semiautomatic strategy takes the manually specified initial segmentation result and propagates it automatically to the frames followed until the user considers the accuracy unacceptable. After the manual correction, the automatic tracking process can then be resumed. In this way, the accuracy of the segmentation can be guaranteed. Therefore, the proposed strategy is very practical and it provides a feasible solution to preprocess video sequence for object based MPEG-4 coding.
!
!
Figure 1. The flow chart of the proposed strategy.
2. System Overview The entire semi-automatic video object segmentation strategy consists of two main steps: interactive segmentation and automatic tracking (segmentation). A system using this strategy therefore comprises two corresponding components. The interactive segmentation part takes a single image (which is either the first frame or an intermediate frame that requires correction) as the input. The user can specify the contour of a semantic object easily and fast with a computer-aided segmentation tool. The manually segmented frame is the input of the automatic tracking part, which propagates the initial segmentation result to the frames followed in the video sequence. Because of the cumulative error of the automatic tracking algorithm, the automatic segmentation result may become
inaccurate after some frames. Hence, we require that the automatic tracking process be supervised by the user. When the user finds that the automatic tracking result is unacceptable after some frames, he (she) can pause the tracking process to interactively correct it and then resume the automatic tracking process with the corrected object contour. This cycle will be repeated until the end of the video sequence. The flow chart of the proposed strategy is shown in Figure 1.
Sometimes the live-wire boundary might be attracted away from a desired weak edge to an undesirable strong edge. Hence, different cost weights for different gradient values are trained on the fly during the interactive specification process of the object’s contour. For those desired gradient values, the cost weights will be decreased while for those undesired gradient values the cost weights will be increased. This ensures the live-wire boundary always attached to the right edges.
In the implementation, we use an improved Intelligent Scissors [2] as the interactive segmentation tool and use the algorithm of Gu and Lee [1] as the automatic tracking part. However, the proposed strategy can also be implemented using other algorithms.
During the segmentation process, the user moves the mouse around the object of interest and clicks some key points as necessary. The Intelligent Scissors will link every two consecutive key points on the contour with an optimal path of the weighted graph computed from the image. The Dijkstra’s algorithm [6] is used to find the optimal path. Whenever the user clicks a key point, the system also moves the point to the nearest gradient peak automatically, to ensure the accuracy of the contour. Finally, by connecting all these piece-wise optimal paths between two consecutive key points together, we obtain the object’s contour.
3. Interactive Segmentation Since the segmentation result of the initial frame is the input of the automatic tracking process, which will propagate and accumulate errors, it is necessary to involve human interaction to guarantee the accuracy. Unlike the automatic tracking part, this interactive segmentation process works on only one image. There are many approaches to segment an image or to find the object’s contour in an image, such as edge-based methods [5] and energy minimization methods [3]. However, considering the efficiency and accuracy, we choose the Intelligent Scissors [2] in our implementation. The Intelligent Scissors let the user drag the mouse around the contour of the object of interest and click some key points as necessary. The system then automatically sticks the contour to its physically accurate position. But one drawback of the Intelligent Scissors is that when the user clicks a new key point there will be a noticeable delay that makes the user experience greatly deteriorated. Even its accelerated version [4] also takes a long time on pre-segmentation, which also may make the final result not a piece-wise optimal path. In this paper, we proposed a new acceleration scheme that not only ensures the piece-wise optimal solution but also smoothes the user experience. In the rest of this section we will briefly describe the original Intelligent Scissors and then explain how we have improved it.
3.1 Original Intelligent Scissors The Intelligent Scissors is an interactive image segmentation tool. It enables users to accurately and quickly extract objects of interest from a digital image using simple mouse motion. It formulates the edge-finding (between two key points in the image) task as an optimal path search problem in a weighted graph. Each pixel of the image is a node of the weighted graph. The directed weighted edges connect each pixel with its 8 neighbors. The cost of each edge is a weighed sum of many image features including: Laplacian zero-crossing, gradient magnitude, gradient direction, edge pixel value, inside pixel value, and outside pixel value. The last three features are used in the later on-the-fly training. Multi-level gradient magnitude and multi-level Laplacian zero-crossing determine the location of strong edges. They are all multi-level in order to handle both fine details and larger scale edge features. Gradient direction adds a smooth constraint on the contour of the object of interest. See [2] for the detail definition of the weighted sum of those features.
3.2 Improvements of the Intelligent Scissors In the original Intelligent Scissors, for each newly inserted key point, the system will calculate all optimal paths started from this key point to all other pixel points. This incurs a noticeable delay. This is a strategy of “compute once for all”. After the delay, when the user moves the mouse to search next key point, the system can display the optimal path from the newly added key point to the current mouse point in real time. This is because the optimal paths from the key point to all other points have already been calculated. In this paper, we propose a “compute as necessary” strategy, which totally eliminates the lag when adding a new key point and still keeps the later user experience smooth. This improvement can still guarantee the optimal path. The original implementation of the search algorithm for the optimal path is based on dynamic programming, the pseudo code of which can be found in [2]. It calculates all the optimal paths from a given node (key point) to all other nodes in the graph. Its basic idea is to take advantages of the optimal paths from the key point to some points to calculate the optimal paths from the key point to some other points. In order to do so, all optimal paths that have been calculated are kept sorted from low cost to high cost during the search process. The algorithm starts from an existing path with the lowest cumulative cost to search for other optimal paths. Hence, the nodes whose optimal paths have less cumulative cost will be finished earlier. Our improvement just makes use of this characteristic. We take the same optimal path search algorithm used in the original Intelligent Scissors but allocate a buffer to store all intermediate optimal paths from the current key point to all other temporary points. Hence, the search process can be paused when the necessary optimal path is available and be resumed when necessary to find a new optimal path that is currently unavailable. The optimal path result is just the same as that the original Intelligent Scissors finds. If the current key point is K and the user moves the mouse to a temporary point P, the interactive tool should display the optimal path from K to P on the screen. The revised search process continues until the optimal path KP is available. We pause the process so that no unnecessary calculation is done. When we find
the optimal path KP, we also know all those optimal paths from K to other temporary points, whose cumulative costs are less than that of KP. Hence, next time when the user moves the mouse to a new temporary point P’, we can find KP’ using the following steps: (1) Check if the optimal path from K to P’ is already available. If it is so, just output it. Otherwise, go to (2). (2) Resume the last paused optimal path searching process (or start for the first time if P’ is the first temporary point after K is specified) until the optimal path from K to P’ is available. (3) Pause the algorithm and output the optimal path. This “compute as necessary” strategy has no noticeable delay so the user can feel totally smooth when moving the mouse forward from a clicked point around the object to examine and search for the satisfying object contour.
4. Automatic Tracking For the video object segmentation problem, currently there is no automatic method that can work well on all conditions. There may be some solutions that generate fairly good result for some very specific problems. But for general MEPG-4 object based coding, the input video may violate some assumptions set by those automatic segmentation algorithms. For instances, nonerigid deformations or free-form movements appear frequently in object movements in video. Hence, fully automatic approach for object segmentation in video for MPEG-4 coding is almost impossible. However, a semi-automatic approach, such as the one we proposed in this paper, may solve the problem in a compromised way. Since we have manually specified initial segmentation result and let the user to supervise the automatic segmentation result, the main task of automatic tracking part is just to propagate the initial segmentation result as far as it can to the frames followed in the video sequence. The automatic tracking algorithm we used in this paper is the one proposed by Gu and Lee [1], which can deal with more generic semantic video objects that may contain disconnected components and multiple none-rigid motions. Hence, it is quite suitable for extracting video objects for MPEG-4 applications. Next we will briefly explain some key issues of the algorithm.
4.1 Region Extraction Gu and Lee’s algorithm [1] is a backward region-based classification algorithm. All later processing such as motion estimation and backward classification take regions not pixels as the primitive units. Hence, reliable region extraction is the basis and guarantee for the accuracy of all later procedures. First of all, the algorithm applies a median filter to the input image before region extraction. A fast spatial segmentation algorithm named “LabelMinMax” is then carried to extract regions. It is a fast, sequential region-growing-like algorithm with the homogeneity controlled by the color difference in a region.
4.2 Region-based Motion Estimation Let It-1 be the vector image at time t-1, and It be the vector image at time t. The region extraction procedure has extracted N regions
in It: Ri (i = 1, 2, …, N). We take the assumption that each of the regions is translated from last frame It-1. The motion vector is Vi for each region Ri. We minimize the following prediction error (PE) to select the motion vector Vi:
PE = min ∑ I t ( p ) − I t −1 ( p + Vi ) , Vi p∈Ri where ║●║ denotes the absolute difference between two vectors and I(p) the color vector of point p. Under the assumption that a region has only a small motion between two consecutive frames, all candidates of Vi are selected within a threshold Vmax which can be adjusted to different situations. So Vi indicates the relative location of Ri in the last frame at time t-1.
4.3 Region-based Classification For each region Ri at time t, we now have a motion vector Vi from the last step. Hence, we can warp backward the region to last frame at time t-1. The warped region R’i can be defined as:
R 'i = ∪ p∈Ri {p + Vi } We have also the segmentation result at time t-1: Ot-1,i ( i = 1, 2, … m), where, Oi is the location or region of the semantic object i at time t-1. The segmented regions can then be classified such that we can determine to which object a region Ri should be associated at time t. We take the Majority Overlapped Area (MOA) criteria to classify these regions. The MOA criteria first backward warp the Ri to R’i. It then calculate the overlapped area of R’i with all Ot-1, j ( j = 1, 2, … m). Then it assigns Ri to the object that has the most overlapped area with R’i at time t-1. By doing so, every region will be classified to a particular object. Finally, the object Oi’s region at time t is the combination of all regions that are classified to this object. The contour of the newly combined region is the contour of the semantic object. After this classification process, some post processing is added to reduce very small areas and smooth the object contour.
5. The User Interface The UI include five consecutive frames of the video in a row, as shown in Figure 2. To slide forward and backward, the user can type “>” and “