CVPR’06
Interactive Feature Tracking using K-D Trees and Dynamic Programming Aeron Buchanan Dept. Engineering Science University of Oxford, UK
Andrew Fitzgibbon Microsoft Research Cambridge, UK
[email protected]
[email protected]
Abstract A new approach to template tracking is presented, incorporating three distinct contributions. Firstly, an explicit definition for a feature track is given. Secondly, the advantages of an image preprocessing stage are demonstrated and, in particular, the effectiveness of highly compressed image patch data stored in k-d trees for fast and discriminatory image patch searches. Thirdly, the k-d trees are used to generate multiple track hypotheses which are efficiently merged to give the optimal solution using dynamic programming. The explicit separation of feature detection and trajectory determination creates the basis for the novel use of k-d trees and dynamic programming. Multiple appearances and occlusion handling are seamlessly integrated into this framework. Appearance variation through the sequence is robustly handled in an iterative process. The work presented is a significant foundation for a powerful off-line feature tracking system, particularly in the context of interactive applications.
Figure 1. An example track generated using our system on 140 frames of a giraffe sequence (PAL video). The eye of the background giraffe was selected in the first frame. Detection of 50 candidate matches from the entirety of every frame was conducted at over 50fps. Track optimization using dynamic programming, including the correct analysis of the occlusion in the middle of the sequence, ran at 175fps on a 3GHz PC in MATLAB.
on locations which have semantic importance (the corners of buildings, animals’ eyes), but which cannot be expected to emerge from full-frame motion computation. For a system to be interactive it requires that the time between an operator’s input and the system’s response is short. In this work we wish to compute the correlation of an arbitrary image neighbourhood in one frame with all other frames of a sequence at rates of over 50 frames per second, i.e. twice real-time. At modern TV resolutions1 this is not possible, using even the fastest template matching algorithms. Modern commercial implementations [1, 2, 16] offer interactive speeds only when the feature’s 2D velocity is highly restricted and thus occlusion cannot be handled. In this work we overcome this problem using task-neutral preprocessing of the sequences (which can be computed overnight, say, or as video is downloaded or film is digitized). In this paper, we present an interactive feature tracking system, incorporating all of the above points: the delivery of precise and accurate tracks over long sequences in a matter of seconds, updated after parameter tweaking in fractions of a second. We achieve this by separating image searching
1. Introduction Interactive feature tracking is the process of extracting long and accurate tracks of 3D features observed in 2D video. Points of interest are indicated with a single mouse click in one frame of the video, and the desired output of the tracker is the location of the point’s 2D projection in every frame of the sequence. A perfect system would be robust to occlusion of the feature over arbitrary numbers of frames, would impose no inherent restriction on object speed, and would cope with considerable appearance change while avoiding algorithmic artifacts such as the “template update problem” [15]. Interactive feature tracking is a worthwhile goal for mainly two reasons: firstly, high precision accurate tracks of arbitrary features are required by many people (particularly in the special effects industry); and secondly, tracks are often required on fine detail that defeats optical flow algorithms (e.g. the plume of a feather, the tip of a spear), and
1 e.g.
1
‘PAL’, by which we mean 576×720 pixel frames at 25fps.
paying instead a fixed occlusion penalty. This state transition cost is similar to that in [6]. Our contribution is to show how this optimization can be computed efficiently, with a strong likelihood of finding the desired optimum, even in the presence of significant occlusion. The system presented in this paper can be separated into four stages. Firstly, the target image sequence is preprocessed: for a preset patch size, the subwindow around every pixel of every image is transformed to its descriptor, p. We show that a 16-element filter jet is sufficient to model 20×20 patches. For each frame, a k-d tree is built holding all feature vectors in that frame. Efficiently implemented, this can be stored at a cost of 24 bytes per pixel. The second stage is performed while in interactive use: an image patch—represented by its 16-element filter jet—is selected by clicking in one frame. Then the top M matches in every frame of the sequence are found. In the third stage all M F possible track candidates are considered and the optimal track, minimizing equation (1), is found in O(FM 2 ) time using dynamic programming. Typical track optimization times are around 1 second for a 100 frame sequence. Finally, if required, the track can be automatically refined to sub pixel accuracy. After a discussion of previous work in the next section, details of the actual implementation are given in Section 3. The results follow in Section 4.
from track calculation, by drawing on the speed and accuracy of k-d trees for detection and the efficiency and optimality of dynamic programming for tracking. However, before describing the system itself in slightly more detail, let us define what we mean by the term feature track. We consider an image sequence of length F frames. A feature track at its simplest is a set of F image locations X = {xf }f =1...F . With each 2D location is associated an appearance description vector or filter jet [17], pf , representing the image intensities in the neighbourhood of xf in frame f . The set of description vectors is denoted PX = {pf }f =1...F . A track is initialized by the user indicating points in keyframes: a subset N of the input frames. For many tracks, the number of keyframes |N | = 1, but if, for example, a feature changes appearance significantly while occluded, it will be necessary to insert an additional keyframe when the feature emerges from occlusion. The keyframes are denoted by the set of (location, descriptor) tuples Q = {(yi , qi )}i∈N . An off-line tracking algorithm, therefore, can be described as a system that takes Q as an input and returns X as output, describing the position (and associated appearance) of the feature through the image sequence. Given this representation of the tracking problem, we may define measures of the ‘quality’ of a track: given two tracks, X1 and X2 , which best describes Q through the image sequence? We may now introduce any priors on interframe motion or appearance change. We do not expect a feature in frame f to appear far from its position in frame f +1. Also, appearance changes tend to be gradual; at standard video frame rates, the image of a feature does not radically alter from one frame to the next. By the Taylor expansion of the transformation of image patches, it can be seen that all incremental patch warpings lead to smooth trajectories of the patch description vector through patch space (depending on the description transformation). Therefore, we can say that xf and pf move smoothly through their respective spaces and the quality of a candidate track X can be defined as: X (1) E(X ) = e(f )
2. Background Previous research which informs our work falls into a number of categories. Tracking algorithms which explicitly treat occlusions and/or appearance change have been the subject of recent investigation [7, 8]; the use of dynamic programming has appeared many times in computer vision [5, 14, 23]; fast template matching [10, 11, 13, 18, 20, 24] considers efficient patch search in long video sequences; and k-d trees are but one of a variety of geometric data structures which accelerate closest-point computations. However, we believe that ours is the first work to combine these techniques to solve the adaptive tracking problem. The essential theme of adaptive tracking is this: if a template is defined by a single frame, its appearance will change over time, and thus the track will be lost. We call this tracking model “track to first”. If the appearance is adapted from frame to frame—in the extreme case, the previous frame being the template for the current one—the track will soon drift onto a stationary piece of background. We call this scheme “track to previous”. Consider “track to first”: in situations where the appearance change is cyclic, or due to temporary occlusion, and if the tracker is willing to search the entirety of every frame (sometimes called the detection stage), the target can be found again. For trackers which depend on a motion model to limit computational cost, however, the cost of detection
f
e(f )
= λd d(xf , xf −1 ) + λu u(pf , pf −1 ) + min (a(pf , qi )) i
(2)
which includes a velocity term, d(·), an appearance update penalty, u(·), and a measure of the deviation of appearance from the keyframe set Q, a(·). In addition, the hard constraints that xi = yi for i ∈ N must be satisfied. Thus the tracking problem may be restated as an optimization problem: choose X to minimize E(X ). Track occlusion is handled by augmenting X with an occlusion flag, so that the appearance terms are suppressed when the feature is occluded, 2
in every frame is significant. Fast detection is possible using techniques such as those of Viola and Jones [21], but such techniques depend on a large number of training images, take time to train and do not currently run over 50fps on the large images (and small-scale features) we wish to process. Using the SVM tracking paradigm of Avidan [3] and Williams et al. [22], detection time is reported to be of the order of one second per frame [22], two orders of magnitude slower than we require. Recent work has attempted to find a balance between track-to-first and track-to-previous. The “Wandering, Stable, Lost” algorithm of Jepson et al. [8] represents image patches using filter jets, as we do, and defines tracking as an optimization over explicitly labelled detection states, similar to our optimization of E in equation (1). The algorithm attempts to learn mixing parameters for each frame which weight the relative contributions of recent frames and keyframes in tracking, and shows impressive short-time occlusion resistance, coupled with fast adaptation to appearance change. Their use of an online EM algorithm, and the relatively strong dependence on their motion model, means, however, that if the object moves significantly while occluded it will be lost, and means that a full-frame search is computationally prohibitive. Another technology that is competitive with our work in computational cost is techniques based on interest points and hashed descriptors such as SIFT [12]. However, we wish to allow the user to explicitly specify the image location to be tracked. Even with generous interest-point detection thresholds, we expect no more than say 1000 detections per image. It is therefore unlikely in practice that an interest point is found near the features we wish to track. Systems which use interest points for tracking [19] depend on having relatively large objects which include several interest points in order to obtain a reliable track.
Figure 2. The 16 basis vectors used for the projection of patches from the ICCV’03 (see Figure 6) sequence, viewed as patches. For maximum impact the absolute value (magnitude) of the normalized pixel values is displayed.
down-sampling the image sequence, or by using a “bag of SIFT” representation [19]. Thus we are interested in the tracking of objects whose diameter is of the order of 5% of the image diameter. In this work, we also require that scale does not change significantly through the sequence. The extension of the technique to multiple scales is well understood [12], by repeating all computations on down-sampled sequences, with an increase in computational cost of the orP∞ 2i der of i=1 12 ≈ 33%.
3.2. Preprocessing Preprocessing comprises two stages: data compression and geometric indexing, now described. Geometric indexing is the storage of every pixel’s patch independently in a k-d tree, and thus it is important that each patch be represented as compactly as possible.
3. Implementation
Data compression. For our choice of patch size, each patch is a 1200-element vector (20×20×3) in colour sequences. This is a big number, especially as one will have almost 400,000 patches for every frame of PAL video. We reduce each patch to a much shorter vector using a bank of filters [9] obtained by PCA over the sequence. Figure 2 shows the filters for the sequence in Figure 6. Taking a large set of patches at random from the sequence, we compute the first C principal components, so that a patch window W can be approximated as the linear PC combination of orthonormal basis windows, i.e. W ≈ k=1 pk Ck . The coefficients pk are given by the multiplication of the pseudo-inverse of the basis filters, Ck . Thus our 3-channel color image is expanded to a 16-channel filter jet image. We further quantize this image to 8 bits per channel. In order to characterize the impact of this data reduction, we generated a number of ground-truth feature tracks, and
We now describe in more detail the implementation of our system. Summarizing Section 1, the key stages are: preprocessing, patch search, track optimization, and track refinement. All of the above are expressed in a “track to first” framework. We then describe how an adaptive tracker can be built which conservatively augments the keyframe set Q to cope with varying appearance. First, however, we consider in detail the question of patch size.
3.1. Patch size Throughout this paper, we consider the tracking of fixedsize patches (we use 20×20). This is because the most difficult tracking problem is that of tracking relatively small objects in large images. If the object of interest occupies a considerable portion of the frame, it may be tracked by 3
1.00
at each node results in leaves with occupancies only a few percent away from the target number. A pointer-based data structure is not required, as the deterministic layout of a full binary tree means that it is quick and easy to locate node or leaf attributes from within a linear array. The path through the tree to every leaf is a binary vector of length h which, converted to an integer, may be used to index into an offset array which defines the set of patch vectors in that leaf. Although in principle no copies of the tree data (the matrix of patch vectors) need to be made, we do reorder the patch vectors so that those in the same leaf are adjacent in memory, in order to maximize cache coherency of the lookup.
0.95 0.89
true positive rate
0.84 0.77 0.71 0.63 0.55
full double 3D double 12D double 16D double 20D double 16D 8−bit
0.45 0.32 0.00 0.00
0.01
0.04
0.09
0.16
0.25
0.36
0.49
0.64
0.81
1.00
false positive rate
It is worth noting that although k-d trees are sometimes considered inefficient in spaces of even moderately high dimension [12], our PCA projection of the data places it in an ideal geometric configuration, meaning that, on average, lookups benefit from more than a threefold speedup over efficient exhaustive search in R16 .
Figure 3. ROC curves showing the performance of quantized filterjet representations of patches compared to using the norm of the full patch residual. A log scale has been used to exaggerate the differences between the plots; the grey line represents expected random performance.
compared the precision-recall performance of matching using the full 20×20 patch, and filter-jet representations for various values of C, both with and without the 8-bit quantization. Figure 3 shows that, surprisingly, the quantization has a negligible effect, and that comparing 16 element vectors is best. To maximize the use of the 8-bits for each frame, the upper and lower bounds of the points in the filter jet space are found for each k-d tree separately and used to normalize the data before quantization. However, this does mean that each dimension of the patch coefficient vectors stored in each k-d tree has been scaled separately. Although this implicitly weights some coefficients over others, the ROC curves show that recall performance does not change compared to using full precision unscaled coefficients.
3.3. Image searching Searching the images for the input patches Q is simply a matter of querying the k-d trees for each patch in turn. The top M matches are retained from the search (we use M =30), i.e. the M nearest neighbours to the input patch. This is also easy to implement efficiently for k-d trees. Standard k-d tree searches cannot be restricted spatially and so every k-d tree search is a search of an entire image. Indeed, this is exactly what is desired as this image search stage is to be kept independent of the track optimization stage. This is true because motion models are restrictive (even spatial consistency and smoothness) and with occlusions and long sequences any sensible motion model would require full frame searches at some stage.
Geometric indexing. The idea of the preprocessing stage is to create data structures with which it is quick to recall information. After projection onto the filter basis, we view each W ×H input frame as a collection of 16-dimensional vectors {p1 , ..., pW H }. The problem of searching the image for a query patch q is reduced to a nearest-neighbour search in R16 , and in particular we would like to efficiently retrieve the locations of the M vectors most similar to q. For this we construct a k-d tree for each frame, taking care in the implementation so that the storage cost of the tree is no more than about 24 bytes per pixel (after which point we may discard the 16 byte-per-pixel filter jet image). This is achieved by storing balanced k-d trees of constant depth h, chosen so that the average number of patches being stored at each leaf is about 100. This order of magnitude of leaf density was seen to give the best search times. When constructing the tree, it is not necessary to achieve perfect balance, and splitting on the median coordinate value
It was mentioned in Section 1 that small transformations (including translations) of a patch remain close to the original in patch space. This is seen in the output of the image searches with many clusters of matches appearing in the nearest neighbours. With a convolution-style strategy, a non-max suppression technique is often used, but the sparseness of the data here means that non-max suppression is less straightforward. We use a fast clustering algorithm and retain only the best response in each spatial cluster, disˆ ≈ M /8 matches for carding the rest, generally leaving M each frame. Because the dynamic programming algorithm is sensitive to the number of matches (having a time complexity ˆ ), further match pruning is performed as sucquadratic in M cessive patches from Q are searched for. Any new matches that are adjacent to any existing matches in each frame are removed. 4
frame number: input patches: matches:
occlusion:
1 inf,nan 0.2,nan 0.2,nan 0.2,nan 0.1,nan 0.5,nan
2 0.2, 4 inf,nan inf,nan inf,nan inf,nan inf,nan
3 inf,nan 0.4, 1 0.3, 1 0.4, 1 inf,nan 0.7, 1
4 inf,nan 0.5, 2 0.6, 2 0.8, 5 0.7, 3 0.9, 5
5 inf,nan 1.0, 3 0.9, 2 inf,nan inf,nan 1.1, 5
6 inf,nan 1.2, 2 1.1, 3 1.2, 3 inf,nan 1.4, 2
7 1.5, 2 inf,nan inf,nan inf,nan inf,nan inf,nan
8 inf,nan 1.8, 1 1.9, 1 1.6, 1 inf,nan 2.0, 1
... ... ... ... ... ... ...
Figure 4. The dynamic programming table is filled one column at a time from left to right. In this example, each element of the table holds the error of the optimal path up to that match in that frame and the match in the previous frame on that path. Note that this approach naturally deals with multiple input patches, varying numbers of matches per frame and occlusions. Here, the best track is {4,1,2,3,2,2,1,4}.
3.4. Track path optimization
frames (stages) respectively, as shown in Figure 4. By the principle of optimality, the table’s entries may be calculated one column at a time, forwards through the frames, so that each entry holds the error of the optimal track, ending on that match, through the whole sequence up to that frame. To calculate the entry of a match in a given frame (column): cycle through each match of the previous frame, determine the total error using the track up to that point (by adding the pairwise error of the two matches from equation (3)), and store the lowest error and the index of that chosen match. At the end of the sequence, the global optimum is the track whose error in the final column is lowest. We set λr to be 0.4λo and so there are only three coefficients to choose values for: λd , λu and λo . The calculation of the optimal track can be so fast that these can be adjusted by hand to get the best results; this is the interactive element of our system. In practice we have found that good values can be found with little effort in very few iterations. It is mainly the tweaking of the occlusion cost, λo , that is most often needed.
At this stage there are, on average, not many more than ˆ matches in each frame. Every possible candidate track, M Xc , takes one match from each image. The best track, of ˆ F candidate tracks, i.e. the the combinatorially explosive M one that minimizes the track error, E(Xc ), is sought. Fortunately, this problem is solved by dynamic programming (DP), which finds the optimal choices of matches over the whole sequence, but searching over a greatly reduced number of candidates. Occlusion handling is easily incorporated into the optimization by classing it as a special kind of match and modifying the error in the summation to be dependent on the transition state of the feature from frame f −1 to f : λo feature becomes occluded X λr feature remains occluded 0 E (Xc ) = v(f ) feature becomes visible f e(f ) feature remains visible (3) where v(f ) is the velocity term d, as used in e(f ), summed over the whole occlusion ending at frame f , as if the image coordinates of the track over these frames had progressed so as to minimize v(f ). When v(f ) comprises only d and no acceleration term, this is a straight line.
3.5. Track refinement The k-d tree can only return matches from those locations it stores and therefore, in our case, results at this stage are pixel-aligned. To obtain sub-pixel accuracy a final correlation-based localization is employed for track refinement. In each frame, f , we simply pick the input patch, qi that is most like the feature appearance, pf , and perform a correlation of the patch and the image immediately neighbouring the match. Standard non-max suppression and parabolic fitting gives very precise locations for the features.
Defining the error functions. For d, u and a we currently use the simple Euclidean distance function on the two vector arguments. We have also experimented with an acceleration term, of the form s(xf , xf −1 , xf −2 ), but this increases ˆ and the computational cost of the DP stage by a factor of M does not currently allow us to track at interactive rates.
3.6. Adaptive tracking
Dynamic Programming. Dynamic programming is a way of reducing the search space for problems where one of a number of states must be chosen for each of a series of successive stages. Here, the stages are the frames of the sequence and the states are the possible matches (detections) in each frame. It can be implemented by filling a table whose rows and columns represent the matches (states) and
The above discussion has been expressed in terms of a fixed keyframe set Q, which for most purposes will be a single patch. In situations where the patch appearance varies cyclically (for example a face occasionally looking at the camera), this is often very effective, because the frames where the match is good tend to “lock down” the track, and 5
frame 50
frame 150
frame 250
frame 350
frame 100
frame 800
frame 950
frame 1100
Figure 5. Top row: four frames from a 464 frame sequence (360×270) of a girl reading with six tracks. Detection rates averaged 135fps. Bottom row: three example tracks through the 1145 frame ‘Dudek’ sequence (320×240). Detection consistently ran at 200fps.
The second sequence can be seen in Figure 6. It is a talk from ICCV’03. On the presentation DVD, a cropped version of this video has been used. A real-world task, therefore, is the tracking of the presenter for an aesthetic crop, keeping him central in the frame. Although technically, the speaker is never occluded, he walks to the projector screen several times, resulting in drastic lighting changes, in addition to continual pose variation. These attributes make this sequence surprisingly troublesome. We attempted this task using a variety of modern tracking methods with little success: continual restarts were necessary throughout the sequence to overcome persistent tracking failures. Two comparisons are shown in Figure 6: firstly, an advanced mean-shift tracker, implemented from [4], which ran at real time on the whole sequence and failed to give even satisfactory results; secondly, a colour KLT tracker, which initially worked very well, but by frame 3000, it had failed completely. There are obviously many parameters that could be adjusted to improve the performance of the KLT tracker for this particular sequence, but the time taken for each iteration is not interactive. Indeed, we spent several hours optimizing the algorithm to obtain the track described. Using our system, we were able to track the presenter over the first 12000 frames with only three input clicks and seven parameter adjustments. Each patch search ran at almost 200fps and track updates around 0.5 seconds. The track was also extended automatically. The total time spent was just over three minutes. The sequence was down-sampled to half resolution and loaded in frame steps of 15. Similar parameters were used as for the giraffe sequence, but with a much higher penalty for occlusions to minimize ‘drop-outs’. Input patches were defined in frames 5, 725 and 4415. The
the patch smoothness term overrides the difference between even quite distant frames and the single template. However, many sequences have more abrupt or more severe appearance change, for example an eye which blinks. In these cases, the M candidate patches in a given frame may not include the correct match, in which case two options are available. The obvious recourse is to supply additional user input. We also implemented an ‘auto-track’ feature to reduce the number of input patches that need to be selected by the operator. It is a simple iteration comprising two steps. Firstly, the track is optimized, as described above, and the local minimum of the track error as defined in equation (3) for each frame is found. Secondly, for those frames which have not yet been used to define input patches, the feature appearance is copied from P and the one most unlike the patches in Q is added to Q as a new patch. The sequence is searched for the new patches and two steps are repeated. This is continued until no new patches are added or the error of the track does not decrease. This is a useful tool for dealing with false negatives. However, it assumes that a current track contains only a small proportion of false positives, if any. This is a reasonable assumption given that the process in user-driven.
4. Results The system is demonstrated with four video sequences. The first is shown in Figure 1. A giraffe in the background is occluded by another giraffe walking in front of it. A single click on the rear giraffe’s eye is enough to obtain a perfect track though all 140 PAL resolution colour frames. The coefficients were set to λu =0.95, λd =0.67 and λo =80. 6
Mean-shift
Colour KLT
frame 11780
frame 8411
frame 8105
frame 4754
frame 707
frame 6
K-D Tree + D.P.
Figure 6. Six of the 12000 frames from the ICCV’03 sequence. A real-world problem is to track the speaker for aesthetic cropping. First column: with three clicks and seven parameter adjustments, an almost complete track was obtained. The head’s center-line was successfully tracked for all 98.4% of the sequence, with full recoveries after the small number of wandering and lost tracks. Second column: the mean-shift tracker performed poorly. It failed during the zoom and was manually restarted in frame 830. It continually lost track (and was automatically restarted in the centre of the image). Third column: the KLT tracker was successful, even through the zoom, until about frame 3000 when it was distracted by the lectern and then started tracking the presenter’s shoes.
7
feel that the argument for this approach is strong.
iterative track extension routine was then used to obtain a further 6 patches automatically. The resulting track is not as accurate as for the giraffe sequence, but is a more than satisfactory solution for the problem laid down above. Throughout the video the track does wander slightly over the presenter’s shoulders, occasionally significantly and at a couple of points loses him completely for a few frames. However, the system always recovered and the track is robustly restored each time. Two more example sequences, with six and three tracks respectively, are shown in Figure 5 (refer to the figure caption for details).
References [1] 2D3. Boujou. http://www.2d3.com/. [2] Apple. Shake. http://www.apple.com/shake/. [3] S. Avidan. Subset selection for efficient SVM tracking. In Proc. CVPR’03, 2003. [4] C. Bibby and I. Reid. Visual tracking at sea. In Proc. ICRA’05, 2005. [5] Y. Chen, T. Huang, and Y. Rui. Optimal radial contour tracking by dynamic programming. In Proc. ICIP’01, 2001. [6] Y. Huang and I. Essa. Tracking multiple objects through occlusions. In Proc. CVPR’05, 2005. [7] T. Ishikawa, I. Matthews, and S. Baker. Efficient image alignment with outlier rejection. Technical Report CMURI-TR-02-27, Carnegie Mellon University, 2002. [8] A. Jepson, D. Fleet, and T. El-Maraghi. Robust online appearance models for visual tracking. PAMI, 25(10), 2003. [9] D. Jones and J. Malik. A computational framework for determining stereo correspondence from a set of linear spatial filters. Image and Vision Computing, 10(10):699–708, 1992. [10] T. Kawanishi, T. Kurozumi, K. Kashino, and S. Takagi. A fast template matching algorithm with adaptive skipping using inner-subtemplate distances. In Proc. ICPR’04, 2004. [11] J. Lewis. Fast normalized cross-correlation. Vision Interface, 1995. [12] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004. [13] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. Imaging Understanding Workshop, pages 121–130, 1981. [14] R. Mann, J. A., and T. El-Maraghi. Trajectory segmentation using dynamic programming. In Proc. ICPR’02, 2002. [15] I. Matthews, T. Ishikawa, and S. Baker. The template update problem. In Proc. BMVA’03, 2003. [16] RealViz. Matchmover. http://www.realviz.com/. [17] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. PAMI, 19(5):530–535, 1997. [18] H. Schweitzer, J. Bell, and F. Wu. Very fast template matching. In Proc. ECCV’02, 2002. [19] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proc. ICCV’03, 2003. [20] L. Stefano and S. Mattoccia. Fast template matching using bounded partial correlation. Machine Vision Applications, 13:213–221, 2003. [21] P. Viola and M. Jones. Robust real-time face detection. IJCV, 57(2):137–154, 2004. [22] O. Williams, A. Blake, and R. Cipolla. A sparse probabilistic learning algorithm for real-time tracking. In Proc. ICCV’03, 2003. [23] Z. Yao and H. Li. Tracking a detected face with dynamic programming. In Proc. CVPRW’04, 2004. [24] S. Yoshimura and T. Kanade. Fast template matching based on the normalized correlation by using multiresolution eigenimages. In Proc. Int. Conf. Intelligent Robots and Systems, pages 2086–2093, 1994.
5. Conclusions This paper shows how separation of the feature tracking problem into preprocessing and interactive stages allows a significant improvement in interactive performance. Long sequences can be tracked with very little operator input, and tuning parameters can be varied in real time for maximal efficiency. We argue that this is a valuable addition to the tracking literature, sacrificing fully automatic performance, but allowing the system to be applied to a wide variety of footage and use-cases. The preprocessing stage is the system’s major strength and weakness. It is a strength because it allows arbitrarily large sequences to be tracked, given arbitrarily large RAM in which to store the preprocessed data. Its weakness is the cost of RAM. Our examples were produced on a laptop with 1GB of RAM, allowing us to store of the order of 100 PAL frames worth of metadata. Given the increasing popularity of compute clusters, however, it is easy to foresee “k-d tree servers”, where each cluster node handles patch searching for a subset of the sequence, which is performed completely independently per subset. The cluster nodes need large memories, but not fast processors. Then only the DP needs to be performed on the client machine, allowing interactive tracking even of very long sequences. Note that the fact that our computation is orders of magnitude faster than raw correlation means that orders of magnitude fewer cluster nodes are required. Extensions include the simultaneous tracking of multiple features on the same object, allowing some betweentrack coherence to strengthen the motion model. The patch search stage currently uses a Euclidean distance kernel to compare patches; it might be useful to use a robust kernel to cope with specularities and partial occlusions. Alternative basis sets for the patch projection, or completely different descriptors (e.g. using invariant transformations), may give better performance. Also, our current DP implementation is much slower than need be—an optimized implementation would allow the use of second-order motion models. In summary, there are many variations on the components of the basic system which need to be investigated, but we 8