Noname manuscript No. (will be inserted by the editor)
Dense descriptor for visual tracking and robust update model strategy Pier Luigi Mazzeo · Paolo Spagnolo · Marco Leo · Pierluigi Carcagni · Marco Del Coco · Cosimo Distante
Received: date / Accepted: date
Abstract Context analysis is a research field that is attracting growing interest in recent years, especially due to the encouraging results carried out by the semantic-based approach. Anyway, semantic strategies entail the use of trackers capable to show robustness to long-term occlusions, viewpoint changes and identity swap that represent the main problem of many tracking-by-detection solutions. This paper proposes a robust tracking-by-detection framework based on Dense SIFT descriptors in combination with an ad-hoc target appearance model update able to overtake the discussed issues. The obtained performances show how our tracker competes with state-of-the-art results and manages occlusions, clutter, changes of scale, rotation and appearance, better than competing tracking methods. Keywords Visual Tracking · Dense SIFT · RANSAC
1 Introduction In recent years the analysis of human/object interactions in free context (Crispim et al, 2013), as well as the recognition of specific actions is becoming topic of growing interest, especially due to the encouraging results guaranteed by the use of semantic-based approaches. Many state of the art algorithms integrate a classic bottom-up approach with some knowledge of the scene (Ga¨ uz`ere et al, 2015). A semantic approach allows a higher-level representation of the knowledge, as a human cognitive system does: the prior knowledge of a generic context allows the definition of a set of semantic rules that lead to the recognition of specific events involving the actors of a scene under analysis. Anyway, Pier Luigi Mazzeo National Research Council of Italy (CNR) Institute of Applied Sciences and Intelligent Systems (ISASI) E-mail:
[email protected]
2
Pier Luigi Mazzeo et al.
the wild contexts that usually surround these kinds of applications, make the tracking strategy a key issue. Indeed, occlusions, scale or perspective variations and other typical problems involved in objects/people tracking are the most frequent reasons that cause identity swapping or target lose leading to the straightforward consequence of the alteration of the semantic capability of scene investigation. This makes the correct upgrade of the target appearance a prominent problem. This task, in visual object tracking field, is one of the most reoccuring problems (Comaniciu et al, 2003; Isard and Blake, 1998; Jepson et al, 2003; Supancic and Ramanan, 2013; Wu et al, 2013; Yilmaz et al, 2006) and it still continues to be a big challenging task. There are many application fields that include motion-based recognition: automated surveillance, video indexing, human-computer interaction and robot or vehicle navigation. Most of previous works face this problem as a motion tracking or as an object detector task. Tracking-by-detection approaches have become a trend for object tracking (Avidan, 2007; Babenko et al, 2011; Grabner et al, 2008; Hare et al, 2016; Kalal et al, 2012; Leibe et al, 2008; Ramanan et al, 2007; Supancic and Ramanan, 2013; Wu and Nevatia, 2007) i.e. when a target object is identified by the user, it is tracked by detecting it frame by frame. Some evaluation papers (Pang and Ling, 2013; Song and Xiao, 2013; Wu et al, 2013) showed as this approach (i.e. Struck (Hare et al, 2016)) gives good performances on different benchmark sets. The most critical aspect of these methods is the updating of the object appearance over the time. For example, if an object is occluded (i.e. a car occluded by trees) and the model is updated at every frame, this leads to the well-known issue of drift (Matthews et al, 2004). Practically, the occluding object (i.e the trees) might be tracked instead of the object of interest (i.e. the car). In this paper we propose an algorithm based on dense local representation which is able to manage such occlusions as well as complex interactions between illumination and object/camera movements. It requires a bounding box initialization in the first frame, and then it tracks objects in presence of challenging scenarios (very long time sequences, presence of several occlusions, changes in illumination conditions). The proposed approach starts with learning an initial appearance model from the first user annotation, similarly to (Hare et al, 2016; Kalal et al, 2012; Supancic and Ramanan, 2013) and then uses it to track the object in the subsequent frames. The implemented algorithmic pipeline introduces a semantic model in which the entities are the objects and the contexts; the related significant regions are the attributes and the relationship among entities has been defined on the basis of distance constraints among the regions descriptors. We use Dense SIFT as local descriptor for appearance modelling, combined with a nearest neighbour (NN) classifier able to discriminate the target object from the context which also prevents improper object template updating during occlusions. The main contributions of this paper are: – The use of a dense descriptor allows the features extraction to be independent from key-point detection, making easier to have features matching
Dense descriptor for visual tracking and robust update model strategy
3
along different images. Redundancies and irrelevant data in dense SIFT have been handled with a proper learning phase based on nearest neighbour (NN) that better captures changes in appearance than other classifiers as geometrically demonstrated by Gu et al in (Gu et al, 2011). It, also, requires no collection of training data, and it is efficient even in case of small sized data samples. – The process by which the object template is updated that operates only on the descriptors with low dissimilarity, avoiding descriptors updating of the temporally occluded object regions. The features pruning scheme does not immediately discard less significant descriptors (i.e. features that exist in both the object and context models). The features are not discarded in case of no or ambiguous matching (i.e. match contemporary with object and context descriptor) but they are kept alive for a sufficient time to properly handle drift. In the light of the above contributions, the proposed approach is able to handle low contrast, few textured or blurred images. This is achieved by avoiding a training phase based on samples since classification is carried out using distance computation in the feature space. In addition, the use of two NN classifiers specifically dedicated for object and context discrimination, increases classification robustness. Finally drift is handled by not updating temporally occluded features and using an incremental feature storing process. On the other side this leads to a more complex management of feature sets with a slightly heavier computation load. This combination leads to a unified tracking-by-detection algorithm which is able to handle appearance changes, occlusions, background clutter and scale changes with promising performances as our experiments demonstrate. The paper is organized as follows: in section 2 we review the state of the art of visual tracking, in section 3 we summarize the proposed algorithm, in section 4 we present a comparison with two of the most recent state of the art methods. Conclusions and future works are finally given in section 5. 2 Related Work Some surveys of the most important methods of tracking were published in (Li et al, 2013) and (Yang et al, 2011), comparisons of performances under different conditions are studied in (Wang et al, 2011), (Salti et al, 2012), (Wu et al, 2013). Many visual tracking strategies have been published in the literature and they have in general two key components: i) object representation (appearance model); ii) inference method used to localize the object in each frame. Object is described by color histograms (Comaniciu et al, 2003) or by appearance representations learned with generative (Lee et al, 2005; Liu et al, 2013; Mei and Ling, 2011; Ross et al, 2008) and discriminative (Avidan, 2007; Collins et al, 2005; Hare et al, 2016; Supancic and Ramanan, 2013) models. The inference methodologies vary from Kalman filtering techniques, to those that use multiple cues (Badrinarayanan et al, 2007; Birchfield, 1998; Du and
4
Pier Luigi Mazzeo et al.
Piater, 2008; Moreno-Noguer et al, 2008; Park et al, 2012; Perez et al, 2004; Spengler and Schiele, 2003; Stenger et al, 2009) and merge the obtained results with methods like particle filtering (Isard and Blake, 1998), analysis of the errors(Stenger et al, 2009), and Markov chain Montecarlo schemes (Park et al, 2012). The cited inference methods suffer from drift, even if they have shown good results. It is, also, not explained how they can recover from the object leaving or re-entering the field of view. Tracking-by-detection approaches (Avidan, 2007; Leibe et al, 2008; Ramanan et al, 2007; Wu and Nevatia, 2007) has raised in popularity as the enhancement of the object detection algorithm obtaining significant results (Dalal and Triggs, 2005; Everingham et al, 2010). The advantage of tracking-by-detection is in the resilience of the underlying appearance representation, making handling occlusions easier. However it is necessary that the objects have enough and discriminative visual information. Object template updating is required to maintain a complete object model that makes the tracker capable to fight against factors that might corrupt the object representation as well as to support long term tracking with no drifting (Babenko et al, 2011; Grabner et al, 2008; Hare et al, 2016; Kalal et al, 2012; Pernici and DelBimbo, 2014; Supancic and Ramanan, 2013). The template updating generally consists of two steps: (i) tracking where the detector predicts the new object location; (ii) learning, where the object is detected by training samples, to improve the learned object model. In (Lascio et al, 2013) it is described a real-time tracker which deals with complex occlusions involving a plurality of moving objects simultaneously. Our tracker is also closely to two other recent approaches (Kalal et al, 2012; Pernici and DelBimbo, 2014). The TLD algorithm (Kalal et al, 2012) proposes to combine the benefits of tracker and detector based approaches. It divides the tracking task into specialized components for tracking, learning and detection, which works concurrently. The ALIEN tracker (Pernici and DelBimbo, 2014) exploits oversampling of local invariant representations to build a robust object/context discriminative classifier.
3 Tracking algorithm Our Tracking algorithm is composed by distinct and independent stages. Initially a classifier discriminates among object and context. It uses local feature descriptors to capture local object appearance similarity according to descriptors matching; object features that have been matched are then voted with a similarity transformation model. The similarity transformation is combined with a RANSAC-like voting scheme as described in section 3.4; we estimate it by evaluating descriptors correspondence among the object template and the searching window in the current frame. We implement a tracking-bydetection algorithm, where tracking is realized in both scale and rotation space, using an object-oriented bounding box. Furthermore, the object template is updated as described in section 3.5: in this way the object appearance is not corrupted and permits tracking with no drifting also in very long sequences.
Dense descriptor for visual tracking and robust update model strategy
5
The system block diagram is shown in figure 1. In the following subsections we will discuss how the algorithm discriminates between object and context, the way in which the algorithm estimates the object status and the procedure for the continuous updating of the object/context appearance.
Fig. 1 Block diagram presenting the principal workflow and the functional components of the proposed tracking algorithm
3.1 Dense SIFT as Local Descriptors SIFT is a local descriptor aimed to extract local gradient information (Lowe, 1999). In (Lowe, 1999), SIFT descriptors are a sparse feature representation composed by two stages: feature extraction and detection. In this paper we only use the feature extraction component. For every pixel in an image, we divide its neighborhood (i.e. 16x16) into 4x4 cell array, quantize the orientation into 8 bins in each cell and obtain a 4x4x8=128-dimensional vector as the SIFT representation for a pixel. We call this per-pixel SIFT descriptors, which is the most dense SIFT image that can be calculated. These dense descriptors (DSIFT) (Vedaldi and Fulkerson, 2010) are roughly equivalent to running SIFT on a dense grid of locations at a fixed scale and orientation. Dense SIFT capture a lot of redundant info in an image, whereas
6
Pier Luigi Mazzeo et al.
normal SIFT tries to find only the relevant info. Introducing redundancies as in Dense SIFT is a key of advantage in many practical cases. There is a lot of irrelevant data in Dense SIFT but this irrelevancy can be removed in a learning phase. In practice there are many scenarios where dense feature representation will work better. Dense SIFT provides enough feature sampling for subsequent learning algorithms. With normal SIFT it is not always true that a feature will be re-detected accurately and this affects robustness of the algorithm, whereas in Dense SIFT there are many alternative redundant clues that can be used once other features are no longer available. Dense SIFT show better performances than SIFT (Liu et al, 2011) and a basic explanation for this is that a larger set of local image descriptors computed over a dense grid usually provide more information than corresponding descriptors evaluated at a much sparser set of image points.
3.2 Object/Context appearance model Let D(I) = {(x1 , d1 ), · · ·, (xn , dn )} be the set of key points of an image I where xi ∈ R2 is the 2D coordinate and di ∈ Rd is the d-dimensional descriptor vector of the features (d=128 in proposed algorithm). We use D(W ; I) to depict the set of key point descriptors of I within the window W : D(W ; I) = d ∈ Rd |(x, d) ∈ D(I), x ∈ W . Considering an image sequence (It0 , Wt0 ), (It1 , Wt1 ), · · ·, (Itk , Wtk ) where Wti is the candidate window in image Iti , we highlight how to compute features matching to the object and the context respectively. Let Ctk ⊂ Rd be the context dynamic appearance model updated at up to the frame tk and let Otk ⊂ Rd the object template model updated up to the frame tk . Initialization is done by a user bounding box defining an object of interest. We set ORect = D(WBB , It0 ) where WBB represents the bounding box defined by the user, and put the rest of the key point descriptors of scene into the context model Ct0 = (AWt0 , It0 ) where AWt0 is the annular region surrounding the bounding box object (WBB ). In order to keep the object/context updated we use Fλ d d d d filter on the feature set defined in (Gu et al, 2011) as Fλ : 2R ×2R ×2R → 2R : Fλ [A, B, C] = {d ∈ A | kd − N NB (d)k < λkd − N NC (d)k} (1) where N NB (v) = argminu∈B ku − vk is the nearest neighbor of v in the set B. Using equation 1 we add new features to object model, computing: [ Otk ≈ Otk−1 FλO [D(Wtk ; Itk ), Otk−1 , Ctk−1 ] (2) We compute the same operation to add new features to context model by using equation 1: [ Ctk ≈ Ctk−1 FλC [D(Wtk ; Itk ), Ctk−1 , Otk−1 ] (3)
Dense descriptor for visual tracking and robust update model strategy
7
The underlying concept in Fλ is very trivial: for each descriptor d ∈ D(Wtk ; Itk ) we compute ratio likelihood to estimate if d is closer to the object model descriptor set Otk−1 or to the context model descriptor set Ctk−1 . The likelihood ratio thresholds λC and λO concept is linked to the familiar matching criterion used in SIFT (Lowe, 1999). This way we have an updating model based on two Nearest Neighbour classifier that handles the appearance changes and occlusions avoiding confusion between object and context descriptors.
3.3 Matching object features with Context In order to separate object and context features, we, both, consider object appearance features and their relation with its context (see section 3.2). We use the matching relationship between the search window feature set D(Wt ; It ), the updated transitive context feature Ct and the updated object feature Ot . The set of ordinated couples of matched descriptors between the search features window and the object features template is given by: MOt = RλO [D(Wt ; It ), Ot , Ct ]
(4)
where MOt contains {(i1 , j1 ), (i2 , j2 ), · ··, (in , jm )} a set of couples of descriptor indexes and Rλ is defined as: Rλ [A, B, C] : {(i, j)|di ∈ A, dj ∈ B ∧ kdi − N NB (di )k < λkdi − N NC (di )k} (5) We also estimate the matching relationship between the search window feature set and the transitive context one as in equation 4, obtaining: MCt = RλC [D(Wt ; It ), Ct , Ot ]
(6)
where, as in equation 5, MCt contains {(i1 , j1 ), (i2 , j2 ), · ··, (in , jm )} a set of couples of descriptor indexes computed using the operator Rλ defined in equation 5 . We have two set of matching couples from equations 4 and 6 respectively MOt and MCt ; we now prune the double assignments that is (i, j) ∈ MOt ∧ (i, k) ∈ MCt . In other words the same descriptor di ∈ D(Wt ; It ) is assigned both to the object Ot and the context Ct . We create a new set of matching couples from the set: Mt = {(i, j)|(i, j) ∈ MOt ∧ (i, k) ∈ / MCt }
(7)
equation 7 is valid ∀i|di ∈ D(Wt ; It ) ∧ ∀j|dj ∈ Ot ∧ ∀k|dk ∈ Ct . This strategy removes the background features which might be included in the object bounding box. This way the features that exist in both the object and the context feature sets are removed, as well as all the other ambiguous features that do not satisfy equation 7. Figure 2 shows the screenshot of the proposed visual tracking algorithm with 4 benchmark sequences. The three bounding boxes plotted in blue, green and yellow colors contain respectively the search area, the object protection
8
Pier Luigi Mazzeo et al.
area (no updating is made for the descriptors located in this area), and the object. The green stars and the red crosses show the matches between the descriptors in search area and respectively the template and the context . As it can be observed, the matches with the template (the green stars) which contribute to vote the correct similarity transformation (see subsection 3.2) are contained into the yellow bounding box.
Fig. 2 Screenshot of the proposed Visual tracking algorithm with the sequences Carchase, Car, Jumping, Pedestrian3
3.4 Object geometry estimation In this stage we robustly estimate multiple view relations from points correspondences obtained by equation 7. We used MLESAC (Torr and Zisserman, 2000) that is well suited to estimate complex surfaces and general non-linear relationship from point data. These are relations among corresponding image points from object template and current searching window. This way, the features in Mt obtained from equation 7 are sampled using a voting scheme based on MLESAC (Maximum Likelihood SAC) (Torr and Zisserman, 2000). The features locations are voted with a similarity transformation model instantiated at each iteration from two correspondences: S(D(Wt , It )) :
∀ (x, d) ∈ D(It ), x ∈ Wt ∧ (x, d) ∈ Mt
(8)
Voting is performed according to the MLESAC (Maximum Likelihood SAC) loss function that, differently from RANSAC, has a probabilistic formulation
Dense descriptor for visual tracking and robust update model strategy
9
that provides a soft criterion in the error evaluation (Tordoff and Murray, 2002). The minimized error is the following negative log likelihood:
L(e/S) = −
1 e2 1 log γ √ exp(− i 2 + (1 − γ) 2σ ν 2πσ 2 i∈Mt X
(9)
where σ is the standard deviation of the error on each coordinate, γ is the weight parameter of inliers and ν is just a constant taking into account the outliers distribution (i.e. the diameter of the search window). By equation 9 we estimate the relation S that minimizes this log likelihood. The member ei in equation 9 is the symmetric transfer error for the ith point of the matching feature di : e2i = d2 (xi , S −1 x0i ) + d2 (x0i , Sxi )
(10)
where d(·) is the Euclidean distance between the matched feature locations xi and x0i in the image and object template respectively, and S is the similarity matrix representing the estimated transformation model S(D(Wt , It )).
3.5 Object descriptors update The transformation model S(·) contains object in-plane rotations and scale changes. The appearance changes as well as self-occlusions and out-of-plane rotations are obtained by an information mixing from the object template, the transformation S(·) and the new object observation in the current frame. Visual tracking with a fixed template is not effective over a long time period as the object appearance may have significantly changed. To address these issues we gradually update the local feature descriptors of object template. Taking advantage of using multiple features uniformly distributed over the object bounding box, updating a fraction of them in each frame allows the template to adapt itself to the appearance change and alleviate the tracking drift problem. Once the new target location is determined, the local descriptors are updated as follows:
∀(i, j) ∈ Mt ∧ xj ∈ S(ORect ) (11) where dold is the previous distance value for the feature xi and di , dj are the feature descriptors located respectively in the object template and in the search area that are contained in transformed bounding box S(ORect ). f1 and f2 are the forgetting factors, which define the appearance template/context updating rate. In general, the descriptors with high dissimilarity represent occlusion or the background for the template representation. In the light of this, the matched descriptors with low dissimilarity are updated. di = dj
if
f1 · dold < kdi − dj k < f2 · dold
10
Pier Luigi Mazzeo et al.
Table 1 Proposed algorithm comparative performance analysis. (Precision / Recall / Fmeasure). Bold numbers indicate the best score. Underlined: second best Sequence
Frames
David Jumping Pedestrian 1 Pedestrian 2 Pedestrian 3 Car Carchase Volkswagen Mean
761 313 140 338 184 945 9928 8576 -
TLD (Kalal et al, 2012) 1.00 / 1.00 / 1.00 0.99 / 0.99 / 0.99 1.00 / 1.00 / 1.00 0.89 /0.92 / 0.91 0.99/ 1.00 / 0.99 0.92 / 0.97 / 0.94 0.50 / 0.40 / 0.45 0.54 / 0.64 / 0.59 0.85 / 0.86 / 0.85
ALIEN (Pernici and DelBimbo, 2014) 0.99 / 0.98 / 0.99 0.99 / 0.87 / 0.92 1.00 / 1.00 / 1.00 0.93 / 0.92 / 0.93 1.00 / 0.90 / 0.95 0.95 / 1.00 / 0.98 0.73 / 0.68 / 0.70 0.98 / 0.89 / 0.93 0.94 / 0.90 / 0.92
Our 1.00 / 0.96 / 0.98 1.00 / 1.00 / 1.00 0.98 / 0.97 / 0.97 0.99 / 0.99 / 0.99 0.98 / 1.00 / 0.99 0.97 / 0.99 / 0.98 0.65 / 0.60 / 0.62 0.90 / 0.88 / 0.89 0.93 / 0.92 / 0.93
4 Experimental Results This section evaluates the effectiveness and robustness of the proposed visual tracking method. We have implemented the algorithm in Matlab and tested it on a PC equipped with a Intel i7
[email protected] GHz and 8GB RAM. We have evaluated it using benchmark video sequences which contain many challenging factors, including drastic illumination changes, pose and scale variation, heavy occlusion, background clutter and fast motion. The used benchmark contains: i) 4 sequences including occlusions (total occlusions in most cases) (Pedestrian1, Pedestrian2, Pedestrian3, Car ) and long time intervals where the object is not present in the camera view; ii) a sequence (Jumping) with a large amount of blur, sudden changes of speed and motion direction; iii) two very long sequences (Volkswagen, Carchase), well suited to estimate the long term tracking capabilities; iv) a sequence (David ) including shadowing, lighting variation and object with self-occlusions due to pose changes. The six sequences (David, Jumping, Pedestrian1, Pedestrian2, Pedestrian3, Car ) were used in (Yu et al, 2008). Contrarily, the Volkswagen and Carchase sequences are used in (Kalal et al, 2012) for the assessment of Predator-TLD tracker. For the evaluation of the tracking algorithm we use the overall ratio among the tracker bounding boxes and ground truth ones: when the overlap ratio is larger than 0.25, the tracking result of the current frame is considered a success. Table 1 shows a comparison between the proposed tracking algorithm against Predator-TLD (Kalal et al, 2012) and ALIEN-Tracker (Pernici and DelBimbo, 2014), using Precision, Recall, and F-measure. The parameters of the proposed tracker are kept fixed for all the experiments. Fig. 3 shows the average success rate of the proposed tracker as a function of the value of λC and λO . It shows that the performance of the tracker has the maximum around 0.6. For this reason the likelihood ratio thresholds λC and λO of the two Nearest Neighbour classifiers have the same value λC = λO = 0.6 in order to assign the same importance to both object and context. We, empirically, set the forgetting factors f1 and f2 as 0.8 and 1.2 respectively in order to update only the regions with medium dissimilarity. On average, with this setting, for a target object with 2000 descriptors about 60
Dense descriptor for visual tracking and robust update model strategy
11
Fig. 3 Average success rate for different values of the likelihood ratio threshold λC and λO (both fixed to the same value)
of them are updated in each frame. The maximum number of features for the object and context classifiers is strictly dependent on the target object area and the density of the SIFT descriptors (sampled every 2 pixels in rows and columns). When the number of features exceeds the number of allowed ones, features are removed according to a pruning schema that eliminates the oldest not updated ones. The maximum number of frames, where the algorithm does not detect the object before random search starts, is fixed to NU = 12. It can be observed that the proposed tracker, despite the simplicity, yields excellent performances often close to or even better than the state of the art methods. More specifically our tracker achieves the best results in the sequences Jumping, Pedestrian2, Pedestrian3, Car and the second best result in the sequences Pedestrian1, Carchase, Volkswagen. The performance shows that our tracking framework can manage heavy occlusions, background clutter and object appearance changes. The results, also, highlight that the proposed tracker outperforms Predator-TLD (Kalal et al, 2012) in long sequences such as Carchase and Volkswagen and is very close to ALIEN-Tracker (Pernici and DelBimbo, 2014). The current algorithm implementation without any software optimization (with the parameter settings explained above) runs at 320x240@4 FPS. Table 2 summarizes the computational cost of the proposed algorithm compared with ALIEN-Tracker(Pernici and DelBimbo, 2014) and PredatorTLD (Kalal et al, 2012). It can be noticed that our algorithm is completely
12
Pier Luigi Mazzeo et al.
Fig. 4 Heavy occlusions condition (Car video): the proposed algorithm tracks correctly the object even when it is quasi-full occluded by trees.
implemented in Matlab with no optimization, and its computational cost is higher than other methodologies which are optimized and partially/entirely implemented with C++ or OpenCV. It is so far from the near frame rate performance, but we are confident that with some software improvements of implementation we can bring the algorithm to real-time performances.
4.1 Occlusion and drift: qualitative results In figures 4 and 5 we have reported some screenshots of the car and carchase sequences with conditions that might hinder tracking and the way in which
Dense descriptor for visual tracking and robust update model strategy
13
Fig. 5 Full occlusion conditions (Carchase video): the proposed algorithm tracks the object until it is fully occluded (car under the bridge) and immediately recover tracking when it re-appears.
the proposed tracker responds. In both sequences we indicate the bounding box, the matched features (among the object template and the candidate region), and the extracted searching window. Figure 4 shows the portion of the sequence car where a car enters and leaves a zone occluded by trees. It can be observed the high capability of the proposed tracker to capture the locality of the object appearance and the robustness against the object occlusion. DenseSIFT overfits a great number of regions: the relative spatial information between regions is maintained and thus occlusions are handled well. The tracking drift generally occurs when the target object is heavily occluded. Our tracker recovers from tracking drift by learning the appearance change. The carchase sequence is challenging as the target is fully occluded in several frames. Figure 5 shows how the proposed algorithm is able, even in high challenging illumination conditions, to relocate the target after a full occlusion occured. As only some regions (not occluded ones) are updated at any time instance, the tracking drift problem can be better handled, when partial/heavy occlusion occurs.
5 Conclusion and Future Improvement In this paper, we presented a visual tracking method based on dense SIFT descriptors and nearest neighbor learning algorithm based on template/context matching. The nearest-neighbour classification of dense local features com-
14
Pier Luigi Mazzeo et al.
Table 2 Processing time comparison Algorithm TLD (Kalal et al, 2012) ALIEN (Pernici and DelBimbo, 2014) Our
Hardware
Implementation
Image Resolution
Processed FPS
unknown
Matlab/C++
unknown
up to 20
Intel
[email protected]
Matlab/OpencCV
320x240
11
Intel
[email protected]
Matlab
320x240
4
bined with the contextual descriptors matching yields to a simple and efficient tracking by the detection algorithm. It handles occlusions, clutter, and significant changes in orientation, scale and appearance. In order to avoid object template contamination, we update its descriptors only when it is not occluded. Quality results showed competitive performance in terms of stability and plasticity. The experimental results show that the proposed tracker gives very encouraging performances compared to leading state of the art methods. Immediate future works will include the optimization of the algorithm, for true real time performances.
References Avidan S (2007) Ensemble tracking. IEEE Trans Pattern Anal Mach Intell 29(2):261–271, DOI 10.1109/TPAMI.2007.35, http://dx.doi.org/10. 1109/TPAMI.2007.35 Babenko B, Yang MH, Belongie S (2011) Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8):1619–1632, DOI 10.1109/TPAMI.2010.226 Badrinarayanan V, Perez P, Clerc FL, Oisel L (2007) Probabilistic color and adaptive multi-feature tracking with dynamically switched priority between cues. In: 2007 IEEE 11th International Conference on Computer Vision, pp 1–8, DOI 10.1109/ICCV.2007.4408955 Birchfield S (1998) Elliptical head tracking using intensity gradients and color histograms. In: Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231), pp 232– 237, DOI 10.1109/CVPR.1998.698614 Collins RT, Liu Y, Leordeanu M (2005) Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10):1631–1643, DOI 10.1109/TPAMI.2005.205 Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–575, DOI 10.1109/TPAMI.2003. 1195991, http://dx.doi.org/10.1109/TPAMI.2003.1195991 Crispim CF, Bathrinarayanan V, Fosty B, Konig A, Romdhane R, Thonnat M, Br´emond F (2013) Evaluation of a monitoring system for event recognition of older people. In: 10th IEEE International Conference on Advanced Video
Dense descriptor for visual tracking and robust update model strategy
15
and Signal Based Surveillance, AVSS 2013, Krakow, Poland, August 27-30, 2013, pp 165–170, DOI 10.1109/AVSS.2013.6636634, http://dx.doi.org/ 10.1109/AVSS.2013.6636634 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol 1, pp 886–893 vol. 1, DOI 10.1109/CVPR.2005.177 Du W, Piater J (2008) A Probabilistic Approach to Integrating Multiple Cues in Visual Tracking, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 225–238. DOI 10.1007/978-3-540-88688-4 17, http://dx.doi.org/10. 1007/978-3-540-88688-4_17 Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2):303–338, DOI 10.1007/s11263-009-0275-4, http: //dx.doi.org/10.1007/s11263-009-0275-4 Ga¨ uz`ere B, Greco C, Ritrovato P, Saggese A, Vento M (2015) Semantic Web Technologies for Object Tracking and Video Analytics, Springer International Publishing, Cham, pp 574–585. DOI 10.1007/978-3-319-27863-6 53, http://dx.doi.org/10.1007/978-3-319-27863-6_53 Grabner H, Leistner C, Bischof H (2008) Semi-supervised on-line boosting for robust tracking. In: Proceedings of the 10th European Conference on Computer Vision: Part I, Springer-Verlag, Berlin, Heidelberg, ECCV ’08, pp 234–247, DOI 10.1007/978-3-540-88682-2 19, http://dx.doi.org/10. 1007/978-3-540-88682-2_19 Gu S, Zheng Y, Tomasi C (2011) Efficient visual object tracking with online nearest neighbor classifier. In: Proceedings of the 10th Asian Conference on Computer Vision - Volume Part I, Springer-Verlag, Berlin, Heidelberg, ACCV’10, pp 271–282, http://dl.acm.org/citation.cfm?id=1964320. 1964348 Hare S, Golodetz S, Saffari A, Vineet V, Cheng MM, Hicks SL, Torr PHS (2016) Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10):2096–2109, DOI 10. 1109/TPAMI.2015.2509974 Isard M, Blake A (1998) Icondensation: Unifying low-level and high-level tracking in a stochastic framework, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 893–908. DOI 10.1007/BFb0055711, http://dx.doi.org/10. 1007/BFb0055711 Jepson AD, Fleet DJ, El-Maraghi TF (2003) Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10):1296–1311, DOI 10.1109/TPAMI.2003.1233903 Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422, DOI 10.1109/TPAMI. 2011.239, http://dx.doi.org/10.1109/TPAMI.2011.239 Lascio RD, Foggia P, Percannella G, Saggese A, Vento M (2013) A real time algorithm for people tracking using contextual reasoning. Computer Vision and Image Understanding 117(8):892 – 908, DOI http://
16
Pier Luigi Mazzeo et al.
dx.doi.org/10.1016/j.cviu.2013.04.004, http://www.sciencedirect.com/ science/article/pii/S1077314213000908 Lee KC, Ho J, Yang MH, Kriegman D (2005) Visual tracking and recognition using probabilistic appearance manifolds. Comput Vis Image Underst 99(3):303–331, DOI 10.1016/j.cviu.2005.02.002, http://dx.doi.org/ 10.1016/j.cviu.2005.02.002 Leibe B, Schindler K, Cornelis N, Van Gool L (2008) Coupled object detection and tracking from static cameras and moving vehicles. IEEE Trans Pattern Anal Mach Intell 30(10):1683–1698, DOI 10.1109/TPAMI.2008.170, http: //dx.doi.org/10.1109/TPAMI.2008.170 Li X, Hu W, Shen C, Zhang Z, Dick A, Hengel AVD (2013) A survey of appearance models in visual object tracking. ACM Trans Intell Syst Technol 4(4):58:1–58:48, DOI 10.1145/2508037.2508039, http://doi.acm.org/10. 1145/2508037.2508039 Liu B, Huang J, Kulikowski C, Yang L (2013) Robust visual tracking using local sparse appearance model and k-selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(12):2968–2981, DOI 10.1109/TPAMI.2012.215 Liu C, Yuen J, Torralba A (2011) Sift flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5):978–994, DOI 10.1109/TPAMI.2010.147, http: //dx.doi.org/10.1109/TPAMI.2010.147 Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol 2, pp 1150–1157 vol.2, DOI 10.1109/ICCV.1999.790410 Matthews I, Ishikawa T, Baker S (2004) The template update problem. IEEE Trans Pattern Anal Mach Intell 26(6):810–815, DOI 10.1109/TPAMI.2004. 16, http://dx.doi.org/10.1109/TPAMI.2004.16 Mei X, Ling H (2011) Robust visual tracking and vehicle classification via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11):2259–2272, DOI 10.1109/TPAMI.2011.66, http://dx. doi.org/10.1109/TPAMI.2011.66 Moreno-Noguer F, Sanfeliu A, Samaras D (2008) Dependent multiple cue integration for robust tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(4):670–685, DOI 10.1109/TPAMI.2007.70727 Pang Y, Ling H (2013) Finding the best from the second bests - inhibiting subjective bias in evaluation of visual tracking algorithms. In: 2013 IEEE International Conference on Computer Vision, pp 2784–2791, DOI 10.1109/ ICCV.2013.346 Park DW, Kwon J, Lee KM (2012) Robust visual tracking using autoregressive hidden markov model. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 1964–1971, DOI 10.1109/CVPR.2012.6247898 Perez P, Vermaak J, Blake A (2004) Data fusion for visual tracking with particles. In: Proceedings of the IEEE, vol 92, pp 495–513, DOI 10.1109/ JPROC.2003.823147
Dense descriptor for visual tracking and robust update model strategy
17
Pernici F, DelBimbo A (2014) Object tracking by oversampling local features. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(12):2538–2551, DOI 10.1109/TPAMI.2013.250, http://dx.doi.org/ 10.1109/TPAMI.2013.250 Ramanan D, Forsyth DA, Zisserman A (2007) Tracking people by learning their appearance. IEEE Trans Pattern Anal Mach Intell 29(1):65–81, DOI 10.1109/TPAMI.2007.22, http://dx.doi.org/10.1109/TPAMI.2007.22 Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. International Journal of Computer Vision 77(1):125–141, DOI 10.1007/s11263-007-0075-7, http://dx.doi.org/10. 1007/s11263-007-0075-7 Salti S, Cavallaro A, Stefano LD (2012) Adaptive appearance modeling for video tracking: Survey and evaluation. IEEE Transactions on Image Processing 21(10):4334–4348, DOI 10.1109/TIP.2012.2206035 Song S, Xiao J (2013) Tracking revisited using rgbd camera: Unified benchmark and baselines. In: Proceedings of the 2013 IEEE International Conference on Computer Vision, IEEE Computer Society, Washington, DC, USA, ICCV ’13, pp 233–240, DOI 10.1109/ICCV.2013.36, http://dx.doi.org/ 10.1109/ICCV.2013.36 Spengler M, Schiele B (2003) Towards robust multi-cue integration for visual tracking. Machine Vision and Applications 14(1):50–58, DOI 10.1007/ s00138-002-0095-9, http://dx.doi.org/10.1007/s00138-002-0095-9 Stenger B, Woodley T, Cipolla R (2009) Learning to track with multiple observers. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 2647–2654, DOI 10.1109/CVPR.2009.5206634 Supancic JS, Ramanan D (2013) Self-paced learning for long-term tracking. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA, CVPR ’13, pp 2379–2386, DOI 10.1109/CVPR.2013.308, http://dx.doi.org/10.1109/ CVPR.2013.308 Tordoff B, Murray DW (2002) Guided Sampling and Consensus for Motion Estimation, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 82–96. DOI 10. 1007/3-540-47969-4 6, http://dx.doi.org/10.1007/3-540-47969-4_6 Torr P, Zisserman A (2000) Mlesac: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding 78(1):138 – 156, DOI http://dx.doi.org/10.1006/cviu.1999.0832, http:// www.sciencedirect.com/science/article/pii/S1077314299908329 Vedaldi A, Fulkerson B (2010) Vlfeat: An open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM International Conference on Multimedia, ACM, New York, NY, USA, MM ’10, pp 1469–1472, DOI 10.1145/1873951.1874249, http://doi.acm.org/10.1145/1873951. 1874249 Wang Q, Chen F, Xu W, Yang MH (2011) An experimental comparison of online object-tracking algorithms. vol 8138, pp 81,381A–81,381A–11, DOI 10.1117/12.895965, http://dx.doi.org/10.1117/12.895965
18
Pier Luigi Mazzeo et al.
Wu B, Nevatia R (2007) Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision 75(2):247–266, DOI 10.1007/ s11263-006-0027-7, http://dx.doi.org/10.1007/s11263-006-0027-7 Wu Y, Lim J, Yang MH (2013) Online object tracking: A benchmark. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp 2411–2418, DOI 10.1109/CVPR.2013.312 Yang H, Shao L, Zheng F, Wang L, Song Z (2011) Recent advances and trends in visual tracking: A review. Neurocomputing 74(18):3823 – 3831, DOI http: //dx.doi.org/10.1016/j.neucom.2011.07.024, http://www.sciencedirect. com/science/article/pii/S0925231211004668 Yilmaz A, Javed O, Shah M (2006) Object tracking: A survey. ACM Comput Surv 38(4), DOI 10.1145/1177352.1177355, http://doi.acm.org/10. 1145/1177352.1177355 Yu Q, Dinh TB, Medioni G (2008) Online tracking and reacquisition using cotrained generative and discriminative trackers. In: Proceedings of the 10th European Conference on Computer Vision: Part II, Springer-Verlag, Berlin, Heidelberg, ECCV ’08, pp 678–691, DOI 10.1007/978-3-540-88688-4 50, http://dx.doi.org/10.1007/978-3-540-88688-4_50