Master's Thesis. Robust Object Tracking ... Semi-Supervised Online Boosting ..... generic facial pose and crop alignment constraints, which is used to drop mis-.
Master’s Thesis
Robust Object Tracking using Semi-Supervised Online Boosting Martin Godec
Graz University of Technology Erzherzog-Johann-Universit¨ at
Institute for Computer Graphics and Vision
Supervisor: Univ.-Prof. Dipl.-Ing. Dr.techn. Horst Bischof Advisor: Dipl.-Ing. Dr.techn. Helmut Grabner Graz, October 2008
Abstract This work presents a detailed analysis and discussion of a new object tracking method using semi-supervised on-line boosting1 . In order to avoid the drifting problem, which presents a challenge to adaptive tracking systems, the new approach incorporates prior knowledge of the tracked object into the tracking process via semi-supervised learning. This method makes it possible to distinguish between actual appearance changes of the object and changes that arise from erroneous measurements. Experiments show that this new method is very robust to misaligned object model updates and moreover performs on a similar performance level as other equivalent state-of-the-art object tracking methods. Within this thesis, the presented approach is analysed and optimised as well as extended to enable real-time performance.
1
Semi-Supervised On-line Boosting for Robust Tracking, Grabner et al. [20]
i
Kurzfassung Diese Arbeit pr¨ asentiert eine detaillierte Analyse und Interpretation eines neuen Objekt Tracking-Verfahrens2 , welches auf Semi-supervised On-line Boosting basiert. Der Ansatz verspricht robuster gegen sogenanntes Drifting zu sein, welches eine große Herausforderung im Bereich der adaptiven Tracking-Verfahren darstellt. Durch das Einbringen von Vorwissen in den Lernprozess mittels Semi-Supervised Learning wird versucht, zwischen erlaubten und fehlerhaften Anpassungen des Objekt-Modells an das aktuelle Objekt zu unterscheiden. In verschiedenen Experimenten wird gezeigt, dass dieser Ansatz f¨ahig ist mit aktuellen, gleichwertigen Tracking-Verfahren zu konkurrieren und zus¨atzlich den Anspruch gerecht wird, robuster gegen fehlerhafte Adaptierungen zu sein. Das beschriebene Verfahren wurde im Zuge dieser Arbeit analysiert, optimiert als auch erweitert um Echtzeit-F¨ahigkeit zu erm¨oglichen.
2
Semi-Supervised On-line Boosting for Robust Tracking, Grabner et al. [20]
ii
Acknowledgements I want to thank Horst Bischof, Helmut Grabner, and Christian Leistner from the Institute for Computer Graphics and Vision for their excellent mentoring and support in technical and administrative issues, Thomas Mauthner for his knowledge and support on particle filters, and Sabine Sternig and Paul Wohlhart for many very imaginative discussions. I also want to thank my lovely family for supporting me over my years of study and my amiable girlfriend, for their patience and endurance of my intermittent confusion.
This work has been supported by the FFG project EVis (813399) under the FIT-IT program.
Graz, in October 2008
Martin Godec
iii
Contents 1 Introduction 1.1 Motivation . . . . . . . 1.2 Object Tracking . . . . 1.2.1 Challenges . . . 1.2.2 Methods . . . . . 1.3 Related Work . . . . . . 1.4 Structure of this Thesis
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
1 1 3 3 3 5 8
2 Boosting and Particle Filtering 2.1 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Weak Classifiers . . . . . . . . . . . . . . . . . 2.1.2 Feature Selection via On-line Boosting . . . . . 2.2 Semi-supervised Learning . . . . . . . . . . . . . . . . 2.3 Semi-supervised On-line Boosting for Feature Selection 2.4 Image Features . . . . . . . . . . . . . . . . . . . . . . 2.5 Particle Filter . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
9 9 11 11 15 15 20 23
3 Robust Object Tracking 3.1 The Tracking Loop . . . . . . . . . . . . . . . . . 3.2 System Overview . . . . . . . . . . . . . . . . . . 3.2.1 Object Detection . . . . . . . . . . . . . . 3.2.2 Object Modelling . . . . . . . . . . . . . . 3.2.3 Location Estimation . . . . . . . . . . . . 3.2.4 Training and Updating the Object Model 3.2.5 Refining of the Confidence Map . . . . . . 3.3 Update Rules for On-line Boosting . . . . . . . . 3.4 Visual one-shot Learning . . . . . . . . . . . . . . 3.5 Particle Filter . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
26 26 27 27 29 29 30 30 31 32 33
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
4 Experiments and Results 36 4.1 Evaluation Data . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
Contents
4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11
v
Parameter Optimisation . . . . . Comparison of Tracking Systems Long-Term Drifting Experiment . Stability to misaligned updates . Evaluation of Update Rules . . . One-shot Training Methods . . . Hot-swap of Prior Knowledge . . Particle Filter Speed-up . . . . .
5 Conclusion 5.1 Future Work . . . . . . . . . . 5.1.1 Image Features . . . . . 5.1.2 Boosting Algorithm . . 5.1.3 Extended Search Space 5.1.4 Multi-view Tracking . . 5.1.5 Speed Optimisation . . Bibliography
. . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
40 40 42 44 45 47 49 51
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
53 53 53 54 54 54 54 56
Chapter 1
Introduction 1.1
Motivation
Object tracking is a major task in many computer vision applications and has been well studied in the last decades. But still there are many problems and tasks where a perfect solution has not been found until now. For instance, video surveillance is a very popular field of application for object tracking. From traffic cameras to monitored public space, tons of hours of video material are captured every day and have to be processed. The CCTV User Group1 estimates that there are probably about 1.5 million public space surveillance cameras in UK covering well over 1000 UK Towns. In order to process this flood of information, partly- and fully-automated analysis techniques have to be developed. To build a robust tracking system, some requirements must be considered. Robustness: Robustness means that even under difficult conditions, the tracking mechanism should be able to detect and follow the object. These difficulties may be cluttered background, changing illumination conditions, occlusions or many other natural occurrences that influence the tracking result. Adaptivity: Additional to changes in the environment an object is located in, also the object itself undergoes changes. This requires a steady adaptation of the tracking system to the actual object appearance. Real-time Processing: A system that should cope with live video streams must be able to process one frame of these videos in a very short time. Thus, a fast and optimised implementation as well as the selection of high performance algorithms is required. The processing speed depends on the speed of movement of the observed object, but to achieve a 1
CCTV User Group: http://www.cctvusergroup.com, 01.10.2008
1
Chapter 1. Introduction
2
fluent video impression for the human eye, a frame-rate of at least 15 frames per second has to be established. But how does object tracking work? First, similar to humans, the detection process needs a description of the object which should be tracked. This can for example be a template image of the object, a shape, texture or color model or something similar. Building such an initial object model is a very critical and hard task, because the quality of the model directly relates to the quality of the tracking process. Additionally, such a description is not always available to the tracking application before runtime and thus, it may be built up during runtime. Even having a good object model available a priori or established after processing some video frames, adaptivity to appearance changes is necessary to achieve robustness. These changes can arise from small rotations or geometrical transformations of the object, but also from changing texture, such as an exchanged piece of clothing. To handle such variances, the object model must be adjusted to the new circumstances from time to time. The major problem of building such a tracking system is the constriction of the adaptivity to distinguish between actual changes of the object and changes that result from errors. This problem is called the Drifting Problem [33]. One solution which tries to tackle this problem is Semi-Supervised Online Boosting for Robust Tracking, recently proposed by Grabner et al. [20]. This technique incorporates prior information to prevent drifting as described above. The intention of this thesis is the evaluation and enhancement of this approach to identify possibilities and limits of this object tracking method.
Chapter 1. Introduction
1.2
3
Object Tracking
This Section gives a short introduction into object tracking, the challenges (see Figure 1.1) and different approaches with their pros and cons.
1.2.1
Challenges
Motion: As this is the main point of an object tracking application, it should also be clear that motion of an object can be fast and abrupt and that also variations in scale of the object belong to this part of the requirements. Geometric Transformations: Since natural objects can move freely in their environment, they can appear slightly transformed, such as rotated in our out of plane, on the images that are presented to the tracker. Cluttered Background: The detection algorithm should be able to differ between the object foreground and the background. Thus the composition of the background can cause additional difficulties. Occlusion: Parts of or the whole object can be occluded by the background or even by itself. This self-occlusion often appears when tracking 3Dobjects like humans. Appearance changes: It is possible, that the object changes it appearance slightly during runtime of the application. This must be handled by the detection algorithm by adjusting the object model. Illumination: As the illumination conditions of the object and the scene can vary, this can also affect the detection process. Similar objects: Another requirement is the robustness of the detection to distinguish between the original objects and similar looking ones. This specification is strongly related to the robustness to cluttered background.
1.2.2
Methods
Object Tracking approaches can be separated into three main groups (taken from [55]): Point Tracking: An object can be represented by a number of points, those correspondence over consecutive frames is tracked. The points are combined in a kind of model of the object and the correspondences are evaluated over a number of constraints, such as motion models. An
Chapter 1. Introduction
(a) Motion
(d) Occlusion
4
(b) Geometric Transfor- (c) Cluttered mation ground
(e) Appearance Change
Back-
(f ) Illumination
Figure 1.1: Tracking Challenges
example for point tracking is Kalman filter [8] or Greedy Optimal Assignment (GOA) [53]. Kernel Tracking: This group of object tracking methods computes the motion of an object in order to track it from one frame to the next. The appearance models can be separated into template based or densitybased models. Popular examples for this method are Template-Tracking [21], Meanshift-Tracking [13], KLT-Tracking [48] or SV-Tracking [1]. The proposed approach can also be assigned to this group, even if there are wide differences to the other mentioned methods. Silhouette Tracking: The object is tracked via estimation of the object region in each frame. This can be done by shape matching or contour evolution. Recent approaches of this group are [56] for contour evolution based methods or [24] for shape matching. Each group and each method has its advantages and special field of application where it can demonstrate its strength. Kernel based tracking methods are very popular due to their simplicity and computational efficiency. Since the described approach uses tracking by detection with an adaptive object model, it perfectly fits into this group.
Chapter 1. Introduction
1.3
5
Related Work
This Section gives an overview and short descriptions on different kernel tracking methods. Certainly this list is not exhaustive and only covers popular approaches and methods strongly related to the proposed one. The group of kernel tracking algorithms can be separated again into two types using different appearance models of the object, namely density-based and template-based. Density-based: The object is modelled with one or more probability density functions, such as Gaussian, mixture of Gaussian, Parzen windows or histograms, that describe the probability of object appearance. Template-based: This method uses templates of the object to calculate its appearance probability. Templates have the advantage of including spatial information of the object but in their original form they can only model a single view of a model. Instead of directly using a template image of the original image as a model, also other representations (e.g., color histograms) can be used.
Figure 1.2: Overview on Object Tracking Methods, focused on Kernel Tracking
The second distinctive feature of kernel tracking methods is the way of finding the object in the actual image. A very simple approach is the brute force method, where the image is organised as a set of overlapping patches that are one by one evaluated to find the best-matching object position. This method can also cover transformations such as rotation or scaling although this enlarges the search-space. An optimised algorithm to avoid the brute force search has been proposed by Schweitzer et al. [47]. Another method of locating the object is the use of the mean-shift [11] algorithm for iteratively maximising the appearance similarity between the image location and the object model. This method calculates the mean-shiftvector based on the density gradient and gradual iterates it until convergence is achieved. Comaniciu et al. [12] proposed a mean-shift tracker based on
Chapter 1. Introduction
6
color distribution and scale adaptation, an adapted version called CAMShift (Continuously Adaptive Mean Shift) [6] is used in the popular OpenCV2 Framework. A main drawback of the standard mean-shift tracking is, that parts of the object must be located within the object area of the previous frame to converge. This problem has been attend be Porikli and Tuzel [43], which use multiple kernels to expand mean-shift tracking to low frame-rate videos. Using optical flow for improving the search-space of a tracking algorithm is another popular method. Tomasi and Kanade [51] used this technique to calculate the translation of a region around an interest point in their poplar KLT-Tracker (see [34] for a survey on local descriptors and interest points). Other methods of estimating the object motion are Kalman filter [23] or particle filter [32]. Recently, Chen et al. [10] used a Kalman filter for predicting the location of occluded objects while using the mean-shift algorithm to refine the result of the Kalman filter if the object is visible. While the Kalman filter assumes Gaussian noise, particle filter are designed to estimate non-linear processes. Li et al. [28] used a cascaded particle filter for location estimation in low frame rate videos. A common way of solving classification problems in computer science is to apply machine learning algorithms. Since object tracking needs to distinguish an object from the background, this is an excellent application field for machine learning methods such as support vector machines, boosting, neural networks and many others. A few examples of approaches using machine learning will be presented here. In [1] a combination of pyramids of support vector machines and optical flow information is used for vehicle tracking. Support vector machines have been successfully used before e.g. for face detection [39]. Bevilacqua et al. [3] uses self-organising maps to implement a vehicle tracking system that claims to be robust against occlusion. The self-organising map is in this case used for clustering of features according to their speed and not for the object recognition itself. Enhancing the result of a tracking algorithm by a combination of a number of weak learning algorithms is a very common method. Avidan [2] uses AdaBoost [16], a popular machine learning algorithm, in combination with the previously mentioned Mean-Shift algorithm. Another method using online boosting for feature selection to adaptively track an object has been proposed by Grabner et al. [19] and serves as a base for this thesis. Since such (supervised) machine learning strategies need labeled data for training of the object model, the quality of the tracking system depends on the number and quality of labeled examples. However, the production of labeled examples always requires human interaction and is therefore expensive. Unlabeled data on the other hand is quite easy to get. Thus, it would 2
OpenCV: http://sourceforge.net/projects/opencvlibrary/, 01.10.2008
Chapter 1. Introduction
7
make sense to combine labeled and unlabeled data to on one hand enlarge the training set and on the other hand achieve better result. This is called semi-supervised learning and has received increasing attention in the field of computer vision in the last years. Mallapragada et al. [31] introduced semi-supervised learning in the field of boosting, Leistner et al. [25] adapted this work to be based on similarity learning. Out of this Grabner et al. [20] proposed a tracking framework, that uses the semi-supervised approach to include prior information into the tracking process. The use of prior information is driven by the drifting problem [33]. Since the adaptive object model is updated continuously by a patch corresponding to the current object location, which is determined by the current object model, this mechanism is sensitive to misalignments in the object location. These misalignments may come from spatial sub-pixel shifts, which can not be covered with the used image resolution, from false detections or from wrong estimations of the object movement. These errors lead the current object model to adapt itself to a misaligned object location and to drift away from the original object. Many approaches have been proposed to tackle this problem, all based on constraints within the adaptation process of the object model, only some of them related to this work are touched here. With an off-line learned face detector, Minyoung et al. [35] introduce a generic facial pose and crop alignment constraints, which is used to drop misaligned detections to avoid drifted updates. In the work proposed by Jepson et al. [22], features are separated into slowly changing, for stable object regions, and fast changing, like parts that follow gesture or facial expressions. Such a measurement of the steadiness of a feature can be included into the detection process by preferring more stable ones. A very recent approach for robust tracking of objects is called the observeand-explain paradigm introduced by Ryoo and Aggarwal [44], where an object is tracked by multiple hypothesis through a scene. After retrieving enough information, the system chooses the hypothesis path with the highest probability which enables the tracking of even fully occluded objects with the cost of higher computational effort. At last it should be mentioned, that the construction of prior knowledge is a crucial part in such an application. Especially when tracking previously unknown objects where only the first frame can be used for initialisation. Attached to this problem Levi and Weiss [26] focused on the selection of features for learning tasks with a small number of examples. Also Shi and Tomasi [48] have focused on the selection of features before. Another method of increasing the number of training examples would be the creation of virtual samples [36].
Chapter 1. Introduction
1.4
8
Structure of this Thesis
After having introduced the topic of this thesis and related approaches chapter 2 gives an overview over the used techniques of the described approach. Also some information on evaluation of a tracking system can be found here. Chapter 3 then covers the structure of the implemented tracking system and details on algorithms and functionality. Detailed information on the accomplished experiments and results can be found in chapter 4. The last chapter includes final conclusions, further considerations and future work.
Chapter 2
Boosting and Particle Filtering This Section describes the techniques that are used to establish a robust tracking system. First, a short introduction into Boosting is given which is refined to Semi-supervised Boosting for Feature Selection. The following Section gives a short overview on image features, especially Haar-like features that are used in the application. Subsequently, particle filtering is described with focus on location estimation for object tracking.
2.1
Boosting
Applying machine learning algorithms to computer vision problems is a common approach. These machine learning techniques require a training set X, consisting of the input examples xi ∈ X and the desired output value yi for each of this examples. Out of this data, a hypothesis H is built, assigning an output value or label yi to each example xi . The hypothesis tries to minimise a certain optimisation criterion, such as the error rate the hypothesis achieves processing the training set. Based on the processing order of the training data, machine learning algorithms can be separated into off-line and on-line algorithms. While offline algorithms have instant access to all training data and are focused on achieving the best possible solution on this data set, on-line algorithms receive the training data piecewise and have to adapt themselves in a steady manner to achieve good results. Ensemble Learning [14] is a special part of machine learning, it combines of a number of classifiers h to one strong classifier. Boosting is a special technique of ensemble learning and uses the combination of weak classifiers. A weak classifier can be described as a decision rule that only has to perform slightly better than random guessing. Resulting from this simple requirement, simple and computationally fast learning algorithms are preferred for 9
Chapter 2. Boosting and Particle Filtering
10
weak classifiers. The combination of such (simple) rules can be compared with the natural human behaviour of basing decision on more than one opinion. Beside boosting, support vector machines [52], bagging [7] and random forests [5] are other types of ensemble learning methods. To understand the formulations of Boosting, some terms have to be introduced: Weak Classifier hn : a weak classifier is able to come to a decision that is correct in more than the half of all cases. That means a weak classifier has an error rate of less than 50%. As mentioned, simple and fast algorithms are preferred due to this weak requirement, but it is also possible to use sophisticated and/or complicated learning algorithms. Strong Classifier H: a strong classifier in the domain of boosting is a classifier that is built up from a weighted combination of weak classifiers and achieves a high accuracy on the decision task. The weights αn are obtained from the error rate of each weak classifier hn . The better the accuracy of the weak classifiers, the lower the number of weak classifiers needed to achieve a certain accuracy of the strong classifier. The weak classifiers are weighted combined by H(x) = sign (conf (x)) conf (x) =
N X
αn · hweak (x). n
(2.1) (2.2)
n=1
Since this is used for binary classification, the resulting class label is H(x) = y ∈ {−1, 1}. The direct result of the weighted combination conf (x) can be interpreted as the confidence on the decision the classifier has achieved. A very popular boosting algorithm is AdaBoost (Adaptive Boosting), as described by Freund and Schapire [16]. AdaBoost was the first practically usable boosting algorithm which was also applied to computer vision (e.g., for optical character recognition). The design of the algorithm consequently focuses on difficult examples, that cannot be classified correctly. Therefore, a weight is assigned to each training example at the beginning and this 1 weight is initialised to |X| with X as the set of training examples1 . After training of the first weak classifier with all training examples X, the weight of the misclassified samples is increased and the weight of correctly classified examples is decreased. The next classifier then automatically focuses on examples that have not been classified correctly by previous classifier(s) (see Figure 2.1). This procedure is repeated until a maximum number of classifiers or a minimal error rate is achieved. 1 Using a training set with an unequal number of positive and negative examples leads to a slightly other initialisation of the weights.
Chapter 2. Boosting and Particle Filtering
11
It shows that in the case of binary classification, the training error of AdaBoost drops exponentially fast regarding to the number of boosting rounds, which is directly related to the number of weak classifiers. It is also shown, that the AdaBoost algorithm has good generalization skills [15]. The training error is calculated from the examples that cannot be classified correctly although they are in the training set. The AdaBoost algorithm can be extended to cover from hypotheses that return confidence measures up to multi-class problems [45]. Tieu and Viola [50] introduced an application of off-line boosting to feature selection, which is in their case used for image retrieval. Feature selection deals with the proper selection of image features out of a pool of features (see Section 2.1.2). This can be done with boosting by directly assigning each feature to a weak classifier. Also on-line versions of bagging and boosting have been developed by Oza [40] and it is shown that on-line boosting converges towards the off-line results for Na¨ıve Bayes base models and if the number of training examples grows towards infinity.
2.1.1
Weak Classifiers
As mentioned before, every classification mechanism that achieves a result better than random guessing can be used as a weak classifier. One of the simplest methods for such a weak classifier is the decision stump as used by Viola and Jones [54]. The threshold θ of this classifier is calculated as the arithmetic mean of the mean values of the Gaussian distributions of the positive µ+ and negative µ− class values. Although there exist much better approaches, such as a naive Bayes classifier [4] or brute force, this simple rule can effectively be used for boosting. In case of the framework used for this thesis, each weak classifier is assigned to a certain image feature, to interpret its evaluation result on a certain image region.
2.1.2
Feature Selection via On-line Boosting
The work of Oza [40] enabled to use of boosting in an on-line manner. This was a very important step, since this allows to incrementally learn a boosting classifier. One application of on-line boosting was on-line feature selection introduced by Grabner et al. [19], which is a task that deals with the optimal selection of image features. Out of the feature set F , a subset of features {f1 , f2 , ..., fn } is selected, which is optimal for a certain task, such as the discrimination of an object to its background. Grabner et al. introduced selectors, which are used to solve this problem. A selector holds a pool of weak classifiers, while one weak classifier representing the decision of this selector is chosen out of this pool with respect to an optimisation criterion. As optimisation criterion, the error rate of the weak classifiers in
Chapter 2. Boosting and Particle Filtering
(a) Set of samples
(b) Determine the first weak hypothesis
(c) Reweighing of the samples
(d) Determine the second weak hypothesis
(e) Reweighing of the samples
Figure 2.1: schematic execution of the boosting algorithm
12
Chapter 2. Boosting and Particle Filtering
13
Figure 2.2: The selector-structure as proposed by Grabner et al. [19]
the selector pool can be used. In the initialisation phase, the strong classifier is built by a fixed number of selectors N , which itself is filled with randomly initialised weak classifiers hn,m . The weak classifiers base their decision on the response of simple image features. Receiving a labeled update image, each weak classifier in the first selector pool is updated with the new response values of their corresponding image features. According to the classification success of the weak classifier, this results in a change of the error rate en,m . The best performing classifier is then selected out of the pool. The error rate of this classifier determines the change in the weight λ of the training example for the following selector. In this way, the training example is passed through the selector stages and updates the weak classifiers in the according pools. The weight adaptation focuses the following selector stages to hard examples, that have not been reliably classified by the preceded stages. When receiving an example for classification, the response of the classifier can be calculated with the weighted combination of the selected weak
Chapter 2. Boosting and Particle Filtering
Algorithm 1 On-line Boosting for Feature Selection Require: training (labeled) examples hx, yi, x ∈ X Require: strong classifier H (initialized randomly) Require: weights λcn,m , λw n,m (initialized with 1) 1: for n = 1 to N do 2: λ=1 3: // update weak classifiers and estimate errors 4: for m = 1 to M do 5: hn,m = update(hn,m , hx, yi, λ) 6: if hweak n,m (x) = y then 7: λcn,m = λcn,m + λ 8: else w 9: λw n,m = λn,m + λ 10: end if λw 11: en,m = λc n,m w n,m +λn,m 12: end for 13: m+ = arg minm (en,m ), en = en,m+ , hsel n = hn,m+ 14: if en = 0 or en > 12 then 15: exit 16: end if n 17: αn = 21 · ln 1−e en 18: if hsel n (x) = y then 1 19: λ = λ · 2·(1−e n) 20: else 21: λ = λ · 2·e1 n 22: end if 23: // replace worst weak classifier 24: m− = argmaxm (en,m ) 25: hn,m− = createW eakClassif ier() 26: λcn,m− = 1, λw n,m− = 1 27: end for
14
Chapter 2. Boosting and Particle Filtering
classifiers
2.2
PN
n=1 αn
15
· hsel n (x) as usual.
Semi-supervised Learning
To train an accurate classifier for a specific task, a large amount of labeled training data is needed. This training data often has to be produced by humans and thus is very expensive. Semi-supervised learning [9] tries to benefit from unlabeled data to enhance the performance of a system. Thus, the training data is extended to X = X l ∪ X u containing labeled X l and unlabeled examples X u . An overview over semi-supervised learning techniques can be found in [58], but all methods rely on similar assumptions (e.g., cluster assumption, large margin assumption). It shows, that the performance of the used classifiers can be improved by this simple method using cheap unlabeled sample data.
2.3
Semi-supervised On-line Boosting for Feature Selection
As motivated in the previous Section, it is always desirable to have a large training set. Including unlabeled data into the off-line boosting framework has been recently presented by Mallapragada et al. [31]. Their intention was the improvement of any given supervised learning algorithm with the help of unlabeled data. An efficient implementations of their formulations in the boosting framework has been used to improve the performance of several commonly used supervised learning algorithms. The formulations of Mallapragada et al. are focused on the use of similarities Si,j between different examples. The proposed formulations look very complex, but their pursue is a rather simple goal. The minimisation is constructed to satisfy the following two criteria: • Pairs of labeled xli and unlabeled xuj examples should share the same label as they have a high similarity S(xli , xuj ). • Pairs of unlabeled examples xui and xuj should share the same label as they have a high similarity S(xui , xuj ). Mathematically formulated Fu (xu , S) =
nu X
u
u
Si,j exi −xj
(2.3)
i,j=1
measures the inconsistency between unlabeled examples X u and l
Fl (x , S) =
nl X nu X i=1 j=1
l u
Si,j e−2xi xj
(2.4)
Chapter 2. Boosting and Particle Filtering
16
between unlabeled X u and labeled examples X l . These two equations can now be combined to F = Fl + CFu , (2.5) to formulate the optimisation problem, where the weighting factor C can be replaced by X1l and X1u to equally weight all examples. Using an u
u
l u
exponential cost (the terms exi −xj and e−2xi xj ) for boosting algorithms not only facilitates the derivation of boosting based algorithms but also increases the classification margin. This non-linear equation is difficult to optimise, but can be upper-bounded by nu X F = e−2αht pi + e2αht qi (2.6) i=1
where |X L | |X u | 1 X 1 X −2H(xi ) pi = l Si,j e δ(yi , 1) + U Si,j eH(xj )−H(xi ) X X
(2.7)
|X L | |X u | 1 X 1 X 2H(xi ) Si,j e δ(yi , −1) + U Si,j eH(xi )−H(xj ) qi = l X X
(2.8)
j=1
j=1
and
j=1
j=1
with
δ(x, y) =
x=y 1 0
and the learned binary classification model ht (x) of the iteration t and H(x) as the combination of the first T classification models H(x) =
T X
αt ht (x).
t=1
This formulation is now minimized by boosting, where X l is the set of labeled examples and X u the set of unlabeled respectively. The quantities pi and qi can be interpreted as the confidence in classifying the unlabeled example xi into the positive class and the negative class, respectively. H(x) is the combined hypothesis up to the actually trained classifier hn and Si,j a similarity measure between example xi and xj . Due to a usually very small number of labeled examples, there is no criteria on minimizing the classification error of the final hypothesis over the labeled examples listed. To embed this into the boosting algorithm, a pseudo-label zi is assigned to each training example. Like in the normal
Chapter 2. Boosting and Particle Filtering
17
AdaBoost algorithm, the voting weight of the weak classifier is calculated from the error rate .
zi = sign(pi − qi )
(2.9)
λi = |pi − qi | 1 1 − n αn = ln 4 n
(2.10) (2.11)
Based on this work Leistner et al. [25] adapted semi-supervised boosting for use with prior knowledge in the from of learned similarities and extended the loss function with a term measuring the inconsistency within the labeled examples. Subsequently, this ideas and formulations have been adapted to the feature selection task [20]. For the purpose of an on-line adaptive object tracker, the semi-supervised boosting approach has to be modified to fit into the on-line feature selection framework. To switch from an off-line, where all examples - labeled and unlabeled - can be accessed at any time, to an on-line setting some approximations have to be made. Especially pi and qi can not be calculated, because no pairwise similarity can be calculated over the labeled and unlabeled examples. Since not all terms of the optimisation criterion can be calculated on-line, the equation must be simplified, which can be explained best step by step. • Since the number of unlabeled examples grows towards infinity in the on-line case and the assumption that most of the examples will have only a very small similarity between each other, the last tends towards zero and can be eliminated. This is essential, since the similarity between unlabeled examples cannot be calculated in the on-line case and with an increasing number of samples, the calculation would also in2 crease with ON which disqualifies this term for a large number of unlabeled samples due to its computational complexity. • Next, the similarity measure S(x, y) is learned by a boosting classifier, which is used to distinguish directly between the positive and negative class X + and X − . This classifier H P is inserted into the original equations instead of S(x, y). Now pi and qi can be simplified to P
p˜n ≈ e−Hn−1 (x)
X xi ∈X +
S(x, xi ) ≈ e−Hn−1 (x) H + (x) ≈
e−Hn−1 (x) eH (x) eH P (x) + e−H P (x) (2.12)
Chapter 2. Boosting and Particle Filtering
18 P
Hn−1 (x)
q˜n ≈ e
X xi ∈X −
S(x, xi ) ≈ e
Hn−1 (x)
eHn−1 (x) e−H (x) H (x) ≈ H P (x) . (2.13) e + e−H P (x) −
The difference between this two terms can be rewritten to build up the so called pseudo-soft label z˜n (x) = p˜n (x) − q˜n (x) = tanh(H P (x)) − tanh(Hn−1 (x))
(2.14)
which is used for setting the label of the training examples in the different classifier stages. Including this formulations into on-line boosting for feature selection it can be seen that the current (pseudo-soft) label depends on the summed confidence of the previous classifier stages. Thus, it is possible that the reached confidence in the actual stage exceeds the confidence of the prior classifier and the label of the training example switches to the opposite class even if it does not belong to this class. Also the weight of the current example decreases as the confidences of the on-line and prior classifier converge. This mechanism avoids over-fitting of the object model, if a high confidence is already given but requires an honest prior, which confesses its uncertainty on the actual example, if the difference to the original sample, which was used for training this prior classifier, is too big. As this algorithm is very similar to the on-line boosting for feature selection algorithm (see Algorithm 1), the description will focus on the differences. Based on a rather complex theory, the mechanism is simplified in the implementation to a few lines of changed code. The basic intuition of this algorithm is the clever incorporation of prior knowledge into the boosting process to tackle the drifting problem. Since there are more simple methods of doing this, the high complexity has to legitimate itself with better results. The main change from supervised to semisupervised learning is located in the code-lines 3 to 8 where the pseudo-soft label is assigned if the current example is unlabeled x ∈ / X L . Otherwise the algorithm exactly works as the supervised version. As described in Section 2.3, the labeled is assigned with respect to the confidence of the prior classifier H P and the confidence of the on-line classifier until the current stage Hn−1 in the form of pe(x) − qe(x) = tanh(3 · H P ) − tanh(3 · Hn−1 ). The term tanh(3 · H P ) − tanh(3 · Hn−1 ) is used to determine the label and the weight of the example as formulated in code line 6 and 7. Since this is the major term coming out of the semi-supervised optimisation problem in this case, further investigations on this equation should be done. First of all, the factor 3 is included within the tanh function to use the full bandwidth of the tanh function since the classifiers itself only operate within [−1, 1] ∈ 12 then 21: exit 22: end if n 23: αn = 21 · ln 1−e en 24: end for
19
Chapter 2. Boosting and Particle Filtering
20
y1 = +1 and λ1 = 1. This forces the on-line classfier to update itself in the well known manner. The weight of the example will increase with increasing n since the confidence of the on-line classifier should increase with summing up more and more weak classifier results. For the next example, the confidence of the prior classifier is assumed to be only moderately high H P ≈ 0.5. Again similar to the first case, the confidence of the on-line classifier Hn will increase with n, but when reaching a confidence similar to the prior classifier this will lead to a small example weight λn for this stage. With further increasing confidence of the on-line classifier, the label yn will switch to the opposite label and the weak classifier will learn this example with the opposite label, but only with a low weight. Assuming that the weak classifier will classify the example correctly, which does not agree with the assigned label now induces an increased confidence. If the confidence of the on-line classifier falls below the confidence of the prior classifier again, the label switches back again to the correct class and the confidence of the classifier increases again. The outcome of this will be a convergence of the on-line classifier towards the confidence of the prior classifier. Discussion The behaviour of the algorithm can interpreted as preventing the on-line classifier to achieve high confidence on examples that are not very similar to the original object model. This similarity is measured with the confidence of the prior classifier H P . The on-line classifier gets detained of getting more confident on a training example than the prior classifier is, as visible in Figure 2.3. Switching the label yn for a specific classifier stage without simultaneous reduction of the weight for this particular example would directly destroy the statistics of the learned classifier in this stage hn because the example would fully be learned with wrong information. Due to the coupled formulation, high weights only appear if prior classifier and on-line classifier have different opinions on a particular example, which should not happen, because the prior classifier should lead the decision of the on-line classifier. If this is the case, the learned statistics of the on-line classifier would be corrected by the high impact of this example. Another insight in this algorithm would be that an example gets “unlearned ”, if the confidence of the classifier gets to high to decrease the impact of learning examples that are already classified correctly. This enforces the diversity in the training set by down-weighting examples that have already been trained.
2.4
Image Features
After explaining the functionality of the semi-supervised feature selection mechanism it should be clearly described what image features are and which type of image features is used in this thesis.
Chapter 2. Boosting and Particle Filtering
(a) Confidence evolution of Semi-supervised Classifier over Selector n
(b) Weight λn and Label yn evolution of Semi-supervised Classifier over Selector n
Figure 2.3: An example update sequence of the Semi-supervised Classifier
21
Chapter 2. Boosting and Particle Filtering
22
Image features are individual measurable properties of an image being observed. Usually features simplify heuristic attributes to a numeric value. This features should contain significant information that describes the object to detect and track. There are many different approaches for image features, some common approaches will be listed here. Color Color is one of the most popular image features, because it is very easily and fast extractable. Usually pure color is not used to match with an object model because it is not robust to errors of measurement and illumination changes. Therefore Color Histograms of arbitrary object areas are used to build a better object description. Color Histograms can be calculated very fast using Integral Histograms [42], although only for rectangular areas. Texture Since texture of an object is a characteristic feature of an object, it is often used for classification and detection tasks. There are many approaches existing for this task, an overview over texture features was given by Ojala et al. [37]. Very popular in this subject are Local Binary Patterns. M¨aenp¨a¨a [30] described the principal functionality and different extensions of Local Binary Patterns. Edges There exist many different approaches to measure edges in an image. One possibility are Haar-like wavelets, which are very popular due to the work of Viola and Jones [54] which proposed the first real-time frontal face detector. Instead of modelling edges with Haar-like features, Levi and Weiss [27] showed that local edge orientation histograms achieve good performance for object detection. Since this work uses Haar-like features, they will be described in detail now. Haar Wavelets [38, 41] can be described as simple filters that encode the difference of average intensities and are closely related to Gabor filters2 . When calculating the response of Haar-like features, the pixel values under the negative area are averaged and subtracted from the average of the positive area. Due to very different configuration of these white and black areas, Haar-like features can describe for example horizontal, vertical or diagonal edges or bars. To cope with different sizes of this image properties, the features can be resized and deformed. To achieve a good description of the image many features with high positive or negative response are needed. 2
A tutorial 01.10.2008
on
Gabor
filters:
http://mplab.ucsd.edu/tutorials/tutorials.html,
Chapter 2. Boosting and Particle Filtering
23
Haar-like Features are very popular in the field of computer vision, because they are simple and can be computed very fast and efficient by the use of Integral Images [54]. To cover different structure sizes they can be resized and deformed. Due to their large number which grows with O(H 2 · W 2 ) with the image size H × W , they build an over-complete set, which means that there exist more features than the number of pixels the observed image patch contains.
(a) horizontal edge
(b) vertical edge
(c) horizontal line
(d) vertical line
(e) diagonal line
(f ) center surrounding
Figure 2.4: Haar-like Features available in the used Framework
The approach described in this thesis uses a collection of Haar-like features as object model because of their fast computation, but it is also possible to use another type or even to mix up different feature types such as local edge orientation histograms [27] for edges or local binary patterns [30] to cover texture.
2.5
Particle Filter
Particle filtering [32] is a method which is amongst others used for motion estimation in tracking applications. Related to the Kalman filter, it can efficiently estimate an unknown probability density function. The main advantage of the particle filter over the Kalman filter is the fact that it is not limited to Gaussian distributions. Particle filters are also known as sequential Monte Carlo methods (SMC), they are used to estimate Bayesian models and with sufficient samples, they approach the Bayesian optimal es-
Chapter 2. Boosting and Particle Filtering
24
timate. From the algorithmic point of view, an particle filter implementation consist of three steps: measurement, resampling and transition. These steps are repeated for every new frame to spread the particles over the image (see Figure 2.5). Measurement: This is done be evaluating the similarity measure of the tracking system at the actual particle position. The received values are normalised and used as weights of this particles. Resampling: The particles are sorted descending over their weights. Poor weighted particles may be dropped whereas particles with high weight may be copied. This leads to survival of the fittest particles. How many particles are dropped and copied and which of the high weighted particles are copied depends on the used resampling strategy. Transition: The last part resulting in the new particle positions is the transition. The motion model estimates the position of the particle in the next frame based on e.g. the starting an the previous position, additionally some random noise is added to spread the particles over an appropriate area.
(a) particle distribution
(b) sample and weight particles
(c) resampling
(d) transition
Figure 2.5: Schematic execution of the particle filter algorithm
This results in a swarm of particles moving in an area around the original object position, where particles with greater distance tend to be dropped
Chapter 2. Boosting and Particle Filtering
25
and good ones in the middle will be copied with high probability instead. To calculate the final object position out of this particle swarm, it is often useful to consider some of the fittest particles, not only the best one. As there is great range of freedom in implementing such a particle filter, regarding the similarity measure, the resampling, the motion model and the spreading during the transition, particle filters can easily be adapted to many fields of application. However, this great freedom is also a huge source of error, especially the resampling method can cause serious problems if not suitable for a special application. A more detailed description on particle filtering can be found in [32].
Chapter 3
Robust Object Tracking This Section gives a more detailed description of the implemented tracking system, from an overview on the whole system and the basic functional principals at the beginning to implementation details of the different parts.
3.1
The Tracking Loop
The needed building blocks of an object tracking system as described in this theses are defined now, they can be assembled to build up a tracking loop as illustrated in Figure 3.1. Find and select the object: The first step is the localisation of the object in the actual image. Using a similarity measure which compares an actually selected patch to the existing object model. The outcome of this processing step should be the most probable object location or the information that the object is not present in the current frame. If the latter is the case, the system can proceed on trying to find the object in the following frame. Refining the Confidence Map: The process of detecting the object in the actual image can also be broken down to calculation of the probability of the object to be located on the current position. The outcome of this operation is called the confidence map. This map can be refined or processed in various ways. Adapting the Object Model: One of the crucial parts in the tracking loop is the adaptation of the object model. Being too adaptive increases the risk of including erroneous information of misaligned or misclassified image patches into the object model. Whether the object model is too inflexible to changes in the appearance of the object, the tracker will fall behind the natural changes of the object.
26
Chapter 3. Robust Object Tracking
27
Figure 3.1: Tracking Loop for Adaptive Kernel Tracking Methods
Constrain the search region: In the last step, the actual object location can be used to constrain the search region for the localisation process. This can be done with the help of motion or velocity models but it is also possible to use simple heuristics like a rectangular neighbourhood assuming only small motion in-between two frames. Based on this four operations of the tracking system, the main components are structured. Each of these components is implemented in various characteristic to build a framework for multiple purposes.
3.2
System Overview
In order to tackle all the tasks mentioned in the previous Section, the tracking system is separated into different modules and additional functions.
3.2.1
Object Detection
This module consists out of the different classifier implementations used for evaluation and comparison of the new approach using semi-supervised
Chapter 3. Robust Object Tracking
28
Figure 3.2: A simplified scheme of the tracking system
learning. All these classifier build up on the same scheme using selectors, on-line boosting and simple Haar-like features. StrongClassifier The base of the boosting framework is build up on this class. It defines the functionality for derived implementations. StrongClassifierStandard The on-line boosting for feature selection classifier from [18]. The classifier includes a set of selectors that are combined via on-line boosting. StrongClassifierSemiLearning The recently proposed classifier from [20]. Similar to the standard classifier the selectors are combined by boosting. The algorithm uses prior information as a function parameter which is included into the on-line updating process. StrongClassifierSummation A very simple approach for the incorporation of prior knowledge is to simple combine an on-line and an off-line classifier by summation of their results. This classifier is used for the evaluation of the semi-supervised approach to legitimate the additional complexity with better results.
Chapter 3. Robust Object Tracking
29
Selector The Selector is a data-structure, that is included in the StrongClassifier class. Each selector manages a pool of weak classifiers, that can be selected and replaced. The rectangular bounding box of the tracked object is modelled with a on-line boosted ensemble of Haar-like features like proposed by Grabner and Bischof [18]. As described in Section 2.1.2, these features are organised in so called Selectors, where the response of the Selector corresponds directly to the response of the feature hsel n with the lowest error rate in it’s pool. A fixed number of such Selectors is then combined to a strong classifier H by boosting.
3.2.2
Object Modelling
WeakClassifier This is the base class for all weak classifiers and defines the functionality and interface. WeakClassifierThreshold A classifier implementation that uses only a threshold for the decision. The average of the positive and negative class are estimated and the threshold is set inbetween this two values. WeakClassifierBayes A more sophisticated classifier, that models the probability distributions of the positive and the negative class. This distributions are used to distinguish between the classes. Feature This is the base class for all different feature classes and characteristics. As described in Section 2.1.1, these are the implementations of different weak classifier types. While WeakClassifierThreshold directly relates to the simple decision stump, WeakClassifierBayes corresponds to an implementation of the naive Bayes classifier. The classifiers all work in a onedimensional space which influences the selection of the used image features. Haar-like features have shown good performance in object detection tasks and are therefore used as standard for this application. To extend the flexibility of the framework, other feature types such as Orientation Histograms and Local Binary Patterns can also be used as input for the weak classifiers.
3.2.3
Location Estimation
Patches Patches are simple rectangles subsection of the image. They can vary in size, aspect ratio and position. PatchesRegularScan This is a set of patches that regularly samples a given image area. All patches have the same size and are placed with a defined overlap.
Chapter 3. Robust Object Tracking
30
PatchesParticleFilter This set of patches includes the functionality of a particle filter (see Section 2.5). They only have to be updated with the evaluated confidences and build up the actual particle distribution from the´ır own. The estimation of the object’s location is another major part of the tracking system. Since this influences the robustness of the system to object motion and occlusion, it is important to make the right assumptions on the object behaviour and choose am adequate method. The better the estimation of the object is, the smaller the area where the detection algorithm has to search in gets. Thus the right algorithm can additionally speed up the whole tracking process. The original tracking system used a very simple method for choosing the search region in the next image. To improve this, a particle filter has been implemented (see Section 2.5).
3.2.4
Training and Updating the Object Model
This module of the application framework is related to the initialisation and adaptation of the object model. While OneShotTraining includes different methods of preparing transformations of the initial image to generate a bigger training set, UpdateRule is used to compare different methods to select the patches that are used to update the classifier during runtime. All used classifiers are updated on-line, but if the selected examples would be trained in an off-line manner, the results of the classifier would probably be better. The main advantage of on-line learning for this is the short initialisation phase which only lasts less than a second. OneShotTraining This class contains all training strategies, that are evaluated in Section 4.9. UpdateRule This class contains all update rules, the results of the evaluation can be seen in Section 4.8.
3.2.5
Refining of the Confidence Map
The flood of confidences coming out of the detection process has to be further processed to retrieve the actual object position. The simplest way to do this is the selection of the absolute maximum over the whole map. However we can assume that the object model will be resistant to very small movements of the object, because many of the structures covered by the selected feature will not change much if they are shifted by a small number of pixels. Thus the map will bring up a small plateau with high confidences around the original object position which can be used to eliminate outliers. This is done by smoothing the confidence map with a 3 × 3 Gaussian kernel. The kernel
Chapter 3. Robust Object Tracking
(a) Raw Confidence Map
31
(b) Smooth Confidence Map
Figure 3.3: Raw Confidence Map and smoothed version
size is kept very small, to not flatten out the whole detection area. Looking forward to the use of a particle filter for the selection of patches that should be evaluated by the detection mechanism, this method can not be applied directly to spatial irregular samples.
3.3
Update Rules for On-line Boosting
Based on Geometry: a straightforward idea of choosing update patches is based in geometric patterns. Usually patches nearly located to the object location are chosen to train the object model to distinguish between the object and the near background (see Figure 3.4 a and b). Based on Confidence: This category of update rules relies on the patch confidence of the last detection period. There are various opportunities of selecting the right patches, which can also be expressed as the sorting order of the whole set of patches that has been extracted during the search for the object. Especially when working with semi-supervised boosting, where no direct label is given to the update patches, this is an interesting field of experiments. A possible variant would be the selection of patches those confidence is near zero, that means they are close to the decision boundary of the detection algorithm and are not well-described by the object model. A main drawback of this method is that the confidences of the patches have to be processed and thus it is more time-consuming than fixed update patterns (see Figure 3.4 d, e, and f). Randomly chosen: The third variant could be the selection of random patches out of the neighbourhood of the object for updating the detection algorithm (see Figure 3.4 c).
Chapter 3. Robust Object Tracking
32
Different variants of those three groups of update rules have been implemented in the course of this thesis and have been evaluated in Section 4.8. Figure 3.4 shows a schematic example of different update methods.
(a) Original geomet- (b) Geometric up- (c) Random updates ric updates dates
(d) Prefer near up- (e) Prefer confident (f ) Update at decidates updates sion boundary
Figure 3.4: Schematic figure of different update rules. Brighter rectangles represent patches that have a lower probability to be chosen for updating the object model. The blue rectangle represents the object position
3.4
Visual one-shot Learning
If the object to track is unknown when starting the application, the prior model must be extracted out of the first frame. To establish a more robust object model, virtual samples [36] are created representing usual appearance changes of the object. Illumination changes are a very usual incident and could have big influence on the tracking result. Furthermore, transformations of the object have already been used to increase the performance of a detection system [17], this can be utilised to also enhance the creation of a more robust object model. Additionally this does also correspond to natural movement of objects, especially persons or faces. Illumination: Since the application uses Haar-like features, changing illumination does not have wide influence on the classification process. Needless to say, in case of using other features, illumination should be considered for training.
Chapter 3. Robust Object Tracking
Particle hx, yi hxp , yp i hx0 , y0 i hwidth, heighti weight
33
the current position of the particle the previous position of the particle the initial position of the particle size of covered region of this particle current weight of this particle
Table 3.1: The datastructure Particle
Rotation: To get a more stable object model, it should be able to handle limited rotation of the original object. When tracking e.g. Persons or Heads, a slight rotation would be normal but is limited by physical realities. Scaling: Although scaling can be handled in a more sophisticated way, the simpliest way to include this into the object model is to train the prior model with some extra examples that are slightly scaled. This does not enable the application to detect scaling by itself, but allows keeping the focus on the center of the object. A better way would be to expand the object search into the scale space. The main question is now how to combine these transformations with a proper updating patch selection to achieve a good prior model with a minimum of update cycles.
3.5
Particle Filter
As described in Section 2.5, a particle filter is used for location estimation in the described tracking system to speed up the execution. To fully describe the functionality of this method, first of all the object type Particle has to be defined. A Particle is a data-structure (see Table 3.1) where foremost motion information is stored1 . In the implemented version, this includes the initial, the previous and the actual position of the particle. Using another motion model would arise the need of additional information. This can for example also include information on scaling and rotation, which can additionally be covered with the particle filter. Another important information of the particle is its weight. The weight is assigned in the measurement phase and used in the resampling algorithm. In between this two steps it is usually normalised. The resampling algorithm is the main part of the whole particle filter. According to the used method, outlier and particles with low weight are 1 The values x and y in this case refer to pixel positions in the image, not to value and class label.
Chapter 3. Robust Object Tracking
34
eliminated and good particles are duplicated. The described sorted-weights resampling algorithm consists of following steps: 1. Sort all particles by decreasing weight. 2. For all particles, calculate the number of copies. Particle with a weight lower than N1 are automatically dropped. 3. Insert the number of copies. 4. If not enough particles have been inserted, copy randomly chosen particles of the already inserted ones until the number of particles has reached N . Algorithm 3 Particle Filter - Sorted-weights resampling Require: number of particles N Require: old particles P old , P old = N Require: new particles P new (empty) Require: sort(A, attribute), sorts set by attribute Require: insert(A, a), insert element a into set A Require: random(a, b), calculate random integer x, a ≤ x ≤ b 1: sort(P old , weight) // sort particles by decreasing weight 2: for i = 1 to N do 3: // calculate the number of copies 4: num = weight(Piold ) · N 5: // insert copies 6: while num > 0 and |P new | < N do 7: insert(P new , Piold ) 8: num = num − 1 9: end while 10: end for 11: M = |P new | 12: // fill up with good particles if needed 13: while |P new | < N do 14: r = random(1, M ) 15: insert(P new , Prnew ) 16: end while The subsequent part of the particle filter is the transition algorithm. At this place, the transition or motion model is included into the movement of the individual particles. Also a portion of Gaussian noise is included to spread the particle in the image area. The algorithm is included only to visualise the influence of the motion model and the Gaussian noise to the transition process.
Chapter 3. Robust Object Tracking
35
Algorithm 4 Particle Filter - Transition Require: number of particles N Require: particles P, |P | = N Require: tX and tY , standard transition for direction x and y Require: A1 , A2 and B0 , parameters of transition model Require: randomGauss(a), a Gaussian distribution random number with µ = 0, σ = a 1: for i = 1 to N do 2: // calculate new location 3: x = A1 ∗(Pi .x−Pi .x0 )+A2 ∗(Pi .xp −Pi .x0 )+B0 ∗randGauss(tX )+ Pi .x0 4: y = A1 ∗ (Pi .y − Pi .y0 ) + A2 ∗ (Pi .yp − Pi .y0 ) + B0 ∗ randGauss(tY ) + Pi .y0 5: // move values 6: Pi .xp = Pi .x, Pi .yp = Pi .y 7: Pi .x = x, Pi .y = y 8: end for 9: M = |P new |
For calculation of the result, a number of particles can be chosen, whose position is averaged. This is a kind of smoothing to the Gaussian noise which is added to the particles in the transition step. Interpretation of the algorithms: The resampling algorithm focuses the particles back to the detected object. It follows a survival of the fittest paradigm. This is needed because otherwise the particles would spread over the whole image area and the sampling density would not be enough any more to establish a proper detection. The transition algorithm then spreads the particles again over a small image area. According to the previous motion of the particles, a proper next location is predicted. To cover noise and irregular motion, some Gaussian random noise is also included in this prediction. This spreads the particles around the predicted location. The final result then is calculated by averaging over a proper number of good particles. Therefore the best 10% of all particles are taken in this implementation.
Chapter 4
Experiments and Results To illustrate the advantages of the described semi-supervised tracking approach and all other extensions that have been added within this thesis, a number experiments have been done. This Section includes descriptions, the setup and the results of these experiments. All experiments have been run several times and averaged for confident results.
4.1
Evaluation Data
As described in Section 4.2, it is difficult to evaluate the performance of a tracking system, especially if the system is on-line adaptive. Therefore several runs over a set of videos have been processed in order to obtain comparable results. Video “dog ”: This video is taken from David Ross1 [29] shows a toy dog. This video only contains small rotations and slow movement of the object (see Figure 4.1). Video “cat ”: Also the second video has been taken from David Ross, it shows a toy cat, but includes more rotation, scaling and additional illumination changes (see Figure 4.2). Video “vehicle ”: As vehicle surveillance is a very popular field of application for tracking systems, a video of the VIVID-Dataset2 was included. This video includes full occlusions and similar objects (see Figure 4.3). Video “sport ”: The last video of the evaluation set show a scene of a beach-volleyball match. The difficulties in this video are the two very similar looking players that cross each others way within the scene. ˜ David Ross: http://www.cs.toronto.edu/dross/ivt/, 01.10.2008 VIVID Dataset: http://www.vividevaluation.ri.cmu.edu/datasets/datasets.html, 2007/06/12 1 2
36
Chapter 4. Experiments and Results
37
Another challenge is the included changing body pose of the tracked player (see Figure 4.4).
(a) Initial position
(b) Out-of-plane rotation
(c) Out-of-plane rotation
(d) In-plane rotation
Figure 4.1: Video “dog ”
These videos have been chosen because they include many of the challenges a tracking system is confronted with. There are rotations in and out of plane, illumination changes, occlusions and similar objects appearing near the original object to track. Another reason for this videos is the public availability. As the results show, not all of this challenges can be handled perfectly by any of the tested methods but the results are impressing and competitive.
4.2
Performance Measures
As in every very dynamic system, it is very difficult to quantify the performance of a tracking system. Especially including such a large number of parameters and randomness this makes the interpretation of the result quite difficult. Tracking Rate: This is the ratio of successfully tracked frames to all frames. This measures the overall tracking success. Precision: This is a measure commonly used in object detection systems. It measures the ratio between correct detections and all detections.
Chapter 4. Experiments and Results
38
(a) Initial position
(b) Out-of-plane rotation
(c) Changed view
(d) Illumination change
Figure 4.2: Video “cat ”
(a) Initial position
(b) Full occlusion
(c) Similar objects
(d) Part occlusion
Figure 4.3: Video “vehicle ”
Chapter 4. Experiments and Results
39
(a) Initial position
(b) Crossing players
(c) Changes aspect ratio
(d) Similar objects
Figure 4.4: Video “sports ”
Overlap: The overlap measures the alignment of the tracking result. The result is modelled as a Gaussian distribution with mean µ and standard deviation σ.
4.3
Parameter Settings
To guarantee comprehensible and repeatable experiments, some parameters of the classification framework have to be set to fixed values. All experiments use the same settings, except the long-term drifting experiment, which uses a lower number of selectors. Number of Selectors: The number of used selectors is set to 50, which have been experimentally tested as a reliable but also fast setting. Classifier Pool Size: Also the pool size is set to 50. This results in an overall number of 2500 image features that are used in the classification framework. This number has shown a high probability to accomplish even difficult tracking tasks. Search Area: The search area has a size of 2.5 times the width and height of the object bounding box and the evaluated patches have an overlap of 99%, which results in a shift of only 1 pixel for objects which have a smaller bounding box than 100 × 100 pixels.
Chapter 4. Experiments and Results
40
The best settings for the parameters of the semi-supervised boosting algorithm have been evaluated in the following Section.
4.4
Parameter Optimisation
To obtain the best possible performance of the semi-supervised on-line boosting algorithm, different parameters have been introduced into the algorithm to refine its function. yn = sign tanh (3 · H p (x)) − tanh 3 · H n−1 (x) λn = tanh (3 · H p (x)) − tanh 3 · H n−1 (x)
(4.1)
yn = sign ω tanh (νH p (x)) − (1 − ω) tanh νH n−1 (x) λn = ω tanh (νH p (x)) − (1 − ω) tanh νH n−1 (x)
(4.3)
(4.2)
(4.4)
H p (x): confidence of prior classifier H n−1 (x): confidence of on-line-classifier including selectors 1 to n − 1 ω: describes the weighting of the prior and on-line decision. If ω is greater than 0.5, the on-line classifier is allowed to adapt itself more to the current patch ν: is a factor that can fasten or slow down the speed a classifier reaches the saturation of the tanh function. Discussion The parameters have been studied on video “cat ”of the test-set intensively and it has shown that the values of ω = 0.5 and ν = 3 result in the best performance of the semi-supervised classifier. The major interpretations of these results go along with the discussion and interpretation of the update algorithm (see Section 2.3).
4.5
Comparison of Tracking Systems
Four different classification methods have been evaluated against each other. A fixed classifier, an incremental learning classifier and two methods with incorporated prior knowledge. Off-line Classifier: This static classifier is trained with the best out of the one-shot training methods (see Section 4.9) and then not updated any more. This classifier is also used as prior knowledge for the summation classifier and the semi-supervised classifier.
Chapter 4. Experiments and Results
Video “dog ” Tracking Rate Precision Overlap µ Overlap σ Video “cat ” Tracking Rate Precision Overlap µ Overlap σ Video “vehicle ” Tracking Rate Precision Overlap µ Overlap σ Video “sports ” Tracking Rate Precision Overlap µ Overlap σ
41
Off-line
On-line
Summation
Semi-supervised
0.99 1 0.88 0.06
0.99 1 0.85 0.07
0.99 1 0.89 0.06
0.99 1 0.87 0.07
0.91 0.91 0.72 0.28
0.94 0.94 0.76 0.25
0.93 0.93 0.78 0.26
0.93 0.93 0.75 0.27
0.12 0.33 0.09 0.26
0.12 0.11 0.09 0.27
0.84 0.79 0.76 0.23
0.96 0.98 0.77 0.24
0.44 0.53 0.54 0.43
0.51 0.62 0.67 0.31
0.31 0.38 0.35 0.45
0.50 0.61 0.55 0.44
Table 4.1: Comparison of different tracking methods based on Boosting
On-line Classifier: The on-line boosting classifier as proposed in [19]. This classifier incrementally learns the target object and is initialised with a simpler one-shot training method, such as all following classifiers. Summation Classifier: This is a very simple method to incorporate prior knowledge into an adaptive classifier, taking an off-line and an on-line classifier and sum up their confidences. Semi-supervised Classifier: The classifier as described in section 2.3. Similar to the summation classifier, the pre-trained off-line classifier is used as prior knowledge. This enables a fair comparison between those two. Discussion All tracking methods have been evaluated on the described videos five times for this comparison (see Table 4.1). It shows that the semi-supervised and the on-line classifier can reach better results than the other two. In any case, the semi-supervised classifier can achieve at least similar results as the other classifiers, but additionally has some advantages, such as the reduced drifting or the ability to recover after full occlusion of the tracked object.
Chapter 4. Experiments and Results
42
All things considered it can be said, that the new semi-supervised tracking approach reaches comparable results to the on-line classifier in any case with the additional advantage of more robustness to drifting (see Figure 4.5 for difficult scenes in the videos). As seen in the results, the simple “dog ”video does not cause problems for any of the tested tracking approaches. The different methods only fail on some frames of the video where the dog is rotated out-of-plane. Also processing the“cat ”video show impressive results for all approaches although this video includes difficult sequences where the toy cat is rotated nearly 90◦ out-of-plane. The freely adaptive on-line classifier does not really perform that bad on the “vehicle ”video, but this comes from a similar vehicle that follows the other one and the classifier adapts itself to this new vehicle after the occlusion. The off-line classifier sometimes fails on this video, because it jumps away from the target object during the occlusion phase an cannot recover afterwards due to the limited search region. The “sports ”video is a very challenging image sequence since it includes two very similar objects that cross each others path. Additionally the original object is widely transformed. To cover these challenges, rotation and multiple-hypothesis tracking should be considered as extensions to the current tracking framework. In the current implementation, the tracking systems are not suitable for this video.
4.6
Long-Term Drifting Experiment
The semi-supervised on-line tracking algorithm has been introduced by Grabner et al. [20] in order to tackle the drifting problem. This problem is an effect of accumulated errors which are caused by misaligned update patches. This false update then forces the classifier to drift away from the original object position more and more. To illustrate this effect, a long-term video has been processed by the on-line and the semi-supervised classifier. The video shows a very simple scene that does not change over time. To enforce drifting, the image quality is decreased internally and the number of used classifiers (see Section 4.3) is also lower. Thus, this experiment does not really reflect real-world conditions but shows the problem of updating the object model without constraints. Discussion Since this illustrative experiment takes a very long time, only a few images out of the sequence are shown for illustration. It can be clearly seen, that the on-line classifier drifts away bit by bit, while the semi-supervised stays correctly aligned to the target object. The drift of the on-line classifier increases over time until it has fully left the original object position (see Figure
Chapter 4. Experiments and Results
43
(a) The off-line classifier cannot recognise the object any more due to a changed aspect ratio
(b) The on-line classifiers selected another target in the search area as object position due to a changed illumination
(c) The object is not tracked correctly due to the aspect ratio and the illumination conditions
(d) On-line and Summation Tracker have not recognised that they are lost
(e) On-line and Summa- (f ) On-line and Summation Tracker are lost be- tion Tracker are tracking cause of a wrong search the wrong vehicle position
(g) There are two similar objects in the search region of all tracking methods
(h) On-line and Summation Tracker have been tracking the wrong player and adapted to her
(i) Only the off-line classifier can recover to the original player again, the Semi-supervised Tracker lost the object due to an out-of-plane rotation
Figure 4.5: Tracking problems
Chapter 4. Experiments and Results
44
4.6).
(a) Initial position
(b) Frame 2000
(c) Frame 4000
(d) Frame 7000
(e) Frame 10000
(f ) Frame 15000
Figure 4.6: Long-term drifting experiment
4.7
Stability to misaligned updates
Similar to the previous experiment, it would be interesting in which way the semi-supervised classifier reacts if update patches are misaligned. Since the update method uses the actual object location amongst others for updating, this updates are manually misaligned by 2 pixels to the lower right to include such misalignments. This has also been done with an on-line classifier to clearly figure out the advantages of the semi-supervised approach. In difference to the previous experiment, the standard settings have been used for both classifiers. Discussion Figure 4.7 clearly shows the stability of the semi-supervised tracker during the experiment. Also the sensitivity of the on-line classifier to misaligned
Chapter 4. Experiments and Results
45
(a) Initial position
(b) Frame 150
(c) Frame 300
(d) Frame 450
Figure 4.7: Stability to misaligned updates
updates can be seen, especially at the beginning of the experiment this effect has much influence. It can be assumed that also the semi-supervised classifier would get misaligned if it is only updated with misaligned patches for a very long time, but such a constant misalignment is rather unlikely in real-world applications.
4.8
Evaluation of Update Rules
The previous experiment has evaluated the influence of misaligned update patches to the on-line boosting process. Until now, the update mechanism itself has not been inspected and discussed in detail. The main question is the selection of proper updating patches, to achieve stable behaviour of the semi-supervised classifier. Since this is incalculable due to high randomness in the selection of the used image features, some simple update rules have been implemented and run on two of the evaluation videos. An illustration of the implemented update rules can be found in Section 3.3, Figure 3.4. Original geometric updates: This is the standard update rule, also used in the previous work done by Grabner et al. [19]. The edges of the search window and the current detection are chosen as update patches. This enables the classifier to train the near surroundings of the object. In the case of semi-supervised boosting, no label is assigned to these patches by the update rule.
Chapter 4. Experiments and Results
46
Geometric updates: Enhancing the first method somewhat, a more complex geometric bounding around the actual detection can be chosen for selection of the update patches. In addition, the actual detection is used for alternated updating. Random updates: This method uses randomly chosen patches out of the search region for updates. Also this method includes the actual detection in addition. Prefer near patches: This method sorts the patches according to their distance to the actual detection and prefers ones that have a lower distance. Prefer confident patches: This method works similar to the last one, only the sorting criterion is exchanged. Patches that have been assigned high confidence during the last evaluation phase are preferred for updates. This are most likely patches that are near the detected object location. Update at decision boundary: Also this method uses sorting of the patches, but this time they are sorted according to the distance of confidence to zero. This means that patches near to the decision boundary are preferred, where the classifier is unsure in its decision. Using a method that “prefers ”better patches is always combined with sorting of all patches by the given criteria and an index transformation. The index i is then calculated by random(1, N )2 i= N where random(1, N ) is a function that chooses a number between 1 and N , which is the number of patches available. The probability of choosing a patch for example out of the best 10% is then about 30%. Figure 4.8 shows the progression of this function. Discussion The result table (Table 4.2) clearly figures out that geometric updates are a preferable methods to achieve high stability and accuracy of the classifier. This can be easily interpreted as prioritise the learning focus on the actual object appearance and the near background. Again it is advisable to select a fixed strategy including no randomness, till this may cause errors without expecting them. Another reason for taking a fixed geometric rule is the low computational cost in contrast to confidence- or distance-based methods which always include a sorting criterion.
Chapter 4. Experiments and Results
47
Figure 4.8: Preferring better patches for the update process is driven by a transformation from an equally distributed random number to the index
4.9
One-shot Training Methods
This section evaluates four different training strategies for prior training of a classifier out of a single image. In the case of this thesis, the obtained classifier is used for modelling the original object and incorporating prior knowledge into the semi-boosted on-line classifier, but it can also be used to derive an on-line classifier that can be incrementally improved by further updates. Since an on-line framework is used in this thesis, the trained classifier is not exact and has to work with estimated values. It can be assumed that off-line training would lead to better classification result with the drawback
Video “dog ” Tracking Rate Precision Overlap µ Overlap σ Video “cat ” Tracking Rate Precision Overlap µ Overlap σ
Simple
Geom.
Rand.
Dist.
Conf.
Bound.
0.95 0.96 0.81 0.20
0.99 1 0.87 0.07
0.99 1 0.86 0.07
0.99 1 0.86 0.06
0.93 1 0.81 0.22
0.99 1 0.87 0.07
0.79 0.89 0.59 0.35
0.90 0.91 0.73 0.29
0.81 0.82 0.68 0.36
0.89 0.89 0.64 0.29
0.82 0.83 0.65 0.35
0.63 0.96 0.52 0.42
Table 4.2: Comparison of different update methods
Chapter 4. Experiments and Results
48
of a longer training period. Nevertheless, some simple strategies (see Figure 4.9) have been evaluated amongst each other to demonstrate the performance of a classifier within the framework that is only trained by less than 100 samples and time consumption less than a second. Transformations of the original image can increase the training success and detection rate for features that are not invariant to this. Thereby a more general training set can be established in a very simple way.
(a) Strategy A: fixed (b) Strategy B: fixed (c) Strategy C: random rule rule with transformation transformations
Figure 4.9: One-shot training strategies, the blue rectangle represents the object position
Strategy A: The training patches are selected in the same way as when updating the on-line classifier in the usual way. This means that there is no transformation included and only slightly moved patches are used for negative updates. Strategy B: Rotation and/or Scaling is included into the training process. The update patches are selected in the same way as in the first strategy, but the whole image is rotated or scaled with fixed parameters and respect to the original object location. Strategy C: To include more diversity in the training examples, each sample is scaled and rotated with randomly chosen parameters. Since the background does not rotate in an usual environment and due to computational reasons, only the positive updates are transformed. The negative updates are randomly chosen from the near neighbourhood of the object location. Discussion Table 4.3 show the result of one-shot trained static classifiers on two of the used evaluation videos. It can clearly be seen, that the more diversity is included into the training set, the better the training result gets. However, when selecting between the random transformations and the fixed rule transformations including scale and rotation, the latter should be preferred due
Chapter 4. Experiments and Results
Video “dog ” Tracking Rate Precision Overlap µ Overlap σ Video “cat ” Tracking Rate Precision Overlap µ Overlap σ
49
Simple
Scale
Rotation
Rot.&Scale
Random
0.99 0.99 0.83 0.07
0.99 0.99 0.84 0.09
0.99 1 0.86 0.05
0.99 1 0.87 0.06
0.97 0.98 0.82 0.16
0.93 0.93 0.74 0.26
0.96 0.96 0.78 0.22
0.75 0.75 0.60 0.38
0.97 0.97 0.78 0.20
0.87 0.87 0.66 0.32
Table 4.3: Comparison of different one-shot training methods
to the abdication on randomness, which may select good values but with no guarantee. The experiments clearly showed that after a few iterations with included transformations the resulting classifier achieves rather good results. Therefore, the training phase can be minimised to less than a second, which results in approximately 100 positive and negative updates in the used test configuration. Off-line training has not been considered due to the high computational cost and time effort.
4.10
Hot-swap of Prior Knowledge
During the experiment phase, the idea came up to switch the prior knowledge during runtime of the classifier. This interesting question is strongly related to multi-view tracking. Figure 4.10 shows the developing of the confidence and the overlap of two switched classifiers. Only the second classifier receives special training on the new target patch when switching the prior classifier. For this experiment, a simple static scene showing a large texture was used. The two used patches for the prior classifiers are overlapping regions, which should simulate the fact that 3D objects would also lead to overlapping classification views. After 100 frames, the prior knowledge of the semi-supervised classifier was exchanged with the previously trained classifier for the second patch and the confidence and overlap have been measured. The technical expertise of this experiment has been used to implement a rudimentary multi-view tracking system. The switch between the prior knowledge always goes along with some initial training of the new target to make use of the very fast adaptation of the classifier towards the new target. A very crucial point for such an application is the decision when the prior classifier should be exchanged. The implemented system only evaluated all prior classifiers within the search-area of the on-line classifier and switched to the classifier and position with the highest confidence measure. Combined
Chapter 4. Experiments and Results
(a) Overlap of switched classifiers with and without special training
(b) Confidence of switched classifiers with and without special training
Figure 4.10: Results of the switching experiment
50
Chapter 4. Experiments and Results
Patches Evaluate Resample and Transition Integral Image Sum
1 20us 10us
51
Regular Scan 3600 (60 × 60) 72ms 20ms 92ms
Particle Filter 720 (20%) 15ms 6ms 20ms 41ms
Table 4.4: Execution Times (approximated)
with this a short training period on the new highest detection took place to adapt the on-line classifier to the new object appearance. The prior knowledge has been created with the best one-shot training strategy (see Section 4.9) and are therefore very specific for the current task. Although this is a very simple method for initialising and switching the prior knowledge, the results were impressive.
4.11
Particle Filter Speed-up
The main advantage of the use of a particle filter is the reduction of sampling patches. The time for evaluating of this huge amount of patches in the normal case has high influence on the overall execution time. Since this number can be reduced by 50% or more, the additional time for the resampling and transition of the particles is easily compensated. Another benefit of the particle filter is, that the sampling effort can be kept constant, independent to the size of the object to track which is not the case with simply expanding the search region around the object with a fixed factor. Comparing the two methods within the same execution environment reflects an increased framerate of more than factor 2, with effectively achieved more than 20 frames per second on a test system with 3 GHz (single-core) and 1 GB of Memory running Windows XP (see Table 4.4). Figure 4.11 shows the particle filtering algorithm applied to the “cat ”video. Using the particle filtering technique for motion estimation results in quite similar results as with the original regular scan method (see Table 4.5). The lower tracking rate in the “cat ”video results from a lower stability of the particle filtering. With further enhancements, such as a dynamic number of particles used, the additional stability can be achieved. This would clearly influence the achieved speed-up, but the particle filter is faster than the regular scan approach in any way.
Chapter 4. Experiments and Results
52
(a) Initial position and ini- (b) Tracking a moving obtial particle distribution ject, the particle filter predicts the object position, therefore are centered in front of the moving object
(c) Object lost, spreading (d) False detection, partiparticles to extend search cles are spread on another area object
Figure 4.11: Tracking with particle filtering
Video “dog ” Tracking Rate Precision Overlap µ Overlap σ Video “cat ” Tracking Rate Precision Overlap µ Overlap σ
Regular Scan
Particle Filter
0.99 1 0.87 0.07
0.99 0.99 0.85 0.09
0.93 0.93 0.75 0.27
0.90 0.96 0.69 0.26
Table 4.5: Comparison of regular scan and particle filtering for motion estimation of the tracked object
Chapter 5
Conclusion The work conducted for this thesis and the experiments have clearly shown the advantages of the semi-supervised tracking approach proposed by Grabner et al. [20]. The use of prior information to constrain the adaptivity of the object model is a very common idea and many approaches have been proposed in this field of study within the last years. The major advantage of this approach is the simplicity with which the prior knowledge is incorporated into the on-line learning mechanism. Another benefit is the integration of labeled and unlabeled data. All things considered, this approach is very interesting concerning two major problems of adaptive object tracking. First, there is always a lack of labeled data and it is a big advantage to be able to include also unlabeled data into the data set easily. The second big advantage is the robustness to drifting which is caused by erroneous updates. Putting bad examples into the update mechanism does not considerably influence the quality of the classifier since they are automatically weighted down by the prior knowledge. Due to the fact that this tracking approach is based on an incrementally adapted detection mechanism, the described technique is not only restricted to object tracking applications but also works for many other detection applications. This wide applicability offers a promising future for this approach.
5.1
Future Work
This thesis showed the capability of semi-supervised boosting in the field of object tracking. Nevertheless, each of the compared approach has its specific drawbacks and can be further extended. In case of the described tracking application, further investigations should be done on several attributes.
5.1.1
Image Features
A main drawback from this approach is the randomised feature initialisation. Although the power of diversity and feature exchange during the tracking 53
Chapter 5. Conclusion
54
period cannot be denied, the feature initialisation phase is based on the assumption of creating at least a few adequate features that are able to track the present object. According to the optimisation criterion, the best features are then selected out of the feature pool in each selector. This method seems to be convenient for selection of the features, but it does not enable any control on the quality of the provided feature pool. Especially in the case of fast visual one-shot learning, this is a required property of the feature initialisation. Another point according to the used image features is the better support of other feature types in the framework. This is a currently ongoing process. Also an interesting investigation would be spatial flexible features, that are not strictly positioned and fixed sized within the image patch. This would enable a better adaptation of the image features to the observed object, especially in the case of Haar-like features, but may introduce other problems that maybe would hinder this idea from being included easily.
5.1.2
Boosting Algorithm
A major improvement in the Boosting algorithm used in the framework for feature selection would be the extension to real boosting [46]. This means that the weak classifiers are extended to return confidence measures instead of class labels. Another improvement would be the use of the information on the category of the tracked object, if it is known before. Then an offline training can be performed to an off-line boosting classifier, that can be refined on-line. This is also an option for initialisation of the feature pool.
5.1.3
Extended Search Space
To further improve the tracking result, the search-space should be extended to rotation and scale to handle this type of object transformation naturally. This improvement can be combined with the particle filter, to preserve the real-time capabilities of the tracking system.
5.1.4
Multi-view Tracking
Based on the results of section 4.10, further investigations should be done to implement a multi-view tracking application. To enable a better switching mechanism, a multi-view prior classifier should be used (e.g., [57]). This would enable adaptive multi-view tracking.
5.1.5
Speed Optimisation
A very popular method to increase the evaluation speed of boosting classifiers is cascading, as used in [54]. To incorporate this approach into the boosting framework, considerations on the abort criterion have to be done.
Chapter 5. Conclusion
55
How many of the classifier stages have to be evaluated until a image patch can be rejected? A more sophisticated method could be WaldBoost [49]. Using optimised basis function would be another idea to increase the overall system performance. Thus an increased use of standard libraries, such as OpenCV1 advantageous. This includes amongst others the calculation of the integral image, which is used for evaluation of Haar-like features. Another benefit of the use of such standard libraries is the optimisation for actual multi-core environments, for example with IPP2 which is integrated into OpenCV. Also analysing the implemented software with a Performance Analyser, such as VTune3 can lead to a much faster execution. Since the evaluation function is one of the most time-consuming application parts due to the large number of calls, a small reduction of execution time results in a improved performance.
1
OpenCV: http://sourceforge.net/projects/opencvlibrary/, 01.10.2008 R Intel Integrated Performance Primitives: http://www.intel.com, 01.10.2008 3 TM R Intel VTune Performance Analyzer: http://www.intel.com, 01.10.2008
2
Bibliography [1] Avidan, S. (2004). Support vector tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26:1064–1072. [2] Avidan, S. (2005). Ensemble tracking. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 494–501. [3] Bevilacqua, A., Stefano, L. D., and Vaccari, S. (2005). Occlusion robust vehicle tracking based on som (self-organizing map). Motion and Video Computing, 2005. WACV/MOTIONS ’05 Volume 2. IEEE Workshop on, 2:84–89. [4] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. [5] Bosch, A., Zisserman, A., and Muoz, X. (2007). Image classification using random forests and ferns. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. [6] Bradski, G. (1998). Computer vision face tracking for use in a perceptual user interface. Intel Technology Journal. [7] Breiman, L. and Breiman, L. (1996). Bagging predictors. In ML, pages 123–140. [8] Broida, T. J. and Chellappa, R. (1986). Estimation of object motion parameters from noisy images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(1):90–99. [9] Chapelle, O., Sch¨ olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT Press, Cambridge, MA. [10] Chen, Y., Yu, S., Sun, W., and Chen, X. (2008). Object tracking using an improved kernel method. Proc. International Conf. on Embedded Software and Systems, pages 511–515.
56
BIBLIOGRAPHY
57
[11] Comaniciu, D. and Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages 603–619. [12] Comaniciu, D., Ramesh, V., and Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 142–149. [13] Comaniciu, D., Ramesh, V., and Meer, P. (2003). Kernel-based object tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages 564–577. [14] Drucker, H., Cortes, C., Jackel, L. D., LeCun, Y., and Vapnik, V. (1994). Boosting and other ensemble methods. Neural Comput., 6(6):1289–1301. [15] Freund, Y. and Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT ’95: Proceedings of the Second European Conference on Computational Learning Theory, pages 23–37, London, UK. Springer. [16] Freund, Y. and Schapire, R. E. (1999). A brief introduction to boosting. In In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1401–1406. Morgan Kaufmann. [17] Grabner, H., Beleznai, C., and Bischof, H. (2005). Improving adaboost detection rate by wobble and mean shift. In Proc. Computer Vision Winter Workshop, pages 23–32. [18] Grabner, H. and Bischof, H. (2006). On-line boosting and vision. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 260–267, Washington, DC, USA. IEEE Computer Society. [19] Grabner, H., Grabner, M., and Bischof, H. (2006). Real-time tracking via on-line boosting. In Proc. British Machine Vision Conf., page I:47. [20] Grabner, H., Leistner, C., and Bischof, H. (2008). Semi-supervised online boosting for robust tracking. In Proc. European Conf. on Computer Vision. [21] Hager, G. D. and Belhumeur, P. N. (1996). Real-time tracking of image regions with changes in geometry and illumination. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, page 403, Washington, DC, USA. IEEE Computer Society. [22] Jepson, A., Fleet, D., and El-Maraghi, T. (2003). Robust online appearance models for visual tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages 1296–1311.
BIBLIOGRAPHY
58
[23] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME – Journal of Basic Engineering, pages 35–45. [24] Kang, J., Cohen, I., and Medioni, G. (2004). Object reacquisition using invariant appearance model. In Proc. Intern. Conf. on Pattern Recognition, pages 759–762, Washington, DC, USA. IEEE Computer Society. [25] Leistner, C., Grabner, H., and Bischof, H. (2008). Semi-supervised boosting using visual similarity learning. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition. [26] Levi, K. and Weiss, Y. (2004a). Learning object detection from a small number of examples: the importance of good features. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages II–53–II–60 Vol.2. [27] Levi, K. and Weiss, Y. (2004b). Learning object detection from a small number of examples: The importance of good features. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages II: 53–60. [28] Li, Y., Ai, H., Yamashita, T., Lao, S., and Kawade, M. (2008). Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages 1728–1740. [29] Lim, J., Ross, D. A., Lin, R.-S., and Yang, M.-H. (2005). Incremental learning for visual tracking. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA. [30] M¨ aenp¨ aa ¨, T. (2003). The local binary pattern approach to texture analysis - extensions and applications. PhD thesis, University of Oulu. Dissertation. Acta Univ Oul C 187, 78 p + App. [31] Mallapragada, P. K., Jin, R., Jain, A. K., and Liu, Y. (2007). Semiboost: Boosting for semi-supervised learning. Technical report, Department of Comp. Science and Engineering, Michigan State University. [32] Maskell, S. and Gordon, N. (2001). A tutorial on particle filters for on-line nonlinear/non-gaussian bayesian tracking. Target Tracking: Algorithms and Applications (Ref. No. 2001/174), IEE, Workshop:2/1–2/15 vol.2. [33] Matthews, L., Ishikawa, T., and Baker, S. (2004). The template update problem. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(6):810–815.
BIBLIOGRAPHY
59
[34] Mikolajczyk, K. and Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(10):1615–1630. [35] Minyoung, K., Kumar, S., Pavlovic, V., and Rowley, H. (2008). Face tracking and recognition with visual constraints in real-world videos. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 1–8. [36] Niyogi, P., Girosi, F., and Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11):2196–2209. [37] Ojala, T., Pietikainen, M., and Harwood, D. (1996). A comparative study of texture measures with classification based on feature distributions. Pattern Recognition, 29(1):51–59. [38] Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., and Poggio, T. (1997). Pedestrian detection using wavelet templates. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 193–199. [39] Osuna, E., Freund, R., and Girosi, F. (1997). Training support vector machines: An application to face detection. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 130–136. [40] Oza, N. (2001). Online Ensemble Learning. PhD thesis, University of California, Berkeley. [41] Papageorgiou, C., Oren, M., and Poggio, T. (1998). A general framework for object detection. Proc. IEEE Intern. Conf. on Computer Vision, pages 555–562. [42] Porikli, F. (2005). Integral histogram: A fast way to extract histograms in cartesian spaces. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 829–836. [43] Porikli, F. and Tuzel, O. (2005). Multi-kernel object tracking. Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pages 1234–1237. [44] Ryoo, M. and Aggarwal, J. (2008). Observe-and-explain: A new approach for multiple hypotheses tracking of humans and objects. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 1–8. [45] Schapire, R. E. and Singer, Y. (1999a). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336. [46] Schapire, R. E. and Singer, Y. (1999b). Improved boosting using confidence-rated predictions. Machine Learning, 37(3):297–336.
BIBLIOGRAPHY
60
[47] Schweitzer, H., Bell, J., and Wu, F. (2002). Very fast template matching. In Proc. European Conf. on Computer Vision, page IV: 358 ff. [48] Shi, J. and Tomasi, C. (1994). Good features to track. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition. [49] Sochman, J. and Matas, J. (2005). Waldboost - learning for time constrained sequential detection. Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 2:150–156 vol. 2. [50] Tieu, K. and Viola, P. (2000). Boosting image retrieval. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 228–235. [51] Tomasi, C. and Kanade, T. (1991). Shape and motion from image streams: A factorization method part 3 - detection and tracking of point features. In Technical Report, School of Computer Science, Carnegie Mellon University. [52] Vapnik, V. (1999). An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5):988–999. [53] Veenman, C., Reinders, M., and Backer, E. (2001). Resolving motion correspondence for densely moving points. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(1):54–72. [54] Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 511–518. [55] Yilmaz, A., Javed, O., and Shah, M. (2006). Object tracking: A survey. ACM Computer Survey, 38(4):13. [56] Yilmaz, A., Li, X., and Shah, M. (2004). Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(11):1531–1536. [57] Zhang, H., Gao, W., Chen, X., Shan, S., and Zhao, D. (2006). Robust multi-view face detection using error correcting output codes. In Proc. European Conf. on Computer Vision, pages IV: 1–12. [58] Zhu, X. (2005). Semi-supervised learning literature survey. Technical Report 1530, Comp. Sciences, University of Wisconsin-Madison.