Document not found! Please try again

A Pattern Classification Approach to Dynamical Object ... - CiteSeerX

12 downloads 0 Views 205KB Size Report
Center for Biological and Computational Learning & Artificial Intelligence Laboratory ... Current systems for object detection in video sequences ... tect objects in static images, the problem becomes a pure .... of motion and segmentation.
Appears in Proceedings of International Conference on Computer Vision, Corfu, Greece, September 1999

A Pattern Classification Approach to Dynamical Object Detection Constantine Papageorgiou Tomaso Poggio Center for Biological and Computational Learning & Artificial Intelligence Laboratory 45 Carleton Street, E25-201 Cambridge, MA 02142 fcpapa,[email protected]

Abstract Current systems for object detection in video sequences rely on explicit dynamical models like Kalman filters or hidden Markov models. There is significant overhead needed in the development of such systems as well as the a priori assumption that the object dynamics can be described with such a dynamical model. This paper describes a new pattern classification technique for object detection in video sequences that uses a rich, overcomplete dictionary of wavelet features to describe an object class. Unlike previous work where a small subset of features was selected from the dictionary, this system does no feature selection and learns the model in the full 1,326 dimensional feature space. Comparisons using different sized sets of several types of features are given. We extend this representation into the time domain without assuming any explicit model of dynamics. This data driven approach produces a model of the physical structure and short-time dynamical characteristics of people from a training set of examples; no assumptions are made about the motion of people, just that short sequences characterize their dynamics sufficiently for the purposes of detection. One of the main benefits of this approach is that transient false positives are reduced. This technique compares favorably with the static detection approach and could be applied to other object classes. We also present a real-time version of one of our static people detection systems.

1 Introduction Object detection in video sequences is an area of fundamental importance for many areas of image processing. For a face recognition based access system, the face must first be detected before it is recognized; for autonomous navigation, obstacles and landmarks need to be detected and subsequently classified; detecting different classes of objects is paramount for indexing into image and video databases. The basic problem we tackle can be formulated as follows: how can we reliably detect a certain class of objects in video sequences of unconstrained, cluttered scenes? As a testbed for our system, we will be focusing primarily on people detection. There are two basic angles this problem could take: static images or video sequences. If we would like to detect objects in static images, the problem becomes a pure pattern classification task; the system must be able to differentiate between the objects of interest and “everything

else”. If, on the other hand, the problem is to detect objects in video sequences, there is a richer set of information available, namely the dynamical information inherent in the video sequence. For a general purpose system that does not make limiting assumptions about the objects, we cannot, however, exclusively rely on motion information per se; what if the particular scene is of a group of people standing at a bus stop? A system that relies on motion to detect people would clearly fail in this case. What we need is a technique that uses a model that is rich enough to both a) describe the object class to the degree that it is able to effectively model any of the possible shapes, poses, colors, and textures of the object, and b) harness the dynamical information inherent in video sequences, without making any underlying assumptions about the actual dynamics of the objects. To achieve these goals, we will use a learning-based approach. The system will automatically learn what constitutes a person and their dynamics. We will appeal to previous work, where we have described a system for object detection in static images [16] [14] [17] that has shown a high degree of performance. This system can be characterized as a learning-based approach that uses a rich, overcomplete dictionary of wavelet features. It makes no assumptions on the scene structure and does not use any motion or tracking information. Unlike the previous systems which reduced the dimensionality of the learning task through a feature selection step, the system we describe here does not do any feature selection. Instead, the entire set of features ( O(103 )) are used to train a support vector machine (SVM) classifier that can effectively handle high-dimensional feature spaces populated by small data sets. We provide an empirical comparison of how using classes of features that are qualitatively different, as well as different numbers of these features, lead to detection systems with significantly different performance. We modify this approach to represent dynamic information by extending the static representation into the time domain. One of the most important factors motivating the extension of the classification into the time domain is that for the static system, false positives often appear, but are typically transient. With this new representation, the system will be able to learn certain dynamical characteristics of people, motion or no motion. The system will learn

2 Related work Much of the previous work in object detection and recognition in video sequences has focused on people detection. Here, we review some of the work relevant to this paper. [20] [11] use simple geometrical models and motion to detect and analyze people. [9] and [18] develop systems using 3D cylindrical models and the latter also uses kinematic motion data. [22] describe a system for the real-time tracking of the human body that uses a maximum a posteriori approach to segment a body into blobs and tracks the blobs, but assumes that the camera and background are fixed and that there is a single person in the image. [1] uses a combination of probabilistic approaches, Kalman filters, and hidden Markov models to segment and recognize different human motions. [13] describe a system that tracks multiple people in video and automatically detects a face for each person found in the images. [6] present a system for real time detection and tracking of people; here, they assume knowledge of the background. To track moving objects, [7] use clusters of consistent color; however, an initial manual labeling step is required. This system has been combined with a time delay neural network to detect and recognize pedestrians [8]. [4] present a system for recognizing actions based on time-weighted binarized motion images taken over relatively short sequences; the system is not used for detection, however. All these systems have succeeded to varying degrees but have relied on the following restrictive features:

    

explicit modeling of the domain; stationary camera and a fixed background; assume person is moving make assumptions about the scene structure; implement tracking of objects, not detection of specific classes.

This work will overcome these problems by describing an example-based approach that learns to recognize patterns in short video sequences and avoids the explicit use of motion and segmentation. An important characteristic of our system will be that, even though it uses dynamical information for detection, it makes no assumptions about the underlying dynamics – there is no explicit model.

3 The static detection system The object detection system we develop does not undergo any feature selection steps. This is in contrast to previously reported results [16] [14] [17] where a small subset of features was extracted from a large feature dictionary. In this section we give an overview of the system

1

0.8

Detection Rate

what a person looks like and what constitutes valid dynamics over short time sequences, without the need for explicit models of either shape or dynamics. The only assumption we make is that short sequences characterize the motion of people sufficiently for the purposes of detection. We show that this technique indeed reduces the false positive rate.

0.6

color29,poly2 color29,poly2,+1 color29,poly3,+1 bw29,poly2 bw29,poly2,+1 bw29,poly3,+1 color1326,poly2,+1

0.4

0.2

0

−4

10

−2

10

0

10

False Positive Rate

Figure 1. ROC curves for different detection systems. The detection rate is plotted against the false positive rate, measured on a logarithmic scale. The false detection rate is defined as the number of false detections per inspected window.

and highlight the increased accuracy when compared to systems doing feature selection. In addition, we present results of using different qualitative classes of features.

3.1 Representation One of the key issues in the development of an object detection system is the representation of the object class. The patterns are 128  64 RGB images where the people have been aligned in the center of the images and can be in many different poses: frontal, rear, side walking, side standing. Typical images of people show a great deal of variability in the color, texture, and pose as well as the lack of a consistent background. Our challenge is to develop a representation that achieves high inter-class variability with low intra-class variability. To motivate our choice of representation, we can start by considering several traditional representations. Pixelbased and color region-based approaches are likely to fail because of the high degree of variability in the color and the number of spurious patterns. Traditional fine-scale edge based representations are also unsatisfactory due to the large degree of noise in these edges. The representation that we use is an overcomplete dictionary of Haar wavelets in which there is a large set of features that respond to local intensity differences at several orientations. We present an overview of this representation here; details can be found in [12] [19] [16] [14]. For a given pattern, the wavelet transform computes the responses of the wavelet filters over the image. Each of the three oriented wavelets – vertical, horizontal, and diagonal – are computed at several different scales allowing the system to represent coarse scale features all the way down to fine scale features. In our system for people detection, we use the scales 32  32 and 16  16. In the traditional wavelet transform, the wavelets do not overlap; they are shifted by the size of the support of the wavelet in x and

y. To achieve better spatial resolution and a richer set of features, our transform shifts by 41 of the size of the support of each wavelet, yielding an overcomplete dictionary of wavelet features. This results in a 1,326 dimensional feature vector for each pattern, which is used as training data for our classification engine. There is certain a priori knowledge embedded in our choice of the wavelets. First, we use the absolute values of the magnitudes of the wavelets; this tells the system that a dark body on a light background and a light body on a dark background have the same information content. Second, we compute the wavelet transform for a given pattern in each of the three color channels and then, for a wavelet of a specific location and orientation, we use the one that is largest in magnitude. This allows the system to use the most visually significant features. 3.2 Support vector machine classification Support vector machines (SVM) is a technique to train classifiers that is well-founded in statistical learning theory; for details, see [21] [3]. One of the main attractions of using SVMs is that they are capable of learning in sparse, high-dimensional spaces with very few training examples. SVMs accomplish this by minimizing a bound on the empirical error and the complexity of the classifier, at the same time. This concept is formalized in the theory of uniform convergence in probability:

R( )  Remp ( ) + Φ



h ?log() ; ` `



(1)

with probability 1 ? . Here, R( ) is the expected risk, Remp ( ) is the empirical risk, ` is the number of training examples, and h is the VC dimension of the classifier that is being used. This leads us directly to the principle of structural risk minimization, whereby we can attempt to minimize at the same time both the actual error over the training set and the complexity of the classifier; this will bound the generalization error as in Equation 1. It is exactly this technique that support vector machines approximate. This controlling of both the training set error and the classifier’s complexity has allowed support vector machines to be successfully applied to very high dimensional learning tasks; [10] presents results on SVMs applied to a 10,000 dimensional text categorization problem and [15] show a 283 dimensional face detection system. We will make use of this property of being able to apply SVMs to very high dimensional classification problems when we describe our dynamic object detection technique later in this paper. Using the SVM formulation, the classification step for a pattern x using a polynomial of degree two is as follows:

f (x) = 

Ns X i=1

iyi (x  xi + 1)

2

+

b

!

(2 )

where Ns is the number of support vectors – training data points that define the decision boundary – and i are Lagrange parameters.

3.3 Detecting people in new images

To detect pedestrians in a new image, we shift the 128  64 detection window over all locations in the image. This will only detect pedestrians at a single scale, however. To achieve multi-scale detection, we incrementally resize the image and run the detection window over each of these resized images. This brute force search over the image is quite time consuming; several methods can be used to reduce the computation (see Section 5).

3.4 Feature comparison In Section 3.1, we discussed the characteristics of the class of features the system extracts. Here, we provide empirical results which show that using all the features leads to a higher-performing system than if the dimensionality of the representation is reduced using feature selection. To determine the performance of a detection system, it is necessary to analyze a full ROC curve that shows the tradeoff between accuracy and the rate of false positives. The system is trained over a database of 1,848 positive patterns and 7,189 negative patterns. We emphasize that our ROC curves are computed over an out-of-sample test set gathered outdoors and over the Internet. Figure 1 compares the ROC curves of several different incarnations of our system. They are as follows:

      

color processing with 29 features using a homogeneous polynomial of degree two color processing with 29 features using a polynomial of degree two color processing with 29 features using a polynomial of degree three grey-level processing with 29 features using a homogeneous polynomial of degree two grey-level processing with 29 features using a polynomial of degree two grey-level processing with 29 features using a polynomial of degree three color processing with all 1,326 features using a polynomial of degree two

From the ROC curve, it is clear that most of the impact on performance comes from the class of features that are used; the complexity of the classifier is not as significant. As expected, using color features results in a more powerful system. The curve of the system with no feature selection is clearly superior to all the others. This indicates that for the best accuracy, using all the features is optimal. When classifying using this full set of features, we pay for the accuracy through a slower system. It may be possible to achieve the same performance as the 1,326 feature system with fewer that the entire set of features; this is an open research question.

4 The dynamic detection system As stated in the introduction, our goal is to develop a detection system for video sequences that makes as few assumptions as possible; we do not want to develop an explicit model of the shape of people or outwardly model their possible motions in any way. We would like a technique that implicitly generates a model of both the shape and valid dynamical characteristics of people at the same time from a set of training data. This should be accomplished without assuming that human motion can be approximated by a linear Kalman filter or that it can be described by a hidden Markov model or any other dynamical model. The only assumption we will make is that five consecutive frames of an image sequence contain characteristic information regarding the dynamics of how people appear in video sequences. From a set of training data, the system will learn exactly what constitutes a person and how people typically appear in these short time sequences. Instead of using a single 12864 pattern from one image as a training example, our new approach takes the 128  64 patterns at a fixed location in five consecutive frames, computes the 1,326 features for each of these patterns, and concatenates them into a single 6,630 dimensional feature vector for use in the support vector training. We use images t ? 4; t ? 3; t ? 2; t ? 1; t where the person is aligned in the center of the image at t. Figure 2 shows several example sequences from the training set. The full training set is composed of 1,379 positive examples and 3,822 negative examples. The extension to detecting people in new images is straightforward; for each candidate pattern in a new image, we concatenate the wavelet features computed for that pattern to the wavelet features computed at that location in the previous four frames. The full feature vector is subsequently classified by the SVM. We emphasize that it is the implicit ability of the support vector machine classification technique to handle small sets of data that sparsely populate a very high-dimensional feature space that allows us to tackle this problem. In developing this type of representation, we expect that the following dynamical information will be evident in the training data and therefore encapsulated in the classifier:

  

Figure 2. Example image sequences that are used to train our dynamic detection system.

people usually display smooth motion or are stationary people do not spontaneously appear or disappear from one frame to another camera motion is usually smooth or stationary

One of the primary benefits derived from this technique is that it extends this rich feature set into the time dimension and is able to both detect people at high accuracy, while reducing transient false positives that would normally appear when using the static detection system. This is purely a data-driven pattern classification approach to dynamical detection. We compare this approach to the static detection system, trained with the individual images corresponding to frame t in each of the sequences, so there are 1,379 positive and 3,822 negative 1,326 dimensional feature vectors as training for the static detection system. Both the static and dynamic systems are tested on

1

Detection Rate

0.8

0.6

0.4

0.2

0

dynamic static −4

−2

10

10

0

10

False Positive Rate

Figure 3. ROC curves for the static and dynamic detection systems. The detection rate is plotted against the false positive rate, measured on a logarithmic scale. The false positive rate is defined as the number of false detections per inspected window.

the same out-of-sample sequence. Figure 3 shows the ROC curves for the two systems. From the ROC curves, we see that the system that has incorporated dynamical information performs better than the system that uses only static patterns. This is more impressive in light of the fact that the dynamic system is doing its classification in a 6,630 dimensional space with only 5,201 training examples. It is important to note that our features are not the 3D wavelets in space and time; what we have done is taken a set of 2D wavelet features spread through time and used these to develop our model. One extension of our system that we would like to pursue is to use 3D wavelets as features. Such a system would learn the dynamics as a set of displacements and therefore may generalize better.

5 A real, real-time system As an alternative to dynamic detection strategies, we can use a modified version of our static detection system to achieve real-time performance. This section describes a real, real-time application of our technology as part of a larger system for driver assistance; the combined system, including our people detection module, is currently deployed "live" in a DaimlerChrysler S Class demonstration vehicle. The remainder of this section describes the integrated system.

5.1 Speed optimizations Our original unoptimized static detection system for people detection in color images processes sequences at a rate of 1 frame per 20 minutes; this is clearly inadequate for any real-time automotive application. We have implemented optimizations that have yielded several orders of magnitude worth of speedups. subset of 29 features: Instead of using the entire set of 1,326 wavelet features, we use just 29 of the more important features that encode the outline of the body. This changes

the 1,326 dimensional inner product in Equation 2 into a 29 dimensional dot product. reduced set vectors: From Equation 2, we can see that the computation time is also dependent on the number of support vectors, Ns ; in our system, this is typically on the order of 1,000. We use results from [2] to obtain an equivalent decision surface in terms of a small number of synthetic vectors. This method yields a new decision surface that is equivalent to the original one but uses just 29 vectors. gray level images: Our use of color images is predicated on the fact that the three different color channels (RGB) contain a significant amount of information that gets washed out in grey level images of the same scene. This use of color information results in significant computational cost; the resizing and Haar transform operations are performed on each color channel separately. In order to improve system speed, we modify the system to process intensity images.

5.2 Integration with the DaimlerChrysler Urban Traffic Assistant The DaimlerChrysler Urban Traffic Assistant (UTA) is a real-time vision system for obstacle detection, recognition, and tracking [5]. The system uses stereo vision to detect and segment obstacles and provides an estimate of the distance to each obstacle. We can use this information as a focus of attention mechanism for our people detection system. Using the knowledge of the location and approximate size of the obstacle allows us to target the people detection system to process relatively small regions for just a few sizes of people. The combined system runs at more than 10 Hz. Furthermore, the portion of the total system time that is spent in our pedestrian detection module is 15 ms per obstacle. An analysis of how much time is taken by each portion of the pedestrian detection module shows that the smallest amount of time is being spent in the SVM classification. This bodes well for improving the performance. We should be able to use a much richer set of features than the 29 that are currently used – perhaps on the order of a few hundred features – without significantly degrading the speed of the system.

6 Conclusion We have described a new technique for object detection in video sequences. This technique is purely data driven and makes no explicit assumptions on the dynamics or motion of people, just that short sequences characterize the dynamics sufficiently for the purposes of detection; in this manner, we avoid the need for an explicit underlying model of human motion and implicitly derive the model from a set of training data. The core method uses training data that is in the form of high dimensional feature vectors that are generated from an overcomplete dictionary of Haar wavelets. Using traditional classifier training techniques would most likely result in overfitting such a high dimensional set, so we use a support vector machine classifier which controls the training error and complexity of the classifier. We have shown that the system learns in the presence of noisy features and performs better with the full set of features than when the dimensionality is reduced through a feature selection step.

Figure 4. Processing the “Downtown Ulm” sequence with the people detection system for video sequences. The system uses no explicit dynamical model; it learns the shape and some dynamical characteristics of people through a pure pattern classification approach. Using this technique achieves high detection accuracy while eliminating transient false positives.

To do detection is video sequences, with one of the primary goals being to reduce the rate of false positives, we extend the static representation directly to the time domain. In an out-of-sample test, this dynamical detection method compares quite well with a static detection system. The system learns the important characteristics of the physical structure and dynamics of people as they appear in video sequences; in this manner, we have completely avoided the need for explicit shape or motion models. Since it is purely a learning-based approach, we expect that this same architecture could be applied to other object classes.

References [1] C. Bregler. Learning and Recognizing Human Dynamics in Video Sequences. In Computer Vision and Pattern Recognition, pages 568–574. IEEE Computer Society Press, 1997. [2] C. Burges. Simplified Support Vector decision rules. In Proceedings of 13th International Conference on Machine Learning, 1996. [3] C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. In U. Fayyad, editor, Proceedings of Data Mining and Knowledge Discovery, pages 1–43, 1998. [4] J. Davis and A. Bobick. The Representation and Recognition of Human Movement Using Temporal Templates. In Computer Vision and Pattern Recognition, pages 928–934. IEEE Computer Society Press, 1997. [5] U. Franke, D. Gavrila, S. Goerzig, F. Lindner, F. Paetzold, and C. Woehler. Autonomous driving goes downtown. IEEE Intelligent Systems, pages 32–40, November/December 1998. [6] I. Haritaoglu, D. Harwood, and L. Davis. W4: Who? when? where? what? a real time system for detecting and tracking people. In Face and Gesture Recognition, pages 222–227, 1998. [7] B. Heisele, U. Kressel, and W. Ritter. Tracking Non-rigid, Moving Objects Based on Color Cluster Flow. In Computer Vision and Pattern Recognition, 1997. [8] B. Heisele and C. Wohler. Motion-Based Recognition of Pedestrians. In Proceedings of International Conference on Pattern Recognition, 1998. (in press). [9] D. Hogg. Model-based vision: a program to see a walking person. Image and Vision Computing, 1(1):5–20, 1983.

[10] T. Joachims. Text Categorization with Support Vector Machines. Technical Report LS-8 Report 23, University of Dortmund, November 1997. [11] M. Leung and Y.-H. Yang. Human body motion segmentation in a complex scene. Pattern Recognition, 20(1):55–64, 1987. [12] S. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–93, July 1989. [13] S. McKenna and S. Gong. Non-intrusive person authentication for access control by visual tracking and face recognition. In J. Bigun, G. Chollet, and G. Borgefors, editors, Audio- and Video-based Biometric Person Authentication, pages 177–183. IAPR, Springer, 1997. [14] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates. In Computer Vision and Pattern Recognition, pages 193–99, 1997. [15] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Computer Vision and Pattern Recognition, pages 130–36, 1997. [16] C. Papageorgiou. Object and Pattern Detection in Video Sequences. Master’s thesis, MIT, 1997. [17] C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Proceedings of International Conference on Computer Vision, 1998. [18] K. Rohr. Incremental recognition of pedestrians from image sequences. Computer Vision and Pattern Recognition, pages 8–13, 1993. [19] E. Stollnitz, T. DeRose, and D. Salesin. Wavelets for computer graphics: A primer. Technical Report 94-09-11, Department of Computer Science and Engineering, University of Washington, September 1994. [20] T. Tsukiyama and Y. Shirai. Detection of the movements of persons from a sparse sequence of tv images. Pattern Recognition, 18(3/4):207–13, 1985. [21] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. [22] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. Technical Report 353, MIT Media Laboratory, 1995.

Suggest Documents