Dictionary Learning Based Object Detection and Counting in Traffic Scenes Ravishankar Sivalingam∗ University of Minnesota Minneapolis, MN, USA
[email protected] Nikolaos Papanikolopoulos University of Minnesota Minneapolis, MN, USA
Guruprasad Somasundaram∗ University of Minnesota Minneapolis, MN, USA
Vassilios Morellas University of Minnesota Minneapolis, MN, USA
[email protected]
[email protected] Osama Lotfallah
Youngchoon Park
Johnson Controls Inc. Milwaukee, Wisconsin, USA
Johnson Controls Inc. Milwaukee, Wisconsin, USA
[email protected] [email protected]
[email protected] ABSTRACT The objective of object recognition algorithms in computer vision is to quantify the presence or absence of a certain class of objects, for e.g.: bicycles, cars, people, etc. which is highly useful in traffic estimation applications. Sparse signal models and dictionary learning techniques can be utilized to not only classify images as belonging to one class or another, but also to detect the case when two or more of these classes co-occur with the help of augmented dictionaries. We present results comparing the classification accuracy when different image classes occur together. Practical scenarios where such an approach can be applied include forms of intrusion detection i.e., where an object of class B should not co-occur with objects of class A. An example is when there are bicyclists riding on prohibited sidewalks, or a person is trespassing a hazardous area. Mixed class detection in terms of determining semantic content can be performed in a global manner on downscaled versions of images or thumbnails. However to accurately classify an image as belonging to one class or the other, we resort to higher resolution images and localized content examination. With the help of blob tracking we can use this classification method to count objects in traffic videos. The method of feature extraction illustrated in this paper is highly suited to images obtained in practical cases, which are usually of poor quality and lack enough texture for the popular gradient based methods to produce adequate feature points. We demonstrate that by training different types of dictionaries appropriately, we can perform various tasks required for traffic monitoring.
1.
INTRODUCTION
Object classification has diverse applications. The techniques used to perform object classification are equally diverse and often depend on the target application for which ∗
contributed equally
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICDSC 2010 August 31 – September 4, 2010, Atlanta, GA, USA Copyright 20XX ACM 978-1-4503-0317-0/10/08 ...$10.00.
the classification is performed. The accuracy of classification is thus subjective and depends on many parameters and computational resources. We will be considering the class of supervised learning methods in this paper and the method discussed here can be categorized under this domain. Usually learning methods have a training phase in which visual models of the different object classes are created. There are two key factors in training which are the object representation and the classifier and its training algorithm. Objects can be represented using several different features such as color, texture, shape, etc. This is one area of intensive research in computer vision. As part of this work we will examine the effect of learning the dictionaries using a few different feature representations. Classifiers are usually maximum-margin-based, like the support vector machine (SVM), which usually have high accuracies. SVMs are extremely popular not only in the computer vision domain but also in many other fields. Also the amount of labeled training samples available is key to the accuracy, which is a direct result from machine learning theory. For practical applications we would like to minimize the amount of training required while achieving a high degree of classification accuracy. In recent times sparse signal models have become popular for image compression and reconstruction. In this framework, a given signal x ∈ Rn is represented as a sparse linear combination α of the atoms of an over-complete dictionary D ∈ Rn×k . The ideal case is when the signal x admits an exact sparse decomposition x = Dα, but in practical cases we allow a maximum approximation error , such that kx − Dαk2 ≤ . The dictionary D can be fixed, for e.g., the DCT basis or the wavelet basis, or it can be adapted to suit the application domain. Learning both D and α in an efficient way has been the focus of much of recent research [1, 2]. But the underlying theory makes them suitable to incorporate discriminative components, wherein the learned dictionaries could be used to tell apart two different classes of signals. Mairal et al. [3] have effectively used this concept for texture classification with inspiring results. In [4], a multi-scale extension of this idea is used to detect edges which are discriminative enough to be used for object classification. In this paper we use these signal models for discriminative learning of different object classes. Using the
Figure 1: A comparison of a bicyclist between a dataset image(left) and a tracked blob image(right).
models learned from individual image classes we have developed a method for distinguishing composite images, i.e., images composed of multiple classes occurring together. Such distinctions are helpful in many safety applications such as intrusion detection or foreign object detection, shielding of hazardous areas, etc. The rest of the paper is organized as follows: In the following section we discuss some related work concerning object representation, prominent approaches for doing object detection and classification, and highlight some of the recent work in sparse signal modeling which served as a precursor to this work. In Section 3, we present the approach of discriminative dictionaries, elaborate on the method used for detection of multiple classes in an image, and the use of this method for efficient classification and counting in traffic videos. In Section 4, we show experimental results for the two applications, and conclude with future directions in Section 5.
2.
RELATED WORK
Objects are best represented using features. Features can be computed from the entire image, interesting regions or interesting points. Many researchers have come up with descriptors for point features and image features which are effective for learning object models and matching. Shape context descriptors [5] have been demonstrated to be a viable choice for matching and recognizing digits, letters and 3D objects. SIFT features [6] is one of the most effective interest point detectors, which uses scale space extrema as interest points and provides a localized high dimensional descriptor. While SIFT promises some great matching results it lacks the simplicity of certain other features. A detailed report of interest point detectors which focus on scale and affine invariance of corner detectors is given in [7]. Some global image features are more suitable under certain circumstances. The pyramidal Histogram-of-Oriented-Gradients (P-HOG) descriptor provides statistical information of edge directions which is extremely useful for shape representation in images [8]. They also show good results for human detection. It is to be noted, that these features depend on the quality of the images and presence of sufficient gradient information. Techniques that work on standardized dataset images will not necessarily guarantee good results in practice, as we need to deal with problems like poor resolution, motion blur, occlusion etc. The problems are escalated when the images are blob images obtained from tracking(See Fig. 1). SVM-based approaches have been prominent under super-
vised learning methods. Not limited to object classification, SVMs have been an attractive choice for many other classification and regression applications. They have been used with many different object features with success and a detailed report is available in [9]. A similar approach which constructs different feature vocabulary sets as well as investigates the use of probabilistic latent semantic analysis, is presented in [10]. HOG features have been extended to construct part models which accommodate deformations in objects and hence improve accuracy, and [11, 12, 13] discuss the use of LDA (Latent Dirichlet Allocation) for object and scene classification and annotation. Most of these methods use benchmark datasets such as the Caltech or Graz 02. Sparse signal models are an interesting area of research for a variety of different applications. They were applied for texture classification in images for the purpose of compression and reconstruction [3]. Sparse signal models are constructed using the K-SVD [1] and MOD (method of optimal directions) [2] algorithms. We have derived the theory for our approach from the ideas presented by Mairal et al. [3]. Starck et al. [14] discuss the design of dictionaries for sparse representations of image content as texture and smooth regions, where they propose basis pursuit denoising with augmented dictionaries for separation of the two components in image content. We use a similar idea to recognize different image classes contained in composite images, using the image patches.
3. 3.1
APPROACH Theory
The underlying theory in this paper is based on the work by Mairal et al. [3]. (For the convenience of the reader we will stick to their notation.) Given a set of vectors X = n n×k {xl }M is l=1 , xi ∈ R , a reconstructive dictionary D ∈ R learned adaptively from the data such that the respective decomposition αl is sparse (i.e., no more than L non-zero elements) by solving the optimization problem: min α,D
M X
kxl − Dαl k22 s.t. kαl k0 ≤ L.
(1)
l=1
The best possible reconstruction error for a given signal x and dictionary D is denoted as R∗ (x, D), where R∗ (x, D) = kx − Dα∗ (x, D)k22 .
(2)
Here α∗ (x, D) is the optimal L-sparse decomposition for the pair (x, D). Both the K-SVD [1] and the MOD [2] approaches proceed to solve the dictionary learning problem in Eq. (1) in an iterative fashion, with each step consisting of a sparse coding stage and a dictionary update stage. Usually the K-SVD algorithm converges in fewer iterations compared to MOD, and hence is more commonly used. Suppose now that we have N different classes Si of signals, Si = {xl }l∈Si , and we would like to perform classification of these signals, with the help of dictionary learning. Mairal et al. [3] incorporate a discriminative component into the objective function in Eq. (1), and learn separate dictionaries, one per class, so that a signal belonging to one class is reconstructed poorly by a dictionary corresponding to another class. Thus the residual reconstruction error of a signal x by the dictionary belonging to class i, R∗ (x, Di ) is used
{Dj }N j=1
i = 1...N l ∈ Si
Ciλ (·)
where is the softmax function, a multi-class extension of the logistic loss function, given by ! N X λ −λ(yj −yi ) Ci (y1 , y2 , . . . , yN ) = log e . (4) j=1
The parameter λ defines the penalty for the softmax loss function (misclassification cost), while γ is the weight given to the reconstructive part of the objective function in Eq. (3). As the value of γ increases, the method approaches that of a regular reconstructive dictionary learning. As γ decreases, the system gains more discriminative power at the expense of poor overall reconstruction. For the complete algorithm on solving the above discriminative dictionary learning problem, the interested reader is referred to [3].
3.2
Detection of mixed classes
The dictionaries are adapted in the above manner to discriminate between objects from N classes. However, when images from two different classes A and B co-occur in an image, for e.g., a person holding a bike, as shown in Fig. 1, neither of the learned dictionaries DA and DB will be able to individually provide an efficient reconstruction. The primary goal of this work is to detect and classify this image as belonging to a class AB, without explicitly training for such a case. In this direction we construct a new dictionary DAB by concatenating the dictionaries of the individual classes as: DAB = [ DA | DB ] .
Reconstruction error curves for different dictionaries 0.45 Residual reconstruction error (normalized)
as a discriminant for classification. The modified objective function for discriminative dictionary learning is given by: X λ ∗ N min Ci {R (xl , Dj )}j=1 + λγR∗ (xl , Di ), (3)
E`1 = log
kEa − Eab k1 . kEb − Eab k1
(6)
E`2 = log
kEa − Eab k2 . kEb − Eab k2
(7)
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0.02
0.04 0.06 0.08 0.1 0.12 Fraction of non−zero elements in α
0.14
0.16
Figure 2: Plots of the normalized residual reconstruction error for varying sparsity levels for gray 16×16 data of the image on the left of Fig. 1. Clearly the combined dictionary performs much better compared to the individual dictionaries, and this is used to detect the presence of a combined (mixed class) image.
(5)
For a given signal x, the sparse decomposition based on each of the three dictionaries is computed using Orthogonal Matching Pursuit (OMP) [15]. The three dictionaries are then used to produce residual reconstruction error curves for varying fractions of sparsity of the decompositions, αA , αB and αAB . The OMP algorithm can produce such a curve very efficiently in one run, due to its sequential nature. If the sparsity constraint is in terms of the `1 norm on α, the full solution path can be obtained very efficiently using the LARS-Lasso algorithm [16]. One such curve is shown in Fig. 2, for the image on the left of Fig. 1 where we see a person riding a bicycle. Clearly the combined dictionary produces better performance compared to the individual dictionaries, showing that it is more useful in reconstructing, and thus detecting, mixed classes. The residual errors are normalized with respect to the signal energy, i.e., the norm of the signal kxk2 . Let us denote by Ea , Eb and Eab respectively the normalized error curves generated by the dictionaries DA , DB and DAB . We compute the logarithm of the ratio of the differences of the curves Ea and Eb with respect to Eab . These differences are computed using either the `1 or `2 norm.
people bikes combined
0.4
Figure 3: Form of the multiscale dictionary. If the image contains objects from two classes, as in Fig. 2, then the curves Ea and Eb will both be roughly equally well separated from Eab , resulting in E`p , p = 1, 2 being close to zero. However, if the image contains only one class, say A, then the Ea and Eab curves will be quite similar, whereas Eb will be very distinct from these two. This will result in a strong negative value for E`p , p = 1, 2. Similarly if the image contains only an object from class B, then this will produce a strong positive value in Eqs. (6,7). Using either E`1 or E`2 , we can distinguish between the occurrence of class A or class B or both, in a given image.
3.3
Classification and counting in traffic videos
As already illustrated by Fig. 1 tracked blob images obtained from traffic videos have motion artifacts such as blur, occlusion and they also have poor resolution. The quality of the blob images is a result of the quality of the camera, the performance of the tracking algorithm which in turn depends on the background removal routine. We can clearly see that most of the gradient based methods will not provide adequate keypoints to build models of the different object classes. Removal of motion blur is not straightforward as it is non-uniform and varies depending on the type of objects. Hence one generalized deconvolution kernel cannot be obtained. For this purpose we have trained discriminative dictionaries (refer section 3.1) on image patches of varying sizes
Image Features
People Dictionary
Sparse Coding
Sparse Coding
Bike Dictionary
Eb
Ea Reconstruction Error Feature
Fisher LDA Classifier
Class Label Figure 4: Block diagram of the complete classification procedure. obtained from the different image classes. For our study we have attempted to distinguish pedestrians from bicyclists. This is particularly challenging, since a bicyclist has image content from a bicycle as well as a person. These patches were obtained from bicycle and people images from the Graz 02 dataset. From a set 357 bicycle images and 294 people images, a random selection of 5120 samples of 16 × 16 patches, 25600 samples of 8 × 8 patches, and 32000 samples of 4 × 4 patches were obtained for training multiscale discriminative dictionaries. From the training procedure we obtain the three dictionaries of sizes 256 × 512 (from 16 × 16 patches), 64 × 256 (from 8 × 8 patches), and 16 × 128 (from 4×4 patches), for each of the two image classes. These three dictionaries have been appended to form a single large multiscale dictionary for each image class. The overall schematic of this combined dictionary is illustrated in Fig. 3. Note that the dimensions of the feature vector are appropriately sorted to account for the variation in ordering, due to the column-major scanning at the different scales. The reason for using multiscale dictionaries is to capture the discriminative features at the different scales and patches, to better classify the image classes. Note that the trained dictionary images are from the Graz dataset and are either bicycles or pedestrians. Candidate test images obtained from tracking, which are either bicyclists or pedestrians, are decomposed into 16 × 16 patches and are individually sparse coded with the trained multiscale dictionaries using the OMP algorithm. By varying the fractions of sparsity in the decompositions (for 50 different sparsity levels) we obtain a 50-dimensional reconstruction error vector for each dictionary, Ea , Eb ∈ R50 . This results in a 50-dimensional feature vector composed from difference of the reconstruction errors with respect to bicycle and pedestrian dictionaries. These features vectors are used for training a Fisher Linear Discriminant Analysis classifier, which is applied to patch-level classification, to determine whether the patch originated from a bicycle or a person. In the end each of the image patches get a label and by ana-
Figure 5: Some of the images used for testing of mixed classes. The first row consists of cars with bicycles, the second, persons with bicycles, and the third, persons with cars.
lyzing the composition of labels, we can classify the image blob. In an image, the ratio of the number of patches classified as bike to the total number of patches determines the final class label for the image. A classification threshold of 10% for this fraction was determined empirically. If the image had greater than 10% of bike patches, it was classified as a bike, otherwise it was labeled as a pedestrian. A block diagram illustrating the complete procedure of classification is shown in Fig. 4.
4. 4.1
EXPERIMENTS Results of Mixed Class Detection
The discriminative dictionary learning framework was used to learn dictionaries based on the people (p), cars (c) and bicycles (b) classes from the GRAZ02 dataset [17, 18]. Three different experiments were run based on pairwise combinations of these classes (N = 2). Test images consisting of mixed classes were obtained from the GRAZ01 dataset [19], the INRIA Person dataset [20] and Flickr 1 [21]. Some of the sample images in each mixed class are shown in Fig. 5. Two classes of images are selected for training from the GRAZ02 dataset, for e.g., people and cars. Discriminative dictionaries are learned based on the framework in [3], to yield Dp (for people) and Dc (for cars), and a combined dictionary Dpc is created by concatenating the individual dictionaries. For each image in the test set, the residual reconstruction error curves were plotted as explained in the previous section, and the quantities E`1 and E`2 are obtained. The images were classified using univariate Gaussian classifiers based on the E`1 and E`2 values and the accuracy was evaluated using leave-one-out cross-validation. Figs. 6(a) and 6(b) show the classification accuracy based on the `1 difference between the individual curves and the combined curve. The results show that the dictionaries learned using grayscale and P-HOG features are better at recognizing mixed classes compared to the dictionaries learned from 1 The images are licensed under a Creative Commons Attribution-Non-commercial license. http://www.flickr. com/creativecommons/
Classification accuracy based on E`1 error between curves
Classification accuracy based on E`2 error between curves
1
1
0.9
0.9 0.8
0.7
Classification accuracy
Classification accuracy
0.8
0.6 0.5 0.4 0.3 0.2
0.6 0.5 0.4 0.3 0.2
0.1 0
0.7
0.1 people & cars gray (16x16)
0
people & bikes cars & bikes phog (168) HShist (8x8) HShist (12x12)
people & cars gray (16x16)
(a)
(a)
Classification accuracy on mixed classes only based on E`1 error between curves 1
Classification accuracy on mixed classes only based on E`2 error between curves 1
0.9
0.9 0.8
0.7
Classification accuracy
Classification accuracy
0.8
0.6 0.5 0.4 0.3 0.2
0.7 0.6 0.5 0.4 0.3 0.2
0.1 0
people & bikes cars & bikes phog (168) HShist (8x8) HShist (12x12)
0.1 people & cars gray (16x16)
people & bikes cars & bikes phog (168) HShist (8x8) HShist (12x12)
0
people & cars gray (16x16)
people & bikes cars & bikes phog (168) HShist (8x8) HShist (12x12)
(b)
Figure 6: Classification accuracy based on the `1 difference between the individual curves and the combined curve : (a) Overall accuracy (b) Classification accuracy on mixed classes only. The dictionaries learned using grayscale and P-HOG features are better at recognizing the mixed classes compared to the dictionaries learned from hue-saturation values.
Figure 7: Classification accuracy based on the `2 difference between the individual curves and the combined curve : (a) Overall accuracy (b) Classification accuracy on mixed classes only. The classification based on the `2 difference on the curves is comparatively lower than using the `1 differences, and this is especially noticeable in the P-HOG performance.
Results of Counting
For this experiment, we acquired one hour of video from a university walkway. The traffic is predominantly of bicyclists and pedestrians. There were a total of 355 tracked blobs with about 82% of them being pedestrians and 18% of them were bicyclists. The ground-truth for these blobs were obtained using manual labeling and then the classification of these blobs was done using the algorithm described in Section 3.3. Two sample reconstruction error curves for patches originating from bicyclists and pedestrians are shown in Fig. 8. Classifications label masks overlaid on sample blob images are shown right next to the true blob images in Fig. 9. Note that the labels are for 16×16 patches.
Actual
4.2
Actual
hue-saturation values. Figs. 7(a) and 7(b) show the corresponding results for the `2 difference. Using the `1 difference produced better results compared to the `2 difference. The corresponding confusion matrices for different class combinations using the grayscale intensity features are shown in Table 1.
Actual
(b)
pc p c
Pred. (E`1 ) pc p c 36 1 0 4 96 0 1 0 99
Pred. (E`2 ) pc p c 36 1 0 5 95 0 2 0 98
pb p b
Pred. (E`1 ) pb p b 72 1 1 2 98 0 3 0 97
Pred. (E`2 ) pb p b 70 3 1 3 97 0 3 0 97
cb c b
Pred. (E`1 ) cb c b 15 0 0 0 100 0 1 0 99
Pred. (E`2 ) cb c b 15 0 0 0 100 0 0 0 100
Table 1: Confusion matrices (predicted vs. actual) for all permutations of the three classes used in the experiments. p - people, c - cars, b - bicycles.
(a) (a)
(b)
(b) Figure 8: (a) shows the residual reconstruction error plots for a patch from a pedestrian image, with respect to the two different multiscale dictionaries, while (b) shows the corresponding plots for a patch taken from the wheel of a bicycle image. Class Accuracy
Bikes 86.11%
Pedestrians 97.99%
Overall 95.87%
Table 2: Classification accuracy for bikes and pedestrians, and overall accuracy.
Here white corresponds to pedestrian patches, black to bicycle patches and grey represents background, which was not considered for classification. The counting accuracy for bicyclists and pedestrians, as well as the overall accuracy, is shown in Table 2. The results are very encouraging and motivate further research into the use of discriminative dictionaries for similar applications. From the overlaid class labels for each patch we observe that in some cases patches from the person riding the bicycle are classified as part of the bicycle. This factor has not biased the classification results. In fact we observe that we have very few pedestrian patches labeled as bicycle patches. We attribute this to the effects of posture of the person riding the bicycle and motion artifacts. This method does not resolve the problem of counting pedestrians moving in groups. In fact a rigorous analysis of the problem of counting pedestrians in groups is presented in [22]. When it comes to bicycle counting we have
Figure 9: This figure shows the classification label masks for the patches overlaid on the blob images. The corresponding images are shown above the masks. White corresponds to patches from pedestrian images, while black patches are from bicycle parts. Grey patches are not considered, i.e., they belong to the background. An overall structure of a bicycle is visible in the bike classification masks, whereas there are almost no ‘bike’ patches in pedestrian images. achieved better accuracies than our previous approach [23] in which we looked at feature based methods such as SIFT and HOG with an SVM performing the classification. Also this method is purely appearance-based and does not use any geometric or motion information as yet.
5.
CONCLUSIONS
Sparse signal models have become extremely popular in the recent decade, and the use of morphological component analysis to separate texture from smooth regions in images is very lucrative. Extending this idea to separate multiple image classes in composite images, we show with extensive training and experimentation that augmented dictionaries learned using different feature types achieve good classification accuracies. Further, without any additional learning, the individual dictionaries can be augmented to distinctly classify composite images with the different classes occurring together in an image. The results from our experiments confirm this. Current work concerns extending this work to multiple classes of objects (3 or more). Using a similar approach preliminary experimentation has shown good results, but augmentation of dictionaries by direct appending is impractical for large number of classes. Morphological separation
of different classes in an image through adaptively learned dictionaries as opposed to using fixed dictionaries (Starck et al. [14]) is also of interest. We are working on integrating the crowd counting approach presented in [22] along with this apperance based method to improve counting in mixed groups. Another direction of future work is to use multiple cameras, specifically overhead cameras with stereo information to better separate the moving blobs. This allows us to improve our counting results on groups of pedestrians and bicycles.
[11]
[12]
Acknowledgments This material is based upon work supported in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract #911NF-08-1-0463 (Proposal 55111-CI) and the National Science Foundation through grants #CNS-0324864, #CNS-0420836, #IIP-0443945, #IIP-0726109, #CNS-0708344, #CNS-0821474, and #IIP-0934327.
[13]
[14]
6.
REFERENCES
[1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Transactions on Signal Processing, vol.54, no.11, pp.4311-4322, Nov. 2006. [2] K. Engan, S.O. Aase, and J.H. Husoy, “Frame based Signal Compression using Method of Optimal Directions (MOD),” Proceedings of the 1999 IEEE International Symposium on Circuits & Systems, vol.4, pp.1-4, Jul 1999. [3] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discriminative Learned Dictionaries for Local Image Analysis,” Proceedings of the 2008 IEEE Conference on Computer Vision & Pattern Recognition, pp.1-8, June 2008. [4] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce, “Discriminative Sparse Image Models for Class-specific Edge Detection and Image Interpretation,” Proceedings of the 10th European Conference on Computer Vision, pp.43-56, Springer-Verlag, 2008. [5] S. Belongie and J. Malik, “Matching with Shape Contexts,” Proceedings of the IEEE Workshop on Content-based Access of Image & Video Libraries, pp.20-26, 2000. [6] D. Lowe, “Object Recognition from Local Scale-invariant Features,” Proceedings of the 7th IEEE International Conference on Computer Vision, vol.2, pp.1150-1157, 1999. [7] K. Mikolajczyk and C. Schmid, “Scale and Affine Invariant Interest Point Detectors,” International Journal of Computer Vision, vol.60, no.1, pp.63-86, 2004. [8] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proceedings of the 2005 IEEE Conference on Computer Vision & Pattern Recognition, vol.1, pp.886-893, June 2005. [9] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study,” Proceedings of the 2006 Computer Vision & Pattern Recognition Workshop, pp.13, June 2006. [10] A. Bosch, A. Zisserman, and X. Munoz, “Scene Classification using a Hybrid
[15]
[16]
[17]
[18]
[19]
[20]
[21] [22]
[23]
Generative/Discriminative Approach,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.30, no.4, pp.712-727, April 2008. P. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discriminatively Trained, Multiscale, Deformable Part Model,” Proceedings of the 2008 IEEE Conference on Computer Vision & Pattern Recognition, pp.1-8, June 2008. C. Wang, D. Blei, and L. Fei-Fei, “Simultaneous Image Classification and Annotation,” Proceedings of the 2009 IEEE Conference on Computer Vision & Pattern Recognition, pp.1903-1910, June 2009. L. J. Li, R. Socher, and L. Fei-Fei, “Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework,” Proceedings of the 2009 IEEE Conference on Computer Vision & Pattern Recognition, pp.2036-2043, June 2009. J. L. Starck, M. Elad, and D. Donoho, “Image Decomposition via the Combination of Sparse Representations and a Variational Approach,” IEEE Transactions on Image Processing, vol.14, pp.1570-1582, Oct. 2005. S. Mallat and Z. Zhang, “Matching Pursuits with Time-Frequency Dictionaries,” IEEE Transactions on Signal Processing, vol.41, pp.3397-3415, Dec 1993. B. E. Trevor, T. Hastie, L. Johnstone, and R. Tibshirani, “Least Angle Regression,” Annals of Statistics, vol.32, pp.407-499, 2002. M. Marszatek and C. Schmid, “Accurate Object Localization with Shape Masks,” Proceedings of the 2007 IEEE Conference on Computer Vision & Pattern Recognition, pp.1-8, June 2007. A. Opelt, A. Pinz, M. Fussenegger, and P. Auer, “Generic Object Recognition with Boosting,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.28, pp.416-431, March 2006. M. Fussenegger, A. Opelt, A. Pinz, and P. Auer, “Object Recognition using Segmentation for Feature Detection,” Proceedings of the 17th International Conference on Pattern Recognition, vol.3, pp.41-44, Aug. 2004. N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proceedings of the 2005 IEEE Conference on Computer Vision & Pattern Recognition, vol.2, pp.886-893, June 2005. Flickr, http://www.flickr.com D. Fehr, R. Sivalingam, V. Morellas, N. Papanikolopoulos, O. Lotfallah, and Y. Park, “Counting People in Groups,” Proceedings of the 6th IEEE International Conference on Advanced Video and Signal-Based Surveillance, pp.152-157, 2009. G. Somasundaram, V. Morellas, N. Papanikolopoulos, and L. Austin, “Counting Pedestrians and Bicycles in Traffic Scenes,” Proceedings of the 2009 IEEE Intelligent Transportation Systems Conference, St.Louis, 2009.