SAR Target Classification Using Sparse Representations and Spatial Pyramids Peter Knee, Jayaraman J. Thiagarajan, Karthikeyan Natesan Ramamurthy and Andreas Spanias School of Electrical, Computer and Energy Engineering SenSIP Center, Arizona State University, Tempe, AZ
[email protected] Abstract— We consider the problem of automatically classifying targets in synthetic aperture radar (SAR) imagery using image partitioning and sparse representation based feature vector generation. Specifically, we extend the spatial pyramid approach, in which the image is partitioned into increasingly fine sub-regions, by using a sparse representation to describe the local features in each sub-region. These feature descriptors are generated by identifying those dictionary elements, created via -means clustering, that best approximate the local features for each sub-region. By systematically combining the results at each pyramid level, classification ability is facilitated by approximate geometric matching. Results using a linear SVM for classification along with SIFT, FFT-magnitude and DCT-based local feature descriptors indicate that the use of a single element from the dictionary to describe the local features is sufficient for accurate target classification. Continuing work both in feature extraction and classification will be discussed, with emphasis placed on the need for classification amid heavy target occlusion.
I.
INTRODUCTION
Synthetic aperture radar (SAR) imaging systems possess the tremendous capability to form near optical quality imagery in a multitude of adverse weather conditions as well as under the cover of night [1]. These capabilities have led to the development of automatic target recognition (ATR) systems that attempt to accurately and efficiently detect and classify pre-determined targets. Statistical techniques developed in the fields of signal and image processing have been unable to address the numerous difficulties encountered in SAR ATR. In addition to the multiplicative noise (speckle) common to all coherent imaging systems, SAR ATR systems must overcome large target signature variabilities from articulating or rotating targets, changing background surfaces and both intentional and unintentional target obscuration. SAR ATR systems have predominately relied upon template matching based algorithms for target classification. This method requires the generation of templates at incrementally spaced aspect angles for each known target type [2]. While straight-forward, tedious manual intervention is required during template creation and classification throughput decreases significantly as targets are added as a result of the increasing number of template-image correlations. Alleviating the need for manual training, feature
978-1-4244-8902-2/11/$26.00 ©2011 IEEE
Figure 1. Generation of single image feature descriptor using spatial pyramid algorithm for image classification. The inclusion of the sparse representations occurs during local feature encoding.
based approaches use the training data set to discriminatively train the classification system. These typically non-linear approaches such as the multilayer perceptron [3] and the support vector machine [4] have been shown to provide better generalization capabilities as well improved classification performance. These sets of classifiers however typically require an accurate pose estimation algorithm that has proven to be very difficult, in part due to inherent image noise as well as varying imaging scenarios and wide-ranging target signature variabilities. More recently, we presented an approach based on sparse representations that used a local linear approximation to each target class manifold to generate a classification prediction [5]. Small modifications to the manifold such as target translations have the potential to make accurate linear approximations
294
difficult. Working in the high-dimensional image space also required the use of dimensionality reduction techniques, such as random projections, that discarded useful locality information, making occlusion handling very difficult. In this paper we investigate the merging of the sparse representation framework within the spatial pyramid approach presented by Lezabnik [6] for scene characterization. The spatial pyramid matching (SPM) algorithm can be viewed as a global method that considers aggregates of statistical properties in a holistic fashion over fixed sub-regions in an image. While the original intent of the approach was to recognize scene categories, the algorithm proved to be surprisingly effective at categorizing images containing specific objects. The power to do so was a result of the ability to compute a rough geometric correspondence between local features in the image. A general flowchart of the SPM algorithm is shown in Fig. 1. Image subregions are generated by partitioning the image into increasingly finer spatial resolutions. Local image features are then encoded using a learned codebook. Similar to the approaches by Yang [7] and Wang [8], the inclusion of sparse representations during feature encoding will replace the hard vector quantization in the original SPM work. Known as spatial pooling, the locations of the encoded image features are aggregated across subregions to form a vector representation of the image. The histogram, or average, pooling technique is replaced by a more biologically-inspired adaptive [7] technique known as max pooling, which has the added benefit of increased spatial translation robustness. Aggregating then only the most salient image properties into the image feature vector, a set of linear SVMs are trained to perform image classification. Successful results using 3 targets from the publicly available Moving and Stationary Target Acquisition and Recognition (MSTAR) database are shown. The organization of this paper is as follows: in Section II we present a brief review of the spatial pyramid algorithm and the local image features that will be considered. Dictionary generation and feature encoding using both vector quantization and sparse representations will be presented in Section III. Spatial feature pooling using both histograms and max pooling are discussed in Section VI. Section V will present the results for 3 targets from the MSTAR dataset, including results using the baseline histogram approach presented in [6]. Concluding remarks, along with the discussion of extensions and improvements will be presented in Section VI. II.
SPATIAL PYRAMIDS
Spatial pyramid matching is an adapation of the efficient approximation technique known as pyramid matching [9]. Itself an adaption of the bag-of-features method, pyramid matching works by computing a rough geometric correspondence between sets by using increasingly coarser grids over the feature space and weighted sums of the matches that occur at each level of resolution. The pyramid match kernels work using an orderless image representation, matching sets in high-dimensional feature space while discarding all spatial information from the original image space. Spatial pyramid matching provides the capability to
978-1-4244-8902-2/11/$26.00 ©2011 IEEE
Figure 2. Example construction of a three-level pyramid for spatial pyramid matching.
retain spatial information by matching in the partitioned image space and clustering in the high-dimensional feature space. By operating in the image space, rather than feature space, spatial information is maintained by concatenating “long” feature vectors that consist of the weighted and encoded features at all levels of resolution. While similar, spatial pyramids and multiresolution histograms differ markedly: multiresolution histograms repeatedly subsample the image and compute features where as spatial pyramid matching fixes the resolution at which features are computed and varies the spatial resolution at which they are aggregated. To create a spatial pyramid image representation, consider the construction of a three-level ( 3) pyramid as shown in Fig. 2. At each resolution , 0, … , 1, a grid is constructed such that there are 2 resolution cells along each dimension. As previously mentioned, each local feature vector is then encoded using an -element dictionary or codebook. Feature pooling then provides the means to extract the most salient features from each subregion and resolution level . The pooled vectors are weighted according to the inverse of the cell width at that level, i.e. , placing more emphasis on matches found at finer resolutions. The final feature vector for the image then is the concatenation of all pooled vectors, a single vector of dimensionality (4 1) for each image. This is the same general procedure that was introduced by Lezabnik [6]. The implementation details of each stage have been left very general to allow for the modifications of our proposed system. The specifics behind feature encoding and pooling both in general and within our framework are discussed in the subsequent sections. A. Local Image Features Three image features are used in the experiments of Section V. To characterize the uniqueness of radar imagery, we chose to utilize the “strong features” presented by Lezabnik [6]. These scale-invariant feature transform (SIFT) descriptors have proven to be highly useful in natural imagery [7], [8] due to their invariance to scale, orientation and affine distortions. Rather than identifying points of interest, we chose to extract 16 16 pixel patches computed over a dense grid with a spacing of 4 pixels to account for the reduced resolution of the SAR imagery. The choice of a dense grid over interest points is a result of the improved scene classification performance demonstrated by Li and Perona [10].
295
In addition, we consider the use of FFT and DCT-based feature descriptors. Again, we extract 16 16 pixel patches using a dense grid with a spacing of 4 pixels. For the FFT, we consider the magnitude of the frequency response, discarding all phase information. The extracted feature matrix at any point in the image is then rasterized, forming a single vector of length 256. The codebook, or dictionary, is generated using a random subsample of the features from the training set. Clustering using k-means provides the ability to develop a dictionary of a specified size. As observed in [6], codebooks of larger than 400 elements show no significant advantages over their larger counterparts. We will consider codebooks then of sizes 100, 200 and 400. III.
CODE GENERATION
The D-dimensional local descriptors for each subregion are encoded into an M-dimensional code using one of two coding schemes. Given a set , ,…, of local descriptors and a codebook with entries, , each scheme converts each local descriptor , ,…, into an -dimensional code used to generate the final image descriptor. Vector quantization is used in the original SPM architecture. The integration of sparse representations for code generation will improve the robustness of the spatial pyramid framework. A. Vector Quantization Traditional SPM uses vector quantization (VQ) to generate the set of codes for . This is , ,…, accomplished by solving the constrained least squares fitting problem [7]
Consider our dictionary of unit-norm elements that are approximately linearly dependent. For a real-valued innerproduct signal space, every signal then has an infinite number of best approximations using . The goal of the sparse representation algorithm is to find a vector to accurately reconstruct the signal using the dictionary while minimizing the number of non-zero coefficients, i.e. (2) where · denotes the -norm, or the total number of non-zero elements in . Unfortunately maximization of the norm is known to be NP-hard. The motivation for using sparse representations lies in the discovery that if the signal is sufficiently sparse, the solution for in (2) is equal to the solution of the -minimization problem [12] (3) More importantly, equation (3) is a convex problem that can be solved in polynomial time using standard linear programming methods [11]. Even more efficient greedy algorithms, such as orthogonal matching pursuit, provide the ability to accurately estimate the sparse representation as well. By making the locally optimal choice at each stage, greedy algorithms provide an estimate of the sparse code for increasing numbers of non-zero coefficients, commonly referred to as the sparsity level, L, of the signal. The results in Section V are generated using both vector quantization and sparse representations for descriptor encoding. Sparsity levels ranging from 1 to 10 are used to demonstrate the utility of sparse representations for feature code generation.
(1) . .
1,
1,
0,
The -norm constraint ensures that there will only be one non-zero element in each code, while the -norm and nonnegative constraints force the non-zero entry to be 1. Vector quantization can be viewed as equivalent to identifying the nearest-neighbor codebook element for each local descriptor and setting the associated weight to 1. B. Sparse Representations Finding a sparse representation for a signal is equivalent to finding a model that yields the most compact and useful signal representation. The use of sparse representations has seen a surge of interest, primarily due to the fact that the computation of a sparse linear approximation with respect to an overcomplete dictionary can be efficiently computed using basic convex optimization techniques [11]. Standard dictionary elements such as Fourier and Wavelet have led to compression technologies that are very useful in computer and media storage. As such, the original goal of sparse representation was that of signal representation and compression, but the identification of the most compact subset of elements to represent a signal is extremely discriminative and has been shown to be very useful in a variety of classification tasks [5], [11].
978-1-4244-8902-2/11/$26.00 ©2011 IEEE
IV. Let
SPATIAL POOLING
be the set of
subregion, i.e.
,
feature codes generated for a ,
,
,
,
,…,
.
This is matrix formed by column-stacking the generated codes for each local feature. Spatial pooling provides a mean of extracting a single feature vector, of dimensionality , from the local feature codes for that region. Traditional SPM weights the total number of instances of each feature in the subregion according to the spatial resolution of the region, in effect placing more emphasis on points found at finer resolutions. This is equivalent to c, ,
(4)
1, … ,
is given by . The result is a where the weight single vector, , ,…, , where the subscript r is used to denote the subregion associated with the feature descriptor. This method is the same as averaging and normalizing the histograms of the features in each subregion. Max pooling has been shown to not only be more robust to local spatial translations, but also to more closely resemble the
296
Table II. MSTAR Classification Results M = 100 L=1
SIFT-Histograms SIFT-Sparse Rep. FFT-Histograms FFT-Sparse Rep. DCT-Histograms DCT-Sparse Rep.
77.73% 88.94% 79.48%
L=5
73.19% 78.32% 84.91% 87.25% 80.22% 63.96%
M = 200 L = 10
L=1
77.14%
80.29%
80.88%
90.26%
58.02%
87.62%
processes in the human visual system [13]. Specifically, it has been shown to mimic the cortical response in the human nervous system by identifying only the most important, or salient, feature elements in each subregion [14]. Mathematically, max pooling is equivalent to c, ,
1, … ,
(4)
After computation of the subregion feature vector, , the feature vectors across all subregions are concatenated into a single image descriptor, i.e. , ,…, , where is the total number of subregions in the spatial pyramid. For the results presented in Section V, max pooling is used in conjunction with sparse representations for the generation of the image descriptor. Vector quantization along with histogram pooling provide the other set of results. V.
RESULTS
Three targets, the T72 and BMP2 tanks and the BTR70 personnel carrier, from the MSTAR [15] data set are used to show classification results. This dataset contains SAR images collected at 15° and 17° depression angles. A single variant of each target type from the 17° set are used for training, while all available variants for each target type collected at the 15° depression angle are used for testing. Testing against the variants of each target type will provide an indication of the generalization capabilities of our classification system. The specific breakdown of training and testing images for each target are shown in Table II. Recall that training requires the generation of the codebook along with training of the SVMs. Of the 698 available training images, half are randomly selected to provide training feature vectors for the clustering algorithm. As the results of k-means clustering are highly dependent on the initial selection of the cluster centers, it is expected that the codebook generation and subsequent classification results will differ from run to run. We provide classification performances, in terms of the probability of correct classification , for a single run of the algorithm. A oneversus-one support vector machine (SVM) with a linear kernel is used to provide confidence metrics for each target type. Class designation is based on the SVM that generates the largest confidence metric. Classification results are shown in Table II. The highlighted results for each dictionary size indicate the best
978-1-4244-8902-2/11/$26.00 ©2011 IEEE
M = 400
L=5
L = 10
77.14% 79.27% 87.33% 86.81% 83.00% 79.27%
L=1
78.39%
81.54%
81.69%
88.57%
75.31%
87.62%
L=5
82.78% 79.27% 87.91% 88.13% 83.81% 82.05%
L = 10
79.27% 82.56% 79.41%
Table I. Training and testing target breakdown.
Training T72 BMP2 (SN_132) (SN_9563) # Images
232
233
Testing BTR70
T72
BMP2
BTR70
233
582
587
196
overall classification performance. It is obvious that not only do FFT features outperform SIFT and DCT features, but the use of sparse representations provides a noticeable increase in classification performance over the use of vector quantization. As observed in [6], we also notice no additional benefit in increasing the dictionary size above 200 elements. Surprisingly, we see no benefit in the use of more than one sparse coefficient during encoding. A single non-zero element in the local feature vector using sparse representations is still quite different than the single non-zero, unity length element produced using vector quantization. Reason being, that the generated code contains an indication of the strength of each codebook element in the local feature vector representation. This similarity value between the nearest neighbor codebook element and the local feature vectors aids in the development of the SVM, helping to provide additional discrimination information for target classification. Additionally, max pooling has the effect of retaining only the most salient features in each subregion, potentially discarding features that make discrimination between target variants difficult. VI.
CONCLUSIONS
The use of spatial pyramids has been proven to be an effective classification technique for SAR imagery. The in ability for SIFT and DCT features to provide an adequate image representation displays the difficulty in applying classification algorithms developed for natural imagery to SAR. None the less, the ability of basic magnitude FFT features to provide classification performance levels nearly equal to that of state-of-the-art classification systems is very promising. It is expected that the development of more useful local image features for SAR will provide the ability to surpass current SAR classification abilities. A complete statistical analysis on the effect of the k-means dictionary generation algorithm on system performance remains to be completed. Emerging dictionary learning algorithms are expected to provide additional performance increases. More pertinent to SAR and its military applications is the assessment of the algorithms capability to handle clutter and occlusions. While SPM using sparse representations has
297
proven to generalize well across target variants, its performance when faced with unknown target objects and its ability to reject them remains to be seen. Lastly, the ability to formulate an accurate geometric correspondence between target parts may provide the ability to handle both intentional and unintentional target occlusions, overcoming the most difficult obstacle in SAR target classification algorithms. ACKNOWLEDGMENT This work was supported in part by the SenSIP Consortium at Arizona State University and its industry member Raytheon Missile Systems in Tucson, AZ. REFERENCES [1] C. V. Jakowatz, D. E. Wahl, P. H. Eichel, D. C. Ghiglia, and P. A. Thompson, Spotlight-Mode Synthetic Aperture Radar: A Signal Processing Approach. Norwell, MA: Kluwer Academic Publishers, 1996. [2] G. J. Owirka, S. M. Verbout, and L. M. Novak, "Template-based SAR ATR performance using different image enhancement techniques," in Proceedings of SPIE, vol. 3721, 1999, pp. 302-319. [3] S. J. Rogers and e. al., "Neural networks for automatic target recognition," Neural Networks, vol. 8, pp. 1153-1184, 1995. [4] Q. Zhao and e. al., "Support vector machines for SAR Automatic Target Recognition," IEEE Transactions on Aerospace and Electronic Systems, vol. 37, no. 2, pp. 643-653, 2001.
[6] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2006, pp. 2169-2178. [7] J. Yang, K. Yu, Y. Gong, and T. Huang, "Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification," in Computer Vision and Pattern Recognition (CVPR), Miami, FL, 2009, pp. 17941801. [8] J. Wang, et al., "Locality-constrained Linear Coding for Image Classification," in Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, 2010, pp. 3360-3367. [9] K. Grauman and T. Darrell, "Pyramid match kernels: Discriminative classification with sets of image features," in Proc. ICCV, 2005, pp. 1458-1465. [10] F. F. Li and P. Perona, "A Bayesian hierarchical model for learning natural scene categories," in Proc. CVPR, vol. 3, 2005, pp. 524-531. [11] J. Wright and e. al., "Robust face recognition via sparse representation," IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009. [12] D. L. Donoho, "For most large undeterdetermined systems of linear equations the minimal l_1-norm solution is also the sparsest solution," Communications on Pure and Applied Mathematics, vol. 59, no. 6, pp. 797-829, 2006. [13] Y. Boureau, J. Ponce, and Y. LeCun, "A Theoretical Analysis of Feature Pooling in Visual Recognition," in ICML, Haifa, Israel, 2010. [14] T. Serre, L. Wolf, and T. Poggio, "Object Recognition with features inspired by visual cortex," in CVPR, 2005, pp. 994-1000. [15] E. R. Keydel, "MSTAR Extended Operating Conditions," in Proceedings of SPIE, vol. 2757, 1996, pp. 228-242.
[5] J. Thiagarajan, K. Ramamurthy, P. Knee, and A. Spanias, "Sparse representations for automatic target classification in SAR images," in 4th International Symposium on Communications, Control and Signal Processing (ISCCSP), Mar. 2010, pp. 1-4.
978-1-4244-8902-2/11/$26.00 ©2011 IEEE
298