A Novel ATR Classifier Exploiting Pose Information Jose C. Principe, Qun Zhao, Dongxin Xu Computational NeuroEngineering Laboratory University of Florida Gainesville, FL 32611
[email protected]
1.0 Abstract This paper describes a new architecture for ATR classifiers based on the premise that the pose of the target is known within a precision of 10 degrees. We recently developed such a pose estimator. The advantage of our classifier is that the input space complexity is decreased by the information of the pose, which enables fewer features to classify targets with higher degree of accuracy. Moreover, the training of the classifier can be done discriminantly, wh ic h also imp roves perfo rman ce. Although our work is very preliminary, performance comparable with the standard template matcher was obtained in the MSTAR database.
2.0 Introduction ATR classifiers can be broadly divided into two types following the taxonomy in [1]: one class in one network (OCON) and all class in one network (ACON). Conceptually we can think that the OCON network eliminates all the cross-class connections in the upper layers of the ACON topology. In the first group we have the template matchers, while the MLP (multilayer perceptron) classifiers appear in the second class. They are developed with very different data models in mind and therefore they have very different characteristics. We can also expect different performance, with the performance edge tilting towards the ACON. Template matchers create a classifier by taking information exclusively from a single class. There-
fore one needs one template per target, and the classifier is linear followed by a winner take all network. It is easy to increment the number of classes by just developing extra templates. No modification is needed on the developed templates. However, the big problem is that the more templates we add, the more likely the classifier is to make mistakes. The second class is trained with exemplars of every class using the rules of statistical pattern recognition. Systems are normally nonlinear, such as quadratic classifiers or multilayer perceptrons. The big advantages is that training is discriminant, i.e. one class is trained with all the other classes. This means that the classifier has to be retrained from scratch if one more class is added to the testing suit, but every time the classifier has the possibility of choosing “features” that best individualize each class with respect to all the others. Although performance also degrades with the number of classes this degradation is slower than with the template matchers. There is an intermediate class of systems where some of the properties of the other classes are brought into the training as a penalty term for the template matcher. The minimum average correlation energy (MACE) filter appears as the best example of such a technique. As we commented in a previous paper [2] the MACE is still a compromise between purely template matchers and classifiers, and it is not easy to pick criteria to decide how to best train the MACE. Conventional design “guidelines” are sub-optimal.
It is important to analyze in more mathematical terms the difference between these two classes of classifiers. Towards this goal we ask the question when is a template matcher optimal? The answer is well known in communication theory [3], and can be simply stated as follows: a template matcher is optimal when all the classes have the same covariance. In fact, assuming Gaussian distributed classes [4] with equal a priori probabilities the discriminant function becomes · T –1 g i ( x ) = – 0.5 ( x – u i ) Σ i ( x – u i ) – 0.5 log Σ i + k
(1)
where u is the class mean, Σ is the covariance matrix. Therefore the general discriminant is a quadratic function. Notice that when the covariance matrix becomes diagonal Σi = σ 2 I , eq.1 defaults to 2
x – ui g i ( x ) = ------------------2 2σ
(2)
which is a linear discriminant [4]. Linear discriminants are also called distance classifiers because they make decisions based on Euclidean distances from the class mean. We can show that a template matcher is a distance classifier, which means that they are optimum detectors only when each class has equal covariance. Obvious this is not the general case in real world data, so template matchers do not exploit the information contained in the shape of data clusters, which makes them sub-optimal. The issue is how much better can we do for SAR ATR when the covariance among the targets is exploited.
3.0 Discriminant Classifiers for ATR One of the difficulties that we face when developing discriminant classifiers for ATR is the problem of scarce training data and generalization. In our opinion, parametric training is out of question for SAR-ATR due to our lack of knowledge about the statistics of SAR targets. Imposing Gaussian assumptions imply a performance penalty. Purely nonparametric training requires too much data which is scarce in SAR, so it should be avoided. One promising alternative is the class of semi-parametric training algorithms, which are based on multilayer perceptrons (MLPs). These mappers are universal, implementing families of discriminant functions depending on their topologies [5]. MLPs can be readily trained with the
backpropagation (BP) algorithm. MLPs have been applied to SAR ATR with reasonable success [6]. One difficulty with MLPs is that their training does not control generalization [7]. This becomes an issue for our application, and for their widespread use. Early stopping and/or weight decay are two popular practical training approaches, but a systematic theory exists based on the VC dimension [7]. A related problem is that they tend to “memorize the training set” when the number of data samples is inferior to the number of free system parameters. Practically, MLPs require that the size of the topology be constrained by the amount of data available, which for image processing applications means that some form of pre-processing for dimensionality reduction (feature extraction) must be utilized. On the other hand, the fewer the number of features the higher must be their discriminant power for good performance. Alternatively, one may try to decrease the complexity of the problem domain, and be able to use fewer features while keeping the performance at high levels. Hence, the issue of the number of features can not be treated independently of task complexity and the size of the network. 3.1 A novel architecture for SAR-ATR classifiers
In this paper we propose a novel architecture for SAR-ATR which uses pose to decrease the complexity of the classification problem. By decreasing the complexity we can utilize fewer features that may not be very discriminant when applied to the full classification problem. Figure 1 depicts our proposed topology. First a pose estimator is applied to the image and determines the approximate pose of the target within a precision of 10 degrees (see a companion paper for details of our pose estimator).
0 F e ature + C la ssifie r P ose E stim ator
vehicle 1 vehicle 2 vehicle... vehicle k
18 0 selector
Figure 1. Classifier architecture Then for each pose a classifier based on an MLP is developed and it is trained discriminantly among all the target classes of interest. Note that we
divided the complexity of the problem, because it is much simpler to develop a classifier that performs well when the variability of the input data is reduced to 10 degrees of aspect. However, we still can not apply the image directly to the MLP since many parameters would be required. For a 64x64 image chip 4,096 inputs are needed! Hence we need some form of pre-processing.
classifier for 3 targets, with several confuser vehicles. Although the confuser is a concept that makes sense mostly for template matchers we still utilized the same idea even when our system is being trained discriminantly. We created a 4 output system, one output for each of the three classes and the fourth output for the confusers.
3.2 Preprocessing for MLPs
Since we have simplified the complexity of the problem by dividing the aspect angle of the targets in slices of 10 degrees, we can utilize also simpler feature extractors without affecting performance. We will demonstrate this approach in this paper, by designing a very simple preprocessor based on the RCS (radar cross section) in down range. We acknowledge that 2D information is lost when this preprocessing is utilized. In fact, experience shows that RCS is useless to classify targets when the full 360 degree of aspect is used. But we want to make the point that when a pose estimator is used to slice the input space into sub-regions, even this suboptimal projection preserves enough information for classification with reasonable accuracy, comparable to that of the template matchers [8]. Figure 2 gives an illustration of the mean RCS downrange profiles of T72, BMP2 and BTR70 at 0 degree, which are obtained from a set of 64x64 SAR image. We observe that the RCS for a given angle are very different. Therefore, if we are able to divide the input space in narrow azimuth regions and derive a classifier for each, even simple features will be able to produce accurate classifiers. To get more robust feature for classification against noise disturbance and image shift, pre-processing steps are needed, including smoothing among neighboring resolution cells, down sampling and normalization. From the 2D SAR data, we can obtain both downrange and cross-range profiles. However, we find that cross-range profiles could not provide as good results as the range profiles, so we used the former in this work. 3.3 Definition of the classification problem
We utilized the MSTAR database to test our proposed classifier architecture. We used the problem formulation in [8]. Basically the goal is to design a
Figure2 RCS downrange profiles of 3 targets (a) T72 (b) BTR70 (c)BMP2 In this experiment, three MSTAR targets T72, BMP2, and BTR70 are employed. Each target has two or three serial numbers. The T72 serial numbers are 132, 812, and s7. The BMP2 serial numbers are c21, 9563, and 9566. And the BTR70 serial number is c71. One of these serial number from each target is used in the training set, while the others are assigned to the testing set. All training occurred at a 17 degree depression angle while testing was performed at a 15 degree depression angle. Confuser targets are the MSTAR’s 2S1 and Zil131.
3.4 Classifier training
We selected a MLP with 24 inputs and one hidden layer and 4 outputs. A one-hidden layer MLP has been shown an universal mapper for smooth discriminant functions so it is the one utilized here. The variable in the topology is the number of hidden layer PEs. There was not enough training patterns to train a 24 input 4 output MLP with data from poses within 10 degrees as we would like to do for simplicity. We had to increase the aspect angle slice to 35 degrees to be able to obtain a reasonable training data size (30 examples per target). The training set contains 120 samples from the 3 target classes and 1 confuser class, which are all at 17 degree depression angle. The testing set has 160 samples from the 3 target types (different invariants or serial number) and 1 confuser matrix, all at 15 degree depression angle. We trained 4, and 5 hidden PEs MLPs for the classification. In the training set, both MLPs reached 100% accuracy.
4.0 Results Table I illustrates the classification results of two classifiers with different hidden nodes. The two classifiers are both built on the training set within the aspect range of [0,35] degrees. From these results, we see that the performance is always higher than 65% accuracy, and reaches 100% in two of the three classes for the 5 hidden PE MLP. These results are comparable with that of template matching [8]. This fact suggests that the mean RCS downrange profiles could be used as a effective feature for SAR ATR if the input space is sliced in narrow angles of view as done with the pose estimator.
Table 1: Testing confusion matrix of network I and Network II 24-4-4
T72
btr70
bmp2
confuser
T72_2
0.8
0.2
0.
0.
T72_3
0.9
0.1
0.
0.
btr70-1
0.
1.0
0.
0.
bmp2_1
0.25
0.
0.65
0.1
bmp2_2
0.15
0.
0.85
0.
24-5-4
T72
btr70
bmp2
confuser
T72_2
1.0
0.
0.
0.
t72_3
0.95
0.
0.05
0.
btr70_1
0.
1.0
0.
0.
bmp2-1
0.25
0.
0.65
0.1
bmp2-2
0.15
0.
0.85
0.
There are some important conclusions: First the performance of our classifier for the 35 degree range is comparable to the results of the template matcher, even when the RCS is used as a feature. This shows two things: we have effectively divided the complexity of the input domain to a point that a sub-optimal feature could be used while providing comparable results to the template matcher. Moreover, the discriminant training is able to better utilize the RCS feature when compared to a simpler OCON system. Due to the preliminary nature of these results we have not tested other angles, but we have no reasons to believe that our classifier will perform worse for other angles. Second, the MLP approach requires more data for training. We have to extend the size of the pose slices to 35 degree, even though our pose estimator has an accuracy better than 10 degrees. However, our MLP has between 384 and 480 weights, so we can not train it well with less data (in fact we believe that the system is overtraining due to the disparity between the training set and test set performances). We have to utilize weight sharing [9] to come up with topologies with fewer free parameters that normally generalize better. Another approach is to use the newly developed technique of boosting [10], to further train the MLP even when the training set performance is 100%. Third, in no way we are advocating the use of RCS as the “best” possible feature for our classification scheme. We know that this feature under utilizes the information in the target images. However, it is a simple, quick way, to decrease the number of inputs. And amazingly it is reasonable feature when only part of the input space is used for training the MLP. We plan to study other pre-processors including our recently proposed method of information theoretic learning [11] to decrease the size of the number of features, which will allow training the classifier for 20 degree slices. This may be
necessary in more realistic testing conditions where the depression angle changes as in the MSTAR II database. Our goal is to come up with a feature extractor that will enable testing the classifier for 20 degree azimuth (5 degrees of overlap with the previous and following slice). We will be reporting on these experiments during the next year. Acknowledgments: This work was supported by a DARPA grant F33615-97-1-1019.
5.0 References 1. Kung 2. Fisher (spie paper) 3. Van Trees (communications book) 4. Duda (pattern rec) 5. Haykin 6. ??? (a good paper on NN for SAR classification) 7. Vapnik 8. Velten 9. Weigend (see in Haykin) 10. boosting (ask Victor) 11. xu