4.7.1 Introduction . ...... information. This is illustrated by considering the classification of mobile battlefield targets from ...... that could be made would of course depend entirely on what sort of extra information is available. In many ...
BAYESIAN APPROACHES FOR ROBUST AUTOMATIC TARGET RECOGNITION
by KEITH COPSEY
A report presented for the examination for the transfer of status to the degree of Doctor of Philosophy of the University of London
Department of Mathematics Imperial College Huxley Building 180 Queens Gate London SW7 2BZ United Kingdom
JULY 2004
To my parents
2
Abstract This thesis documents studies into robust automatic target recognition (ATR), with an emphasis on ATR from radar measurements. The underlying aim of the research is to develop ATR systems that are robust to the variety of conditions that will be faced in operational use. Although motivated by ATR applications, the majority of the research is applicable to generic classification (discrimination) problems. A variety of approaches have been proposed, all following a Bayesian formalism. The initial research focuses on the use of Bayesian mixture modelling, to estimate the conditional distributions of the sensor measurements for each class of target. A procedure based on altering hyperprior distributions is investigated as a mechanism for improving the generalisation properties of the mixture model classifiers. Specifically, attempts are made to design a classifier that is robust to changes in target configuration and that can generalise to other targets of the same generic class. This thesis argues that the development of robust ATR systems requires more than just application of classification algorithms designed using limited amounts of training data, and applied to single sensor measurements. Specifically, it is argued that full use must be made of any extra information or data that is available. A Bayesian inverse model based algorithm is developed to enable a classifier designed using data gathered from one sensor to be applied to data gathered from a different sensor (provided that physical and processing models for the sensors are known), thus addressing the issue of insufficient training data for operational sensors. Bayesian networks are proposed for incorporating contextual information and domain specific knowledge alongside the sensor measurements in an ATR system. Finally, a Bayesian framework is developed that utilises measurements from two sensors, separated in time, to classify relocatable targets in a scene.
3
Acknowledgements The research in this thesis was conducted under the PRI/IRL (Public Research Institute and Industrial Research Laboratory) part-time PhD scheme operated by Imperial College of Science and Technology. The research was sponsored by the UK MOD Corporate Research programme, and was undertaken while under the employment of QinetiQ Ltd in Great Malvern. The work has been conducted under the helpful supervision of Dr Andrew Webb at QinetiQ Malvern, and Prof David Hand at Imperial College, whom I both thank. I would like to thank my parents, brother and the rest of my family for their support throughout my education. I am also grateful to my friends, and the players of Tranmere Rovers for making it all bearable. I would like to thank Dr Adrian Britton, Mr Richard Lane, Mr Guy Maskall, Dr Sunil Manchanda and Dr Karl West of QinetiQ for their support and advice on the radar aspects within this work. Thank you also to Dr Chris Holmes of Imperial College for helpful advice on aspects of the work within this thesis. I would also like to thank Dr Stephen Luttrell and Dr Christopher Webber of QinetiQ for providing me with many interesting sources of research while I undertook this PhD. Although the research into self-organising techniques that I have conducted under their supervision is not directly relevant to this thesis, the guidance that I have received from them has been invaluable. Thank you also to Dr Sally Baker, Mr Mark Briers, Mr David Faulkner, Dr Ashley Ford, Mrs Jenny Green, Mr Roger Herbert, Mr David Howell, Mr Eric Knowles, Dr Alan Marrs, Dr Simon Maskell, Mrs Helen Newsholme, Dr John O’Loghlen, Dr Colin Reed, Dr Jon Salkeld, Mr Hugh Webber and Mr David Whitaker of QinetiQ, Dr Mark Bedworth, Dr Neil Gordon and Mr James Levett formerly of QinetiQ, Miss Hannah Batty of Warwick University, and many others who have not been listed, for their helpful support, guidance and assistance.
4
Table of contents ABSTRACT
3
ACKNOWLEDGEMENTS
4
ACRONYMS
21
1 INTRODUCTION
23
1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
1.2
Motivation
24
1.3
Benefits of robust ATR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
1.4
Identification Friend or Foe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.5
Requirement for robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.6
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
1.7
Bayesian motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.8
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.9
Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 LITERATURE SURVEY
29
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2
Generic pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2.2
Generic pattern recognition system . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2.3
Classification techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2.4
Density estimation approaches . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.2.5
Motivation for density estimation approach . . . . . . . . . . . . . . . . . . .
32
2.2.6
Discriminant function approaches . . . . . . . . . . . . . . . . . . . . . . . . .
33
Research state-of-the-art in ATR . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.3
5
2.3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.3.2
Density estimation approaches . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.3.3
Discriminant function approaches . . . . . . . . . . . . . . . . . . . . . . . . .
38
2.3.4
General comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3 BAYESIAN GAUSSIAN MIXTURE MODELS 3.1
47
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.1.2
Introduction to mixture models . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.2.1
Mixture model motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.2.2
Bayesian motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.3
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4
Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4.1
Introduction and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4.2
Model order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.4.3
Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.4.4
Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.5.2
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.5.3
Conditional distributions for the mixture components . . . . . . . . . . . . .
57
3.5.4
Conditional distributions for the allocation probabilities . . . . . . . . . . . .
57
3.5.5
Conditional distributions for the allocation variables . . . . . . . . . . . . . .
59
Trained classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.6.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.6.2
Classifying the training data . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.6.3
Classifying future observations . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.6.4
Rejection of unknowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
3.7.1
Description of the experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
64
3.7.2
Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.7.3
Model order selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.2
3.5
3.6
3.7
6
3.8
3.7.4
Algorithm details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.7.5
Convergence of the MCMC sampling algorithm . . . . . . . . . . . . . . . . .
69
3.7.6
Baseline classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
3.7.7
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4 USE OF HYPERPRIORS 4.1
75
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.1.2
Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.1.3
Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.2
Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.3
Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.3.1
Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.3.2
Hyperprior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.3.3
Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.4.1
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.4.2
Conditional distributions for the mixture components . . . . . . . . . . . . .
84
4.4.3
Conditional distributions for the variable hyperparameters . . . . . . . . . . .
84
4.4.4
Conditional distributions for the allocation probabilities . . . . . . . . . . . .
85
4.4.5
Conditional distributions for the allocation variables . . . . . . . . . . . . . .
85
Using the trained classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
4.5.1
Classifying the training data . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
4.5.2
Classifying future observations . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.5.3
Incorporating uncertainty in target location . . . . . . . . . . . . . . . . . . .
89
Altering the hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.6.2
Approach for improving generalisation performance . . . . . . . . . . . . . . .
90
4.6.3
Prior sensitivity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.6.4
Updating the MCMC samples . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Description of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
4.7.1
95
4.4
4.5
4.6
4.7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
4.8
4.9
4.7.2
Training, validation and test datasets . . . . . . . . . . . . . . . . . . . . . .
96
4.7.3
First scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
4.7.4
Second scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.7.5
Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
Results for first scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.8.2
Mixture model classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
4.8.3
Mixture model classifiers used in comparisons . . . . . . . . . . . . . . . . . . 102
4.8.4
Comparisons with the baseline classifiers on the validation data . . . . . . . . 103
4.8.5
Comparisons with the baseline classifiers on the test data . . . . . . . . . . . 103
Results for second scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.9.2
Validation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.9.3
Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.9.4
General comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.10 Summary and recommendations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5 BAYESIAN GAMMA MIXTURE MODELS 5.1
5.2
110
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.2
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.3
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1
Mixture model motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.2
Bayesian motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.3
Gamma mixture component motivation . . . . . . . . . . . . . . . . . . . . . 112
5.3
Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4
Bayesian solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5
5.4.1
Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.2
Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.2
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8
5.6
5.7
5.8
5.5.3
The mixture components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5.4
The allocation probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.5
The allocation variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Trained classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6.1
Classifying the training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.6.2
Classifying future observations . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.6.3
Incorporating uncertainty in target location . . . . . . . . . . . . . . . . . . . 123
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.7.1
Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.2
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.7.3
Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.7.4
Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.7.5
Examples of recognition failure . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.7.6
Performance variation with target orientation . . . . . . . . . . . . . . . . . . 130
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 GENERALISATION BETWEEN SENSORS 6.1
6.2
6.3
132
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.1.2
Military benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.1.3
Bayesian inverse model based approach . . . . . . . . . . . . . . . . . . . . . 133
6.1.4
Bayesian motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.1.5
Scope of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Bayesian framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.2
Section outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.3
DBS data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.4
Required distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.5
Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.6
Bayesian objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.7
Using the ISAR training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2.8
An alternative classification model . . . . . . . . . . . . . . . . . . . . . . . . 143
MCMC solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9
6.4
6.5
6.6
6.3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.2
Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . 145
6.3.3
Classification using the ISAR training data . . . . . . . . . . . . . . . . . . . 149
6.3.4
Extension to the Gibbs sampler algorithm . . . . . . . . . . . . . . . . . . . . 150
6.3.5
Conversion of the ISAR data . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Simplified problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.4.2
Description of the data
6.4.3
Sensor measurement models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.4.4
Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.4.5
Variance of the additive sensor noise . . . . . . . . . . . . . . . . . . . . . . . 155
6.4.6
Proposal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4.7
Analytic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.2
Details of the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.3
Algorithm parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5.4
Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5.5
Examination of the results for class 0 in the 1st experiment . . . . . . . . . . 164
6.5.6
Examination of the results for class 1 in the 1st experiment . . . . . . . . . . 165
6.5.7
Convergence of the MCMC sampling algorithm . . . . . . . . . . . . . . . . . 166
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7 CONTEXTUAL INFORMATION 7.1
7.2
172
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.1.2
Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.1.3
Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.1.4
Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.1.5
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.1.6
Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Bayesian network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10
7.3
7.4
7.5
7.6
7.2.2
Groups of targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.3
Terrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.4
Classes and locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2.5
Measurements and locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2.6
Standard Bayesian classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2.7
Terrain estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.3.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.3.2
Proximity to boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.3
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.3.4
Local terrain type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3.5
Group effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Inference on the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.4.1
Calculating node probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.4.2
Using the probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.4.3
Standard Bayesian classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Simulated example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.5.1
Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.5.2
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8 RELOCATABLE TARGETS 8.1
8.2
8.3
194
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.1.1
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.1.2
Overall framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.1.3
Bayesian motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.1.4
Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.1.5
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Problem specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 8.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2.2
Targeting detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2.3
Seeker detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Bayesian Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11
8.4
8.5
8.3.1
Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.3.2
Prior Evolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.3.3
Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.3.4
Application of the particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.3.5
Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.3.6
Use of the particles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Synthetic example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 8.4.1
Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.4.2
Specific example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.4.3
Monte Carlo assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9 CONCLUSIONS
213
10 RECOMMENDATIONS
215
A DISTRIBUTIONS
230
A.1 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.2 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 B GAUSSIAN MIXTURE MODEL CLASSIFIER
232
B.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 B.2 Classifying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 B.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 C HYPERPRIOR DETAILS
236
C.1 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 C.2 Grouping components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 C.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 C.2.2 Starting point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 C.2.3 Integrating the variance analytically . . . . . . . . . . . . . . . . . . . . . . . 238 C.2.4 Integrating the hyperparameters analytically . . . . . . . . . . . . . . . . . . 238 C.2.5 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12
D HYPERPRIOR EXPERIMENTS
240
D.1 Hyperparameter values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 D.2 First scenario results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 D.2.1 Mixture model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 D.2.2 Comparisons with the baseline classifiers . . . . . . . . . . . . . . . . . . . . . 243 D.3 Second scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 D.3.1 Effect of using extended training sets . . . . . . . . . . . . . . . . . . . . . . . 245 E TABLES FROM THE HYPERPRIOR EXPERIMENTS
247
F FIGURES FROM THE HYPERPRIOR EXPERIMENTS
262
G GAMMA MIXTURE MODEL CLASSIFIER
272
G.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 G.2 Classifying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 G.2.1 Predictive density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 G.2.2 Partial Rao-Blackwellisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 H GENERALISATION BETWEEN SENSORS
276
H.1 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 H.2 Simplified problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 H.2.1 Analytic sampling from the distribution for the ISAR data . . . . . . . . . . 278 I
BAYESIAN NETWORK DETAILS
280
I.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
I.2
Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
I.3
Allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
I.4
Conditional allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
13
List of Tables 1
Acronyms - A to L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2
Acronyms - M to Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.1
Training datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.2
Test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.3
Gaussian mixture model results for training data . . . . . . . . . . . . . . . . . . . .
72
3.4
Gaussian mixture model results for test datasets . . . . . . . . . . . . . . . . . . . .
72
3.5
Correlation-filter results for test datasets . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.1
Summary of ship data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2
Classification rates for test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3
Test results for self-organising map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4
Test results for maximum-likelihood gamma mixture model. . . . . . . . . . . . . . . 128
6.1
Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2
Parameters used in the MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3
Classification rates for class 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4
Classification rates for class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.5
Parameters used to investigate convergence of the MCMC algorithm . . . . . . . . . 169
7.1
Group identification performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.2
Classification rates for the three scenarios . . . . . . . . . . . . . . . . . . . . . . . . 191
8.1
Classification rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.2
Performance estimating location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
E.1 Training datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E.2 Validation datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
14
E.3 Test datasets, part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 E.4 Test datasets, part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 E.5 Separating validation datasets into groups reflecting the relationship between validation and training vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 E.6 Separating test datasets into groups reflecting the relationship between test and training vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 E.7 Number of datasets in each group and class, for validation and test data . . . . . . . 249 E.8 Results for training data, using default hyperpriors . . . . . . . . . . . . . . . . . . . 250 E.9 Results for training data, using altered hyperpriors . . . . . . . . . . . . . . . . . . . 250 E.10 Results for validation data, using direct estimation . . . . . . . . . . . . . . . . . . . 251 E.11 Results for validation data, using Rao-Blackwellisation . . . . . . . . . . . . . . . . . 251 E.12 Results for group G1 test datasets, using direct estimation . . . . . . . . . . . . . . . 251 E.13 Results for group G2 test datasets, using direct estimation . . . . . . . . . . . . . . . 252 E.14 Results for group G3 test datasets, using direct estimation . . . . . . . . . . . . . . . 252 E.15 Results for group G4 test datasets, using direct estimation . . . . . . . . . . . . . . . 253 E.16 Results for group G5 test datasets, using direct estimation . . . . . . . . . . . . . . . 254 E.17 Results for group G1 test datasets, using Rao-Blackwellisation
. . . . . . . . . . . . 254
E.18 Results for group G2 test datasets, using Rao-Blackwellisation
. . . . . . . . . . . . 255
E.19 Results for group G3 test datasets, using Rao-Blackwellisation
. . . . . . . . . . . . 255
E.20 Results for group G4 test datasets, using Rao-Blackwellisation
. . . . . . . . . . . . 256
E.21 Results for group G5 test datasets, using Rao-Blackwellisation
. . . . . . . . . . . . 257
E.22 Classification rates for the validation datasets . . . . . . . . . . . . . . . . . . . . . . 257 E.23 Classification rates for group G1 test datasets . . . . . . . . . . . . . . . . . . . . . . 258 E.24 Classification rates for group G2 test datasets . . . . . . . . . . . . . . . . . . . . . . 258 E.25 Classification rates for group G3 test datasets . . . . . . . . . . . . . . . . . . . . . . 258 E.26 Classification rates for group G4 test datasets . . . . . . . . . . . . . . . . . . . . . . 259 E.27 Classification rates for group G5 test datasets . . . . . . . . . . . . . . . . . . . . . . 259 E.28 Results for extended training data of second scenario . . . . . . . . . . . . . . . . . . 260 E.29 Classification rates for the validation datasets, under the 2nd scenario . . . . . . . . 260 E.30 Classification rates for group G1 test datasets, under the 2nd scenario . . . . . . . . 260 E.31 Classification rates for group G2 test datasets, under the 2nd scenario . . . . . . . . 260 E.32 Classification rates for group G3 test datasets, under the 2nd scenario . . . . . . . . 261 E.33 Classification rates for group G4 test datasets, under the 2nd scenario . . . . . . . . 261
15
E.34 Classification rates for group G5 test datasets, under the 2nd scenario . . . . . . . . 261 H.1 Definition of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
16
List of Figures 2.1
Pattern recognition system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2
Taxonomy of statistical classification techniques . . . . . . . . . . . . . . . . . . . . .
31
3.1
Hierarchical structure for the prior distributions of the parameters of the Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.2
Outline of the Bayesian Gaussian mixture model training algorithm . . . . . . . . .
56
3.3
Vehicles imaged to create the ISAR dataset . . . . . . . . . . . . . . . . . . . . . . .
64
3.4
Example of an ISAR image from a training dataset . . . . . . . . . . . . . . . . . . .
65
3.5
Collection of the extracted range profiles . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.6
Some extracted range profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.7
Training data classification rates against model order . . . . . . . . . . . . . . . . . .
68
3.8
10-fold cross-validation classification rates against model order . . . . . . . . . . . .
69
3.9
MCMC samples for a mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
3.10 MCMC samples for a standard deviation vector . . . . . . . . . . . . . . . . . . . . .
71
3.11 Summary of classification rates for the ISAR test data . . . . . . . . . . . . . . . . .
73
4.1
Hierarchical structure for the prior distributions of the Gaussian mixture model parameters in previous chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
Hierarchical structure for the prior distributions of the Gaussian mixture model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.3
Outline of training algorithm for Gaussian mixture models using hyperpriors . . . .
83
4.4
Typical vehicle imaged to create the ISAR dataset . . . . . . . . . . . . . . . . . . .
95
4.5
Example of an ISAR image from the training dataset . . . . . . . . . . . . . . . . . .
96
5.1
Hierarchical structure for the prior distributions of the parameters of the gamma mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2
Intensity plots of some training data range profiles . . . . . . . . . . . . . . . . . . . 125
5.3
Radar range profiles from training data - i . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4
Radar range profiles from training data - ii . . . . . . . . . . . . . . . . . . . . . . . 126
4.2
17
5.5
Summary of correct classification rates for the RRP data
. . . . . . . . . . . . . . . 128
5.6
Training data classification rate versus model order
5.7
Test data misclassification of class 5 as class 4 - i . . . . . . . . . . . . . . . . . . . . 130
5.8
Test data misclassification of class 5 as class 6 - ii . . . . . . . . . . . . . . . . . . . . 130
5.9
Overall test classification rate against aspect angle . . . . . . . . . . . . . . . . . . . 131
6.1
Example of DBS and ISAR imagery of a battlefield target . . . . . . . . . . . . . . . 133
6.2
DBS data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3
Prior distribution for the RCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4
DBS data generation, combining physical and processing models . . . . . . . . . . . 140
6.5
Bayesian inverse model incorporating the ISAR image . . . . . . . . . . . . . . . . . 142
6.6
Including the class label in the DBS and ISAR model . . . . . . . . . . . . . . . . . 143
6.7
MCMC algorithm for DBS inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.8
Further sub-division within the MCMC algorithm for DBS inversion . . . . . . . . . 148
6.9
Mean values for the target RCS grids . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
. . . . . . . . . . . . . . . . . . 129
6.10 Sensor one point spread function applied to the mean values for the target RCS grids 154 6.11 Sensor two point spread function applied to the mean values for the target RCS grids - i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.12 Sensor two point spread function applied to the mean values for the target RCS grids - ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.13 Classification rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.14 Average classification rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.15 RCS and sensor one images from class 0 . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.16 First and last MCMC RCS samples from the posterior distribution . . . . . . . . . . 165 6.17 MCMC RCS samples from the posterior distribution . . . . . . . . . . . . . . . . . . 166 6.18 Analytic mean/MAP RCS grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.19 RCS and sensor one images from class 1 . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.20 First and last MCMC RCS samples from the posterior distribution . . . . . . . . . . 167 6.21 MCMC RCS samples from the posterior distribution . . . . . . . . . . . . . . . . . . 168 6.22 Analytic mean/MAP RCS grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.23 Separated pixels of the MCMC RCS samples for a class 0 image
. . . . . . . . . . . 169
6.24 Separated pixels of the MCMC RCS samples for a class 1 image
. . . . . . . . . . . 170
6.25 Demonstration of correlation between grid positions for the MCMC RCS samples for a class 0 image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
18
6.26 Euclidean distances between the MCMC RCS samples and the actual RCS grid . . . 171 7.1
Bayesian network for incorporation of contextual information . . . . . . . . . . . . . 176
7.2
Gaussian measurement distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.3
Scenario for first group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.4
Scenario for second group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.5
Scenario for third group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.6
Classification rates for the first scenario . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.7
Classification rates for the second scenario . . . . . . . . . . . . . . . . . . . . . . . . 192
7.8
Classification rates for the third scenario . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.1
Exploitation of targeting information by a weapon seeker . . . . . . . . . . . . . . . 194
8.2
Framework for exploitation of targeting information by a weapon seeker . . . . . . . 195
8.3
Example of a DBS seeker image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.4
Example of a DBS image chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.5
Example set of targeting detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.6
Example set of relocated targeting detections . . . . . . . . . . . . . . . . . . . . . . 208
8.7
Example set of seeker detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.8
Particles with largest weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.9
Estimated locations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.10 Estimated locations with Bayesian estimate of error . . . . . . . . . . . . . . . . . . 211 F.1 Training and validation classification rates against model order . . . . . . . . . . . . 262 F.2 Validation dataset classification rates, for mixture model classifiers . . . . . . . . . . 263 F.3 Classification rates within groups and classes, for the validation data . . . . . . . . . 263 F.4 Classification rates for group G1 test datasets . . . . . . . . . . . . . . . . . . . . . . 264 F.5 Classification rates for group G2 test datasets . . . . . . . . . . . . . . . . . . . . . . 264 F.6 Classification rates for group G3 test datasets . . . . . . . . . . . . . . . . . . . . . . 265 F.7 Classification rates for group G4 test datasets . . . . . . . . . . . . . . . . . . . . . . 265 F.8 Classification rates for group G5 test datasets . . . . . . . . . . . . . . . . . . . . . . 266 F.9 Classification rates within groups and classes, for the test data . . . . . . . . . . . . 266 F.10 Change in test data classification rates within groups and classes, for mixture model classifiers compared to nearest neighbour classifier . . . . . . . . . . . . . . . . . . . 267 F.11 Change in mixture model test data classification rates within groups and classes, by altering the hyperpriors compared to default hyperpriors . . . . . . . . . . . . . . . . 267
19
F.12 Training and validation classification rates against model order, for the 2nd scenario
267
F.13 Classification rates within groups and classes, for the validation data, under the 2nd scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 F.14 Classification rates for group G1 test datasets, under the 2nd scenario . . . . . . . . 268 F.15 Classification rates for group G2 test datasets, under the 2nd scenario . . . . . . . . 269 F.16 Classification rates for group G3 test datasets, under the 2nd scenario . . . . . . . . 269 F.17 Classification rates for group G4 test datasets, under the 2nd scenario . . . . . . . . 270 F.18 Classification rates for group G5 test datasets, under the 2nd scenario . . . . . . . . 270 F.19 Classification rates within groups and classes, for test data, under the 2nd scenario . 271 F.20 Change in the mixture model test data classification rates within groups and classes, between the second and first scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 271 F.21 Change in the baseline test data classification rates within groups and classes, between the second and first scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 H.1 Bayesian inverse model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 I.1
Bayesian network for incorporation of domain specific information . . . . . . . . . . 280
20
Acronyms ADU AFRL APC ARMA ATR CAD CART CC&D CFAR CIMA2001 DAG DARPA DBS EM EOC FLIR GCNN GPNN GUI HMM HRRP IAPR ID IFF IPB IR IRL ISAR LCS LDA LHS LVQ
Air Defence Unit U.S. Air Force Research Laboratory Armoured Personnel Carrier Autoregressive Moving Average Automatic Target Recognition Computer-Aided Design Classification and Regression Tree Camouflage, Concealment and Deception Constant False Alarm Rate Computational Intelligence: Methods and Applications conference Directed Acyclic Graph Defence Advanced Research Projects Agency Doppler Beam Sharpened Expectation-Maximisation Extended Operating Condition Forward-looking Infrared Gram-Charlier Neural Network Generalised Probabilistic Neural Network Graphical User Interface Hidden Markov Model High Range Resolution Radar Profile International Association of Pattern Recognition Identity Identification Friend or Foe Intelligent Preparation of the Battlefield Infra-red Industrial Research Laboratory Inverse Synthetic Aperture Radar Localised Contour Sequence Linear Discriminant Analysis Left-hand-side Learning Vector Quantiser Table 1: Acronyms - A to L
21
MAP MARS MBT MCMC M-H MIDAS MLP MMW MSTAR OCR PCA PEMS PRI RBF RCS RHS RJMCMC ROC ROE ROI RRP SAR SNR SOFM SPNN SPR SVM VQ
maximum a posteriori Multivariate Adaptive Regression Main Battle Tank Markov chain Monte Carlo Metropolis-Hastings Mobile Instrumented Data Acquisition System Multilayer Perceptron Millimetre Wave Moving and Stationary Target Acquisition and Recognition Omni-directional Reflector principal components analysis Predict, Extract, Match, and Search Public Research Institute Radial Basis Function Radar Cross Section Right-hand-side Reversible jump Markov chain Monte Carlo Receiver Operating Characteristic Rule of Engagement criteria Region of Interest Radar Range Profile Synthetic Aperture Radar Signal-to-Noise Ratio Self-Organising Feature Map Self Partitioning Neural Network Statistical Pattern Recognition Support Vector Machine Vector Quantisation Table 2: Acronyms - M to Z
22
CHAPTER 1 INTRODUCTION 1.1
General
This thesis documents studies into robust automatic target recognition (ATR), with an emphasis on ATR from radar measurements. The underlying aim is to develop ATR systems which are robust to the variety of conditions that will be faced in operational use. Although motivated by ATR applications, the research is applicable to generic classification (discrimination) problems. A number of approaches are proposed in this thesis, all following a Bayesian formalism. One challenge to be addressed is the generalisation issue, in which the training data are not representative of the operating conditions in which the classifier will be applied. For example, given training data characterising a set of targets in specific configurations, how can a classifier be designed that is robust to changes in target configuration and that can generalise to other targets of the same generic class? Bayesian Gaussian mixture model approaches are developed to address this challenge, and a specific procedure utilising hyperprior distributions is proposed, as a mechanism for improving robustness to changes in target configuration, given small amounts of additional information. This is illustrated by considering the classification of mobile battlefield targets from Inverse Synthetic Aperture Radar (ISAR) imagery. On the basis of physical considerations for radar ATR, the Gaussian mixture model approach is extended to gamma mixture models. This is illustrated by considering the classification of military ship radar range profiles. However, this thesis argues that the development of robust ATR systems requires more than just application of classification algorithms designed using limited amounts of training data, and applied to single sensor measurements. Specifically, it is argued that full use must be made of any extra information or data that is available. The issue of insufficient training data for a given sensor can be partly addressed if procedures are developed that generalise classifiers across sensors. Within this thesis, for situations where physical and processing models for the sensors are known, a Bayesian inverse model based algorithm is developed to enable a classifier designed using data gathered from one sensor to be applied to data gathered from a different sensor. The approach is illustrated for a synthetic problem. In most ATR scenarios there will be a wealth of contextual information and domain specific knowledge which, if utilised, would enhance the ability to locate and identify targets correctly. This thesis proposes the use of Bayesian networks to incorporate such contextual information and domain specific knowledge alongside the sensor measurements in an ATR system. In many cases the final sensor measurements will not be the only sensor measurements available on the objects that need to be classified. In such circumstances it is important to utilise all 23
CHAPTER 1. INTRODUCTION
1.2. MOTIVATION
the available measurements. Within this thesis a Bayesian framework is developed that utilises measurements from two sensors, separated in time, to classify relocatable targets in a scene. The motivating application for the proposed framework is to utilise targeting information in a weapon seeker. The approach is demonstrated on a synthetic example.
1.2
Motivation
In the traditional approach to classifier design, it is often assumed that the training data are representative of the operating conditions in which the classifier will be applied. In many practical situations, this may not be so, due to drifts or changes in the population parameters, noise in the operating conditions, inadequate training data, unknown priors and costs. If automated classifiers are to be applied to real data in real situations (as opposed to the controlled conditions of the laboratory) all these issues must be addressed. It is this realisation which motivates the work in this thesis. The work in the second part of this thesis is partly motivated by the observation that humans classifying objects take into account a lot more factors than just sensor measurements on the objects. Specifically, a wealth of contextual information and domain specific knowledge is utilised when humans assign objects to classes. If an automatic classifier is to provide robust classifications it is likely that such extra information will need to be utilised.
1.3
Benefits of robust ATR
The military needs for efficient and effective means of ATR are readily apparent. Not least, understanding of the battlefield would be greatly enhanced if the sensor systems used could provide automated, reliable classifications of the objects in the areas surveyed. The large volume of data that is now produced by surveillance makes it not feasible for human interpretation of all the data. Thus, automated techniques are highly desirable. Furthermore, with modern imaging sensors producing images at higher and higher data rates, real-time human interpretation is virtually impossible. Many weapon systems would benefit enormously if the weapon seeker were to have a good autonomous classification ability. For example, anti-armour and anti-ship systems could be programmed to preferentially select high priority targets, without prompting from an external source. Even if the target has been pre-selected using a sensor on the launch platform, the configuration of the targets may have changed during weapon fly-out, thus requiring some form of classification ability for the weapon seeker. A good autonomous classifier on weapon seekers would improve the ability of the weapon to engage the targets designated on launch of the weapon, while minimising collateral damage. The ATR scenarios investigated within this thesis concentrate on ATR from radar measurements, reflecting the dominance of radar as an weather military sensor, across a wide range of military land, sea and air applications.
24
CHAPTER 1. INTRODUCTION
1.4
1.4. IDENTIFICATION FRIEND OR FOE
Identification Friend or Foe
In the era of joint and coalition operations the requirement for accurate target identification is clear, if one is to avoid fratricide and civilian casualties. Indeed, a recent report by the National Audit Office [29] stresses the importance of combat identification. Although cooperative Identification Friend or Foe (IFF) systems can be used to reduce the risk of fratricide they do not provide a full solution to the overall aim of reducing both fratricide and civilian casualties. For example, civilian vehicles will not respond to a cooperative IFF system, but should not be treated as hostile. Similarly, the difficulty with equipping all forces in a coalition operation with the same cooperative IFF system is high (both from logistical and security considerations). For these reasons, an ATR system providing IFF functionality, possibly working in tandem with a cooperative system, would be of immense benefit. An ATR system can be designed to provide IFF functionality by classifying objects into the target classes of the ATR system, and then determining friend or foe based on the outputs of the ATR system. There are a number of operational issues that would need to be addressed with such an approach, which are not covered in this thesis. These include the degree to which the target classes of the ATR system should be broken down. If the target classes are too broad (e.g. all enemy fighter aircraft could be incorporated into a single target class), then it might be hard to design an efficient classifier. If the target classes are too specific (e.g. similar enemy models being counted as separate targets), then training an ATR system to have discrimination capability between the targets is suboptimal if the end use is to be an IFF system. If the ATR system provides a confidence score for the classification outputs there is the issue of whether to assign friend or foe from the most likely potential target, or to aggregate the confidence scores of all friendly target classes, and compare with the aggregated score of all enemy target classes. A further approach to non-cooperative IFF, that is not addressed in this thesis, is through target kinematic behaviour, perhaps through a joint approach to tracking and identification.
1.5
Requirement for robustness
For an ATR system to be deployed successfully in military operations it must be robust to all the conditions that are likely to be faced. For a single sensor imaging a scene these include: 1. Observation angle variations (which relate to robustness to 3-D rotation). 2. Scale of the target (i.e. the distance from the sensor to the target, together with the sensor resolution). 3. Articulation (e.g. the rotation of a tank’s turret with respect to its hull, or reconfiguration of a swing-wing aircraft’s wings between supersonic and low-altitude configurations). 4. Partial occlusion (possibly by other targets as well as background objects). 5. Camouflage, Concealment and Deception (CC&D) measures. 6. Changes in target equipment fit (e.g. the presence/absence of missiles). 25
CHAPTER 1. INTRODUCTION
1.6. RELATED WORK
7. Target configuration (e.g. the status of a tracking radar, or missile launcher). 8. Target operational status (e.g. running engines). 9. Variation between alternative models of a target (e.g. it might not be possible to have access to an enemy’s newest equipment, so an out-of-date model needs to be able to generalise reasonably well). 10. Environmental conditions (visible sensors are well known to be affected by light conditions and fog; even all weather sensors such as radar might provide different target signatures in different weather conditions). In many ATR research papers, the operating conditions stated above are referred to as “extended operating conditions” (EOCs) [128], while in the neural network community the problem is regarded as one of complex transformation robustness. In the statistical pattern recognition literature, a classifier’s robustness (or lack of) to the conditions documented above is often referred to as the “generalisation properties” of the classifier. However, in some pattern recognition papers the term generalisation is used also to refer to the ability of a classifier to cope with different noise realisations on the same underlying data. This is a considerably easier problem than complex transformation robustness, and often can be solved by penalising a classifier against overtraining on the training data. Thus, care is needed when interpreting the merits of classification techniques that claim to have good generalisation properties.
1.6
Related work
That the conditions during training of a classifier are not necessarily the same as those that will occur during application of a classifier, is starting to gain wider attention in the pattern recognition community. Examples of how changes in the population to be classified (referred to as population drift) affect classification performance are provided by Kelly, Hand and Adams [88]. Many of the EOCs listed in Section 1.5 (such as varying equipment fit) are types of population drift. Kelly et al [88] present the problem of predicting credit worthiness of bank customers on the basis of answers provided on applications forms. Since the population of applicants requesting credit will change in accordance with the prevailing economic conditions, population drift is a major issue for such a problem. The proposed solution involves adapting the classification rule in accordance with a model which evolves over time, as the populations change. This is shown to lead to improved classification performance. A major issue for translating this approach to ATR applications would lie in specifying the form of the dynamic model. Kelly and Hand [87] consider changes in class definitions between training and operation of a classifier. For ATR problems this situation might arise if configuration and equipment changes lead to a target being reclassified to a different target type on the basis of its functionality. The problem considered in [87] was one of credit worthiness of bank customers, with the definitions of “good” and “bad” customers changing according to economic conditions and commercial considerations.
26
CHAPTER 1. INTRODUCTION
1.7. BAYESIAN MOTIVATION
Rather than retraining classifiers from scratch according to the new class definitions (which is timeconsuming), for two class problems an approach was proposed which estimated the probability distribution function of a variable which was partitioned to discriminate between the classes. The location of the partition was not used until each classification was to be made, thus allowing changes in the class definitions. The challenge for applying this approach to ATR problems would lie in selecting the partitioning variable to be modelled. Hand [65] conducts a detailed examination of the differences between idealised classification problems, and those encountered in real applications. Issues of population drift, class labelling errors, arbitrariness in class definitions and differing measures of performance are considered. It is demonstrated that when these EOCs occur, simple classification rules tend to perform at least as well as more sophisticated classifiers (and often superior to). This is despite the fact that often the sophisticated classifiers have been shown to provide better performance for the idealised situation where the test data conditions are the same as those for the training data. As an explanation for this effect, it is argued that sophisticated classifiers tend to model details specific to the design conditions of the classifier (in effect a form of overfitting), which are not relevant to the real problem in which test conditions differ from training conditions. For these reasons Hand argues that specific solutions need to be developed, that take into account the particular aspects and EOCs of each problem.
1.7
Bayesian motivation
A Bayesian approach has been adopted through-out this thesis. The main motivation behind a Bayesian approach [10, 98] for robust ATR algorithms, lies in the unique ability of Bayesian statistics to handle limited and possibly conflicting pieces of information in a fully consistent manner. In particular, Bayesian statistics provides a consistent mechanism for manipulating probabilities assigned to observed data. There are a number of different sub-problems tackled with this thesis, all geared towards robust ATR algorithms. The use of a Bayesian approach allows the different approaches to be combined in a consistent manner. Further generic arguments in favour of Bayesian techniques include the ability to cope with additional prior information, perhaps elicited from expert knowledge, and the production of confidence intervals and other statistics for the parameters estimated.
1.8
Outline
The outline of this thesis is as follows. Chapter 2 provides a brief introduction to statistical pattern recognition, and reviews some of the ATR literature. Chapter 3 formulates a Bayesian Gaussian mixture model approach to classification. Chapter 4 extends the Bayesian Gaussian mixture model approach of Chapter 3, by adding hyperprior distributions to the prior parameters of the mixture models, in an attempt to improve robustness to changes in target configuration, given small amounts of additional information. Based upon a physical consideration of radar returns, Chapter 5 proposes a Bayesian gamma mixture model approach to classification. In Chapter 6 a Bayesian inverse model
27
CHAPTER 1. INTRODUCTION
1.9. PUBLICATIONS
based approach to generalising target classification across sensors is formulated. Chapter 7 considers the use of Bayesian networks for incorporating contextual information and domain specific knowledge into an ATR system. Chapter 8 describes a Bayesian framework for utilising measurements from two sensors, separated in time, to classify relocatable targets in a scene. Conclusions and recommendations are provided in Chapters 9 and 10 respectively.
1.9
Publications
The Bayesian Gaussian mixture model work presented in Chapter 3 formed the basis of an oral paper [31] at the 2000 International Association of Pattern Recognition (IAPR) Statistical Pattern Recognition (SPR) workshop, held in Alicante, Spain. The ideas behind the hyperprior work of Chapter 4 have appeared as a poster presentation at the Mixtures 2001 conference, held in Hamburg, Germany. The Bayesian gamma mixture model work of Chapter 5 has been published [34] in the IEEE Transactions on Aerospace and Electronics Systems journal, and has been presented [32] at the 2001 Computational Intelligence: Methods and Applications (CIMA2001) conference held in Bangor, Wales. A paper based on the Bayesian approach to generalising target classification between radar sensors (Chapter 6) [35], has been accepted for oral presentation at the 2004 IAPR SPR workshop, to be held in Lisbon, Portugal. The work of Chapter 7, incorporating domain specific knowledge using Bayesian networks, has been presented as a poster paper [33] at the 2002 IAPR SPR workshop, held in Windsor, Canada.
28
CHAPTER 2 LITERATURE SURVEY OF ATR PAPERS 2.1
Introduction
This chapter provides a literature survey on the research state-of-the-art in ATR, with special emphasis on the robustness (generalisation abilities) of the techniques. To put the review into perspective, Section 2.2 provides a brief description of statistical pattern recognition techniques. The ATR literature review itself is provided in Section 2.3.
2.2
Generic pattern recognition techniques
2.2.1 Introduction The field of pattern recognition has been developing significantly since the 1960s, and a full review of pattern recognition techniques is beyond the scope of this section. The book by Webb [189] provides an introduction to statistical pattern recognition theory and techniques, along with numerous references and pointers for further study. Most of the techniques introduced in Webb’s book [189] could be applied to the target recognition problem and indeed, many have been. However, despite the vast array of research into pattern recognition techniques, the issue of robustness to EOCs remains an open problem. In particular, although all manners of techniques have been used to develop classification systems for specific training data, these techniques are generally not robust to transformations of the objects (targets) in the data. In this section the basic categories of pattern recognition techniques (as seen from a statistical pattern recognition viewpoint) are outlined.
2.2.2 Generic pattern recognition system Figure 2.1 displays a generic pattern recognition system. Initially the sensor measurements (raw data) are fed through to a pre-processor stage. The pre-processor can range from leaving the raw data unaltered through to deriving complicated features from the data (feature extraction). The output of the pre-processor is the set of features to feed into the classifier itself. The output of the classifier is the class of the raw data pattern being examined. Many pattern recognition systems combine the pre-processor and classifier into a single stage.
29
CHAPTER 2. LITERATURE SURVEY sensor
2.2. GENERIC PATTERN RECOGNITION
pre−processor features
raw data
classifier
decision
Figure 2.1: Pattern recognition system. The separation of a pattern recognition system into a pre-processing feature extraction stage and the actual classification stage highlights one of the difficulties when comparing approaches. Even when common datasets are being used to assess techniques, do performance differences arise from more appropriate pre-processing (i.e. feature extraction), or from a more efficient classifier? Generally, the pattern recognition system is determined during a training phase, during which training data is used to optimise the architecture and free parameters of the system. The training data will consist of examples of sensor measurements from the possible classes of target. It is here that one of the issues of transformation robustness appears. In particular, if the pattern recognition system is to be robust to complex transformations, do all possible combinations of these transformations have to appear in the training data? It is clearly not feasible to collect enough training data to provide exemplars of all possible combinations of the transformations, so approaches that are robust (e.g. designed to become invariant) to transformations of the data are invaluable. The next sub-section examines the classifier stage in more detail. The feature extraction stage is not covered in this overview. The reason is that although some feature extraction techniques, such as principal components analysis (PCA) [12, 26, 189], are generic across measurements, many are very application specific and not appropriate to a general overview. Few of the standard classification techniques discussed in the next sub-section have procedures to increase the robustness to complex transformations. Thus, the feature extraction stage is likely to be the most vital if the overall system is to have any degree of complex transformation robustness. Fourier transform [17] pre-processing can be used to provide invariance with respect to global translations of the whole image, and various other features can be defined to provide other degrees of global invariance (e.g. the Boyce and Hossack invariant moments [16] have some degree of translation, scale and intensity invariance). However, their use tends to lead to a rather ad-hoc approach to pattern recognition. In particular, many pattern recognition systems in the literature tend to consist of the selection of various features from a “bag of features” (some of which are invariant to some transformations), followed by the selection of a classifier from a “bag of classifiers”, with both selections occurring without consideration to the data at hand.
2.2.3 Classification techniques A taxonomy of classification techniques (taken from the book by Webb [189]) is provided in Figure 2.2. The first division is into supervised and unsupervised techniques. Supervised techniques make use of the known classes of training data targets to construct the classifier, whereas unsupervised techniques are trained using the sensor measurements only. Thus unsupervised techniques can only cluster the training data into “like” groups, which may or may not correspond to the actual classes of the training data. This can be an advantage, since the lack of constraint from supervisory information may free an unsupervised technique so that it is able to discover important statistical structure that does not correspond directly with a supervisor’s prejudices as to how the data should 30
CHAPTER 2. LITERATURE SURVEY
2.2. GENERIC PATTERN RECOGNITION
be encoded, thus allowing potentially more powerful processing functions to emerge from training. Additionally, techniques developed in the unsupervised field can often be used to discover important statistical structure and encode it in such a way as to allow a less powerful supervised “front end” to present that structure in the format that the supervisor requires. Many supervised techniques are related to unsupervised techniques in this manner. classification
supervised
unsupervised
density
discriminant
estimation
analysis clustering
parametric
non−parametric
linear
non−linear
Figure 2.2: Taxonomy of statistical classification techniques. Supervised techniques can be divided into those that estimate the class-conditional probability densities of the data (where data can refer to the raw sensor measurements or extracted features), from which classifications can be made using Bayes’ theorem, and those that attempt to construct a decision boundary directly (discriminant analysis).
2.2.4 Density estimation approaches Approaches based on density estimation require modelling of the distribution of the data (sensor measurements or extracted features) for each possible target class. In pattern recognition terminology these distributions are known as the class-conditional probability densities (likelihoods) p(x|j), where x represents the sensor data and j the class. For a particular class j, these are the probability distributions for generating each possible sensor datum x. Given the prior class probabilities p(j), Bayes’ theorem can be used to obtain the posterior probabilities of class membership given the data measurement: p(j|x) ∝ p(x|j)p(j), (2.1) J where the constant of proportionality is such that j=1 p(j|x) = 1, with J being the total number of classes. The difficulty lies in knowing or deciding how to specify the form of the class-conditional densities. 31
CHAPTER 2. LITERATURE SURVEY
2.2. GENERIC PATTERN RECOGNITION
Parametric techniques assume a parametric form for the class-conditional densities, with the parameters being estimated during the training phase. How to “best” estimate the parameters is of course a further issue. An example of a parametric technique is the Gaussian classifier, in which the likelihoods are assumed to be Gaussian distributions. Also included in this category are model-based classifiers with noise superimposed. As an example, a model may be proposed for the sensor measurements that might be received from a particular type of target. If a noise model (e.g. Gaussian) is added on top of the ideal sensor measurements then a likelihood for the sensor measurements can be written down, and Bayes’ theorem can then used to infer the posterior probabilities of class membership. Non-parametric techniques do not impose a parametric form for the densities. Instead they construct the required densities solely from the training data. The popular k-nearest neighbour classifier can be formulated in this manner. In the k-nearest neighbour classifier, for each measurement to be classified we determine its k-nearest training data measurements (using an appropriately defined metric) and then assign to the class with the most representatives within that set of k vectors. However, although its formulation lies within the non-parametric approach to density estimation, many practitioners of the k-nearest neighbour classifier would be unaware of this. The nearest neighbour classifier is a special case of the k-nearest neighbour classifier in which k = 1. Further examples of non-parametric density estimation techniques include histogram approaches, expansion by basis functions and kernel-based methods (details in Webb’s book [189]). The Parzen estimator is a particular form of kernel-based density estimation. The density estimates are obtained by calculating the Euclidean distance between the point of interest and all the training data vectors, and then averaging the results (subject to a normalisation constant) of applying a kernel function (e.g. a Gaussian) to each of the calculated distances.
2.2.5 Motivation for density estimation approach Estimating the posterior probabilities of class membership offers a number of advantages over classifiers that produce class membership decisions only. These advantages include giving a measure of confidence for our class predictions and the opportunity to combine the probabilities with additional information, such as domain knowledge, intelligence reports and contextual information (see Chapter 7). As well as being beneficial for ATR, this is advantageous for other classification applications, such as medical image screening and medical prognosis. Since the density estimation approach produces posterior probabilities, a density estimation classifier can be incorporated readily into Bayesian frameworks for manipulating and combining the probabilities. For example, the information can be fused with other sensor data, or incorporate information derived from previous sensors (see Chapter 8). Furthermore, since after classifying a pattern a user needs to decide upon a course of action, the probabilities can be incorporated into a multilevel model that reflects the whole decision making process. For instance, by considering the expected posterior loss of decisions, the different costs involved (and changes in those costs) in making a classification can be taken into account.
32
CHAPTER 2. LITERATURE SURVEY
2.2. GENERIC PATTERN RECOGNITION
2.2.6 Discriminant function approaches The term “discriminant analysis” covers those classification techniques that estimate the decision boundaries directly, without passing through the intermediate stage of estimating the classconditional probability densities. The approaches can be divided into linear and non-linear techniques. Linear approaches to discriminant analysis build up discriminant functions that depend linearly on the features selected. As an example consider the two class problem with linear discriminant function: g(x) = wT x + w0 ,
(2.2)
where w is a weight vector parameter, and w0 is a threshold parameter. The discriminant rule may be, for example, such that we assign a measurement x to class 1 if g(x) > 0, and to class 2 if g(x) < 0. The extension to multiple classes j = 1, . . . , J defines a weight vector wj and bias w0,j for each class, and then assigns to the class that maximises: gj (x) = wjT x + w0,j .
(2.3)
The training phase involves the determination of the weight vectors and thresholds, which in turn involves specification of an appropriate error criterion. Examples include Fisher’s linear discriminant, matched filters, and a technique known as linear discriminant analysis (LDA). Further details are given by Webb [189]. Although the application of the classifier might be linear, computations carried out in the training phase are often highly non-linear, with respect to both the parameters and the data vectors. Next, the possibility of multiple linear discriminant functions per class is considered. Template matching techniques set the weight vectors in (2.3) to be the (appropriately normalised) training data measurements, and the biases to zero. The training data measurements are referred to as the templates. There is a separate linear discriminant function for each training data template, and the object being classified is assigned to the class of the template that provides the largest value for the linear discriminant function. Global translation invariance can be incorporated, by shifting the templates to all possible positions relative to the object being classified. When these translations are implemented via the Fourier transform correlation theorem [17], the resulting classifier is referred to as a correlation-filter. In this case the actual implementation is non-linear, even though the underlying approach is linear. An extension to the use of the correlation-filter uses synthetic discriminant functions rather than training data examples for the templates. The forms of the synthetic discriminant functions would be determined using an appropriate training procedure. For the purposes of this literature survey, classifiers based on the minimum distance to class prototypes are counted as linear, since if the Euclidean distance metric is used they can be formulated as scalar products. However, note that if the prototypes are the original training data measurements, we have the nearest neighbour classifier, which as a special case of the k-nearest neighbour classifier, is treated as a density estimation based scheme. Typically, within classes, the prototypes will be determined using an unsupervised clustering algorithm, such as k-means or Vector Quantisation (VQ) (which is in effect a form of k-means clustering) [189]. An alternative approach for generating the prototypes, is model-based clustering, in which a sensor model is used to generate images of the 33
CHAPTER 2. LITERATURE SURVEY
2.2. GENERIC PATTERN RECOGNITION
targets such as would be expected under the assumptions of the sensor model. A form of nonlinear discriminant function is the generalised linear discriminant function, in which discriminant functions of the form:
gj (x) =
M
wj,i φi (x, µi ) + wj,0 ,
(2.4)
i=1
are defined for each class, j = 1, . . . , J. A data vector x is assigned to the class that maximises gj (x). The φi are the (nonlinear) basis functions. The free parameters are the weights wj,i , biases wj,0 and basis function parameters µi , all of which have to be determined during a training phase. The radial basis function (RBF) neural network [12] is a particular example of a generalised linear discriminant function, in which the φi are radially symmetric functions (i.e functions of |x − µi |). Support Vector Machines (SVMs) [36] are another example of nonlinear discriminant functions. SVMs use a kernel function to implement an inner product in a higher dimensional space. The idea behind this is that the data measurements can be projected into a high-dimensional feature space, in which they can be linearly separated using the “maximal marginal” hyperplane. In a twoclass problem, the maximal marginal hyperplane maximises the distance between the separating hyperplane and the closest measurements from the two classes. Since the late nineties, SVMs have been receiving much attention in the pattern recognition literature. However, by focusing completely on the decision boundary (at the expense of other areas) it is hard to see how SVMs can be made to be robust to transformations of the data. The multilayer perceptron (MLP) neural network [12, 189] is a ubiquitous technique for nonlinear discrimination. Typically trained by back-propagation (gradient descent of a sum-of-squares supervised error function) the MLP has been used on all varieties of pattern recognition problem. Unfortunately, the standard MLP can only become transformation robust in limited ways, through “brute force” approaches such as using massive amounts of training data to learn examples of the data under all possible combinations of transformations, or by hard-wiring pre-determined transformation invariances into the MLP architecture (by means of “weight sharing” [127, 164, 165]). However, even these tend to be ignored in the ATR research literature. Further nonlinear discriminant techniques are based on classification or decision trees, which construct a complex classification rule from a cascade of simpler rules. Examples of classification trees include the classification and regression tree (CART) model and the multivariate adaptive regression (MARS) model. Details are in Webb’s book [189]. In many practical applications a linear discriminant function is only applied to data that have undergone complicated pre-processing and feature extraction procedures. In such cases the actual overall classification system may well be considerably more nonlinear than many nonlinear discriminant function approaches. This highlights the conceptual difficulty in separating a classification system into a feature extraction stage and a classifier (Figure 2.1).
34
CHAPTER 2. LITERATURE SURVEY
2.3
2.3. RESEARCH STATE-OF-THE-ART IN ATR
Research state-of-the-art in ATR systems
2.3.1 Introduction In this section some of the literature for ATR systems is reviewed, and the techniques related to the taxonomy of Section 2.2. The majority of the papers surveyed concern ATR based on high range resolution radar profiles (HRRPs) or synthetic aperture radar (SAR) images. This reflects the dominance of radar as an all weather sensor for surveillance and weapons, in use across a wide range of military land, sea, and air applications.
2.3.2 Density estimation approaches Parametric techniques A multivariate Gaussian classifier was used by Olson and Ybarra [144] on features from synthetic SAR images of vehicles. The features were selected from a set of ad-hoc1 features. Unless the observation angle was specified within a small range, the performance was poor. The approach by Zyweck and Bogner [200] applied a linear discriminant analysis dimensionality-reduction technique to HRRPs of commercial aircraft, followed by a Gaussian classifier. The issue of transformation robustness was not addressed. Stewart et al. [173] compared a Gaussian classifier to a synthetic discriminant function correlation-filter classifier using synthetic radar range profiles (RRPs) of simple targets. The data were divided into aspect angle subclasses. The Gaussian classifier performed poorly if there were only a small number of aspect angle subclasses, but was more robust than the correlation-filter in the presence of white Gaussian noise. Dubuisson Jolly et al. [82] implemented a deformable template model with Gaussian noise for vehicle segmentation and classification from visible image sequences of roads. Experiments to assess the robustness to fog and low-light conditions were conducted, with the technique performing well in fog, but breaking down at dusk. Jouny [84] used wavelet decomposition techniques to extract features from radar range profiles. The extracted features were modelled as having a Gaussian mixture distribution with the mixture components determined by manual separation of the training data into angle bins. A similar approach was adopted as a baseline classifier by Stewart et al. [174]. Neither explored the issues of generalisation and robustness. Kuttikkad [93] investigated a number of different statistical distributions (multivariate complex Gaussian, Weibull, K distribution, etc.) for modelling SAR data. These were used for constant false alarm rate (CFAR) target detection, and pixel classification, but it is hard to see how such simple distributions can lead to robust ATR. Denton and Britton [39] considered the identification of mobile battlefield targets using inverse synthetic aperture radar (ISAR) imagery. Within 5◦ observation angle windows they modelled 1 Olson
and Ybarra’s description
35
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
the class-conditional probability densities of the radar images as products of independent gamma distributions. The issue of transformation robustness was investigated by assessing the classification performance on vehicles at different depression angles to the training data, and by using test vehicles of different types (but within the same generic class) to the training data vehicles. The classification performance was found to degrade significantly in the latter case. Jahangir et al. [80, 79] describe the use of Hidden Markov Models (HMMs) to carry out recognition between three classes of target (personnel, tracked vehicles and wheeled vehicles), using short sequences of radar returns. The approach exploited the time varying nature of radar Doppler data in a manner similar to techniques used in speech recognition (albeit with a modified topology) to distinguish targets with different doppler characteristics. The algorithm was trained and tested on real radar signatures of multiple examples of moving targets from each class. The performance was shown to be robust to target speed and orientation. However, a correction to the paper [79] hints that a training error may have biased the results towards the specific training and test datasets being used. Hummel [75] describes a model-based approach to ATR of vehicles in SAR images. The approach proposed was the outcome of the Moving and Stationary Target Acquisition and Recognition (MSTAR) programme, which was a five-year project of the U.S. Defence Advanced Research Projects Agency (DARPA) and the U.S. Air Force Research Laboratory (AFRL). The MSTAR programme was specifically tasked to investigate ATR using SAR imagery. To facilitate the research a number of publicly available datasets were created, allowing researchers to investigate techniques using common datasets. Unfortunately, the publicly available datasets were not divided into suggested training and test data, with the end result that most researchers used different training and test datasets. However, the programme provided a good source of SAR imagery, with various EOCs and complex transformations of the targets. One of the main outcomes of the MSTAR project was a tool for generating synthetic SAR images from parameterised target models. The particular MSTAR system described by Hummel [75] was divided into three stages. The first was a “Focus of Attention Module” in which “regions of interest” (ROIs) were selected. This was done using a CFAR algorithm, together with various unspecified clutter rejection procedures (based on features within the regions identified by the CFAR algorithm). The next stage was an “indexer” for each ROI. This produced a set of hypotheses for the target type within each ROI. The “indexer” stage was operated using a template/prototype-based discriminant function classifier (minimum distance based), which compared the ROIs with training data examples. The comparison was based both on the raw data values and features extracted from the data (the zero-crossings of the “Laplacian of Gaussian” of the imagery magnitude data, and the ridges in the magnitude data). The target hypotheses passing a threshold were then passed to the final classification procedure, known as the “Predict, Extract, Match, and Search” (PEMS) stage. The PEMS stage consisted of the Predictor module, the Matching module and the Search module. The Predictor module took parameterised computer-aided design (CAD) models of targets and generated synthetic SAR images. The Matching module calculated the probability density (likelihoods) of the features of the ROI given a proposed target model, using a “diffusive scattering model” based on the Poisson distribution. The Search module was used to refine the parameters of the CAD models of targets based on the values of the Matching module likelihoods. After iterating the PEMS process the search module was used to classify the ROI according to Bayes’ rule. No classification results were provided, although a number
36
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
of claims were made about the technique. Under standard operating conditions and some EOCs the technique was said to perform well. EOCs for which the technique was reported to perform well were some articulations (e.g. rotations of a tank’s turret), some configuration changes (e.g. fuel barrels attached to a tank), and vehicles with different serial numbers. Obscuration by walls, and closely spaced targets were reported to be handled unsatisfactorily. Chiang et al. [28] also used a model-based approach to ATR of vehicles using SAR images. CAD models of targets were used to predict the features of candidate targets. The likelihood of the measured features for each hypothesis was then calculated by combining the measured features with the predicted features under a Bayesian formalism (taking into account uncertainties in the features, and the possibility of missing features). In their example, the parameters of the target scattering centres were used as features. Lee [99] investigated ATR of commercial vehicles using visible imagery. The probability density for each class was calculated by combining the results from various component detectors (e.g. detectors for the number-plate and the right and left headlights) using a multivariate Gaussian distribution, while taking into account the probabilities of component detection.
Non-Parametric techniques The majority of non-parametric density estimation techniques in the ATR literature make use of the k-nearest neighbour (or more specifically the nearest neighbour) classifier. Botha [14, 15] applied the nearest neighbour classifier to ATR of model aircraft using ISAR images. The nearest neighbour classifier was applied separately to both the raw images, and features extracted from the raw images. Using the raw images gave better performance (although slightly worse than the performance obtained using an MLP neural network on the raw images). Further nearest neighbour classifier approaches have been explored [47, 174, 184, 193, 194]. Qiang [47] used nearest neighbour classification on feature vectors from radar returns. Two types of feature extraction were considered. The first was based on the time domain, and the second was based on multiple transformations of the data (in particular a Fourier transform followed by a Mellin transform [190], with the Mellin transform providing scale invariance). Stewart et al. [174] compared a nearest neighbour classifier to clustering based classifiers for recognition of civilian vehicles from HRRPs. However, the training conditions for the nearest neighbour classifier were different to those of the other classifiers, making the performance comparison unreliable. Van der Heiden and Groen [184] applied the nearest neighbour classifier to HRRPs of military aircraft. The nearest neighbour classifier was found to outperform an RBF neural network classifier in terms of correct classification, but at the expense of computational speed. A condensed nearest neighbour approach (in which the number of training data vectors used in the nearest neighbour search was reduced intelligently) was found to give a better trade-off between speed and correct classification. The classification/speed trade-off illustrates the difficulty in determining a state-of-the-art ATR system. Xiao et al. [193] applied a nearest neighbour classifier to the scale invariant features provided by the Mellin transform of radar returns from aircraft. The robustness of the technique to rotation of each aircraft was found to be poor. Xiao et al. [194] also applied the nearest neighbour classifier to polarisation features obtained from radar returns of aircraft. Ulug et al. [182] found that a nearest neighbour classifier 37
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
was inferior to an RBF neural network when applied to the recognition of civilian vehicles using features extracted from synthetic SAR imagery. The specific features used were scattering centre parameters, extracted from the SAR imagery using a 2-dimensional Prony model [161]. Rodriguez [159] addressed the computational complexity of the k-nearest neighbour classifier by implementing it in a hierarchical fashion, with test data images being passed down the hierarchy depending on the classification certainty. The technique was applied to handwritten character recognition, and could be adapted to ATR from visible images. Katz and Thrift [86] used a Parzen estimator for the class-conditional densities. The Parzen estimator places a kernel function on each training data vector, and is therefore computationally expensive. For this reason two modifications were proposed. The first trained an MLP neural network to learn training data estimates from a Parzen estimator. The second used Infomax learning (based on Shannon information theory) to reduce the number of training points in the Parzen estimator. The resulting classifiers were applied to the detection of targets in clutter. To allow for robustness to diverse clutter environments the estimators were used for the target class only, with a threshold on the single value being used to declare target or clutter. The presented results did not include a full receiver operating characteristic (ROC) curve of detection probability against false alarm rate, making interpretation of the results difficult. Kim [90] used a Gram-Charlier series expansion known as a Gram-Charlier Neural Network (GCNN) to estimate the class conditional densities. In addition, a Generalised Probabilistic Neural Network (GPNN) was created using a combination of the Gram-Charlier series expansion and a Parzen estimator. The techniques were applied to an artificial problem of radar signal detection in noise. Both the GCNN and GPNN outperformed a standard MLP neural network. Although different types of noise in the signal were considered, they were treated separately, so no assessment of the robustness of the techniques was provided. Hilbert [69] used a combination of a Bayesian and neural network classifier for ATR of SAR imagery. A neural network was used to implement a Parzen estimate of the class conditional probability densities. A site map was used to provide prior probabilities for the targets, based on their locations (e.g. the prior probabilities of aircraft were taken to be higher on runways).
2.3.3 Discriminant function approaches Linear techniques As mentioned in Section 2.2.6, one of the difficulties that arises when separating discriminant function approaches into linear and non-linear techniques is taking into account the effect of the feature extraction/reduction techniques. In particular, a highly non-linear feature extraction technique might be applied prior to a linear classifier. Generally, in this description, the separation of such a technique between linear and nonlinear depends on the form of the feature extraction. Where, for example, a neural network has been used for feature extraction, the technique is counted as nonlinear even though the end classifier might be linear in the extracted features. Many of the linear discriminant function techniques are correlation-filters, which correlate a
38
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
test image with reference images, and pick out the correlation peaks. Such filters are by design invariant to translations of the target. In their most basic form the test images are correlated with (training data) examples of the targets, at various different aspect angles and under various different configurations. In more advanced forms synthetic discriminant functions are used in the correlations. Correlation-filter techniques tend to involve a brute force approach to achieving transformation robustness, in that examples of lots of different configurations and transformations are used in the training data. Fortunately there exist efficient methods for calculating the correlations. Some are based on digital hardware implementations [20], while others are based on optical implementations (e.g. using Fourier transform lenses or holograms) [8, 42, 52, 103, 104, 107, 108, 186, 187]. As an example, a group at QinetiQ uses a hybrid optical-digital correlator [107, 108], which is less reliant on accurate alignment than some of the other optical implementations (an important issue if the system is to be used in a military environment). In particular, the Fourier transforms (binary phase only) of the reference images are calculated and stored electronically off-line. The Fourier transform (binary phase only) of the test image is also calculated electronically. Each reference Fourier transform is then multiplied with the complex conjugate of the test image Fourier transform, and the Inverse Fourier Transform calculated through an optical lens, to give the correlation map for each reference image. Using such an approach, over ten-thousand reference images per second can be correlated with a test image. Thus, although theoretically correlation-filter based ATR can be considered to be a quite primitive classification technique, the possible high-speed implementations lead to a highly competitive approach. Various types of pre-processing for optical correlators have been proposed to provide additional invariances, such as the scale and rotation invariance (at the expense of translation invariance) implemented using a log-polar image remapper [52]. Sadovnik [162] proposed another optical approach to ATR. A single Fourier lens was used to implement a scale and rotation invariant transformation of a pre-segmented phase-coded image of a target. The resulting transformed image was then used in a minimum logarithm-distance based classifier. The approach was applied to ATR of visible images of aircraft. It should be noted that the correlations in a correlation-filter do not have to be conducted on raw images. For example, Chen et al. [27] used a wavelet transform to extract features from radar returns, with these features then being used in a correlation-filter classifier. Hudson [73] applied correlation-filters to RRPs of aircraft. As well as using standard reference vectors, a form of synthetic discriminant function was used in which reference vectors were designed to maximise the correlation with a given target, while minimising the correlation with the profiles of all other targets. The data was divided into training and test sets by assigning odd numbered frames to the training set, and even numbered frames to the test set. The similarity of the test data to the training data meant that the results gave little indication of generalisation performance. Mahalanobis [116] provided another example of the use of synthetic discriminant functions in correlation-filters. In particular, a procedure was developed for two-class classification, with multi-class ATR coming from an appropriate combination of the two-class classifiers. The approach was applied to SAR images of vehicles, and was successfully able to reject a confuser vehicle not present in the training set. However, no other tests of robustness were conducted. Stewart et al. [173] compared a synthetic discriminant function correlation-filter to a Gaussian classifier, for recognition of synthetic RRPs of simple targets. For the Gaussian classifier the data was divided into aspect angle subclasses. The
39
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
correlation-filter outperformed the Gaussian classifier when only a small number of aspect angle subclasses were used, but was less robust in the presence of white Gaussian noise. Tang and Zhu [177] compared correlation-filters to an RBF neural network classifier for HRRP ATR. In particular, a correlation-filter in the space domain, was compared to a correlation-filter in the frequency domain, and an RBF network in the frequency domain. The correlation-filter in the space domain had the best performance, but at the cost of computational expense. A further comparison of correlation-filters with neural-network classifiers was provided by Nieuwoudt and Botha [138], this time for recognising simulated aircraft targets from RRPs. This time, MLPs using features from the RRPs were shown to outperform correlation-filters (implemented with synthetic discriminant functions). The features used in the neural network were the relative position, amplitude and width of peaks in the RRPs, after a parametric spectral estimation procedure for forming the RRPs. Simulated radar data was used with the argument that it would reflect peak expected performance, rather than the performance of a current system under adverse conditions. The motivation behind this was that the performance of current systems should increase with technological improvements. Novak [141, 142] describes a three-stage approach to SAR target detection and recognition. In the first stage, a CFAR detector was used to locate potential objects. The second stage aimed to reject the natural clutter false alarms. This was done with target-sized rectangular matched filters, which also estimated the orientation of the object. Unspecified discriminants based on texture features (such as standard deviation within the target-size template and the fractal dimensions of the pixels) were then used to reject the natural clutter. Finally, either correlation-filters with various synthetic discriminant functions, or distance based classifiers, were used to classify the remaining objects (after Fourier transform pre-processing of the inputs). The approach was applied to classification of ground targets in SAR imagery. When the effects of camouflage on the algorithm performance were investigated [141], as would be expected, there was a degradation in target detection probability at a given false rate. Topiwala [180] addressed the issue of recognition of aircraft using radar returns modulated by the jet engines. Correlation in the time domain was found to be ineffective. Instead, a wavelet-transform feature reduction was applied to the radar returns, with empirically selected wavelet coefficients then used in a matched filter. In particular, the pre-processing consisted of a transformation to the cepstral domain (via a Fourier transform of the absolute value of the Fourier transform of the radar returns), followed by a wavelet transform. A further form of linear discriminant function is based on minimum distance to cluster prototypes. An example was provided by Eom and Chellappa [46] in which a multiscale hierarchical autoregressive moving average (ARMA) model was used to extract features from HRRPs. These feature vectors were then used in a clustering algorithm, with test data being assigned to the closest cluster. Gupta [63] proposed a localised approach to ATR, which would hopefully be robust to partial occlusion of the targets. A contour tracking algorithm was used to extract the target boundaries. Local features, referred to as the localised contour sequence (LCS), were derived from the boundary. Parts of the LCS were then used in a form of minimum distance classifier, with built in rotation robustness. The technique was applied to infra-red (IR) images of military targets.
40
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
He et al. [68] applied a two-pronged approach to the recognition of simulated military vehicles from radar range profiles. In situations where the length of the target in a range profile was sufficiently large, an unspecified rule-based pattern recognition approach was applied to features of the RRPs (based on the scattering centres in the RRPs). For small target profile lengths a matched filter was used directly on the RRPs. The approach of Ikeuchi et al [77] used a SAR simulator to build prototype templates. Features such as the distances between two points in the SAR image, or the slope of the bisecting line between two line segments were then used in a minimum distance-based classifier. Mahalanobis [117] proposed a minimum distance classifier implemented after a linear transformation of the input data. Classification was based on the minimum distance to the average training image of each class, after application of the linear transform. The approach was termed the Distance Classifier Correlation Filter and was applied to a small subset of the MSTAR public release datasets. A constrained linear discriminant analysis approach to ATR from hyperspectral images has been proposed by Du and Ren [44]. Standard LDA defines a dimensionality reduction projection that maximises the ratio of the between-class covariance matrix to the within-class covariance matrices. The constrained version added an additional constraint that the means of different classes under the projection should be aligned with different directions. The targets to be recognised were panels placed in fields. The approach by Stewart et al. [175] used Self-Organising Feature Maps (SOFMs) followed by a Learning Vector Quantiser (LVQ) to obtain prototypes for a minimum distance classifier. The approach was applied to RRPs of civilian vehicles. Lu and Chang [109] applied the same approach to ISAR imagery of civilian vehicles. The features input into the classifier were obtained using 4 × 4 averaging windows with 50% overlap. An entropy-based measure discriminator for comparing test images with prototypes was found to give better performance than Euclidean distance. The robustness of the technique had not been assessed at the time of publication. VQ and LVQ clustering have been compared by Stewart et al. [174] for recognition of civilian vehicles from HRRPs. The LVQ clustering approach was found to give better performance than the standard VQ. Both the VQ and the LVQ had better performance than splitting each class into angle segments and using a Gaussian classifier within the segments. No assessment of robustness was available. Inggs [78] used LVQ clustering for ship ATR based on RRPs. The Fourier-Modified Discrete Mellin Transform was used for feature extraction. When training time was taken into consideration the LVQ clustering approach was considered to be superior to an MLP network (which had similar classification performance). Varad [185] used a SOFM for clustering of radar returns after Wavelet transform feature reduction. The aim was to distinguish re-entry vehicles from surrounding/accompanying objects in the exo-atmosphere. VQ clustering together with a Kohonen self-organising feature network were used to create cluster prototypes by Pham [148]. The approach was applied to synthetic RRPs of aircraft, and outperformed a baseline technique using prototypes selected via division into aspect/observation angle bins. Luttrell [114] used a topographic mapping neural network [92] to determine class prototypes (termed reference vectors) for use in a minimum distance classifier. The approach was applied to the recognition of military ships from HRRPs. The data has been used in Chapter 5 of this
41
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
thesis, where comparisons are made. The behaviour of each network was constrained by the use of topological prior knowledge (based on the aspect angle of the target to the radar line of sight). No data was available to assess the transformation robustness of the technique. Menon [126] used an Adaptive Clustering Network to define cluster prototypes. The approach was partially unsupervised in that the clusters were determined by correlations between all training data inputs, regardless of the class of the training data vectors. Thus clusters could contain more than one class. For each cluster, the degree to which the most common class within that cluster dominated the others was calculated, and stored as the “cluster class contrast”. Once trained, a test data vector was classified by finding the k-nearest cluster means, and weighting the most common classes in the k clusters by their cluster class contrast value. Subject to a threshold value, the test data vector was then assigned to the class with the largest weight. The approach was applied to ISAR images of naval ships.
Non-linear techniques The vast majority of nonlinear discriminant function techniques in the ATR literature involve the application of MLP neural networks (trained using backpropagation [12]) to a selection of features derived from the sensor images. The wealth of literature examining MLPs for ATR is vast, so this literature survey selects just a few of the papers for further comments. In none of these cases was any specific effort made towards achieving transformation robustness. Where the issue of generalisation was raised, it generally referred to the standard neural network description of different noise realisations on the same underlying data, rather than the harder problem of robustness to complex transformations. Baum [6] used an MLP neural network for the classification of laser radar profiles of ships. Benelli [9] investigated the detection of ship prows in X-band real-aperture radar imagery. Two sets of MLPs were used. The first set was used to perform a coarse-resolution segmentation, and the second set to locate prow features within the segmented images. Chakrabarti [24] compared an MLP to a minimum distance classifier for ATR of synthetic radar returns from aircraft. Features were extracted from an ARMA model for the poles of the radar returns. The MLP was found to outperform the minimum distance classifier, particularly as the ratio of signal-to-noise (SNR) was decreased. Dror [43] used an MLP network to recognise 3-dimensional shapes from sonar echoes. Botha [14, 15] applied MLP neural networks to ATR of model aircraft using ISAR images. MLPs were applied separately to both the raw images, and features extracted from the raw images. Using the MLP on raw images gave better performance than a nearest neighbour classifier. In the first paper [14] a test of the generalisation properties of the MLP classifier was conducted. In particular, for the feature data the MLP was trained using top views of the aircraft, and tested using bottom views. Rather surprisingly this gave better results than with the mixed (i.e. containing both top and bottom views) training and testing data. In the later paper [15] simple combination rules were used to combine the results from the nearest neighbour and MLP classifiers. Olson and Ybarra [144] trained an MLP network (using a momentum term) to identify vehicles using features from synthetic SAR images. The approach outperformed a Gaussian classifier on the same features, especially when the range of observation angles was increased. Luneau and Delisle 42
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
[111] used feature vectors from radar returns and Doppler signatures as inputs for an MLP network, designed to classify projectiles. Osman [146] applied MLP networks to ATR of military ships from SAR imagery. Three different feature sets were used: reduced resolution versions of the whole SAR images, SAR image range profiles, and SAR image range profiles after a local-statistics noise filtering algorithm. Using the reduced resolution SAR images was found to give the best performance, while noise-filtering the range profiles gave better performance than using standard range profiles. A multi-resolution representation of a SAR image was found to give better performance than single averaging to reduce resolution, but its design seemed somewhat ad hoc (consisting of differing numbers of various different sized blocks of averaged pixels). Remm [155] used an MLP for RRP ATR of vehicles. Pre-processing based on calculating the auto-correlation was found to give worse performance than using the raw signal. An MLP using features from RRPs of simulated aircraft targets has been shown [138] to outperform a correlation-filter (implemented with synthetic discriminant functions). The features used in the neural network were the relative position, amplitude and width of peaks in the RRPs, after a parametric spectral estimation procedure for forming the RRPs. Simulated radar data was used with the argument that it would reflect peak expected performance, rather than the performance of a current system under adverse conditions. The motivation behind this was that the performance of current systems should increase with technological improvements. Fan et al. [48] used a wavelet transform for feature reduction in a radar ATR problem, with the features being used in an MLP. Fechner [49] investigated ATR from ISAR imagery. Initially a SOFM was used to segment the image into objects and background. Various features from the objects were then used in an MLP classifier. A similar approach was used by Fiorentini et al. [51] for ship detection in radar images. Osman [145] used a probabilistic winner-take-all neural clustering scheme [146] to segment SAR imagery, and then an MLP network to classify ships in the segmented image. To make the use of neural networks more appealing to radar practitioners, Remm [154] proposed a technique for extracting rules and hints from the hidden layers of an MLP. The rules were extracted by applying a technique known as “optimal brain damage”, to iteratively prune (i.e. reduce the number of neural weights) an over-dimensioned neural network. The technique was applied to RRPs of unspecified targets. Munro et al. [130] used a likelihood ratio weighting function to modify the error term in an MLP network. The aim was to improve the detection abilities of the network for low probability events (such as targets in clutter). Although the technique required much less training data than a standard MLP, it required a high signal-to-clutter ratio. Filippidis [50] considered ATR of aircraft, based on Doppler spectra derived from continuouswave coherent radar. After linear prediction pre-processing (which effectively used previous samples to smooth the data) an MLP was compared with a decision tree classifier (known as C4.5). The MLP had better performance, but took considerably longer to train. Kim [89] applied gamma kernels to SAR imagery, and used the outputs in an MLP target detector. Nandagopal [133] used an MLP for feature extraction of Doppler radar returns from rotating blades. The resulting features were used in a LVQ clustering classifier (thus such a technique could
43
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
be considered to be an example of the use of a linear discriminant function classifier). Ranganath [151, 152] used a self partitioning neural network (SPNN) for target detection of tanks and helicopters in visible and infra-red images. The training procedure for the SPNN measured the degree of cooperation between training data vectors (which consisted of features derived from the images) during training of an MLP network. The degree of cooperation was based on the direction in which the weight vectors changed. The target training vectors were subsequently clustered using a matrix made up from the cooperation scores. Separate MLP networks were then trained for each cluster. The classification system obtained by combining the separate MLP networks was found to have better performance than a single MLP. RBF networks have been used [183, 184] for the classification of HRRPs of military aircraft. In terms of the classification rate [184], the RBF networks were outperformed by a nearest neighbour classifier. However, the RBFs required much less computational effort to implement. Tang [177] also used an RBF neural network classifier for HRRP ATR. Again, although the RBF was outperformed by another technique (in this case a correlation-filter), the RBF network was competitive due to its lower computational expense of implementation. Ulug [182] compared an RBF network to a nearest neighbour classifier and an MLP for recognising civilian vehicles using features extracted from synthetic SAR imagery. The specific features used were scattering centre parameters, extracted from the SAR imagery using a 2-dimensional Prony model. The RBF network had the best performance of the three, and was more robust to noise. However, the RBF network defined on feature vectors had lower classification rate performance than a correlation-filter classifier on the raw input data. Ukrainec [181] used an RBF network together with an adaptive cross-polar interference canceller and conventional CFAR processing, for target detection from polarimetric radar. The motivation was the design of a marine navigation system. RBF networks have been used by Xun et al. [195] to identify HRRPs of military aircraft. In particular, Wavelet transforms were used to separate the raw data into low-pass filter features and high-pass filter features. A separate RBF network was trained on both, with the final classification coming from a combination of the two outputs. The approach adopted by Zhao and Bao [196] used RBF networks for ATR of model military aircraft from HRRPs. After an initial averaging of consecutive range profiles, the amplitudes of the Fourier transform of the averaged range profiles were used as inputs to the RBF network. Maskall [120, 121, 119] used RBF neural networks to reduce the dimensionality of ISAR imagery of battlefield vehicles. The motivation was that a suitable nonlinear transformation from the high-dimensional image space to a lower dimensional space might be able to capture large scale characteristic structure in the data, while rejecting small-scale structure resulting from minor differences in target shape and configuration. The resulting features were then used in a nearest neighbour classifier. Both unsupervised and supervised feature reduction were investigated, with supervised feature reduction only providing improved performance when the number of features was constrained to be very small. The datasets used are also investigated within Chapter 4 of this thesis. Transformation robustness was obtained for some transformations, but not all. Sun et al. [176] used a linear interpolation neural network to learn the trajectories of features of RRPs as the observation angle varied. The test data was classified by finding the shortest distance between the feature vector of the unknown object and the feature trajectories (as opposed to just specific feature vectors) of all classes of targets. The approach was applied to RRPs of aircraft, and
44
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
outperformed a conventional MLP and a nearest neighbour classifier. A Neocognitron was used by Himes and Inigo [70] to recognise targets/objects in simulated visible images. The Neocognitron [53] is a hierarchical neural network architecture that recognises moderately shifted and deformed patterns. The lower levels of the Neocognitron detect local features within the input image, and then as the response passes up the hierarchy these local features are merged into more global features. The approach by Himes and Inigo [70] differed from the standard Neocognitron by manually setting (as opposed to determining via unsupervised training) the weights of the first layer of the network. This manual specification was done by selecting a set of features thought to be useful in classifying the input patterns, and then setting the weights so that they extracted those features. A further modification to the standard Neocognitron was made to make the training more efficient. Pre-processing transformations were used to enable scaling and rotation invariance. Only very limited experiments were conducted, which makes assessment difficult. A further modification of the Neocognitron, to provide rotation-invariance as well as shift-, deformationand scale-invariance was provided by Li and Wu [101]. Again only very limited experiments were conducted. Another extension of the Neocognitron has been proposed by Hatakeyama and Kakazu [67], and applied to a very simple target detection problem. Using an optical correlator, Chao [25] implemented a multi-layer neural network based on a Neocognitron (with the multilayer processing achieved by iteratively feeding back the output of the optical correlator to the input spatial light modulator and updating the Fourier filters). Very limited results were presented for the recognition of laser radar images of cones (representing re-entry vehicles) and beachballs (representing decoys), and no indication of the speed advantages of the optical implementation were provided. Guirong [62] proposed a sequential rule-based classification scheme using features derived from HRRPs. Another rule-based pattern recognition approach was proposed by He et al. [68] for the recognition of simulated military vehicles from RRPs. Chen et al. [27] used a wavelet transform to extract features from radar returns from military aircraft. A cascade of exponential correlation associative memory networks was then used to classify the aircraft. This gave better performance than MLP networks. Hughen [74] applied a higher order neural network to HRRP ATR of battlefield vehicles. The approach was found to have better performance than a Gaussian classifier, but no assessment of classifier robustness was conducted. Rizvi [157] proposed a multi-stage procedure for recognising targets from clutter in forwardlooking infrared (FLIR) imagery. After initial ROI extraction using model-based target silhouettes, each potential target was divided into five regions (left, right, top, bottom-left and bottom-right). Within each region a PCA was conducted, to extract regional feature vectors. These regional features were used to obtain cluster prototypes in an LVQ classifier. For final classification, matching scores between the regional feature vectors from the target and the cluster prototypes were combined using an MLP. The relative positions of the scattering centres in SAR images were used by Bhanu and Jones [11] as indices to a lookup table of target type and pose. In particular, geometric hashing [192] was used for the look-ups. The approach was applied to images from the MSTAR public datasets (SAR images of military vehicles). The paper argued that the SAR scattering centres were quasi-invariant to articulation, serial number variations, and small changes in radar depression angle. Results were provided for ATR between an air defence unit (ADU) and main battle tank (MBT). It was shown 45
CHAPTER 2. LITERATURE SURVEY
2.3. RESEARCH STATE-OF-THE-ART IN ATR
that the MBT could be successfully classified with a 45◦ turret rotation, even if the training data only consisted of non-articulated objects. However, it was acknowledged that the articulation robustness might have been an artefact of the large separation between the examples of MBT and ADU. In a separate paper [76], the same approach was applied to synthetic SAR images, as well as the MSTAR datasets. The use of synthetic SAR images allowed the assessment of turret rotations up to 90◦ . The approach of Ahn and Bhanu [2] was an extension of that by Bhanu and Jones [11, 76], in that ATR of two-part articulated objects (such as an MBT consisting of a body and the turret) was considered by modelling both parts and estimating the pose of both. The technique used the relative positions of the scattering centres in SAR images, along with a geometric hashing technique. Initially, the type and pose of the major part of the articulated object (e.g. the body of the tank) was recognised. The scattering centres for this major part were then removed before the next stage of the technique, which involved recognising the second part of the articulated object (e.g. the turret of the tank). The approach was applied to synthetic SAR imagery and the MSTAR public datasets. SVMs have been used by Zhao et al. [199, 197, 198] for ATR of SAR images of military vehicles. In particular, SVMs with Gaussian kernels were applied to the MSTAR public datasets. The test data was imaged at a slightly different depression angle to the training data, and some test vehicles had different serial numbers to the training data vehicles. The pose of the vehicles was (assumed) known to within a 30◦ segment, although the authors did propose an initial pose estimation step based on mutual information. The SVM was found to outperform a linear discriminant classifier (implemented using a perceptron), obtaining around 90% correct classification. A further application of SVM classifiers to the MSTAR data was provided by Bryant and Garber [19], using a polynomial kernel. The approach was competitive with a minimum distance based classifier.
2.3.4 General comments The preceding literature survey of ATR techniques has identified a wide variety of different approaches. Some involved quite complicated schemes (such as some of the neural network techniques), while others were fairly simple algorithmically, yet powerful due to their implementation speed (such as the optical correlation-based techniques). However, none of the techniques mentioned appear to have solved the issue of robustness to complex transformations in ATR systems, apart from (to a limited degree) by a brute-force approach, which places severe limitations on the range and scope of the object transformations that can be handled. Although some progress has been made, such as the model-based technique suggested by the MSTAR programme [75], automatic target recognition that is robust to the majority of EOCs has yet to be demonstrated fully. For this reason, commentators have suggested that the acronym ATR should more properly refer to “aided” target recognition [153]. In fairness, this is due to the immense difficulty of the problem. While some ATR practitioners, such as Licata [102], argue that ATR research must move from an “algorithm centred world” to a “product centred world”, even they accept that this is not appropriate for robust ATR, which still requires scientific breakthroughs.
46
CHAPTER 3 BAYESIAN GAUSSIAN MIXTURE MODEL APPROACH 3.1
Introduction
3.1.1 General This chapter introduces a Gaussian mixture model [123, 179] approach to ATR. The generic discrimination problem considered is one where we are given a set of training data consisting of class labelled measurements (and possibly some unlabelled measurements), and then want to assign a previously unseen object to one of the classes, on the basis of the measurements made of that object. The developed algorithm is illustrated with the results of some experiments looking at ATR of Inverse Synthetic Aperture Radar (ISAR) images from three main classes of mobile battlefield targets. The classifier is based on density estimation (i.e. Bayes’ theorem) and is formally a parametric technique (although mixture model density estimation is often referred to as being semiparametric). The density estimation approach (as described in Chapter 2) provides estimates of the class-conditional probability densities (likelihoods) p(x|j), for measurement data x, and classes j. Estimates for the posterior probabilities of class membership p(j|x) can then be produced using Bayes’ theorem, p(j|x) ∝ p(x|j)p(j), where p(j) are the prior class probabilities. Mixture models are used for the class-conditional probability densities. The motivation for a density estimation approach to classification is stated in Section 2.2.5.
3.1.2 Introduction to mixture models Mixture models [123, 179] provide a convenient semi-parametric method for density estimation, and have been used in some form or another in numerous areas, including target tracking, speech recognition and medical imaging. The general form of a mixture model is given by the following distribution:
p(y) =
g i=1
πi pi (y|φi ),
with
g
πi = 1 and πi > 0, (i = 1, . . . g),
(3.1)
i=1
where πi are the mixing probabilities, and pi (y|φi ) the parameterised component densities. Typically, for continuous densities, the components are Gaussian. However, there is no restriction to Gaussian
47
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.2. APPROACH
components. Indeed, in Chapter 5, based upon physical reasoning, gamma component distributions are used. Given parameterised forms for the component densities, there are two main techniques for estimating the parameters φi . The most common is the Expectation-Maximisation (EM) algorithm [122], which estimates the maximum likelihood parameters. The second major technique, which is the Bayesian approach adopted in this Chapter, uses Markov chain Monte Carlo (MCMC) algorithms [55, 158, 178], and with the advent of cheap and massive computational power, is becoming increasingly popular.
3.2
Mixture model approach
3.2.1 Mixture model motivation Frequently, estimation of the class-conditional probability densities p(x|j) in an ATR system is complicated by the high dimensionality of typical measurement data. Non-parametric methods of density estimation, such as kernel-based methods, tend to require unrealistically large amounts of training data for accurate density estimates [166, 189]. However, many parametric methods, such as approximation by a multivariate Gaussian distribution, might impose a specific form on the density that is too rigid for the problem, particularly if the classifier is to be robust. Mixture models provide a parametric method with enough flexibility to be able to model a wide range of possible densities, but with enough structure that they can be trained effectively with only limited amounts of data. This is the reason for why often mixture model density estimation is referred to as being semiparametric, and forms the main motivation for the use of mixture models for the class-conditional probability densities. Further motivations for the mixture model approach to density estimation in ATR applications arise from the following observations: 01) The probability density function of the sensor measurements for a single target can be expressed as an integral over the angle of observation, of a conditional density of a simple form (e.g. Gaussian, gamma), with the mixture distribution arising as the approximation of this integral by a finite sum. 02) Additional effects, such as robustness to offsets in position, can be readily incorporated into such a model. To illustrate the first observation, suppose that the conditional distribution of the measurement data, x, given the class, j, and the angle of observation, φ, is given by a (simple) distribution p(x|j, φ). Then the overall class-conditional density can be written as: p(x|j) =
p(x, φ|j)dφ =
Φ
p(φ|j)p(x|j, φ)dφ, Φ
48
(3.2)
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.2. APPROACH
where p(φ|j) is the probability density of the angle of orientation. Replacing this integral by a finite sum approximation, based on R components, gives the finite mixture distribution:
p(x|j) =
R
pj,r p(x|j, φr ),
with
r=1
R
pj,r = 1.
(3.3)
r=1
Note, however, that the mixture model approach offers more than just a segmentation of the data into angle sectors. By allowing the training data within each class to cluster naturally according to the data (subject to the component distribution assumptions) rather than manually specified angle segments, the overall fit of the data should be better. For instance, if symmetries in the objects lead to the sensor measurements being similar in separated angle segments, this can be handled efficiently through mixture modelling, but not through manual segmentation into angles. To see the second observation, O2, suppose that the target is recentred, within a given field of view, to produce a data vector, yc . Typically, the component densities would then be represented by p(yc |ψ), where ψ denotes the component parameters. However, in reality, there are uncertainties in recentring the target, so small shifts in the position of the centre of the target should be considered, producing a whole set of recentred vectors, ycm , m = 1, . . . , ns . The component densities, p(yc |ψ) ns can then be replaced by n1s s=1 p(ycs |ψ).
3.2.2 Bayesian motivation A Bayesian approach to estimating the Gaussian mixture model parameters is adopted. Bayesian treatments of Gaussian mixture models offer a number of advantages over methods based on maximum likelihood techniques, such as the EM algorithm. Not least is the elimination of the problem of unboundedness of the likelihood function, which is frequently ignored in maximum likelihood techniques. To see this problem, consider the following mixture of univariate Gaussians: 2 p(y|π1 , . . . , πR , µ1 , . . . , µR , σ12 , . . . , σR ) =
R
πj N (y; µj , σj2 )
j=1
=
R j=1
πj
1 2πσj2
exp
−(y − µj )2 2σj2
.
(3.4)
If we set µ1 = y and let σ12 → 0, then this likelihood tends to infinity, i.e. the likelihood is unbounded. Provided that R ≥ 2, and the parameters of the remaining mixture components are set appropriately, this unboundedness will still be present in the likelihood for a full set of data points, y1 , . . . , yn . Thus, an unconstrained maximum likelihood estimation procedure is inappropriate. Within the Bayesian formalism the prior distribution provides the constraints. Another motivation for the use of Bayesian parameter estimation techniques is that the EM algorithm is notorious for its initialisation problems, frequently becoming stuck in different local maxima, depending on the specific initialisation of the algorithm.
49
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.3. RELATED WORK
There are also the standard arguments in favour of Bayesian techniques, such as the ability to cope with additional prior information, perhaps elicited from expert knowledge and the production of credibility intervals for the parameters estimated. Some of the advantages and disadvantages of Bayesian treatments for Gaussian mixture models are documented in [40, 55, 124, 139, 149, 150, 160]. A further motivation for the Bayesian parameter estimation approach lies in the potential for using hyperparameters in our prior distributions for the mixture model parameters, to account for differences between training and test data, such as variations in the equipment fit, or different types of target (perhaps later models) from the same generic class. This is investigated in Chapter 4.
3.3
Related work
Hastie and Tibshirani [66] used an EM-algorithm [122] to estimate the maximum likelihood parameters of class distributions that are Gaussian mixture models, making the assumption of a common covariance matrix across all the mixture components. Uses of Bayesian Gaussian mixture models for classification have been proposed by Laskey [95], who formulated a Bayesian approach to clustering and classification, but used the EM-algorithm to estimate the maximum a posteriori parameter values only, and by Attias [4] who used Variational Bayes methods [7] on mixture models that were then applied to a classification problem. Unlike the other mixture model approaches, which treat the estimation of the class-conditional densities separately and independently, the approaches in this and subsequent chapters estimate the parameters of all class distributions together, thus allowing the ready use of unlabelled training data. The approach presented in this chapter is a generalisation of the work of Lavine and West [97], who looked at a Bayesian approach to classification where each class was distributed as a multivariate Gaussian, to a Bayesian approach to classification where each class is distributed as a Gaussian mixture model. In the context of ATR using SAR imagery, O’Sullivan et al[147] model SAR imagery as a complex Gaussian process conditioned on the target type and pose. The approach adopted in this chapter differs from that of O’Sullivan in that the mixture components in this work do not necessarily relate to specific connected angular regions, and in the use of Bayesian techniques for parameter estimation.
3.4
Problem formulation
3.4.1 Introduction and notation To emphasise the generic nature of the proposed ATR system the approach is formulated using a generic discrimination problem. Namely, the classification of an object into one of J distinct classes, on the basis of a d-dimensional data measurement of that object. For the military ATR application, the classes are the target types and the data measurements the sensor returns (e.g. the radar returns from the target).
50
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS 3.4. PROBLEM FORMULATION Our training data, which in general can comprise both labelled and unlabelled measurements1 from the classes, consist of n observed d-dimensional data samples y = {y1 , . . . , yn }, and any corresponding class labels. If the class of the measurement yi is known it is denoted by Zi . The combination of y and any known class allocations for the data is denoted by D. In many ATR applications all the training data will have been labelled, either through a controlled collection process (typically referred to as a trial) or via a military analyst. Where we need to distinguish between labelled and unlabelled training data we refer to {i ∈ lab} as the indices of the labelled training data and {i ∈ unlab} as the indices of the unlabelled training data. The probability density function for the d-dimensional data x can be written as:
p(x) =
J
θj p(x|j),
(3.5)
j=1
where θ = (θ1 , . . . , θJ ) is a vector of the prior classification probabilities for each class, with components satisfying: J
θj = 1,
θj > 0 for j = 1, . . . , J
(3.6)
j=1
and p(x|j) is the class-conditional probability density (likelihood) for data from class j. By combining the separate classes into a single mixture distribution, rather than treating each class independently, efficient use can be made of any unlabelled data that may be available. The class-conditional densities are modelled also by mixture distributions, with the j-th class having Rj components, which we refer to as mixture components:
p(x|j) =
Rj
πj,r p(x|j, r).
(3.7)
r=1
πj = (πj,1 . . . , πj,Rj ) represents the mixing probabilities within class j; i.e. πj,r > 0 is the mixing Rj probability for the r-th mixture component of the j-th class, satisfying r=1 πj,r = 1. We denote the complete set by π = {πj , 1 ≤ j ≤ J}. The distribution p(x|j, r) represents the probability density of the data within a given mixture component r, of a class j. To reduce the number of parameters to be estimated, we make an assumption of independence between the components of the data vector x, conditioned on the class and mixture component. This corresponds to:
p(x|j, r) =
d
pl (xl |j, r),
(3.8)
l=1
where x = (x1 , . . . , xd ). For reasonably large values of d, and a non-trivial number of mixture components, it is likely that there will not be enough training data to obtain good estimates of the 1 Unlabelled data may arise due to collection of data in hostile environments where the exact ground truth is unknown.
51
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS 3.4. PROBLEM FORMULATION mixture component distributions, if independence is not assumed. For radar returns, interpreting the mixture components as angle indicators (which it should be stressed might not be the case), the independence corresponds to the assumption of independence between the radar cross-section fluctuations of neighbouring range bins, which is reasonable in many circumstances [39]. Note that this independence within each component does not extend to an independence assumption for the mixture distribution as a whole. Interpreting the mixture components as angle indicators, this agrees with radar scatterers moving from one range gate to another as the target rotates. Gaussian distributions are used for the component distributions pl (xl |j, r) of (3.8), with means 2 , where l = 1, . . . , d, i.e. µj,r,l and variances σj,r,l 2 ). pl (xl |j, r) = N (xl ; µj,r,l , σj,r,l
(3.9)
For a given class and mixture component the collections of means and variances are represented by the vectors µj,r and Σj,r . The sets of all means and variances are represented by µ and Σ, respectively.
3.4.2 Model order The choice of the number of mixture components, Rj , to use in each mixture model, is problematical, since a full Bayesian approach would treat the model order as an unknown random variable. A fully satisfactory treatment of uncertainty in model order for multi-dimensional mixture models has yet to be completed in the Bayesian literature. The high-dimensional nature of the measurement data means that Reversible jump MCMC (RJMCMC) [61, 118, 156] techniques would be extremely complicated to operate efficiently, as would the Markov birth-death process approach of Stephens [170, 171]. Thus, currently the number of mixture components is held fixed throughout, with the value chosen as a compromise between having enough components for adequate modelling of the data (as determined by looking at the training data classification rate), and computational efficiency of smaller model orders. Future research could address techniques for penalising evaluation complexity [37] as well as the more familiar issue of potential training data overfitting if too complex a model is used. Within an RJMCMC framework such penalisation terms would be incorporated into a prior distribution for the model order. To apply a penalisation criterion outside of an RJMCMC framework, the MCMC algorithm would be applied for a variety of fixed model orders. The penalisation criterion would then be used to select from the resulting trained models. Aside from the easier implementation compared to an RJMCMC solution, the advantage of such an approach is that the penalisation criterion can incorporate results from using the trained models to classify data (such as the training data, or possibly validation data), rather than being solely based on the posterior distribution values. However, such an approach does carry the caveat that although the classification performance (as measured by the classification rate) might be better for a particular model order, the accuracy of the posterior probability estimates might be reduced. This could be important if the class probabilities are to be combined with additional information (such as contextual information as proposed in Chapter 7), or if they are to be used in a multi-level model reflecting the whole decision support 52
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS 3.4. PROBLEM FORMULATION h0
ν0
m0
V0
Σ
a0
b0
θ
π
µ
y
Figure 3.1: DAG illustrating the hierarchical structure for the prior distributions of the parameters of the Gaussian mixture models. process (such as whether or not to engage an imaged target, to re-examine it or bring it to the attention of a human operator).
3.4.3 Prior distributions A hierarchical structure is assigned for the prior distributions of the mixture model parameters, which is illustrated with a Directed Acyclic Graph (DAG) [96, 156] in Figure 3.1. The random variables are denoted by circular boxes, and the fixed hyperparameters by square boxes. For notational convenience, our equations do not explicitly state the dependence of the distributions on the fixed hyperparameters. The complete prior distribution for the mixture model parameters is:
p(µ, Σ, θ, π) = p(µ, Σ)p(θ)
J
p(πj ).
(3.10)
j=1
An assumption of prior independence of the (µj,r , Σj,r ) over all classes j and mixture components r is made. In addition, the dimensions within each mixture component are taken to be independent, so that: Rj d J 2 p(µ, Σ) = p(µj,r,l , σj,r,l |mj,r,l,0 , hj,r,l,0 , νj,r,l,0 , Vj,r,l,0 ). (3.11) j=1 r=1 l=1 2 The components µj,r,l and σj,r,l are given independent normal-inverse gamma priors:
2 2 ) ∼ N mj,r,l,0 , σj,r,l /hj,r,l,0 , (µj,r,l |σj,r,l
2 σj,r,l ∼ 1/Ga(νj,r,l,0 , Vj,r,l,0 ),
(3.12)
for fixed means mj,r,l,0 , precision parameters hj,r,l,0 , degrees of freedom νj,r,l,0 and scale parameters Vj,r,l,0 . The form of parameterisation for the inverse gamma distribution is given in Appendix A. The class and mixture component allocation probabilities are given independent Dirichlet priors,
53
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS 3.4. PROBLEM FORMULATION with: θ ∼ D(a0 ),
where a0 = (a1,0 , . . . , aJ,0 ),
aj,0 > 0,
(3.13)
and: πj ∼ D(bj,0 ),
where
bj,0 = (bj,1,0 , . . . , bj,Rj ,0 ),
bj,r,0 > 0.
(3.14)
The hyperparameters a0 and bj,0 are held fixed, with equality between the components within each hyperparameter vector. Thus, there is no prior preference to any class or mixture component.
3.4.4 Posterior distribution The likelihood function for the problem is written:
p(y|µ, Σ, θ, π)
=
J
{i∈unlab}
×
θj
j=1
RZi
{i∈lab}
r=1
Rj
πj,r
r=1
πZi ,r
d l=1
d l=1
2 N (yi,l ; µj,r,l , σj,r,l )
2 N (yi,l ; µZi ,r,l , σZ ) . i ,r,l
(3.15)
Bayes’ rule gives the following relationship between the posterior, prior and likelihood: p(µ, Σ, θ, π|y) ∝ p(y|µ, Σ, θ, π)p(µ, Σ, θ, π),
(3.16)
which gives a posterior distribution on which exact analytical inference cannot be made. In the main, this is due to the multiplication of summations in the likelihood function. In particular, the J n expanded likelihood function is a sum of between ( i=1 RZi ) and ( j=1 Rj )n terms, depending on the relative proportions of labelled and unlabelled training data. In particular, calculation of the normalisation constant is analytically intractable (regardless of the proportion of unlabelled data), as are calculations of various statistics of interest, such as the means and variances of the parameters. Rather than making some rather dubious simplifications to allow us to make inference on the posterior distribution, a full Bayesian approach to the problem is maintained, by drawing samples from the posterior. All inferences can then be made through consideration of these samples. For instance, given a set of N samples (µ(s) , Σ(s) , θ(s) , π (s) ) from the posterior distribution, the mean and covariance matrix of the mean vector for the r-th mixture component of the j-class, can be estimated by: N 1 (s) µ , E(µ j,r ) = N s=1 j,r
(3.17)
and: Cov(µ j,r ) =
1 (s) (s) T (µ − E(µ j,r ))(µj,r − E(µj,r )) , N − 1 s=1 j,r N
respectively.
54
(3.18)
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.5. MCMC ALGORITHM
Unfortunately, it is not possible to sample directly from the posterior distribution, so a Markov chain Monte Carlo (MCMC) algorithm [55, 158, 178] is used. In particular, a Gibbs sampler [21] is developed to draw approximately independent samples from the distribution.
3.5
MCMC algorithm
3.5.1 Introduction The basic premise behind MCMC techniques [55, 158, 178, 189] is as follows. Suppose that we wish to make inferences based on a distribution p(x) with support χ. Furthermore, suppose that this distribution is too complicated for analytical calculations. Instead, we seek to draw samples from the distribution and use these samples to estimate the characteristics of interest. If the distribution is too complicated to sample from directly, an irreducible and aperiodic Markov chain, with state space χ and equilibrium distribution p(x), is constructed from which it is easy to sample. Then if the chain is run for long enough, the samples from the Markov chain will be distributed as if they were drawn directly from the required distribution p(x). MCMC techniques provide the transition kernel that defines such a Markov chain. The Gibbs sampler [21] is a means of MCMC sampling from a distribution by splitting it into full conditional distributions. The technique is well illustrated by considering a two variable problem (Ψ1 , Ψ2 ), which is such that sampling from the joint distribution p(ψ1 , ψ2 ) is too hard, but for which it is possible to sample from the conditional distributions p(ψ1 |ψ2 ) and p(ψ2 |ψ1 ). In the Gibbs (0) sampler algorithm, the value of ψ2 is set arbitrarily and the algorithm then proceeds iteratively: (i)
(i−1)
(i)
(i)
• sample ψ1 from the distribution p(ψ1 |ψ2 = ψ2
),
• sample ψ2 from the distribution p(ψ2 |ψ1 = ψ1 ). (i)
(i)
Then, as i → ∞, the samples (ψ1 , ψ2 ) tend to samples from the joint distribution for the variables (Ψ1 , Ψ2 ). The extension of the algorithm to more than two variables is trivial, each extra variable adding another conditional sampling step at each iteration of the algorithm.
3.5.2 Outline To make use of the Gibbs sampler in our problem, the set of random variables in the posterior distribution p(µ, Σ, π, θ|y), is augmented to include the class allocation variables Z = (Z1 , . . . , Zn ) and the mixture component allocation variables z = (z1 , . . . , zn ). These are such that (Zi = j, zi = r) implies that the observation indexed by i is modelled to be from mixture component r of class j. Zi is known for the labelled training data (and should be treated as a constant), whereas zi is always unknown. Then, rather than sampling from p(µ, Σ, π, θ|y) we sample from the extended posterior distribution p(µ, Σ, π, θ, Z, z|y). The new augmented set of variables is split into three groupings; namely
55
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.5. MCMC ALGORITHM
(µ, Σ), (Z, z) and (θ, π). The required constituents for the Gibbs sampler algorithm are then the posterior distributions for each group, conditional on the other two groups, i.e. the distributions: • p(µ, Σ|θ, π, y, Z, z), • p(θ, π|y, Z, z, µ, Σ), • p(Z, z|y, µ, Σ, θ, π). The outline of the MCMC Gibbs sampler algorithm is given in Figure 3.2: 1. Initialisation. Set i = 1 and determine initial values for (Z (0) , z (0) ) from the support of the joint posterior distribution. • Elements of Z (0) relating to labelled training data are assigned their true class labels. • Elements of Z (0) relating to unlabelled training data are set using very simple (and quick) classifiers, such as nearest class mean. • The z (0) are initialised with simple clustering algorithms such as k-means, or for training data with known angles of observation by clustering into angle segments. 2. Iteration i • Draw a sample (µ(i) , Σ(i) ) from p(µ, Σ|y, θ(i−1) , π (i−1) , Z (i−1) , z (i−1) ) using (3.21) and (3.20), with parameters defined in (3.22), (3.23), (3.24) and (3.25), (see Section 3.5.3). This gives an updated set of parameters for the mixture component distributions. • Sample (θ(i) , π (i) ) from p(θ, π|y, µ(i) , Σ(i) , Z (i−1) , z (i−1) ), using (3.28) and (3.31) (see Section 3.5.4). This gives an updated set of the class and mixing probabilities. • Sample (Z (i) , z (i) ) from p(Z, z|y, µ(i) , Σ(i) , θ(i) , π (i) ), using (3.39) and (3.41) (see Section 3.5.5), giving an updated set of class and mixture component allocation variables. 3. i ← i + 1 and go to 2.
Figure 3.2: Outline of the Bayesian Gaussian mixture model training algorithm. After an initial burn-in period, during which the generated Markov chain reaches equilibrium, the set of parameters (µ(i) , Σ(i) , θ(i) , π (i) , Z (i) , z (i) ), can be regarded as dependent samples from the posterior distribution p(µ, Σ, θ, π, Z, z|y). To obtain approximately independent samples a gap (known as the decorrelation gap) is left between successive samples (i.e. we only retain a sample every l-th iteration of the algorithm, where l ≥ 1). If we are concerned only with ergodic averages, lower variances are obtained if the output of the Markov chain is not sub-sampled. However, if storage of the samples is an issue, it is better to leave a decorrelation gap, so that the full space of the distribution can be explored without having to keep thousands of samples. In our notation these approximately independent samples are relabelled (µ(s) , Σ(s) , θ(s) , π (s) , Z (s) , z (s) ), for s = 1, . . . , N ; where N is the number of MCMC samples. Typically, choice of the burn-in period, decorrelation gap and number of samples will involve specifying an initial set of values, and then examining the classification rate for the training datasets. Modification of these initial values can then be made in the light of this classification performance.
56
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.5. MCMC ALGORITHM
This, together with examining sample plots is likely to provide a better test of convergence than more sophisticated convergence diagnostics [55, 125], which tend to fare poorly for all but the simplest of problems [71, 189].
3.5.3 Conditional distributions for the mixture components Sampling from the conditional distributions for (µ, Σ) makes use of the fact that for a given set of allocation variables (Z, z), the data y consist of classified independent samples from the k = R1 + · · · + RJ mixture components, independent of the prior class and mixing probabilities. Additionally, use is made of the fact that if (r, j) = (r , j ) our parameters and prior distributions are formulated so that the training data assigned to mixture component r of class j, have no influence on the posterior distribution of the component parameters of mixture component r of class j . We define Gj,r = {i; (Zi = j, zi = r)}, the set of indices of data elements that have been assigned to component r of class j; and gj,r to be the cardinality of Gj,r . The set {yi ; i ∈ Gj,r } is denoted by yGj,r . In addition, the following are needed: y¯j,r,l =
1 gj,r
yi,l
and Sj,r,l =
i∈Gj,r
(yi,l − y¯j,r,l )2 .
(3.19)
i∈Gj,r
Some calculations (detailed in Appendix B.1) show that the independent normal-inverse gamma prior distributions give rise to independent normal-inverse gamma posterior distributions: 2 2 , y, Z, z) ∼ N κj,r,l , σj,r,l /hj,r,l , µj,r,l |(σj,r,l
(3.20)
2 σj,r,l |(y, Z, z) ∼ 1/Ga(νj,r,l , Vj,r,l ),
(3.21)
and:
for j = 1, . . . , J, r = 1, . . . , Rj and l = 1, . . . , d where: hj,r,l = hj,r,l,0 + gj,r ,
(3.22)
κj,r,l = (hj,r,l,0 mj,r,l,0 + gj,r y¯j,r,l )/hj,r,l ,
(3.23)
νj,r,l = νj,r,l,0 + gj,r /2,
(3.24)
Vj,r,l = Vj,r,l,0 + Sj,r,l /2 + hj,r,l,0 gj,r (¯ yj,r,l − mj,r,l,0 ) /(2hj,r,l ). 2
(3.25)
3.5.4 Conditional distributions for the allocation probabilities Given the allocation variables (Z, z), the prior class and mixing probabilities (θ, π), will be independent of (y, µ, Σ). Thus, since we are assuming prior independence of the allocation probabilities θ and π: p(θ, π|y, Z, z, µ, Σ, m) = p(θ|Z)p(π|Z, z).
57
(3.26)
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.5. MCMC ALGORITHM
The conditional posterior distribution for the the prior class probabilities is given by: p(θ|Z)
∝ p(Z|θ)p(θ) n J ∝ p(Zi |θ) θj aj,0 −1 i=1
∝
n
j=1
θ Zi
i=1
∝
J
θj
J
θj aj,0 −1
j=1 Rj r=1
gj,r +aj,0 −1
,
(3.27)
j=1
with the third line following from the second via p(Zi |θ) = θZi . This gives a Dirichlet distribution: θ|Z ∼ D(a),
(3.28)
independently of π, where a = (a1 , . . . , aJ ) with:
aj =
Rj
gj,r + aj,0
for j = 1, . . . , J.
(3.29)
r=1
The conditional posterior distribution for the within-class mixing probabilities is given by: p(π|Z, z) ∝ p(π|Z)p(z|π, Z) = p(π) ∝
n
p(zi |π, Zi )
i=1 Rj J
bj,r,0 −1 πj,r
j=1 r=1
∝
Rj J
g
j,r πj,r
n
πZi ,zi
i=1 +bj,r,0 −1
,
(3.30)
j=1 r=1
with the third line following from the second via p(zi |π, Zi ) = πZi ,zi . Thus we have the following independent Dirichlet distributions (for j = 1, . . . , J): πj |(Z, z) ∼ D(bj ),
(3.31)
where bj = (bj,1 , . . . , bj,Rj ) with: bj,r = gj,r + bj,r,0 ,
for j = 1, . . . , J, r = 1, . . . , Rj .
58
(3.32)
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.5. MCMC ALGORITHM
3.5.5 Conditional distributions for the allocation variables Since the data yi are conditionally independent, given (µ, Σ, θ, π), the allocation variables for each data measurement are conditionally independent given (y, µ, Σ, θ, π). Therefore, each measurement can be considered separately. Thus:
p(Z, z|y, µ, Σ, θ, π) =
n
p(Zi , zi |yi , µ, Σ, θ, π),
(3.33)
i=1
and we can consider: p(Zi , zi |yi , µ, Σ, θ, π) = p(zi |Zi , yi , µ, Σ, θ, π)p(Zi |yi , µ, Σ, θ, π).
(3.34)
If Zi is unknown for the i-th data vector, (i.e. the data point is unlabelled): p(Zi = j|yi , µ, Σ, θ, π) ∝ p(yi |Zi = j, µ, Σ, θ, π)p(Zi = j|µ, Σ, θ, π),
(3.35)
p(Zi = j|µ, Σ, θ, π) = p(Zi = j|θ) = θj ,
(3.36)
where:
and: p(yi |Zi = j, µ, Σ, θ, π)
= p(yi |Zi = j, µ, Σ, π) =
Rj
πj,r p(yi |µj,r , Σj,r ).
(3.37)
r=1
For notational ease we have defined: p(yi |µj,r , Σj,r ) = p(yi |µ, Σ, θ, π, Zi = j, zi = r) =
d
2 N (yi,l ; µj,r,l , σj,r,l ).
(3.38)
l=1
Thus: p(Zi = j|yi , µ, Σ, θ, π) ∝ θj
Rj
πj,r p(yi |µj,r , Σj,r ).
(3.39)
r=1
For a labelled training data vector, this sampling is skipped, which is equivalent to replacing (3.39) with a unit mass distribution at the labelled class. For the within-class mixture component allocation variable zi , letting Zi = j, we have: p(zi = r|Zi , yi , µ, Σ, θ, π) ∝ p(zi = r|µj , Σj , πj )p(yi |µ, Σ, θ, π, Zi = j, zi = r),
(3.40)
p(zi = r|Zi = j, yi , µ, Σ, θ, π) ∝ πj,r p(yi |µj,r , Σj,r ).
(3.41)
i.e.
Since the allocations to within-class mixture components are initially unknown, this sampling of zi is needed, regardless of whether the data measurement is labelled.
59
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.6. TRAINED CLASSIFIER
From a practical viewpoint, it should be noted that since the number of classes and mixture components is finite, it is easy to sample from the distributions defined by both (3.39) and (3.41).
3.6
Using the trained classifier
3.6.1 General This section discusses how the MCMC samples from the augmented posterior distribution can be used to make classifications of both training and future (i.e. operational or test) data.
3.6.2 Classifying the training data Even if the training data is class labelled, it is useful to assess the performance of the trained classifier on the training data. Amongst other things, this provides an indication of whether the model has been able to learn the distribution of the training data measurements, and whether the MCMC sampler has converged. Furthermore, the classification performance on training data can be used to assist model selection (i.e. choice of the number of mixture components). Posterior class probabilities for a training data vector yt are obtained using: N 1 p(Zt |D) ≈ p(Zt |yt , µ(s) , Σ(s) , θ(s) , π (s) ), N s=1
(3.42)
where (3.38) and (3.39) give:
(s)
p(Zt = j|yt , µ(s) , Σ(s) , θ(s) , π (s) ) ∝ θj
Rj r=1
(s)
πj,r
d
(s)
2(s)
N (yt,l ; µj,r,l , σj,r,l ).
(3.43)
l=1
For labelled training data (3.42) is the natural expression for the posterior class probabilities based upon the trained model. For unlabelled training data (3.42) is the Rao-Blackwellised [22, 106] estimate. The Rao-Blackwellised estimate approximates the posterior marginalised distribution p(Zt |D) via an expectation of appropriate conditional distributions. For unlabelled training data, this Rao-Blackwellised estimate of the posterior class probabilities has lower variance than the direction estimation approach:
p(Zˆt = j|D) ≈
N 1 (s) I(Zt = j), N s=1
(3.44)
where I is the indicator function, (so I(x = y) = 1 if x = y and 0 otherwise). This lower variance is essentially a consequence of the Rao-Blackwell theorem [100] which can be interpreted as stating that if a conditional expectation produces an estimator, the variance of this new estimator will be less than that of the original estimator used in the conditional expectation.
60
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.6. TRAINED CLASSIFIER
For our unlabelled training data, the original estimator is given by I(Zt = j), and the conditional expectation is: E(I(Zt = j)|y, µ, Σ, θ, π) = p(Zt = j|yt , µ, Σ, θ, π).
(3.45)
This conditional expectation is approximated by (3.42). As a further illustration of Rao-Blackwellisation, suppose that an estimate of the expectation of (µ, Σ) is required. This could be obtained directly by looking at: Σ) = (1/N ) E(µ,
N
(µ(s) , Σ(s) ).
(3.46)
s=1
However an estimate with a lower variance, while still being unbiased, can be obtained by using the Rao-Blackwellised estimate: Σ) = (1/N ) E(µ,
N
E(µ, Σ|m(s) , θ(s) , π (s) , Z (s) , z (s) ).
(3.47)
s=1
3.6.3 Classifying future observations The posterior class probabilities for a previously unseen observation yf can be written: P (Zf = j|D, yf ) ∝ P (Zf = j|D)p(yf |D, Zf = j).
(3.48)
The term P (Zf = j|D) can be estimated from prior beliefs on the spread of future data between classes, or can be approximated from our MCMC samples, using: P (Zf = j|D)
= E(θj |D) ≈
N 1 E(θj |D, Z (s) ) N s=1
=
(s) N aj 1 , J (s) N s=1 a j =1
(3.49)
j
(s)
with aj as defined in (3.29), with allocations given by (Z (s) , z (s) ). This MCMC sample based estimate for p(Zf = j|D) is only appropriate if the class proportions within the training data are the same as those expected for future data. Depending on the way that the data was collected, this might not be the case, in which case expert knowledge should be used to set the prior class probabilities for future data. The density p(yf |D, Zf = j) in (3.48) can be estimated directly as follows: d N Rj 1 (s) (s) 2(s) πj,r N (yf,l ; µj,r,l , σj,r,l ). p(yf |D, Zf = j) ≈ N s=1 r=1 l=1
61
(3.50)
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.6. TRAINED CLASSIFIER
This approach is referred to as direct estimation from the MCMC samples and gives an equation similar to that used for the training data. However, a theoretically better approach uses Rao-Blackwellised estimates to calculate the predictive density. In this case, the density p(yf |D, Zf = j) in (3.48) is approximated by: N 1 p(yf |D, Zf = j, Z (s) , z (s) ), N s=1
p(yf |D, Zf = j) ≈
(3.51)
where this result is obtained from: p(yf |D, Zf = j)
=
p(yf , Z, z|D, Zf = j) dZ dz p(yf |D, Zf = j, Z, z)p(Z, z|D, Zf = j) dZ dz
= ≈
N 1 p(yf |D, Zf = j, Z (s) , z (s) ), N s=1
(3.52)
with the second line following from the first via the definition of conditional probabilities, and the third line following from the second via p(Z, z|D, Zf = j) = p(Z, z|D) (conditioning on the class of a future observation has no effect on the distribution for the training data allocations), together with the MCMC approximation of a marginal posterior distribution. We write p(yf |D, Zf = j, Z (s) , z (s) ) as a mixture distribution:
p(yf |D, Zf = j, Z (s) , z (s) )
=
Rj
p(yf , zf = r|D, Zf = j, Z (s) , z (s) )
r=1
=
Rj {p(zf = r|D, Zf = j, Z (s) , z (s) ) r=1
×p(yf |D, Zf = j, zf = r, Z (s) , z (s) )},
(3.53)
with the second line following from the first via the definition of conditional probabilities. Now: (s)
p(zf = r|D, Zf = j, Z
(s)
,z
(s)
) = E(πj,r |D, Z
(s)
,z
(s)
bj,r
) = Rj
(s) r =1 bj,r
,
(3.54)
(s)
with bj,r as defined in (3.32), with allocation assignments given by (Z (s) , z (s) ). The density p(yf |D, Zf = j, zf = r, Z (s) , z (s) ) is the predictive density for the data drawn from mixture component r of class j, with parameters determined by the posterior distributions for the mixture components, conditional on the class and mixture component allocations, (Z (s) , z (s) ). Some calculations (Appendix B.2), show that p(yf |D, Zf = j, zf = r, Z (s) , z (s) ) is given by a product of independent Student-t distributions (Appendix A), with the l-th component distribution in the product having:
62
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.6. TRAINED CLASSIFIER
(s)
1) 2νj,r,l degrees of freedom, (s)
2) location parameter κj,r,l ,
(s)
3) scale parameter Υj,r,l = (s)
(s)
(s)
(s)
(s)
(s)
Vj,r,l (hj,r,l + 1)/(hj,r,l νj,r,l ).
(s)
(s)
The parameters hj,r,l , κj,r,l , νj,r,l and Vj,r,l are defined in (3.22), (3.23), (3.24) and (3.25) respectively, using quantities determined by the allocation variables (Z (s) , z (s) ). The Rao-Blackwellisation method therefore provides:
p(yf |D, Zf = j, zf = r, Z (s) , z (s) ) =
d
(s)
(s)
(s)
St (yf,l ; κj,r,l , 2νj,r,l , Υj,r,l ),
(3.55)
l=1
where the Student-t distribution is defined in Appendix A. Substituting (3.55) into (3.53) along with (3.54) gives:
p(yf |D, Zf = j, Z
(s)
,z
(s)
) = Rj
1
Rj
(s) r =1 bj,r r=1
(s)
bj,r
d
(s)
(s)
(s)
St (yf,l ; κj,r,l , 2νj,r,l , Υj,r,l ).
(3.56)
l=1
Thus (3.51) becomes: Rj d N 1 1 (s) (s) (s) (s) p(yf |D, Zf = j) ≈ bj,r St (yf,l ; κj,r,l , 2νj,r,l , Υj,r,l ). Rj (s) N s=1 b l=1 r =1 j,r r=1
(3.57)
It should be noted that although the Rao-Blackwellisation based method gives theoretically lower variance results than direct estimation, the approach is more computationally expensive. This is because the Rao-Blackwellisation based method makes uses of the mixture component allocation (s)
variables zi for all n training data vectors and from each MCMC sample. Thus, the computational expense increases as the number of training vectors increases, as well as with the number of MCMC (s)
(s)
(s)
samples. In contrast, the direct estimation approach only requires the variables (πj,r , µj,r , Σj,r ) from each MCMC sample, whose size depends on the total number of mixture components. Thus, the computational expense increases as the number of mixture components and the number of MCMC samples increases. Since the number of mixture components will be less than the number of training vectors, the direct estimation approach will be less computationally expensive. Thus, the Rao-Blackwellisation method should not necessarily be preferred to the direct estimation approach. Future research could address techniques for penalising such evaluation complexity [37]. Furthermore, the theoretical advantage from Rao-Blackwellisation is based upon an assumption that the specified mixture models form the correct model for the data. Where the models are only approximating the data (which will invariably be the case for real data), such theoretical advantages will not necessarily hold.
63
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
3.6.4 Rejection of unknowns The mixture model classifier has been designed for situations where all the targets belong to one of the J specified classes. In operational use there will be situations where the target is actually from an unknown class. Ideally, in such situations the ATR algorithm should reject the target without making a classification [129]. In some cases, a simple threshold on the maximum posterior probability could be used to reject a target, i.e. declaring the target to belong to the class maximising the posterior probability only if that posterior probability is above a pre-specified threshold. However, since the class probabilities are normalised to sum to one, even if the new target is an outlier, the posterior class probabilities might strongly favour one class. A second option is to compute the distribution of the predictive data: p(yf |D) =
J
P (Zf = j|D)p(yf |D, Zf = j),
(3.58)
j=1
using the equations of Section 3.6.3, and reject the target if this value falls below a threshold. The value of the threshold would depend on the required trade off between false rejection (i.e. rejection of a target that would have been correctly classified) and classification of spurious objects (i.e. classifying an object which doesn’t belong to any of the classes). Such a rejection threshold has not been applied in the documented application to mobile battlefield vehicle ATR.
3.7
Application to mobile battlefield vehicle ATR
3.7.1 Description of the experiments The proposed mixture model classifier is illustrated using real Inverse Synthetic Aperture Radar (ISAR) data, consisting of images of vehicles from three classes. Class 1 is a T62 (a type of main battle tank (MBT)), class 2 is a BMP (a type of armoured personnel carrier (APC)) and class 3 is a ZSU (a type of air defence unit (ADU)). Pictures of the three vehicles are provided in Figure 3.3. The task is to classify individual ISAR images into the three classes.
Figure 3.3: Vehicles imaged to create the ISAR dataset (left class 1, middle class 2, right class 3). The data have been collected during a trial held in 2000, using the MIDAS (Mobile Instrumented Data Acquisition System) radar. The radar produced millimetre wave (MMW) ISAR images of vehicles rotating on a turntable. The radar was operated at a frequency of 94 GHz and processed to a down-range resolution of 0.3m and a cross-range resolution of 0.35m, giving almost square image 64
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
pixels. The cross-circular polarisation channel was selected to form the images. Each dataset consisted of the ISAR images obtained as the target rotated through 360◦ . A typical dataset consisted of approximately 1200 images (multi-dimensional measurements). An example of an ISAR image from one of the datasets is given in Fig. 3.4. The vertical axis represents cross-range and the horizontal axis range. The intensities in the image are related to the amplitudes of the complex-valued ISAR returns.
Figure 3.4: Example of an ISAR image from a training dataset (range along the horizontal axis, cross range along the vertical axis). Table 3.1 and Table 3.2 document the training and test datasets respectively. For these experiments the configurations of the vehicles in the test data were similar to those in the training data. Thus, the generalisation issue of robustness to extended operating conditions does not arise (see Section 1.5). The work in Chapter 4 uses a superset of this data, in which the generalisation issue is addressed. Dataset 13 17 31
Class 1 2 3
Type T62 (MBT) BMP (APC) ZSU (ADU)
Configuration
radar up
Table 3.1: The training datasets. Default configuration is for hatches and toolboxes to be closed, with guns pointing forward. Dataset 6 8 12 15 16 32 35
Class 1 1 1 1 2 3 3
Type T62 (MBT) T62 (MBT) T62 (MBT) T62 (MBT) BMP (APC) ZSU (ADU) ZSU (ADU)
Configuration toolbox 1 open toolboxes open hatches open engine on hatches open radar up, gun 45◦ azimuth radar up, gun 45◦ elevation
Table 3.2: The test datasets. Default configuration is for hatches and toolboxes to be closed, with guns pointing forward. The angle of the turntable, and therefore the angle of observation of the target, was recorded for 65
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
each image. Although such information cannot be used for test data, the observation angle for each training data image can be used legitimately for training classifiers. However, in our experiments the angle of observation was only used to initialise the MCMC algorithm (see Section 3.5.2).
3.7.2 Pre-processing The initial complex ISAR images are 49 by 49 pixels (a dimensionality of 2401). To display and process the images, the absolute value of each complex pixel in the complex ISAR image is taken. Even after elimination of an outer boundary containing background returns only (reducing the images to 40 by 40 pixels) this results in a dimensionality of 1600. Such a dimensionality is too large to use straight-off within the mixture model formulation. Thus some form of dimensionality reduction is required. One option is to use a PCA [12, 26, 189] to reduce the dimensionality of the data. A second approach to dimensionality reduction converts the original complex ISAR images into complex range profiles, by averaging over cross-range bins. Real-valued range profiles are obtained by taking the absolute value of the complex number in each range bin. After extraction of the target from the range profile (by determining the target centre), this second approach results in 40-dimensional input data. Creation of range profiles is the pre-processing adopted for the mixture model experiments in this chapter. Each extracted range profile is normalised by a (linear) scaling factor, so that within each range profile the average amplitude of the extracted range bins (components of the data vector) is one. Such a normalisation is needed because the power output of the radar is unlikely to be constant, leading to variations in the intensities of the radar returns, unconnected with differences between the vehicles. Furthermore, weather conditions and smoke will alter the power of the return signals received by the radar. A similar normalisation should be applied to the ISAR images if a PCA is to be conducted and, indeed, before any classifications are attempted using the full ISAR images. An intensity plot of the normalised range profiles over a complete rotation of a target is displayed in Figure 3.5. Figure 3.6 displays two specific normalised range profiles from the same dataset. It is clear that the amount of information in the data is dependent on the aspect angle of the target. Since the mixture components are modelled by Gaussian distributions, the logarithm of each element of the normalised extracted range profile is taken, in an effort to pre-process the data to be more Gaussian-like. In future, more sophisticated pre-processing and feature extraction techniques could be used, such as those proposed by Maskall and Webb [120, 121, 119], in which Radial Basis Functions [12, 189] are used to reduce the dimensionality of ISAR images. If a greater proportion of the relevant discriminating information can be preserved in the pre-processing of the ISAR images, prior to input to the mixture model classifier, then the classification performance should improve.
66
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
1000
800
600
400
200
0 0
10
20
30
40
Figure 3.5: A collection of the extracted range profiles from a training dataset. Range is along the horizontal axis, and index related to aspect angle along the vertical axis. The brightness of each pixel is related to the amplitude of the relevant bin in the range profile. 14
14
12
12
10
10
8
8
6
6
4
4
2
2 10
20
30
40
10
20
30
40
Figure 3.6: Some individual extracted range profiles from a training dataset. Normalised amplitude along the vertical axis, range bin along the horizontal axis. LHS an aspect angle of approximately 0◦ , RHS an aspect angle of approximately 90◦ .
3.7.3 Model order selection To illustrate model order selection for the mixture model classifier, Figure 3.7 presents the training data classification rates for a variety of different model orders. Specifically, the results for mixture models with 3, 6, 12, 18, 24 and 48 components per class are displayed. In each case the MCMC algorithm has been run to draw 400 samples, by sampling every 5th iteration after a burn-in period of 5000 iterations. The classification rate increases with model order, but the sizes of the increases get smaller. For example, there is only a small increase in performance as the model order is increased from 18 mixture components to 24 mixture components (compared to earlier increases). There is a more sizeable training data classification rate increase from 24 mixture components to 48 mixture components, but this is possibly a sign of over-fitting. Indeed, the increase in training data classification rate moving from 3 mixture components to 48 is only just over 8%, which might indicate over-fitting. Over-fitting occurs when the number of mixture components is sufficiently large that specific noise realisations in the training data can be modelled as opposed to just the general fit. If the training noise is modelled, the classification rate for training data will increase, but the performance
67
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
Training data 98 96 % 94 92 0
10
20 30 40 no. components
Figure 3.7: Training data classification rates for the mixture model classifiers, against model order (i.e. the number of mixture components for each class). on an independent test set (with different noise realisations) will degrade. To assess whether overfitting is responsible for some of the increases in training data classification rate in Figure 3.7, a 10-fold cross-validation [189] has been performed. In ν-fold cross-validation the training data is (randomly) partitioned into ν roughly equal-sized subsets. Then, for each subset in turn, the classifier is trained using the remaining ν − 1 subsets, and the classification performance assessed on the excluded subset. Combining the results from each of the ν subsets produces an estimate for the training data classification rate that should be less susceptible to over-fitting. Setting ν = 10 means that the amount of training data for each classifier is only reduced by 10%, but does require 10 runs of the training algorithm. The results from the 10-fold cross-validation are plotted in Figure 3.8. It can be seen that (as in Figure 3.7) the classification rate increases with the number of mixture components, with the sizes of the increases decreasing as the number of mixture components increases. The similarity between Figure 3.7 and Figure 3.8 for model-orders up to 18-24 mixture components indicates that over-fitting is probably not occurring before those model orders. The increase in classification performance moving from 18 mixture components to 48 mixture components is much less for the 10-fold cross-validation, than for the full training run. This would seem to indicate that at least some of the original performance increase moving from 18 to 48 mixture components is due to over-fitting. That the 10-fold classification rates in Figure 3.8 are slightly lower than the corresponding rates in Figure 3.7 is likely to be an artefact of using 10% less training data. Taking into account the lower computational requirements when smaller model orders are used, together with the small size of the performance increase moving from 18 to 48 mixture components (in both Figure 3.7 and Figure 3.8), 18 mixture components have been used in our documented experiments. As noted in Section 3.4.2, this is not an entirely satisfactory treatment for model selection. Not least, because it relies on a subjective choice by the user. Future research could address more sophisticated mechanisms for model order selection.
68
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
10-fold cross-validation 98 96 % 94 92 90 0
10
20 30 40 no. components
Figure 3.8: 10-fold cross-validation classification rates for the mixture model classifiers, against model order (i.e. the number of mixture components for each class).
3.7.4 Algorithm details The values of the hyperparameters for the Bayesian mixture models are detailed in Appendix B.3. The MCMC algorithm has been run to draw 400 samples, by sampling every 5th iteration after a burn-in period of 5000 iterations. The number of samples drawn was rather small for the dimensionality of the distribution being considered. Thus, better modelling of the densities (at the expense of increased computational cost) may be possible if more samples are used. The MCMC algorithm was initialised by clustering the mixture component allocations into angle segments. Results are presented for both the direct estimation and Rao-Blackwellisation based approaches to classification (see Section 3.6). To take into account the uncertainty in extracting the target range profiles (arising due to errors in estimation of the centre of each target within an ISAR image), shifts of up to ±1 in the centre of each target have been applied to the test data. In all cases the classifications have been made by taking the class which gives the largest posterior probability.
3.7.5 Convergence of the MCMC sampling algorithm To assess the convergence of the MCMC sampling algorithm we examine some plots of the MCMC samples, rather than attempting a formal convergence assessment. This reflects the fact that full MCMC convergence diagnostics [55, 125] tend to perform badly when applied to mixture models. Furthermore, MCMC convergence diagnostics only provide indications of possible non-convergence [71, 189], rather than a full test of convergence or non-convergence. Figure 3.9 displays the values of some individual elements from the MCMC samples for the mean (s)
of the 1st mixture component for the 1st class. Specifically, the values µj,r,l , for j = 1, r = 1 and d/2 − 6 ≤ l ≤ d/2 + 5 are displayed. For display purposes the exponential has been taken for each value (reflecting the fact that the logarithm of each range profile was used). Graphs for the other classes and mixture components have a similar form. Similarly, Figure 3.10 displays the values of some individual elements from the MCMC samples for the standard deviation of the 1st mixture component for the 1st class. Specifically, the values
69
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
0.22
0.5
0.2
0.45
0.18
0.4
3.7. APPLICATION
1.2 1 0.8
0.35 0.16 0.3 0.14
0.25
0.12
0.2 0
100
200
300
0.6 0.4 0
400
100
200
300
400
5 4.5 4 3.5 3 2.5 2 1.5
2.25 2 1.75 1.5 1.25 1 0.75 0
100
200
300
400 3
6
2.8
5.5
2.6
200
300
400
300
400
0
100
200
300
400
0
100
200
300
400
1.6 1.4
1
3.5
1.8
0.8
1.2
1.6
0.6 0
100
200
300
400
1.4
1.6
1.2
1.4
1.2
1.2
1.1
1
200
1.8
2
400
100
3 100
4
300
0
4
2.2
200
400
3.5
2.4
100
300
5
5
0
200
4.5
4.5
3
100
5.5
0
6.5
0
1.3
1
1 0.8
0.9
0.8 0.6 0
100
200
300
400
0
100
200
300
400
Figure 3.9: MCMC samples for the middle elements of a mixture component mean vector (the middle 12 components from the samples for µ1,1 ). 2(s)
σj,r,l , for j = 1, r = 1 and d/2 − 6 ≤ l ≤ d/2 + 5 are displayed. Graphs for the other classes and mixture components have a similar form. The jagged natures of the sampled pixel values in Figures 3.9 and 3.10 indicate that the MCMC samples are exploring the posterior distribution, without becoming stuck in local maxima. Note, however, that the issue of label-switching [23, 170, 172] in the mixture model components is being ignored. Label-switching refers to the existence of R! modes in the posterior distribution for a mixture model with R components. A mixture distribution is invariant under a permutation of the mixture component indices. Therefore, since no identification constraint is imposed during specification of the prior distributions in Section 3.4.3, there will be R! modes in the posterior distribution for each mixture model (reflecting the R! possible orderings of the mixture components). The plots in Figure 3.9 and Figure 3.10 seem to show exploration of just a single mode of the posterior distribution, indicating that label-switching is not being properly dealt with. Fortunately, the use of the MCMC mixture component samples within our mixture model classifier (as described in Section 3.6) corresponds to a label-invariant projection of the posterior distribution. Thus, potential issues arising from the lack of an exploration of all the modes of the posterior distribution do not arise. Were interpretation of the individual mixture model components to be important, the MCMC sampler
70
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
1.8
1.4
1.1
1.3
1
3.7. APPLICATION
1.6
1.2 0.9
1.1
0.8
1
0.7
0.9
0.6
0.8 0
100
200
300
400
1.4 1.2 1 0
100
200
300
400
1.6 1.5 1.4 1.3 1.2 1.1 1 0.9
1.8 1.6 1.4 1.2 1 0
100
200
300
400
0
100
200
300
400
0
100
200
300
400
0
100
200
300
400
0
100
200
300
400
0.9 0.8 0.7 0.6 0.5 0
100
200
300
400 1.3
0.9 0.8
0.9
1.2
0.8
1.1 1
0.7
0.7
0.6
0.6
0.5
0.5
0.9 0.8 0.7 0.6
0
100
200
300
400
0
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
100
200
300
400
0.8 0.7 0.6
0
100
200
300
400
0.5 0
100
200
300
400
Figure 3.10: MCMC samples for the middle elements of a mixture component standard deviation vector (the middle 12 components from the samples for σ1,1 ). would have to be adjusted to properly deal with label-switching. This is a current area of research interest within the field of Bayesian statistics, but is not investigated within this thesis.
3.7.6 Baseline classifier The baseline classifier is a correlation-filter (see Section 2.2.6) applied to the original ISAR images. 40 × 40 image chips have been obtained by centring the targets within each full ISAR image, and then taking the amplitude of the complex returns. Before use in the correlation-filter, each image chip has been (individually) scaled so that the mean square of the pixels is one. Each training image was used as a template in the correlation-filter.
3.7.7 Experimental results The classification matrix for the trained mixture models applied to the training data is provided in Table 3.3. The overall training data classification rate is 98.5%. Table 3.4 documents the classifi-
71
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.7. APPLICATION
cation matrix for the test data (for both the direct estimation and Rao-Blackwellisation approaches to classification). Encouragingly, the majority of classification rates are over 90%, indicating that for the situation where the test data targets are similar to the training data targets (at least), the Bayesian mixture model approach results in good classification performance. The performance of the baseline correlation-filter classifier is documented in Table 3.5. Dataset
Type
13 17 31
T62 BMP ZSU
Predicted class T62 BMP ZSU 97.7 2.2 0.1 1.4 98.6 0.0 0.4 0.7 98.9
Table 3.3: Gaussian mixture model results for training data. Dataset 6 8 12 15 16 32 35
Class T62 T62 T62 T62 BMP ZSU ZSU
T62 91.0 89.2 88.0 87.4 96.8 96.7 96.0 95.8 6.4 5.8 5.0 5.1 0.5 0.6
Predicted class BMP 8.9 10.7 11.8 12.5 3.1 3.2 3.9 4.1 93.1 93.8 2.4 2.8 0.4 0.5
ZSU 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.5 0.4 92.6 92.1 99.1 98.9
Table 3.4: Gaussian mixture model results for the test datasets. For each dataset, within each class, the results to the left are for the direct estimation approach to classification and the results to the right are for the Rao-Blackwellisation based approach to classification. Dataset
Class
6 8 12 15 16 32 35
T62 T62 T62 T62 BMP ZSU ZSU
Predicted class T62 BMP ZSU 94.5 1.7 3.9 88.2 4.3 7.5 98.0 1.2 0.8 98.2 0.5 1.3 3.9 91.3 4.8 8.6 7.1 84.3 1.4 1.1 97.5
Table 3.5: Correlation-filter results for test datasets. The overall classification rates are 93.8% and 93.5% for the Bayesian mixture models using the direct estimation and Rao-Blackwellisation based approaches to classification, respectively, and 93.1% for the correlation-filter. Thus, the overall performance of the mixture model approaches is better than than that of the baseline correlation-filter. However, since the number of datasets from each class is different, an overall measure of performance can be misleading. Figure 3.11 enables a graphical comparison of the test data performances for the Bayesian mixture models and the correlation filter. To the left are the classification rates for each test dataset, and to the right are the average classification rates for each class. As can be seen, the mixture model approaches provide better performance for the BMP (class 2) and ZSU (class 3), but the correlation filter results in a higher classification rate for the T62 (class 1). Within the Bayesian mixture model 72
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.8. SUMMARY
results, it can be seen that the theoretically better posterior probability estimates provided by the Rao-Blackwellisation based approach, does not translate to better classification rate performance. 95 95 90 90 85 85 80
80 75
75 6
8
12
15
16
32
35
1
2
3
Figure 3.11: Summary of the classification rates for the test data. On the left are the classification results for each dataset, and on the right are the averaged classification rates for each class. For each plot; in red are the rates for the Bayesian Gaussian mixture model classifier using direct estimation; in green the Bayesian Gaussian mixture model classifier using Rao-Blackwellisation; and in blue the correlation-filter. Although, the Bayesian mixture model approaches provide better classification rates for the majority of classes, it should be noted that classification rate does not give a true indication of the overall performance of a classifier [64]. Not least is the fact that it treats all misclassifications with equal weight, when in reality there will be different misclassification costs [1]. Furthermore, it does not take into account confidence scores for the classifications. Where there is overlap between the measurements from distinct classes, it will not be possible to obtain perfect classification. In such cases a good classifier would indicate uncertainty in the overlapping region. This is vital for the work in Chapters 7 and 8. Thus, rather than a classification rate, an assessment of the accuracy of the mixture model estimates of the posterior probabilities is desirable. However, such an assessment is extremely difficult for real data. Furthermore, the baseline correlation-filter does not produce posterior probability estimates.
3.8
Summary and recommendations
This chapter has developed a Bayesian Gaussian mixture model approach to discrimination. The class-conditional densities have been modelled as Gaussian mixtures, and a Bayesian analysis conducted under the assumption of constant model order for the mixtures. This formulation provides the starting point for the next chapter, which introduces hyperprior distributions within the Bayesian modelling for the parameters of our Gaussian mixture models. The parameters of these hyperprior distributions are then varied to match expected deviations of the operating data from the training data, in an attempt to provide a classifier robust to the challenges described in Section 1.5 (e.g. configuration changes compared to the training data, and newer examples of the training vehicles). The use of the algorithm has been demonstrated successfully on real datasets, consisting of ISAR images of mobile battlefield targets. The experiments represent the situation where the configurations of vehicles in the test data are similar to those in the training data. The mixture model approach has been shown to be highly competitive with a template-based correlation-filter. 73
CHAPTER 3. BAYESIAN GAUSSIAN MIXTURE MODELS
3.8. SUMMARY
Aside from the hyperprior modelling introduced in the next chapter, there are many possible areas for further work. For example, the model selection issue of the number of mixture components to use in each mixture model could be addressed. As well as issues with overfitting of the training data, the choice of model order should take into account the increased computational cost of using larger models for target recognition [37]. It would be of interest to assess the effect of different pre-processing feature extraction techniques, prior to application of the mixture model density estimation procedure. Additionally, alternative methods for assessing the performance could be investigated. The ultimate aim is to use the probabilities in a multi-level model incorporating contextual information and reflecting the whole decision making process. Thus, accurate class probability estimates are of more importance than the actual classification rate. The mixture model approach has been designed to incorporate unlabelled (as well as labelled) training data. It would be interesting to see what performance advantages can be gained from using unlabelled training data, and to determine the situations where such unlabelled data would be available.
74
CHAPTER 4 USE OF HYPERPRIORS IN A BAYESIAN APPROACH TO GAUSSIAN MIXTURE MODELS FOR TARGET RECOGNITION 4.1
Introduction
4.1.1 General In the traditional approach to classifier design, it is often assumed that the training data are representative of the operating conditions in which the classifier will be applied. In many practical situations this may not be so due to drifts or changes in the population parameters, noise in the operating conditions, inadequate training data, unknown priors and unknown misclassification costs. To mitigate this problem, we need to take into account the expected differences between training and operating conditions in the classifier design, or design the classifier to be robust to such changes. It is primarily issues in generalisation to changes in population parameters that this chapter addresses. The problem of training data that are unrepresentative of the operational data is particularly serious when developing ATR systems. Section 1.5 outlines some of the extended operating conditions which a typical ATR system will encounter in deployment. Not least is the requirement to develop a classifier that is stable to variations in equipment fit and that exhibits good generalisation to previously unseen targets. However, despite the importance of this robustness issue for ATR, the ATR literature has tended to assume that training conditions are representative of operational conditions (as documented in the literature survey of Chapter 2). That this assumption is not true provides the motivation for the research described in this chapter.
4.1.2 Proposed approach Building on the work introduced in Chapter 3, a Bayesian Gaussian mixture model approach to ATR is considered. The generic discrimination problem [189] tackled is again one where we are given a set of training data consisting of class labelled measurements (plus possibly some unlabelled measurements) and are then required to assign a previously unseen object to one of the classes, on the basis of measurements made of that object. The approach adopted aims to provide, for measurement data x and classes j, estimates of the Gaussian mixture model class-conditional probability densities (likelihoods) p(x|j). Posterior
75
CHAPTER 4. USE OF HYPERPRIORS
4.1. INTRODUCTION
probabilities of class membership p(j|x) are then obtained using Bayes’ theorem: p(j|x) ∝ p(x|j)p(j),
(4.1)
where p(j) are the prior class probabilities. Some of the advantages of producing posterior class probabilities via a density estimation approach to classification are documented in Section 2.2.5. As in Chapter 3, the parameters of the mixture models are estimated using a Bayesian formalism, which requires the use of an MCMC algorithm. The extension to the work documented in Chapter 3 lies in a new hierarchical structure for the prior distributions, which then allows generalisation issues to be addressed via alteration of the hyperprior distributions. In particular, the hyperprior distributions are adjusted to match expected deviations of operating data from training data. This requires some vague knowledge of the likely deviations. For the purposes of this thesis, this vague knowledge is taken to be the mean values of some extra measurements that might be expected for operating data. It is assumed that the individual data measurements for this expected data are not available. Thus, it is not possible to incorporate them into the training data (labelled or unlabelled). Furthermore, to maintain realism, the mean values are calculated from datasets different to those used for the actual test data. Thus, the extra information does not cover all the extended operating conditions to be assessed. The approach is applied to ATR of ISAR imagery of mobile battlefield vehicles. The particular aim within this application is to develop a classifier that is stable to variations in equipment fit and that exhibits good generalisation to previously unseen targets from the same generic class.
4.1.3 Chapter outline The outline of this chapter is as follows. The problem is formulated formally in Section 4.2 and the Bayesian model specified in Section 4.3. Section 4.4 introduces the MCMC algorithm used to estimate the parameters of the mixture model, with some of the mathematics being deferred to Appendices C.1 and C.2. How the trained classifier is used appears in Section 4.5. Section 4.6 looks at altering the hyperprior distributions to improve the generalisation properties of the classifier. Thus Section 4.6 contains the main crux of the chapter. To keep the chapter relatively self-contained, parameters and expressions that were used in the formulation of the original Bayesian Gaussian mixture model algorithm of Chapter 3 are redefined within this chapter. However, for some of the initial justifications the reader is referred back to Chapter 3. Section 4.7 introduces the ISAR data and experiments used in the assessment of the proposed technique, while Section 4.8 and Section 4.9 present the results. Detailed performance discussions are deferred to Appendices D.2 and D.3. The tables and figures displaying the results are in Appendices E and F, respectively. Section 4.10 looks at what conclusions can be drawn from the investigations and gives some recommendations for further work.
76
CHAPTER 4. USE OF HYPERPRIORS
4.2
4.2. PROBLEM FORMULATION
Problem formulation
As in Chapter 3 the problem is formulated in a generic manner, in which classification of an object into one of J distinct classes, on the basis of a d-dimensional data measurement of the object, is considered. Our training data, which can comprise both labelled and unlabelled measurements from the classes, consist of n observed d-dimensional data samples, y = {y1 , . . . , yn } and any corresponding class labels. If the class of the measurement yi is known, it is denoted by Zi . The combination of y and any known class allocations for the data is denoted by D. Where there is a need to distinguish between labelled and unlabelled training data, the indices of the labelled training data are referred to as {i ∈ lab} and the indices of the unlabelled training data as {i ∈ unlab}. The probability density function for the d-dimensional data x can be written:
p(x) =
J
θj p(x|j),
(4.2)
j=1
where θ = (θ1 , . . . , θJ ) is a vector of the prior class probabilities, with components satisfying θj ≥ 0 J and j=1 θj = 1. The distribution p(x|j) is the class-conditional density (likelihood) for data from class j. The class-conditional densities are modelled also by mixture distributions, with the j-th class having Rj components, which are referred to as mixture components. Thus we have the following mixture models: Rj p(x|j) = πj,r p(x|j, r). (4.3) r=1
πj = (πj,1 , . . . , πj,Rj ) represents the mixing probabilities within class j; in particular πj,r ≥ 0 is Rj r=1 πj,r = 1.
the mixing probability for the r-th mixture component of the j-th class. These satisfy The complete set is denoted by π = {πj , 1 ≤ j ≤ J}.
The distribution p(x|j, r) represents the probability density of the data within a given mixture component r, of a class j. We make an assumption of independence between the components of the data vector x, conditioned on the class and mixture component. This corresponds to:
p(x|j, r) =
d
pl (xl |j, r),
(4.4)
l=1
where x = (x1 , . . . , xd ). Note that this independence assumption for each component does not extend to an independence assumption for the mixture distribution as a whole, due to the summation over mixture components. A discussion on the choice of the number of mixture components Rj to use in each class conditional mixture model is provided in Section 3.4.2. The approach adopted in this thesis holds the number of mixture components fixed throughout the algorithm, with the value chosen as a
77
CHAPTER 4. USE OF HYPERPRIORS
4.3. BAYESIAN APPROACH
compromise between having enough components for adequate modelling of the data (as determined by looking at the training and validation data classification rates) and computational efficiency of smaller model orders. The component distributions pl (xl |j, r) are taken to be univariate Gaussian, with means µj,r,l and 2 variances σj,r,l , where l = 1, . . . , d. For a given class and mixture component these are represented by the vectors µj,r and diagonal matrices Σj,r . The sets of all means and variances are represented by µ and Σ, respectively. Typically, some simple pre-processing is applied to the data to make it more Gaussian-like. For instance, in the case where the data measurements are the absolute values of radar range profiles, we take logarithms of the data. Using the defined notation, the full likelihood function for the training data can be written:
p(y|µ, Σ, θ, π)
∝
J
{i∈unlab}
×
πj,r
r=1
j=1
RZi
{i∈lab}
4.3
θj
Rj
r=1
πZi ,r
d l=1 d l=1
2 N (yi,l ; µj,r,l , σj,r,l ) 2 N (yi,l ; µZi ,r,l , σZ ) . i ,r,l
(4.5)
Bayesian approach
4.3.1 Prior distributions The novelty of the approach documented in this chapter lies in the specification of the prior distributions for the mixture model parameters. In particular, some of the hyperparameters of the prior distributions are treated as random variables, depending on hyper-hyperparameters. These prior distributions are best illustrated graphically with DAGs [96, 156]. Figure 4.1 redisplays the DAG describing the hierarchical structure used in Chapter 3, which makes use of fixed hyperparameters. Figure 4.2 shows the new hierarchical structure. In both cases, the random variables are denoted by circular boxes and the fixed hyperparameters by square boxes. The fixed hyperparameters m0 of Figure 4.1 are replaced with a variable set of hyperparameters m in Fig. 4.2. The new random variable hyperparameters consist of the d-dimensional vectors {mj,r ; j = 1, . . . , J, r = 1, . . . , Rj }, with a separate hyperparameter vector mj,r for each mixture component (where j indexes the class and r indexes the mixture component). The hyperprior distributions for m are defined by the collection of fixed hyper-hyperparameters (m0 , Ψ0 ). The remaining fixed hyperparameters (h0 , ν0 , V0 , a0 , b0 ) have the same form for both models. For the remainder of this chapter, we present the hierarchical prior formulation of Figure 4.2. The complete prior distribution for the random variables can be written mathematically as: p(µ, Σ, m, θ, π|h0 , m0 , Ψ0 , ν0 , V0 , a0 , b0 ) = p(µ|Σ, m, h0 )p(Σ|ν0 , V0 )p(θ|a0 ) ×p(π|b0 )p(m|m0 , Ψ0 ).
78
(4.6)
CHAPTER 4. USE OF HYPERPRIORS h0
4.3. BAYESIAN APPROACH ν0
m0
V0
Σ
a0
b0
θ
π
µ
y
Figure 4.1: DAG illustrating the hierarchical structure for the prior distributions of the Gaussian mixture model parameters in the previous chapter. h0
Ψ0
m0
ν0
V0
a0
b0
θ
π
m Σ µ
y
Figure 4.2: DAG illustrating the hierarchical structure for the prior distributions of the Gaussian mixture model parameters. This prior distribution differs from that specified in Section 3.4.3 by the addition of the term p(m|m0 , Ψ0 ) on the bottom line. Conditional on the hyperparameters m, an assumption of prior independence of the (µj,r , Σj,r ) over all classes j and mixture components r is made. Thus:
p(µ, Σ|m, h0 , ν0 , V0 ) =
Rj J
p(µj,r , Σj,r |mj,r , hj,r,0 , νj,r,0 , Vj,r,0 ).
(4.7)
j=1 r=1
In addition, the dimensions within each mixture component are taken to be independent, so that:
p(µj,r , Σj,r |mj,r , hj,r,0 , νj,r,0 , Vj,r,0 ) =
d
2 p(µj,r,l , σj,r,l |mj,r,l , hj,r,l,0 , νj,r,l,0 , Vj,r,l,0 ).
(4.8)
l=1
2 In particular, independent normal-inverse gamma priors for the components µj,r,l and σj,r,l are
79
CHAPTER 4. USE OF HYPERPRIORS
4.3. BAYESIAN APPROACH
assigned. Thus: 2 2 µj,r,l |(σj,r,l , mj,r,l , hj,r,l,0 ) ∼ N (mj,r,l , σj,r,l /hj,r,l,0 ),
(4.9)
and: 2 σj,r,l |(νj,r,l,0 , Vj,r,l,0 ) ∼ 1/Γ(νj,r,l,0 , Vj,r,l,0 ),
(4.10)
where the parameterisation of the inverse gamma distribution is given in Appendix A. Thus, condi2 tional on the variable hyperparameters m, we have the standard conjugate priors for (µj,r,l , σj,r,l ).
The fixed hyperparameters hj,r,l,0 are referred to as precision parameters, while νj,r,l,0 and Vj,r,l,0 are referred to as degrees of freedom and scale parameters respectively. The values of these hyperparameters are chosen to have reasonable values, as determined using the training data (see Section D.1). This reduces the risk of inappropriate values being chosen. The value of hj,r,l,0 determines the weight of the contribution of the prior information compared to the data (likelihood) in the posterior distribution (see Section 4.6.3). The class and mixture component allocation probabilities are given independent Dirichlet priors, with: θ ∼ D(a0 ),
where a0 = (a1,0 , . . . , aJ,0 ),
aj,0 > 0,
(4.11)
and: πj ∼ D(bj,0 ),
where
bj,0 = (bj,1,0 , . . . , bj,Rj ,0 ),
bj,r,0 > 0.
(4.12)
The hyperparameters a0 and bj,0 are held fixed, with equality between the components within each vector. Thus, there is no prior preference to any class or mixture component. For notational convenience, in subsequent equations the conditioning on fixed hyperparameters is generally left implicit. Appendix D.1 documents the values for the fixed hyperparameters in the experiments of Section 4.8 and 4.9.
4.3.2 Hyperprior distributions The variable hyperparameters mj,r are taken to have prior independence over all classes j and mixture components r, giving the following hyperprior distribution:
p(m|m0 , Ψ0 ) =
Rj J
p(mj,r |mj,r,0 , Ψj,r,0 ).
(4.13)
j=1 r=1
However, unlike the conditional prior distributions for (µj,r , Σj,r ), there is no assumption of independence between the components that make up these hyperparameter vectors. Multivariate Gaussian hyperpriors are assigned: mj,r |(mj,r,0 , Ψj,r,0 ) ∼ Nd (mj,r,0 , Ψj,r,0 ),
(4.14)
where mj,r,0 and Ψj,r,0 are fixed hyper-hyperparameters. It is via the alteration of these hyperhyperparameters that we attempt to improve the generalisation properties of the proposed mixture model classification scheme.
80
CHAPTER 4. USE OF HYPERPRIORS
4.3. BAYESIAN APPROACH
The hyper-hyperparameters are defined to be the same across all mixture components of a class; i.e. for each class j, we set mj,r,0 = mj,0 and Ψj,r,0 = Ψj,0 , for r = 1, . . . , Rj . This reflects the lack of any a priori labels for the mixture components. Details of how the hyper-hyperparameters (and therefore the hyperpriors) are altered in an attempt to improve the generalisation properties of the mixture model classifier are provided in Section 4.6. The conditioning on fixed hyper-hyperparameters is (for notational convenience) generally left implicit in subsequent equations.
4.3.3 Posterior distribution Bayes’ theorem gives the following relationship between the posterior, prior and likelihood: p(µ, Σ, θ, π, m|y) ∝ p(µ, Σ, θ, π, m)p(y|µ, Σ, θ, π),
(4.15)
which gives:
p(µ, Σ, θ, π, m|y) ∝
Rj d J
−hj,r,l,0 (µj,r,l − mj,r,l )2 − 2Vj,r,l,0 exp 2 2σj,r,l
j=1 r=1 l=1
×
J
θjaj,0
Rj −1
b
j,r,0 πj,r
r=1
j=1
d −1
1
2(νj,r,l,0 +3/2) l=1 σj,r,l
Rj J
−(mj,r − mj,r,0 )T Ψ−1 j,r,0 (mj,r − mj,r,0 ) × exp 2 j=1 r=1 Rj J d 2 × θj πj,r N (yi,l ; µj,r,l , σj,r,l ) i∈unlab
×
j=1
RZi
i∈lab
r=1
πZi ,r
r=1
d l=1
l=1
2 N (yi,l ; µZi ,r,l , σZ ) . i ,r,l
(4.16)
Calculation of the normalisation constant of the posterior distribution (regardless of the proportion of unlabelled data) is analytically intractable, as are calculations of various statistics of interest, such as the means and variances of the parameters. It is possible to integrate out the hyperparameters mj,r analytically. However, this results in a very complicated marginal posterior distribution for (µ, Σ, θ, π), which is less amenable to future processing than the full posterior distribution. Thus, this possibility is not investigated further. Rather than making some (potentially) rather dubious simplifications to allow us to make inference on the posterior distribution, a full Bayesian approach to the problem is maintained, by drawing samples from the posterior. All inferences can then be made through consideration of these samples. Since it is not possible to sample directly from the distribution given by (4.16), a MCMC algo-
81
CHAPTER 4. USE OF HYPERPRIORS
4.4. MCMC ALGORITHM
rithm [55, 158, 178] is used. In particular, a Gibbs sampler [21] is developed to draw approximately independent samples from the distribution.
4.4
MCMC algorithm
4.4.1 Outline To make use of the Gibbs sampler to sample from the posterior distribution, the set of random variables (µ, Σ, π, θ, m) is augmented to include the class allocation variables Z = (Z1 , . . . , Zn ) and the mixture component allocation variables z = (z1 , . . . , zn ). These are such that (Zi = j, zi = r) implies that the observation indexed by i is modelled to be from mixture component r of class j. Zi is known for our labelled training data (and should be treated as a constant), whereas zi is always unknown. Then, rather than sampling from p(µ, Σ, π, θ, m|y) we sample from the extended posterior distribution p(µ, Σ, π, θ, m, Z, z|y). The new augmented set of variables is split into four groupings; namely (µ, Σ), m, (Z, z) and (θ, π). Investigations into an alternative formulation, in which the groups (µ, Σ) and m are merged into the single group (µ, Σ, m), are discussed in Appendix C.2. The required constituents for the Gibbs sampler algorithm are the posterior distributions for each group, conditional on the remaining three groups, i.e. the distributions: • p(µ, Σ|θ, π, m, y, Z, z), • p(m|θ, π, µ, Σ, y, Z, z), • p(θ, π|y, Z, z, µ, Σ, m), • p(Z, z|y, µ, Σ, m, θ, π). The outline of the MCMC Gibbs sampler algorithm is given in Figure 4.3:
82
CHAPTER 4. USE OF HYPERPRIORS
4.4. MCMC ALGORITHM
1. Initialisation. Set i = 1 and determine initial values for (m(0) , Z (0) , z (0) ) from the support of the joint posterior distribution. • Elements of Z (0) relating to labelled training data are assigned their true class labels. • Elements of Z (0) relating to unlabelled training data are set using very simple (and quick) classifiers, such as nearest class mean. • The z (0) are initialised with simple clustering algorithms such as k-means, or for training data with known angles of observation by clustering into angle segments. (0)
• The mj,r are sampled from the prior distributions given in (4.14). 2. Iteration i • Sample (µ(i) , Σ(i) ) from p(µ, Σ|y, m(i−1) , θ(i−1) , π (i−1) , Z (i−1) , z (i−1) ), using the equations of Section 4.4.2. • Sample m(i) from p(m|y, µ(i) , Σ(i) , θ(i−1) , π (i−1) , Z (i−1) , z (i−1) ), using the equations of Section 4.4.3. • Sample (θ(i) , π (i) ) from p(θ, π|y, µ(i) , m(i) , Σ(i) , m(i) , Z (i−1) , z (i−1) ), using the equations of Section 4.4.4. • Sample (Z (i) , z (i) ) from p(Z, z|y, µ(i) , Σ(i) , m(i) , θ(i) , π (i) ), using the equations of Section 4.4.5. 3. i ← i + 1 and go to 2.
Figure 4.3: Outline of the training algorithm for Bayesian Gaussian mixture models using hyperpriors. After an initial burn-in period, during which the generated Markov chain reaches equilibrium, the set of parameters (µ(i) , Σ(i) , m(i) , θ(i) , π (i) , Z (i) , z (i) ) can be regarded as dependent samples from the posterior distribution p(µ, Σ, m, θ, π, Z, z|y). To obtain approximately independent samples, a gap (known as the decorrelation gap) is left between successive samples, i.e. a sample is retained only every l-th iteration of the algorithm, where l ≥ 1. If we are concerned only with ergodic averages, lower variances are obtained if the output of the Markov chain is not sub-sampled. However, if storage of the samples is an issue, it is better to leave a decorrelation gap, so that the full space of the distribution can be explored without having to keep thousands of samples. In our notation, these approximately independent samples are relabelled (µ(s) , Σ(s) , m(s) , θ(s) , π (s) , Z (s) , z (s) ), for s = 1, . . . , N ; where N is the number of MCMC samples. As in Chapter 3, choice of the burn-in period, decorrelation gap and number of samples involves specifying an initial set of values, and examining the classification rate for the training and validation datasets. Modification of these initial values can then be made in the light of this classification performance. This, together with examining sample plots is likely to provide a better test of convergence than more sophisticated convergence diagnostics [55, 125], which tend to fare poorly for all but the simplest of problems [71, 189].
83
CHAPTER 4. USE OF HYPERPRIORS
4.4. MCMC ALGORITHM
4.4.2 Conditional distributions for the mixture components Given the allocation variables (Z, z), the data y consist of labelled independent samples from the k = R1 + · · · + RJ mixture component distributions, independent of the prior class and mixing probabilities. Furthermore, if (r, j) = (r , j ) the parameters and prior distributions are formulated so that the training data assigned to mixture component r, of class j, have no influence on the posterior distribution of the parameters of mixture component r , of class j . The set of indices of data elements that have been assigned to component r, of class j, is defined to be Gj,r = {i; (Zi = j, zi = r)}. The cardinality of Gj,r is represented by gj,r . The set {yi ; i ∈ Gj,r } is denoted by yGj,r . In addition, the following are needed: y¯j,r,l =
1 gj,r
yi,l
and Sj,r,l =
i∈Gj,r
(yi,l − y¯j,r,l )2 .
(4.17)
i∈Gj,r
Thus: p(µ, Σ|θ, π, m, y, Z, z) =
Rj J
p(µj,r , Σj,r |mj,r , yGj,r ).
(4.18)
j=1 r=1
Some algebra (similar to that detailed in Appendix B.1 for the original mixture model formu2 lation) shows that under this conditioning, the components (µj,r,l , σj,r,l ) have independent normalinverse gamma posterior distributions: 2 2 , mj,r,l , yGj,r ) ∼ N κj,r,l , σj,r,l /hj,r,l , µj,r,l |(σj,r,l
(4.19)
2 σj,r,l |(mj,r,l , yGj,r ) ∼ 1/Γ(νj,r,l , Vj,r,l ),
(4.20)
and:
where: hj,r,l = hj,r,l,0 + gj,r , κj,r,l =
(4.21)
hj,r,l,0 mj,r,l + gj,r y¯j,r,l , hj,r,l
νj,r,l = νj,r,l,0 + gj,r /2, Vj,r,l = Vj,r,l,0 +
(4.22) (4.23)
gj,r hj,r,l,0 Sj,r,l + (¯ yj,r,l − mj,r,l )2 . 2 2hj,r,l
(4.24)
4.4.3 Conditional distributions for the variable hyperparameters By following an approach similar to that of Section 4.4.2, the posterior conditional distributions for the hyperparameters mj,r can be written:
p(m|θ, π, µ, Σ, y, Z, z) =
Rj J j=1 r=1
84
p(mj,r |µj,r , Σj,r , yGj,r ).
(4.25)
CHAPTER 4. USE OF HYPERPRIORS
4.4. MCMC ALGORITHM
Some algebra (detailed in Appendix C.1) shows that the conditional posterior distributions for the mj,r are multivariate Gaussian: mj,r |(µj,r , Σj,r , yGj,r ) ∼ Nd (m∗j,r , Ψ∗j,r ),
(4.26)
−1 −1 , Ψ∗j,r = (Ψ−1 j,r,0 + hj,r,0 Σj,r )
(4.27)
−1 m∗j,r = Ψ∗j,r (Ψ−1 j,r,0 mj,r,0 + hj,r,0 Σj,r µj,r ),
(4.28)
where:
and:
with hj,r,0 a d × d diagonal matrix, with l-th diagonal element, hj,r,l,0 , for l = 1, . . . , d; and Σj,r 2 a d × d diagonal matrix, with l-th diagonal element, σj,r,l , for l = 1, . . . , d.
4.4.4 Conditional distributions for the allocation probabilities Given the allocation variables (Z, z), the prior class and mixing probabilities (θ, π), will be independent of (y, µ, Σ, m). Thus, we have the same equations as Section 3.5.4, and the mixing probabilities are updated using independent Dirichlet distributions. For the prior class probabilities: θ|Z ∼ D(a),
(4.29)
independently of π, where a = (a1 , . . . , aJ ) with:
aj =
Rj
gj,r + aj,0 .
(4.30)
r=1
For the within-class mixing probabilities: πj |(Z, z) ∼ D(bj ),
(4.31)
bj,r = gj,r + bj,r,0 .
(4.32)
where bj = (bj,1 , . . . , bj,Rj ) with:
4.4.5 Conditional distributions for the allocation variables The conditional distributions for the allocation variables are determined in the same manner as Section 3.5.5. Specifically, each measurement can be considered separately so that:
p(Z, z|y, µ, Σ, m, θ, π) =
n
p(zi |Zi , yi , µ, Σ, θ, π)p(Zi |yi , µ, Σ, θ, π).
i=1
85
(4.33)
CHAPTER 4. USE OF HYPERPRIORS
4.5. USING THE TRAINED CLASSIFIER
If Zi is unknown for the i-th data vector, (i.e. the data point is unlabelled):
p(Zi = j|yi , µ, Σ, θ, π) ∝ θj
Rj
πj,r
r=1
d
2 N (yi,l ; µj,r,l , σj,r,l ).
(4.34)
l=1
For a labelled training data vector, this sampling is skipped, which is equivalent to replacing (4.34) with a unit mass distribution at the labelled class. For the within class mixture component allocation variable zi , letting Zi = j, we have: p(zi = r|Zi = j, yi , µ, Σ, θ, π) ∝ πj,r
d
2 N (yi,l ; µj,r,l , σj,r,l ).
(4.35)
l=1
Since the allocations to within class mixture components are initially unknown, this sampling of zi is needed, regardless of whether the data vector is labelled. Since the number of classes and mixture components is finite, it is easy to sample from the distributions defined by both (4.34) and (4.35).
4.5
Using the trained classifier
This section discusses how the MCMC samples from the augmented posterior distribution can be used to make classifications of both training and future (i.e. operational or test) data.
4.5.1 Classifying the training data Even if the training data is class labelled, it is useful to assess the performance of the trained classifier on the training data. Amongst other things, this provides an indication of whether the mixture models have been able to learn the distribution of the training data measurements, and whether the MCMC sampler has converged. Furthermore, the classification performance on training data can be used to assist model selection. Posterior classification probabilities for a training data vector yt are obtained using: p(Zt |D) ≈
N 1 p(Zt |yt , µ(s) , Σ(s) , θ(s) , π (s) ), N s=1
(4.36)
where p(Zt = j|yt , µ(s) , Σ(s) , θ(s) , π (s) ) is evaluated using (4.34), i.e.
(s)
p(Zt = j|yt , µ(s) , Σ(s) , θ(s) , π (s) ) ∝ θj
Rj r=1
(s)
πj,r
d
(s)
2(s)
N (yt,l ; µj,r,l , σj,r,l ).
(4.37)
l=1
For unlabelled training data, these are the Rao-Blackwellised estimates [22, 106]. Details are the same as in Section 3.6.2.
86
CHAPTER 4. USE OF HYPERPRIORS
4.5. USING THE TRAINED CLASSIFIER
4.5.2 Classifying future observations The posterior class probabilities for a previously unseen observation yf can be written: p(Zf = j|D, yf ) ∝ p(Zf = j|D)p(yf |D, Zf = j).
(4.38)
If the class proportions in test or operational data are expected to be the same as those in the training data, the term p(Zf = j|D) can be approximated from the MCMC samples, using:
p(Zf = j|D) ≈
(s) N aj 1 , N s=1 J a(s) j =1
(s)
with aj
(4.39)
j
as defined in (4.30), with the gj,r determined by the allocation variables (Z (s) , z (s) ). The
derivation of (4.39) is the same as that in Section 3.6.3. If the class proportions in test or operational data are expected to be the different to those in the training data, the distribution defined by p(Zf = j|D) should be estimated from prior beliefs on the spread of future data between classes. As in Section 3.6.3, the density p(yf |D, Zf = j) can be estimated directly as follows:
p(yf |D, Zf = j) ≈
d N Rj 1 (s) (s) 2(s) πj,r N (yf,l ; µj,r,l , σj,r,l ). N s=1 r=1
(4.40)
l=1
This approach is referred to as direct estimation from the MCMC samples and gives an equation similar to that used for the training data. However, a theoretically better approach uses Rao-Blackwellised estimates to calculate the predictive density. Unfortunately, it turns out that (using calculations similar to those in Appendix C.2, detailing attempts to combine the groups m and (µ, Σ) within the Gibbs sampler) it is not possible to marginalise analytically over the full conditional posterior distribution for each (µj,r , Σj,r , mj,r ). Instead, the Rao-Blackwellisation based solution is used on predictive densities over the posterior distributions for (µj,r , Σj,r ) only. The integration over the mj,r occurs through direct summation (s)
over the MCMC sampled values, mj,r . Specifically: p(yf |D, Zf = j) =
p(yf , m, Z, z|D, Zf = j) dm dZ dz p(yf |D, Zf = j, m, Z, z)p(Z, z, m|D, Zf = j) dm dZ dz
= =
p(yf |D, Zf = j, m, Z, z)p(Z, z, m|D) dm dZ dz,
(4.41)
with the second line following from the first via the definition of conditional probabilities and the third from the second via the observation that the class of a future measurement has no effect on the marginal posterior distribution for (m, Z, z). Then, using the MCMC approximation of the marginal
87
CHAPTER 4. USE OF HYPERPRIORS
4.5. USING THE TRAINED CLASSIFIER
posterior distribution:
p(yf |D, Zf = j)
≈
N 1 p(yf |D, Zf = j, m(s) , Z (s) , z (s) ) N s=1
≈
N Rj 1 p(zf = r|D, Zf = j, m(s) , Z (s) , z (s) ) N s=1 r=1
×p(yf |D, Zf = j, zf = r, m(s) , Z (s) , z (s) ),
(4.42)
where: (s)
(s)
p(zf = r|D, Zf = j, m
,Z
(s)
,z
(s)
) = E(πj,r |D, Z
(s)
,z
(s)
bj,r
) = Rj
(s) r =1 bj,r
,
(4.43)
(s)
with bj,r as defined in (4.32), with the gj,r determined by the allocation variables (Z (s) , z (s) ). The density p(yf |D, Zf = j, zf = r, m(s) , Z (s) , z (s) ) is the predictive density for data drawn from mixture component r of class j, with parameters determined by the posterior distributions for the mixture components, conditional on the subset of MCMC samples (m(s) , Z (s) , z (s) ). Some calculations (similar to those detailed in Appendix B.2), show that this is a product of independent Student-t distributions (see Appendix A), with the l-th component distribution in the product having: (s)
1) 2νj,r,l degrees of freedom, (s)
2) location parameter κj,r,l , (s)
3) scale parameter Υj,r,l . (s)
(s)
The parameters νj,r,l and κj,r,l are defined in (4.23) and (4.22) respectively, using quantities de(s)
termined by the allocation variables (Z (s) , z (s) ) and the hyperparameter vector mj,r . The parameter (s) Υj,r,l is defined to be: (s) (s) (s) (s) (s) Υj,r,l = Vj,r,l (hj,r,l + 1)/(hj,r,l νj,r,l ), (4.44) (s)
(s)
(s)
with the parameters hj,r,l , νj,r,l and Vj,r,l defined in (4.21), (4.23) and (4.24) respectively, again using quantities determined by the allocation variables (Z (s) , z (s) ) and the hyperparameter vector (s) mj,r . In summary:
p(yf |D, Zf = j, zf = r, m(s) , Z (s) , z (s) ) =
d
(s)
(s)
(s)
St (yf,l ; κj,r,l , 2νj,r,l , Υj,r,l ),
(4.45)
l=1
where the Student-t distribution is defined in Appendix A. Substituting (4.45) and (4.43) into (4.42) gives:
p(yf |D, Zf = j) ≈
Rj d N 1 1 (s) (s) (s) (s) b St (yf,l ; κj,r,l , 2νj,r,l , Υj,r,l ). N s=1 Rj b(s) r=1 j,r l=1 r =1 j,r
88
(4.46)
CHAPTER 4. USE OF HYPERPRIORS
4.5. USING THE TRAINED CLASSIFIER
It should be noted that although the Rao-Blackwellisation based method gives theoretically lower variance results than direct estimation (see Section 3.6), the approach is more computationally expensive (see Section 3.6.3). Thus, the Rao-Blackwellisation method should not necessarily be preferred to the direct estimation approach. Future research could address techniques for penalising such evaluation complexity [37].
4.5.3 Incorporating uncertainty in target location Before application of the mixture model classifier to new data, an attempt must be made to align the targets consistently with the training data. For a radar target recognition application, this is required because the range bins/cells occupied by the target will differ, depending on the distance of the radar to the target. The standard way of doing this is to centre the training data, using an appropriate consistent procedure, and then centre each subsequent target using the same procedure. The centred data is then treated as being consistently aligned. However, there will always be uncertainty in the exact location of the target centre, and therefore of the position of the target itself. Some account should be made of this uncertainty. This can be handled readily in the Bayesian framework. For ease of explanation, the approach adopted is illustrated for radar range profile data. The extension of the approach to full ISAR images is trivial. The training and test data are all recentred about the extracted target centres. For each profile, a d-dimensional data vector, where d = 2p, is created by taking p range bins either side of the extracted centre, producing recentred data y(c). For previously unseen profiles (i.e. test data or operational data collected under uncontrolled conditions), the possibility that the extracted centre might be a few range bins away from the centre of appropriately aligned training data is taken into account, by considering small shifts in the centre position. This produces data yf (c + s) where s is the (integer) shift of the position of the centre (and can be positive or negative). If we allow for errors of −sm ≤ s ≤ sm in the extracted centre, our single probability density: p(yf |D, Zf = j) = p(yf (c)|D, Zf = j),
(4.47)
can be replaced with expressions of the form: p(yf |D, Zf = j) ≈
sm 1 p(yf (c + s)|D, Zf = j). 2sm + 1 s=−s
(4.48)
m
Each expression p(yf (c + s)|D, Zf = j) is treated as if yf (c + s) is the consistently aligned, previously unseen data. As sm increases, unless the data is badly recentred, the terms p(yf (c + s)|D, Zf = j) with |s| large will be negligible compared to the terms p(yf (c + s)|D, Zf = j) with |s| small. Thus, in practice, only small values of sm need to be considered. An alternative approach to shifting the range profiles would be to attempt to use shift-invariant features (for both training and test data), such as those provided by a Fourier transform of the range profiles.
89
CHAPTER 4. USE OF HYPERPRIORS
4.6
4.6. ALTERING THE HYPERPRIORS
Altering the hyperprior distributions
4.6.1 Introduction This section details how the hyperprior distributions are adjusted in an attempt to improve the generalisation properties of the mixture model classification system. In particular, changes are made to reflect the possible differences between training and operational/test data. The discussion starts with some details on the alterations that have been proposed and evaluated. It then goes on to describe a technique that could, potentially, be used to update the MCMC samples from a posterior distribution with one set of hyper-hyperparameters (defining the hyperprior distributions), to MCMC samples from a posterior distribution with a different set of hyper-hyperparameters, without re-running the MCMC training algorithm from scratch.
4.6.2 Approach for improving generalisation performance The hyperprior distributions (see Section 4.3.2) are given by independent multivariate Gaussian distributions: mj,r |(mj,0 , Ψj,0 ) ∼ Nd (mj,0 , Ψj,0 ),
(4.49)
for j = 1, . . . J, r = 1, . . . , Rj , where mj,0 and Ψj,0 are fixed hyper-hyperparameters. The hyperprior means mj,0 are set to be the appropriate class means, estimated from the labelled training data. The hyperprior covariance matrices Ψj,0 are set to be: 2 ), Ψj,0 = Bj + Diag(ψj,0
(4.50)
2 2 ) denotes a d×d diagonal matrix, with l-th diagonal element ψj,l,0 and the matrix Bj where Diag(ψj,0 is related to the between ‘subclasses’ covariance matrix, within class j. Here, ‘subclasses’ refers to
different examples of a class: e.g. for the application to mobile battlefield targets, different vehicles from the same generic class; the same vehicle under different levels of equipment fit; or the same vehicle imaged under different conditions. Specifically, suppose that there are ηj ‘subclasses’ in the training data from class j, with mean vectors ζj,1 , . . . , ζj,ηj . Then we set:
Bj =
ηj
(ζj,m − mj,0 )(ζj,m − mj,0 )T .
(4.51)
m=1
Note that (4.51) is not the ‘standard’ between ‘subclasses’ matrix, since there is no division by ηj . This reflects a desire to model the extra uncertainty due to different ‘subclasses’, which will be larger than that evidenced in the mean vectors only. However, if there are a large number of ‘subclasses’ in the training data, a scaling factor will be required to prevent the matrix becoming too large. Since each Bj is likely to be singular (the number of subclasses is less than the dimensionality of 2 the data), the Diag(ψj,0 ) term is introduced to ensure that Ψj,0 is invertible. In practice, the terms 2 ψj,l,0 are set to be the diagonal elements of the between classes covariance matrix. However, this is a somewhat ad hoc choice.
90
CHAPTER 4. USE OF HYPERPRIORS
4.6. ALTERING THE HYPERPRIORS
For the purposes of this thesis, if there is additional information on the measurements that might be expected from operating data, then it is assumed to be in the form of the mean values from measurements of the data. To illustrate how this information could be used, suppose that there are nj additional mean vectors, vj,1 , . . . , vj,nj , from class j, with each vector corresponding to a different ‘subclass’. Then the hyperprior mean for this class is updated to: mj,0
1 = nj + 1
mj,0 +
nj
vj,m
,
(4.52)
m=1
and the matrix Bj is updated as follows: Bj =
ηj
(ζj,m − mj,0 )(ζj,m − mj,0 )T +
m=1
nj
(vj,m − mj,0 )(vj,m − mj,0 )T .
(4.53)
m=1
Similarly to (4.51), if large numbers of ‘subclasses’ are available in the training data or nj is large, 2 (4.53) will need to be scaled (to prevent the matrix from becoming too large). The matrix Diag(ψj,0 ) 2 required to ensure that the matrix Ψj,0 = Bj + Diag(ψj,0 ) is invertible is unchanged. Where such vague additional information is available and the hyper-hyperparameters are updated, the resulting hyperprior distributions are referred to as altered hyperprior distributions. The hyperprior distributions formed without using any of the additional information, are referred to as the default hyperprior distributions. It is important to note that this particular hyper-hyperparameter updating scheme is only one of many possibilities for the given additional information. Furthermore, in real use the actual changes that could be made would of course depend entirely on what sort of extra information is available. In many circumstances the extra information is likely to be far superior to just the mean values of measurements from operating data. In the extreme case of the extra information consisting of actual data measurements it would be more appropriate to incorporate this data into the training data rather than use it to alter the hyperpriors.
4.6.3 Prior sensitivity In Chapter 3 the values of the (fixed) hyperparameters were chosen so that the prior distributions carried little weight compared to the data (the likelihood) in the posterior distributions (see Appendix B.3). If the majority of these values are used in the new formulation, then changing the values of the hyper-hyperparameters would have little effect on the final posterior distributions. Thus, it is required to set the hyperparameters so that the exact forms of the prior distributions do have a noticeable effect on the final posterior distribution. However, this is not without its difficulties because it makes finding appropriate values for the fixed hyperparameters more vital, since it is no longer possible to rely on the data dominating the priors. The most immediate way of increasing the sensitivity of the posterior distribution (and therefore the mixture model classifier) to the variable hyperparameters m is to increase the size of the precision parameters hj,r,l,0 , since this has the effect of narrowing the prior distributions for the means µj,r,l (by
91
CHAPTER 4. USE OF HYPERPRIORS
4.6. ALTERING THE HYPERPRIORS
decreasing the variance, as can be seen in (4.9)). By examining the top line of the posterior density function p(µ, Σ, θ, π, m|y) given in (4.16), it can be seen that increasing the value of hj,r,l,0 assigns greater probability to values of µj,r,l close to mj,r,l . Additionally, (4.22) shows that increasing hj,r,l,0 has the effect of assigning more weight to the hyperparameters mj,r,l in the conditional posterior 2 distribution for µj,r,l |(σj,r,l , mj,r,l , yGj,r ).
The potential difficulty with increasing the values of the hj,r,l,0 is that if they are set too large the hyperpriors might dominate the data (likelihood). This could be a particular problem if the number of mixture components Rj for a class j is large, since in this case there might not be much training data assigned to a particular mixture component at an iteration of the MCMC algorithm. In such cases gj,r would be small in (4.22), and therefore the hyperprior mean would dominate the data.
4.6.4 Updating the MCMC samples Introduction In the experiments conducted to data (and documented in Sections 4.8 and 4.9), the altered hyperprior results have been obtained from MCMC samples generated using the training algorithm detailed in Section 4.4. However, if a set of MCMC samples generated using the default hyperprior distributions has already been obtained, it may be possible to modify these samples to become samples from the posterior distribution with the altered hyperpriors, without re-running the entire MCMC algorithm. This would considerably reduce the computational expense of training the classifier for different hyperpriors. One potential technique for this more computationally efficient resampling is Importance Sampling [54, 168]. Importance sampling enables us to take samples from one distribution and convert them to samples from a second distribution. Thus, importance sampling enables us to take samples from a posterior distribution with a given prior distribution, and use them to obtain samples from a posterior distribution with a different prior.
Brief description of Importance Sampling This section provides a brief description of importance sampling (justifications are provided in Section 8.3.3). The description supposes that we have a set of n independent samples ψ (1) , . . . , ψ (n) , from a probability distribution with probability density function proportional to g(ψ), but we are actually interested in making inference on a probability distribution with density function proportional to f (ψ). If a set of unnormalised importance weights: w(s) = f (ψ (s) )/g(ψ (s) ),
(4.54)
is defined, the expectation of a function a(ψ) with respect to the distribution defined by f (ψ) can
92
CHAPTER 4. USE OF HYPERPRIORS
4.6. ALTERING THE HYPERPRIORS
be estimated by: a ¯f =
n
w(s) a(ψ (s) )
s=1
n
w(s) .
(4.55)
s=1
The accuracy of the estimate a ¯f depends on the variability of the importance weights. Degeneracy problems can arise if the weights vary too widely, since essentially this leads to the estimate being based on only the few points with the largest weights. Thus, for the technique to work well, the distribution defined by g(ψ) is required to be a fairly good approximation to that defined by f (ψ). Consider now, the special case where the distribution defined by f (ψ) is the posterior distribution for a likelihood function l(x|ψ) and prior distribution πf (ψ), while the distribution defined by g(ψ) is the posterior distribution for the same likelihood function, but a different prior distribution, πg (ψ). Then, the importance weights defined in (4.54) become: w(s) = πf (ψ (s) )/πg (ψ (s) ).
(4.56)
Thus, we have a method for adapting samples from a posterior distribution with one prior distribution, to samples from a posterior distribution with a different prior distribution. However, the original posterior distribution does need to be a good approximation to the new posterior distribution for the technique to work well, i.e. the two prior distributions should be similar.
Use of Importance Sampling when altering the hyperprior distributions If the hyperprior distribution is changed from p(m|m0 , Ψ0 ) to p(m|m0 , Ψ0 ), then the samples obtained from the posterior distribution p(µ, Σ, θ, π, m, Z, z|y) with the first hyperprior distribution, can be used to make inferences on the posterior distribution with the second hyperprior distribution, by using the unnormalised importance weights: w(s) = i.e. w
(s)
=
Rj J
p(m(s) |m0 , Ψ0 ) , p(m(s) |m0 , Ψ0 )
N (mj,r ; mj,r,0 , Ψj,r,0 ) (s) (s)
j=1 r=1
(4.57)
,
(4.58)
N (mj,r ; mj,r,0 , Ψj,r,0 )
(s)
where m(s) = {mj,r ; 1 ≤ j ≤ J, 1 ≤ r ≤ Rj }, for s = 1, . . . , N are the MCMC samples for the hyperparameters, drawn from the posterior distribution using the original hyperprior distribution. Using (4.36), the expression for the posterior classification probabilities for a single training data vector, under the original hyperprior distribution, can be expressed as:
p(Zt = j|D) ≈
N 1 p(Zt = j|yt , µ(s) , Σ(s) , θ(s) , π (s) ), N s=1
93
(4.59)
CHAPTER 4. USE OF HYPERPRIORS
4.6. ALTERING THE HYPERPRIORS
where (4.37) gives: (s) d (s) 2 (s) r=1 πj,r l=1 N (yt,l ; µj,r,l , σj,r,l ) . J (s) Rj (s) d (s) 2 (s) θ π N (y ; µ , σ ) t,l r=1 ,r j j j =1 l=1 j ,r,l j ,r,l (s)
(s)
p(Zt = j|yt , µ
(s)
,Σ
,θ
(s)
,π
(s)
)=
θj
Rj
(4.60)
Using (4.55) together with (4.59) the training data classification probabilities for the mixture model approach using the altered hyperpriors can be approximated by: p (Zt = j|D) ≈
N s=1
w(s) p(Zt = j|yt , µ(s) , Σ(s) , θ(s) , π (s) ) , N (s) s=1 w
(4.61)
where p(Zt = j|yt , µ(s) , Σ(s) , θ(s) , π (s) ) is unchanged from (4.60) and w(s) is as defined in (4.58). The relevant expressions for classifying future (previously unseen) observations are obtained by modifying the equations given in Section 4.5.2. Using (4.38), the posterior classification probabilities under the altered hyperprior distributions can be approximated by: p (Zf = j|D, yf ) ∝ p (Zf = j|D)p (yf |D, Zf = j).
(4.62)
The importance sampling expression for the first term of (4.62) is given by:
p (Zf = j|D) ≈
N
N s=1
w(s) E(θj |D, Z (s) ) ≈ N (s) s=1 w
(s)
a
j (s) J (s) s=1 w a j =1 j , N (s) s=1 w
(4.63)
or (usually) more appropriately is estimated from prior beliefs on the spread of future data between classes. The direct estimation approach to classifying future data uses the following importance sampling expression for the second term of (4.62): N
p (yf |D, Zf = j) ≈
s=1
w(s)
Rj r=1
(s) d (s) 2(s) πj,r l=1 N (yf,l ; µj,r,l , σj,r,l ) . N (s) s=1 w
(4.64)
The Rao-Blackwellisation based approach to classifying future data uses the following importance sampling approximation:
p (yf |D, Zf = j) ≈
N s=1
w(s) p(yf |D, Zf = j, m(s) , Z (s) , z (s) ) , N (s) s=1 w
(4.65)
with p(yf |D, Zf = j, m(s) , Z (s) , z (s) ) taken from (4.46). Both (4.64) and (4.65) use the expression for w(s) defined in (4.58).
94
CHAPTER 4. USE OF HYPERPRIORS
4.7. DESCRIPTION OF EXPERIMENTS
Degeneracy problems The major problem with the importance sampling technique for producing samples from the posterior distribution defined using altered hyperpriors, given samples from the posterior defined using default hyperpriors, is the tendency to produce degenerate weights if the two hyperprior distributions are not similar enough. Specifically, when the importance weights are normalised, it may turn out that one weight will tend to one, while the rest will all be zero. It might be hoped that changes to the hyper-hyperparameters will be relatively small, leading to only a small change in the hyperprior distributions and therefore non-degenerate importance sampling. However, it will often be the case that the large dimensionality of the data measurements (and therefore the hyperprior vectors) is such that the weights will be degenerate. If this is the case (which will be readily apparent by an examination of the importance weights), then an entire re-run of the MCMC algorithm for the altered hyperpriors might be the simplest solution. An alternative approach lies in tempered transition [134, 135] and annealed importance sampling [136, 137] techniques, in which moves between two distributions are made via a sequence of appropriate intermediate distributions.
4.7
Description of experiments
4.7.1 Introduction The classifier is illustrated using real ISAR data, consisting of images of vehicles from three main types of mobile battlefield target: main battle tanks (MBTs); armoured personnel carriers (APCs); and air defence units (ADUs), referred to as classes 1, 2 and 3 respectively. The task is to classify images into one of the three broad/generic classes, based on limited training data examples from the classes. The datasets are a superset of those used in Chapter 3. An example of a vehicle from one of the datasets is given in Fig. 4.4.
Figure 4.4: Typical vehicle imaged to create the ISAR data set. An example of an ISAR image from one of the training datasets is given in Fig. 4.5. The vertical axis represents cross-range and the horizontal axis range. The intensities in the image are related 95
CHAPTER 4. USE OF HYPERPRIORS
4.7. DESCRIPTION OF EXPERIMENTS
to the amplitudes of the complex-valued ISAR returns.
Figure 4.5: Example of an ISAR image from the training dataset (range along the horizontal axis, cross range along the vertical axis). The data have been collected during two separate trials, held in 1997 and 2000. In both trials, images were collected for a range of target types within the generic classes. Additionally, many of the target types were imaged with a wide range of different configurations. The same radar, namely the MIDAS (Mobile Instrumented Data Acquisition System) radar, was used in both the trials. However, the data from the year 2000 trial is considered to be of a higher quality than that generated in the 1997 trial. The radar produced millimetre wave (MMW) ISAR images of vehicles rotating on a turntable. The radar was operated at a frequency of 94 GHz and processed to a downrange resolution of 0.3m and a cross-range resolution of 0.35m, giving almost square image pixels. The cross-circular polarisation channel was selected to form the images. Each dataset consisted of the ISAR images obtained as the target rotated through 360◦ . A typical dataset consisted of approximately 1200 images (multi-dimensional measurements). The angle of the turntable, and therefore the angle of observation of the target, was recorded for each image. Although such information cannot be used for test data, the observation angle for each training data image can be used legitimately for training classifiers. However, in our experiments the angle of observation was only used to initialise the MCMC algorithm (see Section 4.4.1).
4.7.2 Training, validation and test datasets The datasets were separated into training, validation and test data [189]. Which subsets of these separate groups of data were actually used to train and then test the classifiers changed, depending on the nature of the problem that was being addressed. Tables E.1 and E.2 describe the training and validation datasets respectively. Tables E.3 and E.4 describe the test datasets.
4.7.3 First scenario In the first scenario the training data consisted of three datasets, with one from each of the three generic classes. The actual datasets used were dataset 13 (MBT of type MBT1), dataset 17 (APC 96
CHAPTER 4. USE OF HYPERPRIORS
4.7. DESCRIPTION OF EXPERIMENTS
of type APC1) and dataset 31 (ADU of type ADU1), all collected during the year 2000 trial. The validation and test datasets were split into five different groups, with assignments of datasets to the groups depending on the relationship of the vehicle, model and configuration to the training data vehicles (where for the purposes of these experiments, the same vehicles imaged in different trials were treated as different models). The groups can be described as: G1. Same models and types of vehicles, relatively similar configurations. G2. Same models and types of vehicles, larger configuration differences. G3. Different models or types of vehicles, relatively similar configurations. G4. Different models or types of vehicles, larger configuration differences. G5. Different radar depression angle (default was 10◦ ). Assignment to these groups is now illustrated for ADUs imaged at the same radar depression angle as the training data. In particular, the case where our training data example of an ADU is an ADU of type ADU1 from the year 2000 trial, with the tracking radar ready for operation, is considered. Datasets of the same vehicle, in the same configuration, would be placed into group G1. The same vehicle, but with the radar down, would be placed into group G2. Different types of ADU, or an ADU of type ADU1 from the 1997 trial, would be placed into groups G3 or G4; with G3 being chosen if the vehicle was imaged in the same configuration as the training data example (with the radar ready for operation); and G4 being chosen otherwise. The splitting of the validation and test datasets into the aforementioned groups, is given in Tables E.5 and E.6 respectively. In Table E.6, some effort has been made to group together the datasets that correspond to the same vehicle, or in the case of the ADU, those with the radar up and those with the radar down. It goes without saying that there is ambiguity in whether particular configuration differences should lead to a dataset being placed into group G2 rather than group G1 (in the case of the same type of vehicle), or group G4 rather than group G3 (in the case of a different type of vehicle). However, as long as this is borne in mind when interpreting the results, this is not a problem. For the test datasets assigned to group G5, it turns out that if the radar depression angle were the same as that used for the training data, they would all belong to group G3. Thus, it is not valid to judge the effects of changes in depression angle by considering the results for group G5 alone. At this stage, it is important to note that most standard classifier assessments would concentrate on group G1 test vehicles only. This is particularly the case for ATR applications (see Chapter 2). However, by introducing group G2 we are able to assess the more realistic scenario of different configurations of the training data vehicles. The introduction of groups G3 and G4 is related to the military problem of trying to classify a new enemy vehicle, in situations where the training data consist of older examples/models of vehicles only. A difficulty with the datasets is that they are not evenly divided between the three generic classes of vehicles. This can be seen in Table E.7, which documents the number of datasets from each class, within each of the groups, for both the validation and test sets. It is readily apparent that there are 97
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
a lot more MBTs (class 1) than any other class. Thus, single measures of classification performance may not be useful; e.g. in the extreme case of a group containing datasets from only one class (such as group G5), the best classification performance for that group would come from a classifier that classifies all objects as that single class, regardless of measurements, which is not desirable.
4.7.4 Second scenario In the second scenario a few extra datasets have been added into the training data in addition to the original training datasets of the the first scenario. Thus, the second scenario is also referred to as experiments with the extended training set. Specifically, an extra example of each generic vehicle class was added to the training data. These are dataset 38 (an ADU of type ADU1, but with the radar down), dataset 77 (an MBT of type MBT3), and dataset 113 (an APC of type APC2). The validation and test datasets for this second scenario could be re-arranged into groups that reflect their relationships to the extended training data (i.e. groups G1’, G2’, G3’, G4’ and G5’, following the same rules as the previous groups). However, for comparison purposes with the first scenario, the allocation of datasets to groups remains the same as before.
4.7.5 Pre-processing The pre-processing for the mixture model experiments is the same as that in Chapter 3, and is discussed in Section 3.7.2. Specifically, the original complex ISAR images are converted into complex range profiles, by averaging over cross-range bins. Real-valued range profiles are obtained by taking the absolute value of the complex number in each range bin. After extraction of the target from the range profile (by determining the target centre) this results in 40-dimensional input data. Plots of the data are provided in Section 3.7.2. Each extracted range profile is normalised by a (linear) scaling factor, so that within each range profile the average amplitude of the extracted range bins (components of the data vector) is one. Such a normalisation is needed because the power output of the radar is unlikely to be constant, leading to variations in the intensities of the radar returns, unconnected with differences between the vehicles. Furthermore, weather conditions and smoke will alter the power of the return signals received by the radar. Since the mixture components are modelled by Gaussian distributions, the logarithm of each element of the normalised extracted range profile is taken, in an effort to pre-process the data to be more Gaussian-like.
4.8
Experimental results for first scenario
4.8.1 Introduction This section presents the classification results for the first scenario experiments introduced in Section 4.7. Detailed discussions are deferred to Appendix D.2.
98
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
As baseline comparisons, two standard classifiers are also assessed. These are a nearest neighbour classifier [189] applied to the extracted range profiles, and an independent component Gaussian classifier (i.e. a Gaussian classifier with diagonal covariance matrices) [189] applied to the original ISAR images (after recentring of each target). The Gaussian classifier has been applied using diagonal covariance matrices, because of the high-dimensionality of the ISAR images. In the interests of brevity, for the baseline classifiers, full misclassification vectors are not shown for each test dataset. Instead, only the correct classification percentages are provided (which in a slight abuse of notation are referred to as classification rates). For the first scenario, full misclassification vectors are provided for the mixture model classifiers. Performance assessment is based on classification rate (and therefore, equivalently, misclassification rate). However, it should be noted that classification rate is only a very simple indicator of performance [64], not least because it treats all misclassifications with equal weight. In practice, misclassification costs will be rarely equal and often will be unknown for the operating conditions.
4.8.2 Mixture model classifiers Introduction The performance of the developed Bayesian Gaussian mixture model classifier has been assessed, using both default and altered hyperprior distributions (see Section 4.6.2). The additional information used to create the altered hyperprior distributions came from four additional training datasets collected during the year 2000 trial. These were datasets 1, 21, 26 and 38. In particular, this information consisted of an example of the training data MBT in a different configuration, two examples of an MBT of type MBT2 (in different configurations) and the training data ADU in a different configuration (with the tracking radar down). Unfortunately, no additional APC training datasets from the year 2000 trial were available. For all four additional datasets, the information used was the mean of the dataset, in the manner described in Section 4.6.2. The individual measurements from the additional datasets were not made available, so could not be used as extra training data (labelled or unlabelled). For both hyperprior distributions, 18 mixture components have been used for each class mixture distribution. A justification for this choice of model order can be obtained from the graphs in Fig. F.1, which present training and validation dataset classification results for the proposed mixture model approach (using default hyperpriors), for a variety of different model orders. Specifically, the results for mixture models with 3, 6, 12, 15, 18, 24 and 48 components per class are displayed. The top left graph shows the average training data classification rate against model order. The remaining three graphs show the validation classification rates against model order, within the first two groups (G1 and G2), separated by class. The top right graph corresponds to the MBT of group G1, the bottom left to the APC of group G2 and the bottom right to the ADU of group G2. The validation dataset results have been obtained using Rao-Blackwellisation based classification, with no shifts in the recentred data. It can be seen that by the time the model order has been increased to 18 mixture components per class, there is little improvement in performance as the model order increases further. Thus, since smaller model orders are more computationally efficient, 18 mixture components have been used in our documented experiments. As noted in Section 3.4.2 (discussing 99
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
model order selection for standard Gaussian mixture models), this is not an entirely satisfactory treatment for model selection. Not least, because it relies on a subjective choice by the user. Future research could address more sophisticated mechanisms for model order selection. For both the default and altered hyperprior approaches, 400 samples were drawn from the posterior distribution, by sampling every 5th iteration of the MCMC algorithm, after a burn-in period of 5000 iterations. The number of samples drawn was rather small for the dimensionality of the distribution being considered and better modelling of the densities (at the expense of increased computational cost) may be possible if more samples are used. The values of the fixed hyperparameters are detailed in Appendix D.1. Initialisation of the MCMC algorithm was based upon the angle of observation of the targets. To reduce the effect of random MCMC sampling effects, each experiment has been repeated five times. The presented results are the average of these five sets of results. In Tables E.8 and E.9, the training data misclassification matrices for the Bayesian Gaussian mixture model classifiers are documented, using default and altered hyperpriors respectively. The method of classification is that described in Section 4.5.1. The overall classification rates are 98.4% using the default hyperprior distributions and 98.3% using the altered hyperpriors. This indicates that the use of altered hyperpriors has not had a significant adverse effect on the training of the mixture models. It should be noted that we would not expect the use of altered hyperpriors to give better training set classification performance than default hyperpriors, since the information used to change the hyperpriors is, by its nature, different from the information contained in the training data. Posterior class probabilities for the validation and test data have been calculated using both direct estimation from the MCMC samples and the Rao-Blackwellisation based method. In all cases, classifications have been made by assigning to the class with the largest posterior probability, i.e. assuming equal losses for all misclassifications. Two different magnitudes of shifts in the recentred data (see Section 4.5.3) have been used; none; and up to ±1, where the zero shift assumes that the extracted target centre is completely correct. In total, for the validation and test data, there are therefore eight sets of mixture model results presented (all using 18 mixture components for each class). These are: MM1) Default hyperpriors, using direct estimation, with no shifts in the recentred data. MM2) Default hyperpriors, using direct estimation, with shifts up to ±1 in the recentred data. MM3) Altered hyperpriors, using direct estimation, with no shifts in the recentred data. MM4) Altered hyperpriors, using direct estimation, with shifts up to ±1 in the recentred data. MM5) Default hyperpriors, using Rao-Blackwellisation based approach, with no shifts in the recentred data. MM6) Default hyperpriors, using Rao-Blackwellisation based approach, with shifts up to ±1 in the recentred data. MM7) Altered hyperpriors, using Rao-Blackwellisation based approach, with no shifts in the recentred data.
100
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
MM8) Altered hyperpriors, using Rao-Blackwellisation based approach, with shifts up to ±1 in the recentred data.
Validation set results The misclassification vectors for the mixture model classifiers applied to the validation datasets are documented in Tables E.10 and E.11 for the direct estimation and Rao-Blackwellisation based approaches, respectively. In Fig. F.2, the individual classification rates for each of the mixture model classifiers applied to the validation datasets are displayed, separated by group. With the exception of dataset 153, an MBT of type MBT5 in group G4, the validation set results are all quite good. This is particularly the case for the first two groups and dataset 29, an MBT of type MBT2 in group G3. The poor performance on dataset 153 is not surprising, since the type MBT5 vehicle being imaged is radically different to the training data example of an MBT. Unfortunately, the validation datasets do not cover all groups and classes so a full performance assessment is not possible. In terms of classification rate, it would appear that using RaoBlackwellisation does not give a performance advantage other the direct estimation approach. Indeed, for datasets 79 and 153 in particular (which are the datasets with the lowest classification performance) the classification rates degrade using Rao-Blackwellisation. Although this result appears surprising at first, it should be noted that the theory underlying the Rao-Blackwellisation approach is based on the assumption that the training data is truly representative of the operational/test data (i.e. group G1). Thus, degraded performance on datasets from groups G3 and G4 (which includes datasets 79 and 153) is not a sign of a theoretical shortcoming. Furthermore, it should be noted that the performance assessment is based on classification rates only and not on the accuracy of the class probability estimates. Integrating over shifts in the recentred data to allow for uncertainty in the position of the extracted target improves the classification rates for most of the datasets, but not for all. This is particularly evident in group G3 for which dataset 29 has improved classification performance when shifts are used, while dataset 153 has degraded performance. It is likely that for some datasets the recentring procedure produces accurately aligned measurements, and therefore any shifting will only degrade the performance, while for the majority of datasets the recentring procedure introduces misalignments, whose effect can be reduced by integrating over shifts. Using altered hyperprior distributions compared to default hyperpriors is seen to give no performance advantage for groups G1 and G2. However, this is unsurprising, since the configuration and vehicle changes used to the alter the hyperpriors are not present in the validation data from groups G1 and G2. For dataset 29 (the MBT of type MBT2 in group G3) there is a slight improvement in classification performance using the altered hyperpriors, especially for the Rao-Blackwellisation based approach. However, since this corresponds to a vehicle that was used to alter the hyperpriors, the insignificant size of the improvement is disappointing. Furthermore, for the direct estimation approach, this improvement comes at the expense of a degradation in performance for dataset 79 (the MBT of type MBT3 in group G3). However, for the Rao-Blackwellisation based approach, the performance on dataset 79 is better using the altered hyperpriors rather than default hyperpriors. There is a larger improvement when altered hyperpriors are used for dataset 153, the MBT of type
101
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
MBT5 in group G4. However, even the improved performance is very poor, with the target being classified as an APC for the majority of measurements.
Test set results The misclassification vectors for the mixture model classifiers applied to the test datasets are documented, for the direct estimation approach, in Tables E.12 through to E.16 (separated by test data group). For the Rao-Blackwellisation based approach, Tables E.17 through to E.21 document the misclassification vectors (again separated by test data group). The results in Tables E.12 through to E.21 allow examination of the full classification performance for the mixture model classifiers, but are too dense for meaningful comparisons. Furthermore, the performance varies from dataset to dataset, which makes overall conclusions hard. However, some examination of performance can be conducted, and this is detailed in Appendix D.2.1. Generally, the results are good for groups G1 and G2, but fall away for some of the different vehicles in groups G3, G4 and G5. Typically, the MBTs with the worst performance in groups G3, G4 and G5 are manufactured in countries different to the origin of the training data MBT. Rao-Blackwellised classification would appear to degrade classification performance for groups G3, G4 and G5, as compared to direct estimation. This agrees with the earlier comment that the RaoBlackwellisation approach is based on the assumption that the training data is truly representative of the operational/test data. The more sophisticated estimation of the posterior class probabilities offered by the Rao-Blackwellisation based approach applied to test data similar to the training data, does not appear to translate across to superior classification results for test data vehicles that differ from the training data. However, it is important to bear in mind that although RaoBlackwellisation leads (generally) to poorer classification rate performance, the estimates of the posterior class probabilities might actually be more accurate [64]. Disappointingly, the use of altered hyperprior distributions does not lead to a significant classification performance increase, compared to the use of default hyperpriors. Although there are classification rate increases for some of the vehicle and configuration changes, there are also some examples in which the procedure leads to a decrease in performance. Furthermore, even when the configuration and vehicle changes are similar to those present in the data used to alter the hyperprior distributions, the increases in classification performance are small.
4.8.3 Mixture model classifiers used in comparisons Ideally, using the validation datasets, a single “best” mixture model classifier with default hyperpriors and a single “best” mixture model classifier with altered hyperpriors would be selected for test data comparisons with each other and the baseline classifiers. However, the nature of the validation datasets and their results makes this inappropriate not least because, for groups G3 and G4, we would be biasing our choice to classifiers which are good on the MBT class. Instead, four mixture model classifiers have been selected for future comparisons:
102
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
C1) Default hyperprior Bayesian mixture model classifier, using direct estimation, with shifts of ±1 in the recentred data (i.e. MM2). C2) Default hyperprior Bayesian mixture model classifier, using Rao-Blackwellisation based approach, with shifts of ±1 in the recentred data (i.e. MM6). C3) Altered hyperprior Bayesian mixture model classifier, using direct estimation, with shifts of ±1 in the recentred data (i.e. MM4). C4) Altered hyperprior Bayesian mixture model classifier, using Rao-Blackwellisation based approach, with shifts of ±1 in the recentred data (i.e. MM8). The choice of shifts of up to ±1 in the recentred data (rather than no shifts) reflects the validation data observation that shifting the recentred data improves classification performance for most of the validation datasets.
4.8.4 Comparisons with the baseline classifiers on the validation data Table E.22 documents the validation dataset classification rates for the four selected mixture model classifiers and the two baseline classifiers (independent component Gaussian classifier applied to the recentred ISAR images and nearest neighbour classifier applied to the extracted range profiles). Figure F.3 displays the results averaged for each class within each group. The label G#1-#2 corresponds to class #2 of Group G#1, where #1 ∈ {1, 2, 3, 4, 5}, and #2 ∈ {1, 2, 3}. Thus, the label G1-1 corresponds to the MBT class of group G1, the label G2-2 corresponds to the APC class of group G2, and the label G2-3 corresponds to the ADU class of group G2, etc. Empty classes within a group are ignored. With one exception (namely Rao-Blackwellisation based classification with default hyperpriors for the MBT of group G4), the independent component Gaussian classifier is outperformed by the mixture model classifiers, in terms of average class results. The nearest neighbour validation set performance is very good for the validation datasets from groups G1 and G2, but falls away (on average) for the MBTs from groups G3 and G4. This would hint at a possibly poor generalisation performance for the nearest neighbour classifier. Within the mixture model results, the use of altered hyperpriors rather than default hyperpriors only leads to a significant increase in performance for the MBT from group G4. However, even the improved performance is poor.
4.8.5 Comparisons with the baseline classifiers on the test data In Tables E.23 to E.27, the classification rates for the test datasets are documented, separately for each group. These results are displayed pictorially in Figures F.4 to F.8. It should be noted that in the figures the scale along the vertical axis displaying the classification rate changes with test data group. Thus, comparisons between groups are awkward. However, since the different groups reflect different questions being asked of the classifiers, comparison between groups would be dangerous anyway.
103
CHAPTER 4. USE OF HYPERPRIORS
4.8. RESULTS FOR FIRST SCENARIO
It is apparent that the relative performances of the classifiers vary with the datasets. Based on the results, it is hard to make performance judgements between the classifiers for groups G1 and G2. Furthermore, the results for groups G3 and G4 vary rapidly with vehicle and configuration changes. However, Appendix D.2.2 attempts to make some performance comparisons between the mixture model classifiers and the two baseline classifiers. In Fig. F.9, the results have been averaged for each class within each group. Again, G1-1 corresponds to the MBT class of group G1, G1-2 corresponds to the APC class of group G1 and G2-3 corresponds to the ADU class of group G2, etc. It should be cautioned that such an averaging of results can be misleading since, as detailed in Appendix D.2.2, the results vary rapidly from dataset to dataset. However, it can provide a rough guideline to overall performance. With the exception of the MBT class of group G2 (G2-1), and the ADU class of group Group G3 (G3-3), the best mixture model classifiers can be seen to have better averaged class performance than the independent component Gaussian classifier, for groups G1, G2 and G3. The poorer mixture model classifier results for the MBT class of group G2 (G2-1) are a result of their very poor performance on dataset 3 (in Appendix D.2 we postulate that this might be due to configuration changes above the turret being mistaken for the tracking radar of an ADU). The mixture model classifiers actually have worse performance than the independent component Gaussian classifier for the MBT and ADU classes of group G4 (G4-1 and G4-3 respectively). However, for the ADU class of group G4, it is important to note that all the classifiers are performing poorly. Based on these results, it would appear that with the exception of configurations and vehicles which are very different to the training data, the use of mixture models on the range profiles is preferable to an independent component Gaussian classifier on the full ISAR images. Comparison of the mixture model approach with the nearest neighbour classifier is made easier by studying Figure F.9 in conjunction with Figure F.10. Figure F.10 documents the improvement in the average class results for each group, for the mixture model classifiers using the direct estimation approach (default and altered hyperpriors), compared to the nearest neighbour classifier. A similar graph is obtained if the Rao-Blackwellisation based approach is used for the mixture model classifiers. However, the justification for considering only the direct estimation approach is provided by the validation set observation that the Rao-Blackwellisation based approach does not lead to an increase in classification performance, as compared to direct estimation. The relative performances of the mixture model classifiers compared to the nearest neighbour classifier vary depending on the group and class. In terms of classification rate, for each class there exists a group for which the mixture model classifiers outperform the nearest neighbour classifier and also a group for which the nearest neighbour classifier outperforms the mixture model classifiers (although the latter statement is marginal for the APC class). Similarly, within groups G1, G2 and G4, there is a class for which the mixture model classifiers outperform the nearest neighbour classifier and also a class for which the nearest neighbour classifier outperforms the mixture model classifiers. The mixture model classifiers outperform the nearest neighbour classifier for all three classes in group G3 and also the single MBT class of group G5. Thus, with the exception of the MBT class of group G2 (due to the poor mixture model performance for dataset 3), it would appear
104
CHAPTER 4. USE OF HYPERPRIORS
4.9. RESULTS FOR SECOND SCENARIO
that the mixture model classifiers should be preferred to the nearest neighbour classifier, in terms of average classification rate. However, the caution should be added that this averaging process can be misleading, since the performance varies from dataset to dataset. This preference to the mixture model classifiers becomes even more the case, when it is noted that the mixture model classifiers produce estimates of the posterior probabilities of class membership, where-as the nearest neighbour classifier produces class membership decisions only. Advantages in producing posterior class probabilities are covered in Section 2.2.5. Figure F.11 displays the improvements in the average class results for each group, using altered hyperpriors rather than default hyperpriors. The graph on the left-hand-side displays the results using the direct estimation approach to classification, while the graph to the right-hand-side is for the Rao-Blackwellisation based approach. For the direct estimation approach the altered hyperpriors lead to improved classification performance for the ADU class (across all four groups that contain an example of an ADU). The general trend for the MBT class is also for improved performance. No extra information was made available from APC datasets to use in altering the hyperprior distributions, which might explain why using altered hyperpriors results in poorer performance for the APC class (across all three groups which contain it). With the exception of group G3, the comparison between altered hyperpriors and defaults hyperpriors using the Rao-Blackwellisation based approach is similar to that using direct estimation. However, for group G3 the ADU has worse performance using altered hyperpriors, while the APC actually has better performance (despite the fact that no APC datasets were used to alter the hyperprior distributions). Although, on balance, it could be argued that the use of altered hyperpriors leads to improved classification performance, the results from the comparison between altered and default hyperpriors are somewhat disappointing. In particular, where there are increases in performance, they are generally small. Furthermore, as the in-depth examination of Appendix D.2.1 shows, there are no examples where the use of altered hyperprior distributions leads to a qualitative change in performance (e.g. from average to good).
4.9
Experimental results for second scenario
4.9.1 Introduction The classification performance is now assessed for the extended training set of the second scenario (see Section 4.7.4). Since the scope of the training data has been increased, there is no longer a need to attempt to alter the hyperprior distributions, so only a single default set of hyperpriors is assessed. The values for the hyper-hyperparameters defining the hyperpriors are chosen by following the prescription provided in Section 4.6.2 (however to allow for the larger number of ‘subclasses’ in the data, the terms in (4.51) are normalised by the ‘subclass’ sizes ηj ). As discussed in Section 4.5.2, the Rao-Blackwellisation based approach to classification increases in computational complexity as the number of training vectors increases. Therefore, for this increased training data scenario, only
105
CHAPTER 4. USE OF HYPERPRIORS
4.9. RESULTS FOR SECOND SCENARIO
the direct estimation approach to classification has been used. The extra training datasets will lead to increased within-class variability in the training data. Therefore, in the initialisations of the MCMC algorithm (see Section 4.4.1) we have used a kmeans clustering algorithm for the mixture components, rather than an angle of observation-based approach. Only very limited experiments have been conducted to assess the number of mixture components to use in each class mixture distribution. As would be expected, to allow for the greater variability in the training data, larger numbers of mixture components were required (compared to the first scenario). The left-hand graph of Fig. F.12 documents the average extended training data classification rate against model order, based on 24, 48 and 72 mixture components for each class. The right-hand graph of Fig. F.12 documents the validation classification rates, for each class in the first two groups (groups G1 and G2). No shifts have been used for the recentred data. Based on these results and a desire to choose smaller model orders if possible, 48-component class mixture distributions have been selected. For the 48-component mixture distributions, 250 MCMC samples have been drawn, by sampling every 20th iteration of the MCMC algorithm, after a burn-in period of 5000 iterations. The smaller number of MCMC samples drawn compared to the first scenario, reflects not only the increased computational cost, but also the increased memory and storage costs of a larger model order (48 components compared to 18 in the first scenario). The fixed hyperparameter values used follow the prescription provided in Appendix D.1. Table E.28 displays the training data misclassification matrix, with results separated by dataset. The overall classification rate is 96.2%. For the subset of training data used in the first scenario, the classification rate is 96.9%, which is slightly lower than the training set results that were obtained for the first scenario (98.4% and 98.3% for the default and altered hyperprior mixture model classifiers, respectively). This reflects the fact that training a classifier is harder for a more diverse set of training data. To classify future data (i.e test data) shifts of up to ±1 in the recentred data have been used. The baseline classifiers are again an independent component Gaussian classifier on the recentred ISAR images and a nearest neighbour classifier on the range profiles. Both are trained using the extended training set. However, we also compare the performance with two of the mixture model classifiers from the first scenario (i.e. using the original training set). The first of these mixture model classifiers uses default hyperpriors, while the second uses altered hyperpriors. Both use the direct estimation approach to classification, with shifts of ±1 in the recentred data.
4.9.2 Validation data Table E.29 documents the classification rates for the validation datasets, while Figure F.13 shows the results averaged for each class within each group, using the same notation as Figure F.3. By looking at the three mixture model classifiers we can see that the extra training datasets have led to degraded performance for the majority of validation datasets. The only dataset with improved performance is dataset 79 of group G3, an MBT of type MBT3. This is an example of a dataset that
106
CHAPTER 4. USE OF HYPERPRIORS
4.9. RESULTS FOR SECOND SCENARIO
belongs to group G1’ under the extended training sets, so the dramatic improvement in performance is not surprising. Since we have already conducted a performance comparison between the mixture model classifiers and the independent Gaussian and nearest neighbour classifiers on the first scenario, there is little to be gained from studying the relative validation dataset performances for the three classifiers trained on the extended training sets.
4.9.3 Test data In Tables E.30 to E.34 the classification rates on the test datasets are documented, separately for each group. These results are displayed pictorially in Figures F.14 to F.18. In Figure F.19 the results have been averaged for each class within each group. As in the first scenario, the relative performances of the mixture model classifier, independent component Gaussian classifier and nearest neighbour classifier vary from dataset to dataset, so there is little to be gained from another analysis of the results. Instead, for this second scenario, we concentrate on the change in performance obtained using the extended training sets compared to the original training sets. A detailed examination of the changes in classification rates for the mixture model classifiers is provided in Appendix D.3. The graphs in Figure F.20 display the change in average class performance, within each group, for the extended training set mixture model classifier compared to the original training set mixture model classifiers (the left-hand-side using default hyperprior distributions and the right-hand-side using altered hyperprior distributions). It can be seen that only the APC class of group G3 (i.e. G32), the ADU class of group G4 (G4-3) and the MBT class of group G5 (G5-1) have improved averaged class performance using the extended training sets. However, the averaging over datasets removes much of the relevant information, so the reader is referred to Appendix D.3 for a full discussion. It would appear that use of an extended training set has not led to improved generalisation performance. Specifically, although there are some performance increases (especially when the performance for individual datasets is examined), these correspond to vehicles (and configurations) added to the training data, and are therfore not examples of improved generalisation. The left-hand-side of Figure F.21 shows the improvement in average class performance, within each group, for the independent component Gaussian classifier using the extended training set, compared to the independent component Gaussian classifier using the original training set. The right-hand-side of Figure F.21 shows the equivalent graph for the nearest neighbour classifier. Although we are not interested in delving into the results for these baseline classifiers, we can see that as with the mixture model classifiers, there is generally a degradation in performance on the first two groups when the extended training set is used. For the independent component Gaussian classifier this degradation is quite large for some classes. This reflects the inability of an independent component Gaussian classifier to model complicated structures in a diverse training set. This is due to the small number of free parameters available within the classifier. Where the test data contain a vehicle type that was added to the extended training set (such as the APC class of group G3) there is, as would be expected, a dramatic increase in performance.
107
CHAPTER 4. USE OF HYPERPRIORS
4.10. SUMMARY AND RECOMMENDATIONS
4.9.4 General comment Perhaps the most pertinent point to take away from the results presented for this second scenario is the difficulty of the target recognition problem that has been tackled. Even with an increased training set, if the test configurations or vehicles are different to those used in the training data, classification into generic classes based on a single ISAR image (or a feature such as a range profile derived from the image) is extremely difficult. In some cases this difficulty is likely to be due to the fact that different ‘subclasses’ of a generic class are very different. This is particularly relevant if the ‘subclasses’ correspond to vehicles from different countries (which was the case for some of the presented datasets).
4.10
Summary and recommendations
This chapter has reported research attempting to incorporate generalisation procedures for situations where training data are incomplete (in that they do not characterise the expected operating conditions) into a Bayesian Gaussian mixture model approach to discrimination. The procedure adopted consists of varying the hyperprior distributions of some of the parameters of the mixture models, to match expected deviations of operating data from training data. This requires vague knowledge of some of the likely measurements from the operating data. For the purposes of this work, mean values of new examples of the classes have been used. The approach has been applied to ISAR images of mobile battlefield vehicles, with the aim of classifying each (suitably pre-processed) image into the generic classes MBT, APC and ADU. The wide range of different configurations and vehicles covered by the datasets makes it hard to come up with definitive performance statements for the mixture model technique. However, on balance, the mixture model classifiers (applied to range profiles extracted from the imagery) provide better classification performance than both a nearest neighbour classifier on the range profiles and an independent component Gaussian classifier on the original ISAR images. Although the use of altered hyperpriors can lead to improved classification performance in some circumstances, there is no major performance improvement. In particular, where there are increases in performance, they are generally small. Furthermore, no examples were found for which the use of altered hyperprior distributions provides a qualitative change in performance (e.g. from average to good). The generalisation performance across to different types of vehicle within the same generic class was often disappointing. This fact forms part of the motivation for the work presented in Chapters 6, 7 and 8. Each of those chapters investigates how extra information or data can be used to assist the classification of targets, as opposed to relying solely on single sensor measurements. Chapter 6 looks at how extra training data can be utilised if procedures can be developed to generalise target classification between sensors. Chapter 7 looks at how sensor measurements can be supplemented by contextual information and domain specific knowledge. Chapter 8 looks at how prior targeting information can be used to aid later classifications. There are many possible avenues for future work, based upon the research presented in this
108
CHAPTER 4. USE OF HYPERPRIORS
4.10. SUMMARY AND RECOMMENDATIONS
chapter. The hyperprior distributions were altered using the mean values of new examples of the classes (under the assumption that this was the only available information). Future work could look at using different extra information to alter the parameters of the hyperprior distributions and possibly different distributional forms for the hyperpriors. It is likely that the very conservative extra information that has been used, is one of the causes for the disappointing sizes of the classification rate increases obtained using altered hyperprior distributions. More efficient techniques for updating samples from the posterior distribution defined using one hyperprior distribution to become samples from the posterior distribution defined using a different hyperprior distribution, would also be an interesting area for further study. Future work could address the model selection issue of the number of mixture components to use in each mixture model. As well as issues with overfitting of the training data, the choice of model order should take into account the increased computational cost of using larger models for target recognition. It would also be of interest to assess the effect of different pre-processing feature extraction techniques, prior to application of the mixture model density estimation procedure. The mixture model approach has been designed to incorporate unlabelled (as well as labelled) training data. However, no assessment of the benefits of the use of unlabelled data has been presented. It would be interesting to see what performance advantages can be gained from using unlabelled training data, and to determine the situations where such unlabelled data would be available. To date, the experiments have concentrated on classifying the ISAR data into the generic classes of MBT, APC and ADU. It would be of interest to assess the performance of the mixture models classifying into specific types of these generic classes (which should be an easier problem). Additionally, alternative methods for assessing the performance could be investigated. The ultimate aim is to use the probabilities in a multi-level model incorporating contextual information and reflecting the whole decision making process. Thus, accurate class probability estimates are of more importance than the actual classification rate.
109
CHAPTER 5 BAYESIAN GAMMA MIXTURE MODEL APPROACH 5.1
Introduction
5.1.1 General This chapter formulates a gamma mixture model approach to ATR of military targets from radar returns. Specifically, a density estimation approach to classification is considered, with the class conditional likelihoods given by gamma mixture models. The justification for using gamma rather than Gaussian mixture models (which were used in previous chapters) arises from a physical consideration of radar returns. The approach is applied to ATR of military ships from radar range profiles (RRPs). Despite the radar ATR justification for the use of gamma mixture models, the technique is not limited to this specific application, or indeed solely to ATR problems. As with previous chapters, the generic discrimination problem [189] considered is one where we are given a set of training data consisting of class labelled measurements (and possibly some unlabelled measurements), and are then required to assign a previously unseen object to one of the classes on the basis of measurements made of that object. The approach adopted aims to provide, for measurement data x and classes j, Bayesian estimates of the parameters of gamma mixture model class-conditional probability densities (likelihoods) of the data, p(x|j). Posterior probabilities of class membership, p(j|x), are then obtained using Bayes’ theorem, p(j|x) ∝ p(x|j)p(j), where p(j) are the prior class probabilities. The advantages that a density estimation approach to classification can bring are covered in Section 2.2.5.
5.1.2 Related work Gamma mixture models for ATR have previously been proposed by Webb [188], who uses the EM algorithm to estimate the parameters of class distributions that are gamma mixture models. Results from Webb’s approach are presented in this chapter.
110
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.2. APPROACH
5.1.3 Outline The outline of the chapter is as follows. Section 5.2 motivates the mixture model approach. The problem is formulated in Section 5.3, and the Bayesian model specified in Section 5.4. Section 5.5 introduces the MCMC algorithm used to estimate the parameters of the mixture model, with some details deferred to Appendix G.1. How the trained classifier is used appears in Section 5.6, with details given in Appendix G.2. The results from an application to military ship ATR are given in Section 5.7, and compared with those obtained from two previously published techniques; namely a self-organising map and a maximum-likelihood gamma mixture model classifier. Finally, conclusions are presented in Section 5.8.
5.2
Mixture model approach
5.2.1 Mixture model motivation Estimation of the class-conditional probability densities, p(x|j), in a radar target recognition context, is frequently complicated by the relatively high dimensionality of typical RRP data. Non-parametric methods of density estimation, such as kernel-based methods, tend to require unrealistically large amounts of training data for accurate density estimates [166, 189]. However, many parametric methods, such as approximation by a multivariate Gaussian distribution, might impose a specific form on the density that is too rigid for the problem. Mixture models provide a parametric method with enough flexibility to be able to model a wide range of possible densities, including multi-modal densities, but with enough structure that they can be trained with only limited amounts of data. Further motivations for the mixture model approach for the specific application to military ship ATR arise from the following observations: 1. Provided that the pitch and roll of the ship are not too large, the probability density function of the radar returns for a single target can be expressed as an integral over the aspect angle (from the radar to the ship) of a conditional density of a simple form (e.g. in this case gamma)1 . The mixture distribution comes from approximating this integral by a finite sum. 2. Additional effects, such as robustness to offsets in position, can be readily incorporated into such a model. To illustrate the first observation, suppose that the conditional distribution of the measurement data x, given the class j, and the aspect angle of observation φ, is given by a (simple) distribution p(x|j, φ). Then the overall class-conditional density can be written as: p(x|j) =
p(x, φ|j)dφ =
Φ
p(φ|j)p(x|j, φ)dφ,
(5.1)
Φ
1 If all other factors remain constant, for a specific example of a target the radar reflectivity is an analytic function of illumination angle (where illumination angle takes into account the pitch and roll). However, conditioning on the aspect angle only results in a conditional distribution for the radar reflectivity as a function of the angle.
111
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.2. APPROACH
where p(φ|j) is the probability density of the aspect angle. We then replace this integral by a finite sum approximation, based on R components, giving the finite mixture distribution:
p(x|j) =
R
pj,r p(x|j, φr ),
with
r=1
R
pj,r = 1.
(5.2)
r=1
We note however, that the mixture model approach offers more than just a segmentation of the data into angle sectors. By allowing the training data within each class to cluster naturally according to the data rather than manually specified angle segments, the overall fit of the data should be better. For instance, if symmetries in the objects lead to the data being similar in separated angle segments, this can be handled efficiently through mixture modelling, but not through manual segmentation into angles. The incorporation of offsets in position (the second observation) is covered in Section 5.6.3.
5.2.2 Bayesian motivation A Bayesian approach to estimating the parameters of the gamma mixture models offers a number of advantages over methods based on maximum likelihood techniques, such as the EM algorithm. Not least is the elimination of the problem of unboundedness of the likelihood function, which is frequently ignored in maximum likelihood techniques. The EM algorithm is also notorious for its initialisation problems, frequently becoming stuck in different local maxima depending on the specific initialisation of the algorithm. There are also the standard arguments in favour of Bayesian techniques, such as the ability to cope with additional prior information, perhaps elicited from expert knowledge, and the production of confidence intervals and other statistics for the parameters estimated. Some of the advantages and disadvantages of Bayesian treatments for Gaussian mixture models are documented in the Bayesian literature [40, 55, 124, 139, 160]. Although these papers all concentrate on Gaussian mixture models, the general comments are similar for gamma mixture models. Similarly to Chapter 4 there is also the potential for using hyperparameters in our prior distributions for the mixture model parameters, to account for differences between training and test data. For example, variations in the equipment fit, or different types of target (perhaps later models) from the same generic class. This has not been investigated for gamma mixture models to date.
5.2.3 Gamma mixture component motivation A motivation for the choice of gamma, rather than Gaussian mixture component distributions (which are much more commonly used in density estimation), arises from physical consideration of radar returns. The gamma distribution has been empirically observed to model radar returns [167] (albeit for the case of an unresolved target under conditions of a random illumination direction), and has as special cases the Swerling 1, 2, 3 and 4 models, representing fluctuating targets (shape parameter 112
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.3. PROBLEM FORMULATION
equal to 1 in the first two cases, and 2 in the last two) [167]. Additionally, in the limit as the shape parameter tends to infinity, the gamma distribution provides the model for a non-fluctuating target. The benefits to be gained from the use of the more physically appropriate gamma mixture components as compared to Gaussian mixture components, are likely to be larger if the training datasets are small. This is because if the training datasets are large enough, it should be possible to use Gaussian mixture distributions (using larger numbers of components) to approximate the gamma mixture distributions.
5.3
Problem formulation
The problem is formulated in a generic manner, namely classification of an object into one of J distinct classes, on the basis of a d-dimensional data measurement of that object. For the military ship ATR from RRPs application, the classes are the types of ship and the data measurements the radar returns from the target. Our training data, which in general can comprise both labelled and unlabelled measurements2 from the classes, consist of n observed d-dimensional data samples y = {y1 , . . . , yn }, and any corresponding class labels. If the class of the measurement yi is known it is denoted by Zi . The combination of y and any known class allocations for the data is denoted by D. Where there is a need to distinguish between labelled and unlabelled training data we refer to {i ∈ lab} as the indices of the labelled training data and {i ∈ unlab} as the indices of the unlabelled training data. For the purposes of the military ship ATR from RRPs application all the training data is labelled. The probability density function for the d-dimensional data x can be written as:
p(x) =
J
θj p(x|j),
(5.3)
j=1
where θ = (θ1 , . . . , θJ ) is a vector of the prior classification probabilities for each class, with comJ ponents satisfying θj > 0, j=1 θj = 1, and p(x|j) is the class-conditional probability density (likelihood) for data from class j. The class-conditional densities are also modelled by mixture distributions, with the j-th class having Rj components, which we refer to as mixture components:
p(x|j) =
Rj
πj,r p(x|j, r).
(5.4)
r=1
πj = (πj,1 , . . . , πj,Rj ) represents the mixing probabilities within class j; in particular πj,r > 0 is Rj the mixing probability for the r-th mixture component of the j-th class, satisfying r=1 πj,r = 1. The complete set is denoted by π = {πj , 1 ≤ j ≤ J}. 2 Unlabelled data may arise due to collection of data in hostile environments where the exact ground truth is unknown.
113
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.4. BAYESIAN SOLUTION
The distribution p(x|j, r) represents the probability density of the data within a given mixture component r, of a class j. We make an assumption of independence between the components of the data vector x, conditioned on the class and mixture component. This corresponds to:
p(x|j, r) =
d
pl (xl |j, r),
(5.5)
l=1
where x = (x1 , . . . , xd ). Note that this independence assumption for each component does not extend to an independence assumption for the class-conditional mixture distribution as a whole. We take gamma distributions for the component distributions pl (xl |j, r), of (5.5). The singledimensional gamma component distributions Ga(xl ; αj,r,l , λj,r,l ) that are used have shape parameters αj,r,l , and inverse-scale parameters λj,r,l (see Appendix A). Notationally, we define αj,r = (αj,r,1 , . . . , αj,r,d ) and λj,r = (λj,r,1 , . . . , λj,r,d ). The set of all αj,r and λj,r are denoted by α and λ respectively. The αj,r,l are restricted to be positive integers for a minor convenience in the later algorithm. A modification of the algorithm to cope with any αj,r,l would be trivial and would add little to the model, but would possibly increase the computational cost of the training procedure. The comments of Section 3.4.2 apply with regards to the choice of the number of mixture components Rj to use in each mixture model. In implementations to date, the number of mixture components has been held fixed, with the value chosen as a compromise between having enough components for adequate modelling of the data (as determined by looking at the training data classification rate), and computational efficiency of smaller model orders.
5.4
Bayesian solution
5.4.1 Prior distributions A hierarchical structure for the prior distributions of the mixture model parameters is assigned. This is illustrated by the DAG in Fig. 5.1. The random variables are denoted by circular boxes and the fixed hyperparameters by square boxes. For notational convenience, our equations do not explicitly state the dependence of the distributions on the fixed hyperparameters. The complete prior distribution for the mixture model parameters is taken to be: Rj d J p(α, λ, θ, π) = p(θ) {p(αj,r,l )p(λj,r,l )} , p(πj ) j=1
(5.6)
r=1 l=1
where we have assigned prior distributions for (αj,r , λj,r ) that are mutually independent over all classes j and mixture components r. The parameters defining each gamma component distribution are also taken to be independent.
114
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
φ0
ν0
α
ψ0
λ
5.4. BAYESIAN SOLUTION
a0
b0
θ
π
y Figure 5.1: A DAG illustrating the hierarchical structure for the prior distributions of the parameters of the gamma mixture models. The shape parameters are assigned independent Poisson priors: αj,r,l − 1 ∼ Poisson(φj,r,l,0 ),
(5.7)
and the inverse-scale parameters are assigned independent gamma priors: λj,r,l ∼ Ga(νj,r,l,0 , ψj,r,l,0 ).
(5.8)
The class and mixture component allocation probabilities are given independent Dirichlet priors, with: θ ∼ D(a0 ),
where a0 = (a1,0 , . . . , aJ,0 ),
aj,0 > 0,
(5.9)
and: πj ∼ D(bj,0 ),
where
bj,0 = (bj,1,0 , . . . , bj,Rj ,0 ),
bj,r,0 > 0.
(5.10)
The hyperparameters (φj,r,l,0 , νj,r,l,0 , ψj,r,l,0 , a0 , bj,0 ) are held fixed at ‘reasonable’ values, chosen so that the prior distributions fit the general spread of the data. In particular, in our experiments we have set: φj,r,l,0 = 2.0,
(5.11)
νj,r,l,0 = 1 + 4φj,r,l,0 , a0 = (1.0, . . . , 1.0),
ψj,r,l,0 = 4¯ yl , bj,0 = (1.0, . . . , 1.0),
(5.12) (5.13)
where 4 > 0 is small and weights the importance of the prior distributions relative to the likelihood function, and y¯l is the l-th component of the mean of the training data vectors. In our experiments we set 4 = 0.001.
115
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.5. MCMC ALGORITHM
5.4.2 Posterior distribution The likelihood function is given by:
p(y|α, λ, θ, π)
∝
J i∈unlab
θj
i∈lab
πj,r
r=1
j=1
RZi
×
Rj
(αj,r,l − 1)!
l=1
πZi ,r
r=1
αj,r,l αj,r,l −1 d λj,r,l yi,l exp[−λj,r,l yi,l ]
αZi ,r,l αZi ,r,l −1 d λZ ,r,l yi,l exp[−λZi ,r,l yi,l ] i
l=1
(αZi ,r,l − 1)!
.
(5.14)
Applying Bayes’ gives the following posterior distribution for the mixture model parameters:
p(α, λ, θ, π|D)
∝
νj,r,l,0 −1 Rj J d λ exp[−ψ λ ] j,r,l,0 j,r,l αj,r,l −1 j,r,l a −1 bj,r,0 −1 πj,r φj,r,l,0 θj j,0 (αj,r,l − 1)! r=1
j=1
×
J i∈unlab
×
j=1
RZi
i∈lab
r=1
l=1
θj
Rj
πj,r
r=1
πZi ,r
αj,r,l αj,r,l −1 d λj,r,l yi,l exp[−λj,r,l yi,l ] l=1
(αj,r,l − 1)!
αZi ,r,l αZi ,r,l −1 d λZ ,r,l yi,l exp[−λZi ,r,l yi,l ] i
(αZi ,r,l − 1)!
l=1
.
(5.15)
Calculation of the normalisation constant of the posterior distribution (regardless of the proportion of labelled data) is analytically intractable, as are calculations of various statistics of interest, such as the means and variances of the parameters. It is proposed to maintain a full Bayesian approach to the problem by drawing samples from the posterior distribution. All inferences can then be made through consideration of these samples. Unfortunately, it is not possible to sample directly from the distribution, so a MCMC algorithm [55, 158, 178] is used.
5.5
MCMC algorithm
5.5.1 Introduction A hybrid MCMC algorithm, known as Metropolis-within-Gibbs [158], is used. Specifically, a Gibbs sampler is used, with the modification that for one of the component distributions (that is hard to sample from), a single Metropolis-Hastings step is used, rather than exact sampling. The Metropolis-Hastings algorithm for sampling from a distribution p(ψ) works by defining a proposal distribution q(ψ |ψ) for updating the value of ψ to ψ . At each iteration, a sample ψ is drawn from the distribution q(ψ |ψ) where ψ is the current value of the variable. The proposed value ψ is accepted with probability: a(ψ, ψ ) = min(1, r(ψ , ψ)), 116
(5.16)
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS where: r(ψ , ψ) =
5.5. MCMC ALGORITHM
p(ψ )q(ψ|ψ ) . p(ψ)q(ψ |ψ)
(5.17)
5.5.2 Outline To utilise the Gibbs sampler the set of random variables (α, λ, π, θ) is extended to include the class allocation variables, Z = (Z1 , . . . , Zn ), and the mixture component allocation variables, z = (z1 , . . . , zn ). These are such that the values Zi = j and zi = r imply that the observation indexed by i is modelled to be from component r of class j. Zi is known for our labelled training data (and should be treated as a constant), whereas zi is always unknown. The algorithm proceeds as follows:
1. Initialization. Set i = 1 and determine initial values for (α(0) , Z (0) , z (0) ) from the support of the joint posterior distribution. • Elements of Z (0) relating to labelled training data are assigned their true class labels. • Elements of Z (0) relating to unlabelled training data are set using very simple (and quick) classifiers, such as nearest class mean. • The z (0) are initialised with simple clustering algorithms such as k-means, or for training data with known angles of observation by clustering into angle segments. (0)
• The αj,r are sampled from the prior distribution. 2. Iteration i • Sample the mixture components (α(i) , λ(i) ), from the conditional distribution p(α, λ|y, θ(i−1) , π (i−1) , Z (i−1) , z (i−1) ). This is covered in Section 5.5.3. • Sample the allocation probabilities (θ(i) , π (i) ), from the conditional distribution p(θ, π|y, α(i) , λ(i) , Z (i−1) , z (i−1) ). This is covered in Section 5.5.4. • Sample the allocation variables (Z (i) , z (i) ), from the conditional distribution p(Z, z|y, α(i) , λ(i) , θ(i) , π (i) ). This is covered in Section 5.5.5. 3. i ← i + 1 and go to 2.
After an initial burn-in period, during which the generated Markov chain reaches equilibrium, the set of parameters (α(i) , λ(i) , θ(i) , π (i) , Z (i) , z (i) ) can be regarded as dependent samples from the posterior distribution p(α, λ, θ, π, Z, z|y). To obtain approximately independent samples a gap (known as the decorrelation gap) is left between successive samples. If we are only concerned with ergodic averages, lower variances are obtained if the output of the Markov chain is not sub-sampled. However, if storage of the samples is an issue, it may be better to leave a decorrelation gap, so that the full space of the distribution can be explored without having to keep thousands of samples. In our notation these approximately independent samples are relabelled (α(s) , λ(s) , θ(s) , π (s) , Z (s) , z (s) ), for s = 1, . . . N ; where N is the number of MCMC samples. For our application, choice of these parameters (and therefore the termination of the algorithm) involved specifying a set of values, and then examining the classification rate for the training data.
117
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.5. MCMC ALGORITHM
Good training data classification performance indicated convergence of the algorithm.
5.5.3 The mixture components To sample from the conditional distributions for (α, λ), we make use the fact that for a given a set of allocation variables (Z, z), the data y consist of classified independent samples from the k = R1 + · · · + RJ mixture component distributions, independent of the prior class and mixing probabilities. Furthermore, we can make use of the fact that if (r, j) = (r , j ) our parameters and prior distributions are formulated so that the training data assigned to mixture component r, of class j, have no influence on the posterior distribution of the component parameters of mixture component r , of class j . We define Gj,r = {i; (Zi = j, zi = r)}, the set of indices of data elements that have been assigned to component r, of class j; and gj,r to be the cardinality of Gj,r . The set {yi ; i ∈ Gj,r } is denoted by yGj,r . Thus: Rj J
p(α, λ|θ, π, y, Z, z) =
p(αj,r , λj,r |yGj,r ).
(5.18)
j=1 r=1
Since the mixture component likelihood and prior distributions have independent components, we have: p(αj,r , λj,r |yGj,r ) =
d
p(αj,r,l , λj,r,l |yGj,r ),
(5.19)
l=1
where: p(αj,r,l , λj,r,l |yGj,r ) ∝ p(αj,r,l , λj,r,l )
p(yi,l |αj,r,l , λj,r,l ),
(5.20)
i∈Gj,r
for l = 1, . . . , d. Therefore:
p(αj,r,l , λj,r,l |yGj,r )
g
αj,r,l +νj,r,l,0 −1
j,r ∝ λj,r,l
exp −λj,r,l
yi,l + ψj,r,l,0
i∈Gj,r
×
φj,r,l,0 i∈Gj,r yi,l
αj,r,l −1
((αj,r,l − 1)!)gj,r +1
,
(5.21)
for l = 1, . . . , d. It is not clear how to sample from such a distribution, so (5.20) is rewritten as the following: p(αj,r,l , λj,r,l |yGj,r ) = p(λj,r,l |αj,r,l , yGj,r )p(αj,r,l |yGj,r ),
(5.22)
where: g
αj,r,l +νj,r,l,0 −1
j,r p(λj,r,l |αj,r,l , yGj,r ) ∝ λj,r,l
exp[−ψj,r,l λj,r,l ],
(5.23)
i.e. λj,r,l |(αj,r,l , yGj,r ) ∼ Ga(gj,r αj,r,l + νj,r,l,0 , ψj,r,l ),
118
(5.24)
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.5. MCMC ALGORITHM
and, for notational ease, we have defined: ψj,r,l =
yi,l + ψj,r,l,0 .
(5.25)
i∈Gj,r
By integrating over λj,r,l in (5.21) the following marginal distribution for αj,r,l is obtained: ! p(αj,r,l |yGj,r ) ∝
Γ[gj,r αj,r,l + νj,r,l,0 ] g
αj,r,l +νj,r,l,0
j,r ψj,r,l
×
φj,r,l,0
i∈Gj,r yi,l
"αj,r,l −1
((αj,r,l − 1)!)gj,r +1
.
(5.26)
Given a value for αj,r,l it is easy to sample from (5.24). However, sampling efficiently from (5.26) is considerably more difficult, since the distribution does not belong to any well known family of distributions. Therefore, to sample (αj,r,l , λj,r,l ) from (5.22) a Metropolis-Hastings step is used. This should not be seen as a drawback, since in the case of sampling from a vector made up of discrete random variables it has been proved [105] that utilising Metropolis-Hastings steps within a modified Gibbs sampler can actually increase the statistical efficiency compared to the standard Gibbs sampler. For αj,r,l > 1, the proposal distribution is made up of the random walk: ξ q(αj,r,l |αj,r,l ) = 1 − 2ξ 0
αj,r,l = αj,r,l ± 1, αj,r,l = αj,r,l ,
(5.27)
otherwise.
, yGj,r ): along with exact sampling from p(λj,r,l |αj,r,l ∼ Ga(gj,r αj,r,l + νj,r,l,0 , ψj,r,l ). λj,r,l |αj,r,l
(5.28)
If αj,r,l = 1, the random walk part of the proposal distribution is altered so that: |αj,r,l q(αj,r,l
= 1) =
ξ 1−ξ
= 2, αj,r,l = 1, αj,r,l
(5.29)
i.e. so that the random walk does not attempt to decrease αj,r,l to zero. ξ ≤ 0.5 is fixed prior to running the algorithm. In our documented experiment, ξ = 0.5. If the αj,r,l are not restricted to be positive integers, and are instead allowed to take any real value (subject to αj,r,l ≥ 1), then for a suitable choice of prior distribution the second factor on the right-hand-side of (5.26) would be replaced by a slightly more complicated expression. In this case, a continuous random walk (e.g. a uniform distribution) would be used for our Metropolis-Hastings sampling step for αj,r,l . The acceptance probability for each Metropolis-Hastings step for updating (αj,r,l , λj,r,l ) to (αj,r,l , λj,r,l ), is given by:
aj,r,l = min(1, rj,r,l ), 119
(5.30)
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.5. MCMC ALGORITHM
where:
rj,r,l
+ νj,r,l,0 ] Γ[gj,r αj,r,l = Γ[gj,r αj,r,l + νj,r,l,0 ]
φj,r,l,0
i∈Gj,r gj,r ψj,r,l
yi,l
αj,r,l −αj,r,l
(αj,r,l − 1)! (αj,r,l − 1)!
gj,r +1 .
(5.31)
This is independent of the values λj,r,l and λj,r,l . Details of the calculation of this acceptance probability are given in Appendix G.1. Computation of (5.31) is, of course, considerably eased when use is made of the fact that the random walk is such that αj,r,l = αj,r,l − 1 or αj,r,l = αj,r,l + 1.
5.5.4 The allocation probabilities Given the allocation variables (Z, z), the prior class and mixing probabilities (θ, π), will be independent of (y, α, λ), and are updated using independent Dirichlet distributions. The conditional distribution for the prior class probabilities is given by: p(θ|Z) ∝ p(Z|θ)p(θ) n J a −1 ∝ p(Zi |θ) θj j,0 i=1 J
∝
j=1
Rj
θj
r=1
gj,r +aj,0 −1
,
(5.32)
j=1
which gives: θ|Z ∼ D(a),
(5.33)
independently of π, where a = (a1 , . . . , aJ ) with:
aj =
Rj
gj,r + aj,0 .
(5.34)
r=1
The conditional distribution for the within class mixing probabilities is given by:
p(π|Z, z)
∝ p(π|Z)p(z|π, Z) = p(π) ∝
Rj J
bj,r,0 −1 πj,r
j=1 r=1
∝
Rj J
n
p(zi |π, Zi )
i=1 n
πZi ,zi
i=1
g
j,r πj,r
+bj,r,0 −1
,
(5.35)
πj |(Z, z) ∼ D(bj ),
(5.36)
j=1 r=1
which gives the following independent distributions:
120
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.6. TRAINED CLASSIFIER
where bj = (bj,1 , . . . , bj,Rj ) with: bj,r = gj,r + bj,r,0 .
(5.37)
5.5.5 The allocation variables Since the data yi are conditionally independent given (α, λ, θ, π), the allocation variables for each data measurement are conditionally independent. Each measurement can therefore be considered separately. Thus: p(Zi , zi |y, α, λ, θ, π) = p(zi |Zi , yi , α, λ, θ, π)p(Zi |yi , α, λ, θ, π).
(5.38)
If Zi is unknown for the i-th data vector, we have:
p(Zi = j|yi , α, λ, θ, π) ∝ θj
Rj
πj,r p(yi |αj,r , λj,r ),
(5.39)
r=1
where for notational ease we have defined: p(yi |αj,r , λj,r ) = p(yi |α, λ, θ, π, Zi = j, zi = r) =
d
Ga(yi,l ; αj,r,l , λj,r,l ).
(5.40)
l=1
For a labelled training data vector, this sampling is skipped, which is equivalent to replacing (5.39) with a unit mass distribution at the labelled class. For the within class mixture component allocation variable zi , we have: p(zi = r|Zi = j, yi , α, λ, θ, π) ∝ πj,r p(yi |αj,r , λj,r ).
(5.41)
Since the number of classes and mixture components is finite, it is easy to sample from the distributions defined by both (5.39) and (5.41).
5.6
Using the trained classifier
5.6.1 Classifying the training data Even if the training data is class labelled, it is useful to assess the performance of the trained classifier on the training data. Posterior classification probabilities for the training data, y, are obtained using Rao-Blackwellised [22, 106] estimates. These approximate the posterior marginalised distribution via an expectation of appropriate conditional distributions, rather than via direct estimation. We obtain: N 1 p(Z|D) ≈ p(Z|y, α(s) , λ(s) , θ(s) , π (s) ), (5.42) N s=1
121
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.6. TRAINED CLASSIFIER
where p(Zi = j|yi , α(s) , λ(s) , θ(s) , π (s) ) is evaluated using (5.39). For unlabelled data this has lower variance than the direction estimation approach:
p(Zˆi = j|D) ≈
N 1 (m) I(Zi = j), N m=1
(5.43)
where I is the indicator function, (so I(x = y) = 1 if x = y and 0 otherwise). This lower variance is essentially a consequence of the Rao-Blackwell theorem [100] which can be interpreted as stating that if a conditional expectation produces an estimator, the variance of this new estimator will be less than that of the original estimator used in the conditional expectation.
5.6.2 Classifying future observations The posterior class probabilities for a previously unseen observation yf can be written: P (Zf = j|D, yf ) ∝ P (Zf = j|D)p(yf |D, Zf = j).
(5.44)
The term P (Zf = j|D) can be estimated from prior beliefs on the spread of future data between classes, or can be approximated from our MCMC samples, using: (s) N N aj 1 1 (s) P (Zf = j|D) = E(θj |D) ≈ E(θj |D, Z ) = , N s=1 N s=1 J a(s) j =1
(s)
with aj
(5.45)
j
as defined in (5.34), with allocation variables (Z (s) , z (s) ).
We could estimate p(yf |D, Zf = j) directly as follows:
p(yf |D, Zf = j) ≈
d N Rj 1 (s) (s) (s) πj,r Ga(yf,l ; αj,r,l , λj,r,l ). N s=1 r=1
(5.46)
l=1
However, a theoretically better approach considers the predictive density via Rao-Blackwellisation. Unfortunately, remembering that speed is a crucial issue in military environments where classifications are to be made, a full Rao-Blackwellised treatment of the predictive density is not computationally feasible. This is due to the rather complicated form of the terms in (5.26) (see Appendix G.2 for details). The situation would be even worse if the restriction that the αj,r,l are positive integers were to be relaxed. As a compromise, we do not attempt to marginalise over the full posterior distribution for each (αj,r , λj,r ), and instead use a Rao-Blackwellisation solution on predictive densities over the posterior distributions for λj,r only. The integration over the αj,r occurs through direct summation over the (s) MCMC sampled values, αj,r,l . We set (see Appendix G.2 for details):
p(yf |D, Zf = j)
≈
N Rj 1 p(yf |D, Zf = j, zf = r, α(s) , Z (s) , z (s) ) N s=1 r=1
122
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.6. TRAINED CLASSIFIER
×p(zf = r|D, Zf = j, Z (s) , z (s) ),
(5.47)
where: (s)
bj,r p(zf = r|D, Zf = j, Z (s) , z (s) ) = Rj (s) , r =1 bj,r
(5.48)
and:
p(yf |D, Zf = j, zf = r, α
(s)
,Z
(s)
,z
(s)
) =
d
(s)
α
yf,lj,r,l
l=1 (yf,l
×
(s)
−1
(s)
gj,r αj,r,l +νj,r,l,0
ψj,r,l
(s)
+ ψj,r,l )(gj,r +1)αj,r,l +νj,r,l,0 1 (s)
(s)
.
(5.49)
β[αj,r,l , gj,r αj,r,l + νj,r,l,0 ]
(s)
The parameters bj,r and ψj,r,l are as given in (5.37) and (5.25) respectively, with allocation variables given by (Z (s) , z (s) ).
5.6.3 Incorporating uncertainty in target location Before application of the classifier to new data, an attempt must be made to align the targets consistently with the training data. This is because the range bins occupied by the target will differ depending on the distance of the radar to the target. The standard way of doing this is to centre the training data, using an appropriate consistent procedure, and then centre each subsequent target using the same procedure. The centred data is then treated as being consistently aligned. However, there will always be uncertainty in the exact location of the target centre, and therefore of the target itself. This uncertainty should be taken into account. In the case of our RRP data, we centre a previously unseen profile about the extracted target centre c, producing centred data yf (c). The possibility that the extracted centre might be a few range bins different from the centre of appropriately aligned training data is taken into account, by considering small shifts in the centre position. This produces data yf (c + s) where s is the (integer) shift of the position of the centre (and can be positive or negative). If we allow for errors of −sm ≤ s ≤ sm , in the extracted centre, our single probability density, p(yf |D, Zf = j), can be replaced with expressions of the form: p(yf |D, Zf = j) ≈
sm 1 p(yf (c + s)|D, Zf = j). 2sm + 1 s=−s
(5.50)
m
Each expression p(yf (c + s)|D, Zf = j) is treated as if yf (c + s) is the consistently aligned previous unseen data. As sm increases, unless the data is poorly aligned, the terms p(yf (c + s)|D, Zf = j), with |s| large, will be negligible compared to the terms p(yf (c + s)|D, Zf = j), with |s| small. Thus, in practice, we need only consider small values of sm .
123
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.7
5.7. APPLICATION
Application to military ship ATR
5.7.1 Description We now consider application of the algorithm to RRPs of military ships from seven class types. There are 19 datasets, each of which contains range profiles of a ship which are recorded as the ship turns through 360◦ . The aspect angle of the ship varies smoothly as one steps through the file and the centroid of the ship drifts across range gates as the ship turns. Each range profile consists of 130 measurements on radar returns from 130 range gates of 3m in extent. The radar operating frequency was 10 GHz. The datasets are divided into 11 training files and 8 test files, as shown in Table 5.1. We note that there are no test data available for class 7. All the training datasets are treated as labelled training data. The training and test datasets were collected at the same trial, and within each class use the same underlying target. The experiment therefore represents tightly controlled situations where the test data closely resemble the training data. In reality this will not be the case and operational performance would suffer a further degradation due to differences in equipment fits (see Section 1.5). Thus the test datasets do not contain the variations from the training data that motivated the work documented in Chapter 4. Class label 1 2 3 4 5 6 7
Ship type Light Aircraft Carrier Spruance Class Destroyer Type 42 Destroyer Leander Class Frigate Type 21 Frigate Type 22 Frigate Ticonderoga Class Cruiser
Training file 18 9,10 5,12,14 2,16 4 6 11
Test file 17 8 13,19 1,15 3 7 -
Table 5.1: Summary of ship data These datasets have been used by Luttrell [114] and Webb [188]. Luttrell [114] used a self organising map to model each class separately. In particular, for each class a topographic mapping neural network [92] was trained on the range profiles from the class, with the behaviour of each network being constrained by the use of topological prior knowledge [112]. Webb [188] (similarly to this work) used a gamma mixture model to model each class likelihood. However, rather than using Bayesian parameter estimation techniques the approach used the EM-algorithm to estimate the maximum likelihood parameters for each class. In the documented experiment, to train the model a subset of 1200 samples over 360◦ is taken from each of the training set files. As an example, Fig. 5.2 shows intensity plots of the training data range profiles for classes 5 and 6. Further examples of the actual range profiles are provided in Fig. 5.3 and Fig. 5.4.
124
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
5.7. APPLICATION
0 0
20
40
60
80
100
120
0
20
40
60
80
100
120
Figure 5.2: Intensity plots of the class 5 (LHS) and class 6 (RHS) ship training data range profiles (range bin along horizontal axis, profile number along the vertical axis).
12
10
10
8
8
6
6 4 4 2
2
0
0 0
20
40
60
80
100
120
0
8
8
6
6
4
4
2
2
0
0 0
20
40
60
80
100
120
8
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
80
100
120
8
6
6
4
4
2
2 0
0 0
20
40
60
80
100
120
14 10
12
8
10
6
8 6
4 4 2
2
0
0 0
20
40
60
80
100
120
0
20
40
60
Figure 5.3: Some (normalised) ship radar range profiles from the training data. Left column at 0◦ bow-on to the radar, right column at 30◦ bow-on to the radar. Top row class 1, 2nd class 2, 3rd class 3, bottom class 4.
125
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS 10
8
8
6
5.7. APPLICATION
6 4 4 2
2
0
0 0
20
40
60
80
100
120
10
0
20
40
60
80
100
120
0
20
40
60
80
100
120
20
40
60
7 6
8
5 6
4 3
4
2 2
1 0
0 0
20
40
60
80
100
120
14 12
12
10
10
8
8 6
6
4
4
2
2
0
0 0
20
40
60
80
100
120
0
80
100
120
Figure 5.4: Some (normalised) ship radar range profiles from the training data. Left column at 0◦ bow-on to the radar, right column at 30◦ bow-on to the radar. Top row class 5, middle class 6, bottom class 7. When examining the figures it is important to note that without a tracking radar to give an estimate of the ship pose, for each test RRP we are unable to select the training data examples at the corresponding aspect angle. Thus we are unable to make use of discriminability at specific angles. The classifier is tested on files with the ship orientated so that it is in the range ±40◦ bow-on or stern-on to the radar. This training and test formulation is the same as that used previously [114, 188], enabling performance comparisons to be made with those techniques. There are 10869 test patterns. Some pre-processing needs to be applied to the range profiles (both training and test) to ensure that the ships are consistently aligned. In the documented experiment this is done by extracting the target from each of the range profiles (using simple thresholding) and then calculating the centre of mass of the extracted target. Each profile is then replaced by a 70-dimensional vector, consisting of the range returns centred about the extracted target centre. The reduction in the range length of each profile may occasionally result in a loss of the measurements at the extremities of a target. However, this negative effect is likely to be outweighed by the more efficient training of the classifier. When the classifier is applied to test data, we need to account for the uncertainty in the exact location of the target centre, and therefore of the target itself. To do this we sum over small shifts of the extracted target centre, as described in Section 5.6.3. Non-target related variations (between profiles) in the strengths of the radar returns also need to be taken into account. These can occur due to fluctuations in the power output from the radar and the collection of data under different environmental conditions, such as mist. To remove these variations each of our 70-dimensional vectors is normalised to have an average amplitude over range
126
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.7. APPLICATION
gates of one. The mixture model classifier has been designed for situations where all the targets belong to one of the J specified classes. In operational use there will be situations where the target is actually from an unknown class. Ideally, in such situations the ATR algorithm should reject the target without making a classification [129]. The techniques described in Section 3.6.4 for rejecting targets within the framework of a Gaussian mixture model classifier apply to the gamma mixture model classifier, but have not been implemented for the documented example.
5.7.2 Experimental results Results are presented for the Bayesian gamma mixture model algorithm with 25 mixture components per class (a discussion on the use of different numbers of mixture components is included in Section 5.7.4). Two hundred samples have been drawn from the posterior distribution, by sampling every 25th iteration of the algorithm, after a burn-in period of 5000 iterations. The number of samples drawn is rather small for the dimension of the distribution being considered, and better modelling of the densities (at the expense of increased computational cost) may be possible if more samples are used. However, by examining the training data classification rates we are able to see that the MCMC sampler has converged enough to produce samples that model the data. Classifications have been made by assigning to the class with the largest posterior probability. The vector of classification rates for each class of the training data was (92.8%, 95.2%, 89.2%, 90.3%, 92.3%, 92.5%, 93.0%), where the i-th entry represents the training classification rate for the i-th class. True class 1 2 3 4 5 6
1 97.8 1.4 1.1 7.3 2.7 7.4
2 2.0 93.5 2.7 1.2 2.0 2.4
Predicted class 3 4 5 0.0 0.1 0.0 1.1 1.1 0.0 86.7 1.2 0.3 3.5 78.9 4.5 6.2 8.9 63.6 15.9 0.0 0.0
6 0.1 0.1 7.6 1.7 15.4 73.2
7 0.0 2.8 0.5 2.9 1.1 1.1
Table 5.2: Classification rates for test datasets. Table 5.2 documents the classifications rates for the test data. To classify this data, shifts of up to ±2 range bins about the extracted target centres were considered. Table 5.3 and Table 5.4 present the test results for the self-organising map [114] and the maximum-likelihood gamma mixture model [188] approaches respectively. True class 1 2 3 4 5 6
1 72.0 4.4 2.9 5.2 1.7 6.0
2 11.3 70.7 10.1 4.8 2.8 8.5
Predicted class 3 4 5 9.3 0.6 0.2 10.0 3.0 2.3 67.7 1.8 3.3 10.0 59.6 9.0 24.6 4.4 57.9 20.2 0.7 2.3
6 2.5 5.9 9.8 4.5 6.2 59.7
7 4.2 3.6 4.4 6.7 2.3 2.5
Table 5.3: Test results for self-organising map.
127
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS True class 1 2 3 4 5 6
1 84.4 8.9 8.5 4.2 2.3 10.3
2 12.6 86.1 5.8 6.0 0.5 10.7
Predicted class 3 4 5 0.6 1.0 0.8 0.2 2.0 0.0 68.9 5.5 5.3 3.4 73.1 8.8 5.5 17.0 57.1 23.2 11.4 0.9
5.7. APPLICATION
6 0.3 0.0 2.9 1.7 11.3 37.7
7 0.3 2.9 3.1 2.8 6.3 5.8
Table 5.4: Test results for maximum-likelihood gamma mixture model. The overall classification rates are summarised in Fig. 5.5. We can see that the Bayesian gamma mixture model classifier outperforms the other two classifiers in terms of classification rate, for each of the test classes. It should be noted, however, that classification rate is only a very simple indicator of performance [64], not least because it treats all misclassifications with equal weight. 100
90
classification rate
80
70
60
50
40
30
1
2
3
4
5
6
class label
Figure 5.5: A summary of the correct classification rates for the RRP test data. The solid (red) line shows the rates for the Bayesian gamma mixture model classifier, the dashed (green) line the maximum-likelihood gamma mixture model classifier, and the dotted line (blue) the self organising map. In addition, the pre-processing of the data is likely to have been different for the other two classifiers (e.g. different techniques for aligning the targets, and different dimensionality reduction of the profiles). Thus, the improvement in classification rate may not be wholly due to the underlying algorithm of the Bayesian gamma mixture model approach.
5.7.3 Computational cost The Bayesian gamma mixture model approach is more computationally expensive compared to the other two classifiers. Both the self-organising network approach and the maximum-likelihood gamma mixture model classifier are considerably quicker to train than the MCMC algorithm. Applied to test data, the Bayesian approach is also more computationally expensive, with the computational cost relative to the maximum-likelihood gamma mixture model classifier increasing linearly with the
128
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS
5.7. APPLICATION
number of MCMC samples used (in the case of only a single MCMC sample, the same number of mixture components and the use of (5.46) to estimate the predictive density, the two approaches would have equivalent computational cost). However, the procedure for classification of test data is readily parallelisable, so this is not a serious practical limitation to its implementation.
5.7.4 Model selection As stated in Section 5.3 there is an issue with regards to the number of mixture components to use, in that as well as avoiding training data overfitting we need to consider the computational efficiency of smaller model orders. Fig. 5.6 displays the training data classification rates for a variety of different model orders (3, 6, 12, 15, 20, 25, 30, 35) based on only shorts runs of the MCMC algorithm at each model order. It can be seen that purely in terms of training data performance, the algorithm continues to improve with larger model orders than 25. However, the improvements are getting smaller (the graph is flattening) so when computational issues are taken into account, the use of 25 mixture components is reasonable. With increases in computational power, more components could be used, with a likely increase in classification performance. 95
90
training classification rate
85
80
75
70
65
60
55
0
5
10
15
20
25
30
35
model order
Figure 5.6: The training data classification rate versus the model order (number of mixture components for each class).
5.7.5 Examples of recognition failure In Fig. 5.7 and Fig. 5.8 we provide two examples of recognition failure for the Bayesian gamma mixture model technique. Both figures concern test and training data from class 5. In both figures the left-hand RRP is a misclassified target from the test data, while the right-hand RRP is the training data example from the correct class at approximately the same aspect angle. The figures highlight the need for RRP amplitude normalisation, since some of the differences between the test and training profiles can be removed by normalisation. However, it is clear that normalisation will not remove all variations, e.g. in Fig. 5.7 we note that the first major peak is higher than the second major peak for the training data RRP, but lower for the test data RRP.
129
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS 80 Intensity
Intensity
80
5.7. APPLICATION
60 40 20
60 40 20
0
20
40
60 80 Range bin
100 120
0
20
40
60 80 Range bin
100 120
120
120
100
100 Intensity
Intensity
Figure 5.7: Example of recognition failure for class 5. On the LHS a test RRP misclassified as class 4. On the RHS the training data RRP from class 5 at the corresponding aspect angle (approx. 143.5◦ ).
80 60 40 20
80 60 40 20
0
20
40
60 80 100 120 Range bin
0
20
40
60 80 100 120 Range bin
Figure 5.8: Example of recognition failure for class 5. On the LHS a test RRP misclassified as class 6. On the RHS the training data RRP from class 5 at the corresponding aspect angle (approx. 204.5◦ ).
5.7.6 Performance variation with target orientation A further issue is how the performance of the Bayesian gamma mixture model technique varies with target orientation, since the “information content” of the range profiles will vary greatly with aspect angle, even within the ±40◦ bow-on or stern-on data that has been used for testing. To assess this, the restricted angle test data has been sub-divided into 10◦ segments, and the overall classification rate across all classes has been determined within each segment3 . These results are displayed in Fig. 5.9 (the 95% highest posterior density credible regions [189] for the classification rates within each segment give approximately ±1.5% error bars on the classification rates). That the classification performance is slightly worse for aspect angles within ±10◦ bow-on or stern-on (i.e. the 5◦ , 175◦ , 185◦ and 355◦ midpoint aspect angle bins) than it is for aspect angles within the next 20◦ of the bow and stern (i.e. the 15◦ , 25◦ , 155◦ , 165◦ , 195◦ , 205◦ , 335◦ and 345◦ midpoint aspect angle bins) is presumably due to larger shadowing effects when the ship is almost bow-on or stern-on. This effect has also been observed by researchers using different classification techniques on alternative (but similar) sets of ship radar range profile data [94]. 3 When interpreting these results the caution should be added that by using the overall classification rate, we are biasing against classification performance on class 7, and in favour of those classes with multiple test files.
130
CHAPTER 5. BAYESIAN GAMMA MIXTURE MODELS 5
15
25
35
145 155 165 175 185 195 205 215 325 335 345 355
5
15
25
35
145 155 165 175 185 195 205 215 325 335 345 355 Midpoint of aspect angle bin
5.8. SUMMARY
80
60 % 40
20
0
Figure 5.9: The overall classification rate on test data as a function of the aspect angle (divided into bins of width 10◦ ).
5.8
Summary and recommendations
This chapter has developed a Bayesian gamma mixture model approach to ATR, in which the measurements from each class of target are modelled by gamma mixture distributions. The use of mixture models for target recognition has been justified from their ability to estimate complex and possibly multi-modal probability densities, and also specifically for ATR applications. The use of gamma component distributions arises from physical consideration of radar returns. The algorithm developed is not restricted solely to ATR problems and is applicable to the generic discrimination problem. Furthermore, the approach can make use of both labelled and unlabelled training data. For computational reasons, an assumption of a fixed model order has been made for each of the mixture distributions. Prior distributions for the mixture model parameters are defined and a MCMC algorithm, known as Metropolis-within-Gibbs, is used to draw samples from a posterior distribution augmented with the classification labels and the mixture component allocation variables. These MCMC samples are then used to classify future data in a manner which allows us to cope with uncertainty in target location within our data measurements. The algorithm has been applied to a problem of classification of RRPs from seven types of military ships. The technique has been found to compare very favourably with two previously published techniques; namely a self-organising map and a maximum-likelihood gamma mixture model classifier. As in the previous two chapters, there are many areas for possible future work. These include the model selection issue, of how many components to use in each mixture model, and the use of alternative methods for assessing the performance, perhaps taking into account the different costs that are involved in making a classification.
131
CHAPTER 6 BAYESIAN APPROACH TO GENERALISING TARGET CLASSIFICATION BETWEEN RADAR SENSORS 6.1
Introduction
6.1.1 General This chapter proposes a Bayesian inverse model based solution to the problem of generalising target classification between sensors. The generic aim is to develop procedures that enable a classifier designed using data gathered from one sensor to be applied to data gathered from a different sensor. This can be used to address the issue of insufficient training data for an operational sensor, by providing a mechanism for operational sensors to utilise classifiers based upon sensors for which it is easier to obtain training data under extended operating conditions. The approach requires physical and processing models for the sensors to be known. The focus of the chapter is to enable objects imaged by a weapon’s seeker to be classified using ATR systems trained on more readily available ground-based sensor data. The immediate motivating application is to use ATR systems trained on readily available ISAR data to classify objects imaged by a Doppler Beam Sharpened (DBS) radar seeker. This is a non-trivial problem since key differences between the measurements from different platforms arise from differences in sensor technology, spatial resolution, polarisation, imaging geometry and target motion. More specifically, the aim is to invert DBS images similar to the example displayed on the lefthand-side of Figure 6.1 to an underlying radar cross section (RCS) from which a simulated ISAR image can be generated. The model generated ISAR images can then be classified using an ATR system trained using ISAR data similar to that on the right-hand-side of Figure 6.1. The advantage to such a procedure is that it is relatively easy to collect high quality ISAR training data (in a variety of different configurations) by imaging targets on turntables.
6.1.2 Military benefits In the increasingly digital battlefield, the deployment of multiple assets to target and engage threats is acquiring a high profile. The use of ATR technology on one sensor platform, using data derived
132
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.1. INTRODUCTION
Figure 6.1: Example of a DBS (LHS) and ISAR (RHS) image of a battlefield target (range along the horizontal axis, cross range along the vertical axis). from another, would enhance the range of options open to the military user. With ever expanding target sets, this efficient use of training data is likely to become increasingly vital. Ultimately, the work in this chapter will allow a weapon seeker ATR system to be trained on more readily available (and cheaper) data from a second sensor (in the motivating application an ISAR sensor). By utilising larger amounts of training data covering more varied extended operating conditions, this should to lead to an improved autonomous classification ability for the weapon seeker. The outcome of this should be a weapon seeker that is better able to identify and react to changes in the configuration of targets during weapon fly-out. The methodology of the research is applicable to a range of sensors and platforms, but the focus is on radar. This is due to the ability of radar to be deployed in all weather conditions and its use across a wide range of military land, sea and air applications.
6.1.3 Bayesian inverse model based approach A Bayesian inverse model based solution has been developed that uses sensor measurement models to invert the measurements from a particular type of sensor (e.g. DBS sensor) to a generic representation (the underlying RCS). Typically, the calculations required for the Bayesian solution are analytically intractable so MCMC sampling is used. This MCMC sampling produces samples distributed as if they were drawn directly from the probability distribution for the underlying RCS. These samples can then be mapped onto the domain of the second type of sensor (e.g. ISAR), so that they appear as images from the second sensor. An ATR system designed for the second sensor can then be used to classify these model generated sensor images. The classifications for these model generated sensor images can then be combined to provide a classification for the original sensor image, meeting the desired aim of using training data from one sensor (in this case ISAR) to classify data from a second sensor (DBS).
133
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
6.1.4 Bayesian motivation The main motivation behind the Bayesian approach to the problem lies in the unique ability of Bayesian statistics to handle limited and possibly conflicting pieces of information in a fully consistent manner. In particular, Bayesian statistics provides a consistent mechanism for manipulating probabilities assigned to observed data. Bayesian approaches are particularly well suited to inverse problems, because of the ability to incorporate prior information into the prior distributions. Furthermore, the Bayesian solution readily handles any “many-to-one” aspects in the inversion.
6.1.5 Scope of this chapter This chapter establishes the theoretical framework for the Bayesian inverse model based approach and illustrates the process with a simplified example. In particular, simplified sensor measurement models are considered for the imaging processes and low-dimensional data is used. Section 6.2 introduces the proposed Bayesian framework. Section 6.3 provides a description of an MCMC approach to estimating the parameters of the Bayesian model. Section 6.4 describes a simplified problem, and details the MCMC algorithm used. Section 6.5 presents the results for the simplified problem. Conclusions and recommendations for the work documented in this chapter are in Sections 6.6.
6.2
Bayesian framework
6.2.1 Introduction This section develops the proposed Bayesian framework for applying a classifier designed using data gathered from one sensor to data gathered from a different sensor. To aid clarity, the explanation considers the specific scenario of applying an ATR system for ISAR images to data gathered with a DBS seeker. The generic representation of the data is taken to be the underlying RCS, with the aim being to invert the DBS data to this generic representation (an inverse model problem) before mapping onto the ISAR domain.
6.2.2 Section outline Section 6.2.3 formulates the Bayesian model for inverting a DBS measurement to the underlying RCS, with the required prior and conditional distributions defined in Section 6.2.4. This information is then used in Section 6.2.5 to obtain (6.11) the posterior distribution for the RCS given a DBS measurement. This posterior distribution encapsulates our knowledge about the underlying RCS for a given DBS measurement, and is used in all subsequent calculations (see Section 6.2.6). How the posterior distribution for the RCS given a DBS measurement can be used to enable an ATR system for ISAR images to classify the DBS measurement is explained in Section 6.2.7. In particular, (6.18) provides the posterior class probabilities for a DBS measurement, based on a classifier trained on
134
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
ISAR data. Finally, Section 6.2.8 briefly discusses an alternative Bayesian classification model for using ISAR training data to classify DBS measurements.
6.2.3 DBS data generation Specifications The Bayesian framework used to model the DBS measurement process is illustrated by the DAG given in Figure 6.2. ✤✜ ✤✜ σ θ ✣✢ ✣✢ ◗ ✑ ◗ ✑ physical model ◗ ✑ ◗ ✑ s✑ ◗ ✰ ✤✜ z ✣✢ processing model ❄ ✤✜ dbs ✣✢ Figure 6.2: DBS data generation. We assume that a physical model is available that transforms the RCS σ to a set of raw sensor measurements z. The model depends on parameters θ, which may be unknown. The parameters θ will contain information on: a) aircraft dynamics (speed, acceleration, roll etc), b) sensor characteristics (beam shape, pulse width etc). Note, however, that θ does not contain information specific to the target being imaged, such as the aspect angle of the target relative to the sensor beam. Such information is implicit within the RCS σ. It will be assumed that, excluding random noise, such a physical model completely describes the transformation of the RCS to the raw sensor measurements. If this turns out to be inappropriate, the discrepancies between the model and reality may need to be taken into account by a semiparametric modelling approach [56]. For example, the discrepancies between a model and reality could be expressed as a weighted sum of basis functions, where the distribution of weights is learned by an optimisation process. The second part of the DBS measurement process is the processing model that takes raw sensor data z and converts it to a DBS image dbs. It is assumed that such a model is available.
135
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
The overall model can be used in a generative way or for inference. The generative mode of operation simulates DBS images given σ and θ. Used for inference the aim is to “invert” a DBS image to obtain (a distribution of) the underlying RCS. There are three possible scenarios: S1) Infer σ, given dbs, with θ fixed at the ‘correct’ value (a focused problem). S2) Infer σ, given dbs, with θ fixed at an ‘incorrect’ value (a defocused problem) S3) Infer σ and θ, given dbs (an autofocus problem). There is much research effort into autofocus techniques in the radar research community. For example, Luttrell [113] proposes a maximum likelihood solution for autofocussing SAR imagery. Mathematically, the DAG in Figure 6.2 can be expressed as the joint distribution: p(σ, θ, dbs, z) = p(dbs|z)p(z|σ, θ)p(σ)p(θ).
(6.1)
The components of this distribution are the prior distribution for the RCS, p(σ), the prior distribution for θ, p(θ), the conditional distribution for the raw sensor data given the measurement model parameters and the RCS, p(z|σ, θ), and the conditional distribution of the DBS image given the raw sensor data, p(dbs|z). These distributions are examined in more detail below.
6.2.4 Required distributions Conditional distributions with additive noise The form of the conditional distribution p(z|σ, θ) depends on the physical model for the sensor data generation and the noise in that physical model. Under the assumption that the raw sensor data is given by a (known) deterministic function of the RCS together with additive noise, a representation of the raw sensor data is given by: z = fph (σ, θ) + 4,
(6.2)
where fph (σ, θ) represents the analytic physical model, and 4 is the additive random noise, drawn from a distribution p1 (4). Typically, this random noise will be based on the Gaussian distribution. The conditional distribution is then represented by: p(z|σ, θ) = p1 (z − fph (σ, θ)).
(6.3)
A similar procedure can be adopted for the conditional distribution p(dbs|z). However, since this distribution represents the processing within a sensor, it will typically be taken to be a point mass distribution, representing the situation where the formation of a DBS image dbs is completely deterministic given z.
136
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
Conditional distributions with noise in the imaging function The form of the conditional distribution p(z|σ, θ) is more complicated if there is non-additive noise in the physical model for the sensor data generation. Such noise might occur if there is corruption of data in the sensor, prior to its recording as the raw sensor measurements z. This non-additive noise could be used to account for inadequacies in the physical and sensor model, such as unknown parameters that are not modelled by θ. However, in most circumstances such parameters should (for clarity) be incorporated into θ. If there is both additive and non-additive noise, a representation of the raw sensor data is given by: z = fph (σ, θ, η) + 4.
(6.4)
As in Section 6.2.4, 4 is the additive random noise, drawn from a distribution p1 (4). The function fph represents the deterministic physical model for the raw sensor data, given the non-additive noise realisation η. This noise realisation is taken to be drawn from a (known) distribution p2 (η|σ, θ). The conditional distribution is then given by: p(z|σ, θ)
=
p(z, η|σ, θ)dη
=
p(z|σ, θ, η)p(η|σ, θ)dη
=
p1 (z − fph (σ, θ, η))p2 (η|σ, θ)dη.
(6.5)
In some circumstances it will be possible to evaluate the integral analytically, and we will be able to proceed in the same manner as with the additive noise case only. In cases where the analytic form of the marginal distribution cannot be calculated, the noise realisation η needs to be estimated as a random variable. In such situations, η would become an extra node in the Bayesian network of Figure 6.2, with parents (i.e. conditioned on) σ and θ, and influencing (i.e. a parent of) z.
Prior distribution for the RCS The form of the prior distribution for σ will depend on the representation of σ. Perhaps the simplest way to represent σ is as a multi-dimensional (typically two-dimensional) grid of values. A second form of representation is as a fixed number of scatterers with variable positions and strengths. A physically correct extension to this second form is to vary the number of scatterers as well as their positions and strengths. In this case, the dimensionality of the random variables comprising σ would itself be a random variable. This would complicate the parameter estimation in the Bayesian model, and most likely require the use of an RJMCMC algorithm [30, 61, 156]. Although the second representation allows the dimensionality of the variables defining the RCS σ to be lower than in the grid-based representation it is likely to complicate calculation of the sensor images. This is because the imaging operator will need to be re-computed for each set of scatterer locations. For this reason, this chapter considers the first case of a two-dimensional grid of RCS
137
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
values, which provides an approximation to the more physically correct variable number of scatterers model. The prior for a grid-based representation of σ is taken to be made up of independent onedimensional priors. In particular, denoting σ by: σ = {σi,j , 1 ≤ i ≤ dσ1 , 1 ≤ j ≤ dσ2 },
(6.6)
where dσ1 × dσ2 is the dimensionality of the grid of points, the prior distribution is taken to be:
p(σ) =
dσ1 dσ2
p(σi,j ).
(6.7)
i=1 j=1
Simple distributions such as the gamma, exponential or log-normal distribution can then be used for the component prior distributions p(σi,j ). Ideally, the parameters of these prior distributions would be based upon the RCS grids that would be expected for the range of targets to be identified (i.e. determined using expert knowledge). Better prior distribution modelling is possible if the parameters of the prior distribution are also treated as random variables, ς. The variables ς are termed hyperparameters, and their prior distributions are known as hyperpriors. Training data can then be used to determine a posterior distribution for these hyperparameters. This training data could consist of ISAR images of targets on a turntable and possibly DBS images of targets and clutter. Figure 6.3 illustrates the model for σ given a set of representative DBS training images dbs1 , . . . , dbst (a fuller model, incorporating the ISAR training data, is provided in Appendix H). For each of the training images dbs1 , . . . , dbst there is an underlying RCS grid σ 1 , . . . , σ t , respectively. For simplicity we have incorporated the raw sensor data z, z1 , . . . , zt into the DBS images. The prior distributions for the RCS grids depend probabilistically on the set of hyperparameters ς. There are several possible models that could be implemented: a) ς could represent the parameters of an exponential or gamma distribution model for a general RCS. In its simplest form, it would be the parameters of independent distributions, as in (6.7). b) The model could be a mixture model comprising components appropriate for modelling clutter and different targets. The use of training data allows the variable hyperparameters ς to specify the prior distribution for the RCS more accurately, since the larger posterior probabilities for the hyperparameters will correspond to values which model the RCS well.
Prior distribution for the imaging model parameters A prior distribution p(θ) is required for the imaging model parameters θ. This will depend on the exact form of θ, which is determined by the physical model transforming the RCS to the raw sensor
138
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
✤✜ ς ✣✢ ✑◗ ✑ ◗ ✑ ◗ ✑ ◗ ✰ ✑ s ◗ ❄ ✤✜ ✤✜ ✤✜ σ1 . . . σt σ ✣✢ ✣✢ ✣✢ ❄ ❄ ❄ ✤✜ ✤✜ ✤✜ dbs1 . . . dbst dbs ✣✢ ✣✢ ✣✢ Figure 6.3: Prior distribution for the RCS. measurements (Section 6.2.3). In many cases each of the variables that comprise θ (e.g. pulse width) will be known to within a tolerance, so independent Gaussian distributions would be appropriate.
6.2.5 Posterior distribution Assuming that dbs (the measured DBS image) is observed, the posterior distribution of interest is: p(σ, θ|dbs).
(6.8)
This posterior distribution defines completely the inversion of the sensor measurement dbs to the underlying RCS σ. Using the definition of conditional probabilities we have: p(σ, θ|dbs) ∝ p(σ, θ, dbs),
(6.9)
which can be expressed as: p(σ, θ|dbs)
∝
p(σ, θ, dbs, z) dz z
∝
p(dbs|z)p(z|σ, θ)p(σ)p(θ) dz
(6.10)
z
where the 2nd line follows from the first using (6.1). Equation (6.10) is conditional on the final DBS image only. However, we note that the raw sensor measurements z should also be available within the sensor, so it is possible to condition on z as well as dbs. For this reason, along with the fact that the generation of dbs should be deterministic given z (see Section 6.2.4) we incorporate the raw sensor measurements into the DBS image variable dbs. Figure 6.2 can then be simplified to Figure 6.4, with the posterior distribution of interest given by: p(σ, θ|dbs) ∝ p(dbs|σ, θ)p(σ)p(θ). 139
(6.11)
CHAPTER 6. GENERALISATION BETWEEN SENSORS ✤✜
6.2. BAYESIAN FRAMEWORK
✤✜
σ θ ✣✢ ✣✢ ◗ ✑ ◗ ✑ physical and processing model ◗ ✑ ◗ ✑ s✑ ◗ ✰ ✤✜ dbs ✣✢ Figure 6.4: DBS data generation, combining physical and processing models. Equation (6.11) is essentially an application of Bayes’ theorem, with the prior distribution p(σ, θ) split into the factors p(σ)p(θ) due to the prior independence of σ and θ (displayed graphically in Figure 6.2 and Figure 6.4). The posterior distribution requires the specification of the conditional distribution for the sensor measurements introduced in Section 6.2.4. Calculation of the normalisation constant in (6.11) is unlikely to be tractable, and for most physical and processing models, statistics of interest (such as the mean and covariance) will not be available analytically. Rather than making some rather dubious simplifications to allow analytic inference on the posterior distribution, a full Bayesian approach to the problem is maintained by drawing samples from the posterior. All inferences can then be made through consideration of these samples. For example, given samples {σ (s) , s = 1, . . . , N } from the posterior distribution for σ, the posterior mean for σ can be approximated by: E(σ|dbs) ≈
N 1 (s) σ . N s=1
(6.12)
In most circumstances it will not be possible to sample directly from the distribution, so an MCMC algorithm [55, 158, 178] is used. The MCMC algorithm is detailed in Section 6.3.
6.2.6 Bayesian objective At this stage it is useful to stress that within a Bayesian framework it is the entire posterior distribution that is used to provide a solution, not just a single estimate of the “best-fitting” parameters. Furthermore, it should be noted that the Bayesian inverse model based approach is not attempting to minimise the Euclidean distance to the actual RCS. The DBS measurement process has a stochastic element (e.g. the addition of Gaussian noise), so many different RCS grids could give rise to a given DBS measurement. Which of these RCS grids maximises the posterior probability (and is therefore ‘favoured’ by the Bayesian model) is governed by the combination of the likelihood of the measurement (for each RCS) and the prior distribution for each RCS. The likelihood of the measurement for a given RCS relates to the probability density of the noise realisation required to produce the measured DBS image from the RCS. In essence, it provides a measure for how ‘likely’ a given RCS is to have produced the measured DBS image. Unless the actual noise realisation when generating the measured image from the actual RCS corresponds to
140
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
the largest probability density of the measurement noise (for the case of Gaussian noise this is the mean of the Gaussian distribution) it is likely that there will be alternative RCS grids that lead to larger values for the likelihood of the measurement (i.e. ‘better fit’ the measured image). As an example, consider the case where the sensor measurement process consists only of the addition of zero mean Gaussian noise (i.e the underlying physical and sensor measurement process leaves the RCS unchanged). Under such a model the actual RCS σ might give rise to a sensor measurement of x after the addition of Gaussian noise. The RCS maximising the likelihood of the measurement will be x, since this corresponds to the noise realisation with highest probability density (i.e. zero). Thus, an approach based on maximising the likelihood will not converge to the actual RCS σ, and will therefore not minimise the Euclidean distance to the actual RCS. The prior distribution can be used to increase the posterior probability at the actual RCS (as compared to the RCS grids that give rise to larger likelihoods), but this relies on the actual RCS having a larger prior probability than the other RCS grids, which is not guaranteed. Thus, the actual RCS is unlikely to be the same (nor due to varying noise realisations should it be) as the RCS that maximises the posterior probability.
6.2.7 Using the ISAR training data With the exception of the hyperparameter modelling in Appendix H, the proposed model has not made use of the training data ISAR measurements of targets on turntables, denoted by {isar}. To make use of these ISAR measurements, or more specifically an ATR system trained using ISAR data, we need to determine the distribution of the ISAR image isar that would have been obtained from the target that gave rise to our DBS image dbs. We determine this distribution via the intermediate RCS variable σ (whose posterior distribution was determined in Section 6.2.5). Figure 6.5 extends the Bayesian model in Figure 6.4 to include the ISAR image variable. The variable φ represents the parameters of the physical and processing model for transforming from the underlying RCS to an ISAR image. This model for transforming the RCS to an ISAR image is analogous to the model for transforming the RCS to a DBS image. Whether φ should be treated as a random variable depends on our knowledge about the parameters for the training ISAR data. Ideally, the training ISAR data would all have the same known parameters which we would then use as the value of φ for our model generated ISAR data. If the values of φ are not known for our training data then they should be estimated off-line, using an autofocus procedure. Mathematically the required distribution can be written as p(isar|φ, dbs). This distribution can be obtained by marginalisation of the joint distribution of the ISAR image and the underlying RCS: p(isar|φ, dbs) =
σ
p(isar, σ|φ, dbs) dσ.
This can be written as: p(isar|φ, dbs)
=
σ
p(isar|σ, φ, dbs)p(σ|φ, dbs) dσ
141
(6.13)
CHAPTER 6. GENERALISATION BETWEEN SENSORS ✤✜
6.2. BAYESIAN FRAMEWORK
✤✜
✤✜
σ φ θ ✣✢ ✣✢ ✣✢ ◗ ✑◗ ✑ ◗ ✑ ◗ ✑ ◗ ✑ ◗ ✑ ◗ ✑ ◗ ✑ s✑ ◗ ✰ s✑ ◗ ✰ ✤✜ ✤✜ dbs ✣✢
isar ✣✢
Figure 6.5: Bayesian inverse model incorporating the ISAR image. =
σ
p(isar|σ, φ)p(σ|dbs) dσ,
(6.14)
with the bottom line following from the top by noting that the distribution of the ISAR image is defined by the underlying RCS together with the ISAR model parameters, and that the underlying RCS is unaffected by the model parameters for the ISAR image. The distribution p(isar|σ, φ) is obtained in a similar manner to p(dbs|σ, θ) in Section 6.2.4, with the physical model and processing for DBS images being replaced by the appropriate physical model and processing for ISAR images. The distribution p(σ|dbs) is obtained from (6.11), using the following marginalisation: p(σ|dbs) =
p(σ, θ|dbs) dθ.
(6.15)
θ
However, as stated in Section 6.2.5, this distribution is unlikely to be available analytically. To proceed, we note that typically when making inference based on the model generated ISAR image we will be interested in the expectation of a function g(isar) of the ISAR image. Such an expectation is given by: E[g(isar)|φ, dbs]
=
g(isar)p(isar|φ, dbs) d isar isar
=
g(isar)
isar
σ
p(isar|σ, φ)p(σ|dbs) dσ d isar
(6.16)
As an example, if g(isar) = isar, (6.16) would give the expected value of the ISAR image, given the DBS image and the ISAR imaging parameters. Now suppose that using our ISAR training data we have been able to design an ATR system for ISAR images (perhaps based upon the mixture model approaches of Chapters 3, 4 and 5). In particular, suppose that we have a mechanism for calculating the posterior class probabilities, p(C = j|isar, {isar}), for j = 1, . . . , J, where J is the total number of classes, isar is the ISAR image to be classified and {isar} denotes the training ISAR data (together with any class labels).
142
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.2. BAYESIAN FRAMEWORK
Then, defining J functions, gj (isar) by: gj (isar) = p(C = j|isar, {isar})
for j = 1, . . . , J,
(6.17)
and substituting each of these functions into (6.16), we obtain the following: E[p(C = j|isar, {isar})|φ, dbs]
p(C = j|isar, {isar})
= isar
×
σ
p(isar|σ, φ)p(σ|dbs) dσ d isar
(6.18)
The expectations in (6.18) provide an estimate of the posterior class probabilities for the DBS image, based on a classifier trained using ISAR data. Analytical evaluation of (6.18) is unlikely to be possible, so Section 6.3.3 shows how to evaluate the function using the MCMC samples for σ, drawn from the posterior distribution given in (6.11).
6.2.8 An alternative classification model None of the Bayesian models described so far in this chapter have included the class variable (i.e. the target type) of the object being imaged. This would appear to be sub-optimal, since the prior distribution for σ would ideally be conditioned on the class of the target. Figure 6.6 displays a DAG containing both the class variable C and the ISAR and DBS images. ✤✜ C ✣✢ ✤✜
❄ ✤✜
✤✜
σ φ θ ✣✢ ✣✢ ✣✢ ◗ ✑◗ ✑ ◗ ◗ ✑ ✑ ◗ ◗ ✑ ✑ ◗ ✑ ◗ ✑ s✑ ◗ s✑ ◗ ✰ ✰ ✤✜ ✤✜ dbs ✣✢
isar ✣✢
Figure 6.6: Including the class label in the DBS and ISAR model. The joint distribution in Figure 6.6 is written mathematically as: p(C, σ, dbs, isar, θ, φ) = p(dbs|σ, θ)p(isar|σ, φ)p(σ|C)p(C)p(θ)p(φ).
(6.19)
In the operational stage of the current scenario, a DBS measurement of a target is received, but
143
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
no ISAR measurement is available. Thus, we have: p(C|dbs) =
σ
θ
p(C, σ, isar, θ, φ|dbs) dσ d isar dθ dφ isar
(6.20)
φ
which using the definition of conditional probabilities together with (6.19) becomes: p(C|dbs)
∝
σ
θ
p(dbs|σ, θ)p(isar|σ, φ)p(σ|C)p(C)p(θ)p(φ) dσ d isar dθ dφ isar
φ
(6.21)
Now using the fact that:
p(isar|σ, φ) d isar = 1,
isar
and
p(φ) dφ = 1,
(6.22)
φ
(6.21) can be simplified to: p(C|dbs) ∝ p(C)
σ
p(σ|C)p(dbs|σ, θ)p(θ) dσ
(6.23)
θ
The distribution p(C) represents the prior class probabilities, while p(dbs|σ, θ) represents our combined physical and processing model for the DBS image (see Section 6.2.3 and Section 6.2.4), and p(θ) represents the prior distribution for the model parameters. The research question to be answered under this model is therefore whether we can estimate the conditional distributions p(σ|C) given some ISAR training data images, and possibly a limited number of DBS training data images. One approach would be to invert the ISAR (and any DBS) training data from a given class into underlying RCS data, and then use that data to build a density estimate for the required distribution. This would require the specification of a parametric model for p(σ|C), such as the mixture models of previous chapters. The advantage of such an approach is that it would lead to a sensor independent representation of the targets. A disadvantage is that efficient evaluation of the integral defined in (6.23) would most likely require some form of inversion of the operational DBS data. Thus, both the training phase and operational use would require the potentially timeconsuming inversion procedure. Furthermore, by building the classifier in the RCS domain, we are unable to utilise already existing classification algorithms for ISAR data. Nevertheless, the inclusion of the class variable in the Bayesian model provides an interesting potential area for further study.
6.3
MCMC solution
6.3.1 Introduction This section describes how to use MCMC sampling techniques to obtain samples from the posterior distribution for the RCS given a DBS measurement. These samples are then used to estimate posterior class probabilities for the DBS measurement, using a classifier trained on ISAR data. The 144
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
sampling from the posterior distribution is covered in Section 6.3.2 and the use of the ISAR classifier is covered in Section 6.3.3. A hybrid MCMC algorithm known as Metropolis-within-Gibbs [158] is used. Specifically, a Gibbs sampler is used with the modification that, for each component distribution that is hard to sample from, a single Metropolis-Hastings (M-H) step is used rather than exact simulation. The M-H algorithm for sampling from a distribution p(ψ) works by defining a proposal distribution q(ψ |ψ) for updating the value of ψ to ψ . This proposal distribution should be easy (and hopefully computationally cheap) to sample from. At each iteration, a sample ψ is drawn from the distribution q(ψ |ψ) where ψ is the current value of the variable. The proposed value ψ is accepted with )q(ψ|ψ ) probability a(ψ, ψ ) = min(1, r(ψ , ψ)), where r(ψ , ψ) = p(ψ p(ψ)q(ψ |ψ) .
6.3.2 Sampling from the posterior distribution Outline This section describes how to draw samples from the posterior distribution p(σ, θ|dbs) given in (6.11). The outline of the algorithm is given in Figure 6.7.
1) Initialization. Set s = 1 and determine initial values for (σ (0) , θ(0) ) from the support of the joint posterior distribution. 2) Iteration s • Sample the underlying RCS σ (s) from the conditional distribution p(σ|θ(s−1) , dbs). • Sample the DBS physical and processing model parameters θ(s) from the conditional distribution p(θ|σ (s) , dbs). 3) s ← s + 1, and go to 2.
Figure 6.7: MCMC algorithm for DBS inversion. Initialisation of the algorithm requires specification of initial values (σ (0) , θ(0) ) for the variables (σ, θ). Theoretically, these could be sampled from the prior distribution. However, faster convergence of the MCMC algorithm is more likely if the variables are initialised to values with high posterior probability. The RCS σ (0) could potentially be initialised using an alternative (as yet unspecified) technique for DBS inversion. The parameters θ(0) could be initialised to their expected values. Intelligent initialisation of the MCMC sampler is likely to reduce the computational expense of the MCMC approach, since the Markov chain will not take as long to converge. After an initial burn-in period, during which the generated Markov chain reaches equilibrium, the set of parameters (σ (s) , θ(s) ) can be regarded as dependent samples from the posterior distribution p(σ, θ|dbs). To obtain approximately independent samples a gap (known as the decorrelation gap) is left between successive samples. If we are only concerned with ergodic averages, lower variances are obtained if the output of the Markov chain is not sub-sampled. However, if storage of the samples 145
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
is an issue, it may be better to leave a decorrelation gap, so that the full space of the distribution can be explored without having to keep thousands of samples. In our notation these approximately independent samples are relabelled (σ (s) , θ(s) ), for s = 1, . . . N ; where N is the number of MCMC samples. Typically, choice of the burn-in period, decorrelation gap and number of samples will involve specifying an initial set of values, and then if possible examining the classification rate for the training datasets. Modification of these initial values can then be made in the light of this classification performance. This, together with examining sample plots is likely to provide a better test of convergence than more sophisticated convergence diagnostics [55, 125], which tend to fare poorly for all but the simplest of problems [71, 189].
The underlying RCS The conditional distribution p(σ|θ, dbs) is given by: p(σ|θ, dbs)
∝ p(σ, θ|dbs) ∝ p(dbs|σ, θ)p(σ)p(θ) ∝ p(dbs|σ, θ)p(σ),
(6.24)
where the 1st line follows from the mathematical definition of a conditional distribution, the 2nd line uses (6.11), and the 3rd line is a consequence of ignoring the normalisation constant. In some circumstances it might be possible to sample directly from the conditional distribution defined in (6.24). However, in most cases this will not be possible so a Metropolis-Hastings (M-H) [158] step is used. The simplest form of M-H step uses a random walk proposal distribution, such as a multivariate Gaussian distribution centred on the current value for σ. More sophisticated proposal distributions can also be incorporated, which prompt the sampler to regions of interest quickly. These proposal distributions could be based on the results obtained from other (as yet unspecified) techniques for inverting DBS images to the underlying RCS and do not need to depend on the current value for σ. The exact form of these more sophisticated proposal distributions would depend on the technique upon which they are based. Possible examples include a Gaussian distribution centred on the estimate obtained from an alternative technique or mixture models with component densities and component probabilities derived from the results obtained using other techniques. As an example, consider the situation where an inversion technique produces weightings w1 , . . . , wr for the degree to which σ 1 , . . . , σ r , respectively, could give rise to the measured DBS image. Assuming that the weightings are positive, a mixture model proposal distribution could be devised as follows: q(σ |σ) = q(σ ) ∝
r
wi N (σ |σ i , τ ),
(6.25)
i=1
where τ is an appropriately defined common covariance matrix and the first line emphasizes that
146
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
the proposal distribution does not depend on the current value σ. The danger of using a proposal distribution that is not a random walk is that the irreducibility of the Markov chain might be lost (irreducibility refers to the ability of the Markov chain transition kernel to move over the entire state space, regardless of the starting point). To mitigate against this problem, an adapted M-H step should be used, for which we choose randomly between two possible proposal distributions [178]. The first proposal distribution, which is chosen with probability λσ (fixed prior to running the algorithm) is the proposal distribution based on results from alternative inversion procedures. The second, which is chosen with probability 1 − λσ is the random walk proposal distribution, which by perturbing about the current value ensures the irreducibility of the Markov chain, whatever the choice for the first proposal distribution. The random walk proposal distribution can be tuned to increase (or decrease) the acceptance rate for moves by decreasing (or increasing) the variance of the proposed random walk.
DBS measurement model parameters The conditional distribution p(θ|σ, dbs) is given by: p(θ|σ, dbs)
∝ p(σ, θ|dbs) ∝ p(dbs|σ, θ)p(σ)p(θ) ∝ p(dbs|σ, θ)p(θ),
(6.26)
where the 1st line follows from the mathematical definition of a conditional distribution, the 2nd line uses (6.11), and the 3rd line is a consequence of ignoring the normalisation constant. For some prior distributions and sensor measurement processes it will be possible to sample directly from the conditional distribution defined in (6.26). However, in most cases this will not be possible, so as with the sampling of p(σ|θ, dbs) an M-H step is used. Similarly to the conditional sampling of the RCS, two types of proposal distribution can be used. The first aims to move to regions of interest quickly (e.g. a Gaussian distribution centred on values of the parameters estimated from other techniques and measurements) and the second is a random walk proposal (e.g. a Gaussian perturbation about the current value).
Further sub-divisions The Gibbs sampling of (σ, θ) from p(σ, θ|dbs) has been split into two steps, the sampling of p(σ|θ, dbs) and the sampling of p(θ|σ, dbs). M-H steps have been proposed for situations where these conditional distributions cannot be sampled from directly. If the dimensionality of either σ or θ is high, these M-H steps will be highly inefficient. In particular, if the dimensionality is large the random walk proposal moves will only have reasonable acceptance rates when the proposed perturbations are very small. If this is the case it will take an infeasibly long time to explore the interesting areas of the posterior distribution.
147
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
One approach to countering this problem is to divide the variables into smaller component parts. For example, the RCS σ could be divided into its one dimensional component parts σi,j : σ = {σi,j , 1 ≤ i ≤ dσ1 , 1 ≤ j ≤ dσ2 },
(6.27)
as in (6.6). Defining σ−i,−j to be the set of values of σ with σi,j removed, and more specifically: (s)
σ−i,−j
$ (s) (s) = {σi ,j , 1 ≤ i < i, 1 ≤ j ≤ dσ2 } {σi,j , 1 ≤ j < j} $ (s−1) {σi,j , j < j ≤ dσ2 } $ (s−1) {σi ,j , i < i ≤ dσ1 , 1 ≤ j ≤ dσ2 },
(6.28)
iteration s of Figure 6.7 is replaced by the procedures outlined in Figure 6.8.
(s)
• Sample the component σ1,1 of the underlying RCS, from the conditional distribution (s)
p(σ1,1 |σ−1,−1 , θ(s−1) , dbs). • ... (s)
• Sample the component σi,j of the underlying RCS, from the conditional distribution (s)
p(σi,j |σ−i,−j , θ(s−1) , dbs). • ... (s)
• Sample the component σdσ
,dσ2 of 1 (s) p(σdσ1 ,dσ2 |σ−dσ ,−dσ , θ(s−1) , dbs). 1 2
the underlying RCS, from the conditional distribution
• Sample the DBS physical and processing model parameters θ(s) from the conditional distribution p(θ|σ (s) , dbs).
Figure 6.8: Further sub-division within iteration s of the MCMC algorithm for DBS inversion. A similar procedure could be applied to the components of θ. Indeed, if the parameters of the DBS physical and processing model reflect different (and uncorrelated) physical characteristics, such a procedure might actually be preferred to sampling θ in one go. Furthermore, in such cases the subdivision of the variables might result in simpler conditional distributions compared to the original conditional distribution p(θ|σ, dbs). In some cases, calculation of the component conditional probability distributions p(σi,j |σ−i,−j , θ, dbs) will be very complicated. Fortunately, an M-H step does not actually require calculation of any new conditional distributions, since: p(σi,j |σ−i,−j , θ, dbs)
∝ p(σi,j , σ−i,−j |θ, dbs) ∝ p(σ|θ, dbs),
(6.29)
with the first line being a simple application of the definition of a conditional probability, and the second line following from the reformation of σ from its constituent elements. A similar result holds
148
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
for the components of θ (and indeed any random variable broken into its constituent elements). Unfortunately, the technique of splitting σ into its constituent elements leads to increased correlations between the variables in the Gibbs sampler. This is a potential problem because correlation between variables has a tendency to decrease the convergence rate of a Gibbs sampler [189].
6.3.3 Classification using the ISAR training data Section 6.2.7 describes how ISAR training data can be used to classify targets being imaged by a DBS sensor. In particular, expectations of the posterior class probabilities are provided by (6.18). Since analytical evaluation of the posterior distribution p(σ, θ|dbs) is possible only in exceptional circumstances, the expression in (6.18) will also be analytically available in limited circumstances only. This is due to its dependence on p(σ|dbs), which is given by: p(σ|dbs) =
p(σ, θ|dbs) dθ.
(6.30)
θ
The posterior class probabilities therefore need to be evaluated using the MCMC samples from p(σ, θ|dbs). In particular, we approximate the distribution p(σ|dbs) by:
p(σ|dbs) ≈
N 1 δ(σ − σ (s) ), N s=1
(6.31)
where δ(x − y) indicates a point mass for x at the value y. To maintain generality we first describe the modification to the expression for the expectation of a function g(isar) of the ISAR image simulated from the DBS image. In particular, using (6.31), equation (6.16) becomes: E[g(isar)|φ, dbs]
≈
g(isar)
isar
≈
σ
p(isar|σ, φ)
N 1 δ(σ − σ (s) ) dσ d isar N s=1
N 1 g(isar)p(isar|σ (s) , φ) d isar. N s=1 isar
(6.32)
Even after this first approximation it is unlikely that the required integration can be evaluated analytically. The expectation is therefore evaluated using probabilistic sampling techniques. Specifically, we draw M samples from each of the N distributions p(isar|σ (s) , φ). These samples are represented by isar(s,m) . Each distribution p(isar|σ (s) , φ) is then approximated by:
p(isar|σ (s) , φ) ≈
M 1 δ(isar − isar(s,m) ). M m=1
(6.33)
Sampling from p(isar|σ (s) , φ) corresponds to generating ISAR images from an underlying RCS
149
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
σ (s) using a forward model for the ISAR image generation process (i.e the physical and processing models for the ISAR sensor). Substituting (6.33) into (6.32) gives the following approximation for the expectation of the function g(isar):
E[g(isar)|φ, dbs]
≈
N M 1 1 g(isar) δ(isar − isar(s,m) ) d isar N s=1 isar M m=1
≈
M N 1 g(isar(s,m) ). N M s=1 m=1
(6.34)
Equation (6.18) for the expectations of the posterior class probabilities then becomes:
E[p(C = j|isar, {isar})|φ, dbs] ≈
M N 1 p(C = j|isar(s,m) , {isar}). N M s=1 m=1
(6.35)
The expressions in (6.35) provide an approximation to the posterior class probabilities for the DBS image, based on a classifier trained on ISAR data. The DBS image can then be assigned (classified) to the class that maximises the posterior class probability. We note, however, that the classification does not need to be based on the expected posterior class probabilities, nor does the ISAR classifier need to produce posterior class probabilities. If the only output of the ISAR classifier is the classification, then we would simply saturate the posterior class probabilities p(C = j|isar, {isar}) to 1 for the winning class and 0 for all other classes. The maximum value for (6.35) would then indicate the winning class for this classifier. If the ISAR classifier provides classification scores, then one approach is to normalise these scores to have the characteristics of a posterior class probability. The exact form of this normalisation would depend on the classifier used. However, the advantage to using proper posterior class probabilities in the ISAR classifier is that the resulting expectations in (6.35) can be manipulated consistently as probabilities using Bayes theorem.
6.3.4 Extension to the Gibbs sampler algorithm If M = 1 the sampling from p(isar|σ (s) , φ) is equivalent to using Figure 6.5 and extending the MCMC algorithm to draw samples from p(σ, θ, isar|dbs, φ) rather than p(σ, θ|dbs). In current work it is assumed that φ is a known variable having been estimated separately using the ISAR training data. Under this extended model, samples from p(σ, θ, isar|dbs, φ) are obtained by writing: p(σ, θ, isar|dbs, φ) = p(σ, θ|dbs, φ)p(isar|σ, θ, dbs, φ),
(6.36)
using the mathematical definition of a conditional probability. The marginal distribution p(σ, θ|dbs, φ)
150
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.3. MCMC SOLUTION
is given in (6.11) and the conditional distribution can be simplified to: p(isar|σ, θ, dbs, φ) = p(isar|σ, φ).
(6.37)
Each iteration of the Gibbs sampler routine of Section 6.3.2 can then be extended to sample also from p(isar|σ, φ), with the sampling from p(σ, θ|dbs, φ) occurring as before. However, nothing is gained by incorporating the sampling of p(isar|σ, φ) into each Gibbs sampler iteration, so in practice the samples are drawn outside of the Gibbs algorithm. Note also that the proposed procedure is is not the same as adding the variable isar to the variables (σ, θ) to be sampled in a standard Gibbs sampler, since the distribution p(σ, θ|isar, dbs, φ) is not used.
6.3.5 Conversion of the ISAR data Currently, it is proposed that a measured DBS image should be converted into the ISAR domain, where-upon a classifier trained using ISAR data can be applied. This takes advantage of the greater availability of ISAR data for training a classifier. Typically, the procedure requires an MCMC technique to invert each DBS image to an underlying RCS, which is then mapped onto the ISAR domain. This might turn out to be prohibitively computationally intensive for operational use, in which case the procedure could be reversed, and all the training ISAR data converted into synthetic DBS images. Then a classifier can be trained, using the model-generated DBS data, and applied directly to operational DBS data as it arrives. This would move the computational cost from the operational domain to the training procedure. The proposed framework readily handles this possibility provided that we can determine the distribution p(isar|σ, φ) in the same manner as the distribution p(dbs|σ, θ). However, issues might arise with the DBS physical and processing model parameters θ, that would need to be defined each time an RCS is mapped onto the DBS domain. Under the current formulation θ is treated as a separate random variable for each DBS image, with the posterior distribution p(θ|dbs) being re-estimated each time. If this distribution differs by large amounts from DBS image to DBS image, it will be necessary to re-map the ISAR data onto the DBS domain each time. This would remove the computational advantage of generating model-based DBS images from the ISAR training data. A third approach would involve converting both the measured ISAR and the measured DBS data to the RCS domain, and then classifying based on the underlying RCS (with the classifier on the RCS data being trained using the inverted ISAR training images). Such a procedure would have the advantage of producing a sensor independent classifier able to make use of both DBS and ISAR training data. However, the MCMC algorithm would still need to be applied to each test data image, so in operation there would be no computational advantage.
151
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.4
6.4. SIMPLIFIED PROBLEM
Simplified problem
6.4.1 Introduction A simplified problem has been developed to illustrate the proposed procedure. An underlying representation of an object is denoted by the variable σ (the RCS of the previous sections). This object can be imaged by two separate sensors, with sensor one producing images x (the DBS images of the previous sections) and sensor two producing images y (the ISAR images of the previous sections). The measurement models for both sensors are assumed known, although some allowance can be made for unknown parameters defining the models. Both measurement models operate on the underlying representation σ. The initial requirement is to take a measurement x from sensor one, invert it to the underlying representation σ, and then convert it to have the form of a measurement y from sensor two. This is done within the described Bayesian framework. The model generated images are then classified using a classifier trained with data from sensor two. To avoid the need to redefine the equations of Section 6.2 and Section 6.3, the images from sensor one are referred to as DBS images, and the images from sensor two are referred to as ISAR images, even though the sensor measurement models do not represent the processes leading to the formation of DBS and ISAR images.
6.4.2 Description of the data A two class problem is considered, with the targets defined on a d × d RCS grid, where d = 5. To define the distributions of the data, the rows of the RCS grid and all images are concatenated into single vectors (for the RCS grid the concatenated vector is of dimension d2 ). However, for ease of explanation the vectors and statistics of the distributions are displayed with the rows re-partitioned. The RCS values for the realisations of the targets are taken to be multivariate Gaussian, with the mean depending on the class. The mean values for the target RCS grids are:
0.0
0.0 µ0 = 0.0 0.0 0.0 and:
0.0
0.0 µ1 = 0.0 0.0 0.0
0.0
0.0
1.0 1.0
1.0 0.0
1.0 1.0
1.0 0.0
1.0 0.0
1.0 0.0
0.0 0.0
0.0
0.0
0.0
0.0
1.0 0.0 1.0
0.0 1.0 0.0
1.0 0.0 1.0
0.0 0.0 0.0
0.0
0.0
0.0
0.0
for classes 0 and 1 respectively.
152
0.0
0.0
0.0 0.0
(6.38)
(6.39)
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.4. SIMPLIFIED PROBLEM
then the conditional posterior distribution p(ψ12 |x, σ) introduced in (6.26) is given by: p(ψ12 |x, σ, psf1 )
∝
d d
'
(
1
−(xi,j − psf(σ, psf1 )i,j )2 2ψ12
)
exp 2πψ12 ( ) −V 1 × 2(ν+1) exp , ψ12 ψ
i=1 j=1
(6.46)
1
which can be simplified to: d d −1 1 2 V + (x − psf(σ, psf ) ) p(ψ12 |x, σ, psf1 ) ∝ exp i,j 1 i,j ψ12 2 i=1 j=1 ×
1 2(ν+d2 /2+1) ψ1
.
(6.47)
By comparison with the probability density function for an inverse-gamma distribution, we see that: ψ12 |(x, σ, psf1 ) ∼ 1/Γ(ν , V ),
(6.48)
where: ν
= ν + d2 /2,
V
= V +
(6.49)
1 (xi,j − psf(σ, psf1 )i,j )2 . 2 i=1 j=1 d
d
(6.50)
Thus, the conditional posterior distribution for ψ12 can be sampled from directly, rather than using a Metropolis-Hastings step (see Section 6.3.2). Furthermore, provided that the sensor measurement noise is additive Gaussian, and that the prior distribution for the variance of the noise is inversegamma, this direct sampling is still possible even if the point spread function sensor measurement model is replaced by a more complicated sensor model. When θ consists of variables θ in addition to ψ12 , the procedure outlined in Section 6.3.2 for sub-dividing the components of θ would be applied, with the parameters split into ψ12 and θ . The Gibbs sampler would then sample from p(ψ12 |σ, dbs, θ ) followed by p(θ |σ, dbs, ψ12 ). Sampling from p(ψ12 |σ, dbs, θ ) would be direct from the inverse-gamma distribution, while sampling from p(θ |σ, dbs, ψ12 ) would probably require a Metropolis-Hastings step. The advantage of this subdivision of θ would come from reducing the dimensionality of the variables to be sampled using Metropolis-Hastings steps.
6.4.6 Proposal distributions Since the parameters θ are treated as fixed, the only proposal distribution required in the MCMC algorithm is for σ. A single random walk distribution is used. In particular, the random walk distribution is a product of independent Gaussians with zero mean, and common variance, ψσ2rw . 156
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.4. SIMPLIFIED PROBLEM
The variance is specified so that the percentage of accepted moves is reasonable (i.e. not too large and not too small). For the experiments presented in Section 6.5 roughly 55% of the proposed moves were accepted. In practice, the random walk variance can be set by monitoring the acceptance rate for proposed moves during a subset of the initial burn-in period. If the acceptance rate falls too low the variance is reduced, while if it rises too high the variance is increased. Re-setting the random walk variance during the burn-in period (and indeed after the burn-in period) is equivalent to starting a new Markov chain initialised with the last sample from the previous chain. Thus, provided that the random walk variance is not continually altered throughout the algorithm, the samples from the Metropolis-within-Gibbs sampler algorithm will be distributed as if they were drawn directly from the posterior distribution.
6.4.7 Analytic sampling Conditional RCS distribution For this simplified example the distribution p(σ|x, θ) can be expressed analytically as a multivariate Gaussian distribution. Thus, the sampling can be done directly, rather than using a MetropolisHastings step. This is due to the form of the prior distribution for σ combined with the sensor measurement model and would not be the case for sensor physical models that (prior to the addition of Gaussian noise) perturb the RCS in a non-linear manner. After concatenating the rows of the DBS image into a vector, the measurement model: x = psf(σ, psf1 ) + 4,
(6.51)
where 4 is the additive Gaussian noise, can be expressed as: x(c) = P1 σ (c) + 4(c) ,
(6.52)
where the superscripts (c) indicate the concatenated rows of the images, and P1 is a d2 × d2 matrix replicating the effect of the point spread function. For our example, d = 5, and the matrix P1 can be represented in block form by:
A1 B1 P1 = 05,5 05,5
B1 A1
05,5 B1
05,5 05,5
B1 05,5
A1 B1
B1 A1
05,5
05,5
05,5
B1
157
05,5 05,5 05,5 B1 A1
(6.53)
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.4. SIMPLIFIED PROBLEM
where 05,5 is a 5 × 5 matrix of zeros, the matrix A1 is given by:
1.0
0.3 A1 = 0.0 0.0 0.0
0.0
0.3
0.0
0.0
1.0 0.3
0.3 1.0
0.0 0.3
0.0 0.0
0.3 0.0
1.0 0.3
0.3 1.0
0.3
0.0
0.0
0.0
0.3 0.3 0.0
0.3 0.3 0.3
0.0 0.3 0.3
0.0 0.0 0.3
0.0
0.0
0.3
0.3
0.0 0.0
(6.54)
and the matrix B1 is given by:
0.3
0.3 B1 = 0.0 0.0 0.0
(6.55)
The distribution for the additive Gaussian measurement noise 4 (Section 6.4.3) can be represented by: 4(c) ∼ N (0, Σψ1 ),
(6.56)
where Σψ1 = ψ12 Id2 ,d2 for our simplified problem, with Id2 ,d2 the d2 × d2 identity matrix. Thus, using (6.52) the likelihood distribution is given by: p(x|θ, σ) = N (x(c) ; P1 σ (c) , Σψ1 ).
(6.57)
We can represent the prior distribution for σ (Section 6.4.4) by: σ (c) ∼ N (κ(c) σ , Σς ),
(6.58)
where Σς = ςσ2 Id2 ,d2 for our simplified problem. The conditional posterior distribution p(σ|θ, x) given in (6.24) can then be written:
p(σ|θ, x)
∝
1 exp (2π)d2 /2 |Σψ1 |1/2 ×
1 (2π)d2 /2 |Σς |1/2
(c) − P1 σ (c) ) −(x(c) − P1 σ (c) )T Σ−1 ψ1 (x
2
exp
(c) − κσ ) −(σ (c) − κσ )T Σ−1 ς (σ 2 (c)
(c)
.
(6.59)
By completing the square for σ (c) in the exponent, the distribution in (6.59) can be simplified to the Gaussian distribution: σ|(θ, x) ∼ N (κ , Σ ),
(6.60)
−1 −1 Σ = (P1T Σ−1 , ψ1 P1 + Σς )
(6.61)
where:
158
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.4. SIMPLIFIED PROBLEM
and: (c) (c) κ = Σ (P1T Σ−1 + Σ−1 ς κσ ). ψ1 x
(6.62)
Using (6.60), (6.61) and (6.62) we can sample directly from the conditional distribution p(σ|θ, x). Indeed, under the assumption that θ is known, this distribution defines the RCS posterior distribution completely. Thus, an MCMC approach is not needed for the simplified problem.
Distribution for the ISAR data If θ is known, the distribution p(isar|φ, dbs) can be evaluated analytically for this simplified example. Thus, rather than passing RCS posterior samples through the ISAR sensor measurement model, as in Section 6.3.3, the sampling can be done exactly. Similarly to the DBS data, the rows of the ISAR image are concatenated into a vector y (c) . The sensor measurement model can then be expressed as: (c)
y (c) = P2 σ (c) + 42 ,
(6.63)
where P2 is a d2 × d2 matrix replicating the effect of the point spread function (defined analogously to (6.53), (6.54) and (6.55)), and: 42 ∼ N (0, Σψ2 ),
(c)
(6.64)
Σψ2 = ψ22 Id2 ,d2 .
(6.65)
with:
Defining: Σ
=
−1 −1 (P2T Σ−1 ) , ψ2 P 2 + Σ
(6.66)
Σ
=
−1 T −1 −1 (Σ−1 , ψ2 − Σψ2 P2 Σ P2 Σψ2 )
(6.67)
κ
=
Σ
−1 Σ−1 κ, ψ2 P 2 Σ Σ
(6.68)
with Σ and κ defined in (6.61) and (6.62) respectively, the required conditional probability distribution can be expressed as a multivariate Gaussian distribution: y (c) |(φ, x(c) , θ) ∼ N (κ , Σ ).
(6.69)
The derivation is provided in Appendix H.2.1. Samples drawn directly from (6.69) can then be used in the manner described in Section 6.3.3. Specifically the samples can be used to calculate the expectations of the posterior class probabilities (given in (6.35)). However, it should be stressed that an analytical evaluation is unlikely to be possible for more complicated sensor measurement models.
159
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.5
6.5. RESULTS
Results
6.5.1 Introduction This section presents some results for the simplified problem introduced in Section 6.4. Specifically, both the MCMC sampling and the direct sampling approaches are assessed for the described example. Performance comparisons are made with a baseline approach and with idealised classifiers.
6.5.2 Details of the experiments Experiments have been conducted for a range of variances ψσ2 for the multivariate Gaussian RCS realisations (see Section 6.4.2). In particular, results are presented for the values given in Table 6.1. Parameter ψσ ψ1 ψ2 ntr nte
Experiment 1 0.1 0.1 0.1 100 1000
Experiment 2 0.5 0.1 0.1 100 1000
Experiment 3 1.0 0.1 0.1 100 1000
Table 6.1: Experimental parameters. The sensor noise standard deviations, ψ1 and ψ2 , given in Table 6.1, are assumed known in our experiments. For the third experiment, we note that the standard deviation ψσ = 1.0 is a factor of two larger than the standard deviation ςσ = 0.5 on the prior distribution for σ (see Section 6.4.4). The actual RCS realisations are therefore likely to lie in the tails of the prior distribution. Thus, the prior distribution is likely to be poor at constraining the posterior distribution to have high probability at the actual RCS. This might lead to a degradation in algorithm performance, since it is likely that each observed DBS image can be obtained from many different underlying RCS grids. For each experiment ntr training data RCS realisations from each class have been generated. Using the sensor two measurement model, these realisations have been used to create ntr images from sensor two for each class. In operational use, only the sensor two training data would be available, as opposed to having the underlying RCS realisations. These training data images have been used to train a multivariate Gaussian classifier (using full covariance matrices) for the sensor two data. More complicated classifiers could be used for real data, such as those proposed in Chapters 3, 4 and 5. The Gaussian classifier has been applied in a standard manner, with classifications made to the class maximising the posterior class probability under the assumption of equal class priors. In operational use the classifier would have to cope with situations where the target is actually from an unknown class. Ideally, in such situations the ATR algorithm should reject the target without making a classification [129]. In some cases, a simple threshold on the maximum posterior probability could be used to reject a target, i.e. declaring the target to belong to the class maximising the posterior probability only if that posterior probability is above a pre-specified threshold. However, since the
160
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.5. RESULTS
class probabilities are normalised to sum to one, even if the new target is an outlier, the posterior class probabilities might strongly favour one class. A second option is to reject the target if the value for each (Gaussian) class likelihood distribution falls below a threshold. The value of the threshold would depend on the required trade-off between false rejection (i.e. rejection of a target that would have been correctly classified) and classification of spurious objects (i.e. classifying an object which doesn’t belong to any of the classes). When used after the Bayesian inversion procedure, this classifier reject option could identify the sensor one measurements for which the inversion procedure has failed to generate an appropriate sensor two image. Such failure could occur if the MCMC algorithm has not converged and also if the target observation angles for the sensor one measurement are outside the range covered by the sensor two training data. Note, however, that the latter case would be a problem with the sensor two training data rather than the Bayesian inversion procedure. The test data for each experiment is obtained by generating a further nte RCS realisations from each class. These test data realisations have then been imaged using the sensor one measurement model, creating nte sensor one test data images for each class. Again, it is important to note that in operational use only the sensor one test data would be available, rather than the RCS realisations. These test data images have then been assigned to classes using the proposed Bayesian inverse model based approach. Both the MCMC and the analytic sampling approaches from Sections 6.2, 6.3 and 6.4 have been used. Thus there are two sets of Bayesian inverse model results for each experiment, referred to as the MCMC and direct sampling approaches. To assess the performance of the Bayesian inverse model based approaches, four additional sets of classification results have been obtained. All are based on multivariate Gaussian classifiers using full covariance matrices. The first applies a classifier for the sensor two (ISAR) data directly to the sensor one (DBS) data (i.e. a classifier trained using ISAR data but tested on DBS data). The results from this baseline classifier indicate the performance that would be obtained were the change in sensor between test and training conditions to be ignored by the ATR algorithm. None of the remaining three classifiers are expected to be available for a real system, so their classification results do not represent a baseline performance. The classifiers are: C1) A classifier applied directly to the sensor one (DBS) data. The training data for the sensor one domain classifier was obtained by converting the RCS training data to sensor one measurements using the sensor one measurement model. Thus, this classifier is different to that suggested in Section 6.3.5, which proposed use of the Bayesian inversion procedure to convert training ISAR data to the underlying RCS representation, before mapping onto the DBS domain (where it could be used to train a classifier to operate directly on DBS data). C2) A classifier applied directly to the sensor two (ISAR) data. The test data for the sensor two domain classifier was obtained by converting the RCS test data to sensor two measurements using the sensor two measurement model. This differs from the Bayesian procedure, in that the actual RCS grids are converted to sensor two data (in the Bayesian inverse model based algorithm it is the inverted DBS images that are converted to the ISAR domain). C3) A classifier applied directly to the RCS data. The training and test data for the RCS domain classifier was the original RCS data, rather than inverted sensor one or sensor two measurements (thus, this classifier is different to the RCS domain classifier suggested in Section 6.3.5).
161
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.5. RESULTS
The results from classifier C1 indicate the performance that would be obtained were DBS training data to be available (it is expected that there will not be enough DBS data to train a classifier for operational scenarios, so this option wouldn’t be available in practice). The results from classifier C2 indicate the performance that would be obtained if the seeker weapon could use ISAR sensors (due to the nature of ISAR sensors, requiring target motion, this is unlikely in practice). Finally, the results from classifier C3 indicate the performance that would be obtained in the idealised scenario, where the measurement processes do not distort the underlying RCS in any manner.
6.5.3 Algorithm parameters In all documented experiments the MCMC algorithm has been run using the parameters detailed in Table 6.2. The RCS samples are initialised to the measured DBS image. The direct sampling approaches drew 500 samples from the posterior distribution p(isar|dbs, φ) described in (6.69). Parameter N Nb g M ψσrw
Description No. of MCMC samples MCMC burn-in period MCMC decorrelation gap No. of ISAR samples per RCS sample Standard deviation of the random walk for σ
Value 500 25000 50 1 0.02
Table 6.2: Parameters used in the MCMC algorithm for generalising target classification between sensors.
6.5.4 Classification results Figure 6.13, Table 6.3 and Table 6.4 display the classification rates obtained for the Bayesian inverse model based approaches, the baseline classifier and the three idealised classifiers. In each case the results have been separated by class and RCS standard deviation ψσ . The results are summarised in Figure 6.14, which displays the averaged classification rates. Classifier MCMC sampling Direct sampling Baseline Classifier, C1 Classifier, C2 Classifier, C3
RCS s.d. ψσ 0.1 0.5 1.0 100.0 95.3 75.5 100.0 95.8 75.6 100.0 99.1 66.6 100.0 97.8 71.2 100.0 99.3 72.7 100.0 97.8 73.4
Table 6.3: Classification rates for class 0. The poorer performance of the baseline classifier relative to the other classifiers (including the Bayesian inverse model based approaches) indicates that, if ignored, the change in sensor between test and training conditions does (as would be expected) reduce the classifier performance. In contrast, both the MCMC and direct sampling Bayesian inverse model based approaches can be seen to provide a mechanism to maintain classifier performance. The results for the MCMC sampling approach are very similar to those for the direct sampling approach, indicating that the MCMC
162
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.5. RESULTS
100
90
80
classification rate
70
60
50
40
30
20
10
0
{0,0.1}
{1,0.1}
{0,0.5}
{1,0.5}
{0,1.0}
{1,1.0}
{class, RCS standard deviation}, {j,ψσ}
Figure 6.13: Classification rates separated by class and RCS standard deviation. From left to right, within each cluster of bars, the results are for the MCMC approach, the direct sampling approach, the baseline classifier, and classifiers C1, C2 and C3 respectively. Classifier MCMC samping Direct sampling Baseline Classifier, C1 Classifier, C2 Classifier, C3
RCS s.d. ψσ 0.1 0.5 1.0 100.0 96.9 77.5 100.0 97.2 77.5 74.9 77.8 68.2 100.0 93.1 74.8 100.0 95.1 73.4 100.0 95.5 74.1
Table 6.4: Classification rates for class 1. algorithm is operating efficiently. Comparing the results for classifier C3 (the classifier using the actual RCS data) with those for classifiers C1 and C2 shows that the measurement processes do not have much of an adverse effect on the overall classification results. For ψσ = 0.1 the two classes are well separated, with the result that all but the baseline classifier classify all the test data correctly. For ψσ = 0.5, the classes are still quite well separated and there is little to choose between the Bayesian inverse model based classifiers and the three idealised classifiers. Specifically, the Bayesian inverse model based approach gives slightly worse performance than the three idealised classifiers for class 0, but better performance for class 1. Although the performance of the baseline classifier for class 0 is higher than that of the other classifiers when ψσ = 0.5, this is at the expense of much reduced performance for class 1. The classes are less separated when ψσ = 1.0, which is reflected by the lower classification rates for both the Bayesian inverse model based classifiers and the three idealised classifiers. For both classes, when ψσ = 1.0 the Bayesian inverse model based approach gives slightly better performance than the three idealised classifiers. The most likely reason for this surprising result is that the distortion in the model generated sensor two images arising from the Bayesian inversion procedure happens to separate the two classes. Although leading to better performance for this two class
163
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.5. RESULTS
100
90
80
classification rate
70
60
50
40
30
20
10
0
0.1
0.5
RCS standard deviation, ψσ
1.0
Figure 6.14: Average classification rates for each RCS standard deviation. From left to right, within each cluster of bars, the results are for the MCMC approach, the direct sampling approach, the baseline classifier, and classifiers C1, C2 and C3 respectively. problem, if such distortion is occurring it is likely to lead to reduced performance as the number of classes increases. A possible cause for any inversion distortion is (as noted in Section 6.5.2) that the RCS realisations for ψσ = 1.0 will lie in the tails of the prior distribution p(σ), which will therefore be poor at constraining the posterior distribution. Thus the inversion performance for ψσ = 1.0 is likely to be sub-optimal. However, the difference in classification rates is not large and, for the simplified problem at least, the Bayesian inverse model based approach appears to enable a classifier designed using data from one sensor to be applied to data gathered from a second sensor.
6.5.5 Examination of the results for class 0 in the 1st experiment This section takes a closer look at a specific set of results for class 0 in the first experiment of Table 6.1. In particular, the MCMC samples from the posterior distribution for the RCS given a specific sensor one image from class 0 are examined. Figure 6.15 displays an actual RCS grid from class 0 alongside a corresponding image from sensor one. The first and last MCMC RCS samples from the resulting posterior distribution (after the burn-in period of Table 6.2) are displayed in Figure 6.16. To allow comparisons, the relationship between the values and the intensities in the RCS plots are the same in both figures. It can be seen that the MCMC RCS samples have picked up the general shape of the underlying RCS, although for both samples there are differences. The mean value of the MCMC RCS samples is displayed on the left-hand-side of Figure 6.17. This mean of the MCMC RCS samples appears to recreate the underlying target shape well. Also plotted in Figure 6.17 is the ‘closest’ MCMC RCS sample to the actual RCS grid (with ‘closest’ defined to be the minimum Euclidean distance between the concatenated rows of the RCS grids). This ‘closest’ sample compares well with the actual RCS grid. Since the analytically calculated posterior distribution for the RCS (see (6.60), (6.61) and (6.62)) is a multivariate Gaussian distribution, the maximum a posteriori (MAP) value is the same as the
164
CHAPTER 6. GENERALISATION BETWEEN SENSORS Parameter N Nb g M ψσrw
6.6. SUMMARY
Description No. of MCMC samples MCMC burn-in period MCMC decorrelation gap No. of ISAR samples per RCS sample Standard deviation of the random walk for σ
Value 2000 0 25 1 0.02
Table 6.5: Parameters used to investigate the convergence of the MCMC algorithm. i=2, j=2
i=2, j=3
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
i=2, j=4
2 1.75 1.5 1.25 1 0.75 0.5 0
500
1000
1500
2000
1.4 1.2 1 0.8 0.6 0.4 0.2 0
500
i=3, j=2
1000
1500
2000
0
500
i=3, j=3
1.75 1.5 1.25 1 0.75 0.5 0.25
1000
1500
2000
i=3, j=4
2
2
1.5 1.5 1
0
500
1000
1500
0.5
1
0
0.5
2000
0
500
i=4, j=2
1000
1500
2000
500
1000
1500
500
2000
1000
1500
2000
i=4, j=4
2 1.75 1.5 1.25 1 0.75 0.5
1.2 1 0.8 0.6 0.4 0.2 0 0
0
i=4, j=3 1.4 1.2 1 0.8 0.6 0.4 0.2 0
500
1000
1500
2000
0
500
1000
1500
2000
Figure 6.23: Separated pixels of the MCMC RCS samples from the posterior distribution based upon a sensor one image from class 0 (generated from an RCS sampled with standard deviation 0.1). The red dotted lines correspond to the values for the actual RCS. Only the pixels from an inner 3 × 3 grid are displayed. one sensor to data gathered from a different sensor. The approach requires physical and processing models for the sensors to be known. A Bayesian approach has been adopted that uses the sensor measurement models to invert the sensor data to generic representations. Ultimately, in this manner, classifiers designed using ISAR images of targets on a turntable will be able to be applied to radar seeker data. This is advantageous, since it is easier to collect ISAR images of targets under a variety of extended operating conditions, than it is for radar seekers. A theoretical framework for the sensor inversion procedure has been proposed and has been illustrated for a simplified problem. The results for this simplified problem are encouraging, although much work still needs to be done before the approach could be used in a more realistic scenario. In particular, accurate sensor measurement models need to be developed and used. Furthermore, the ability of the approach to scale to larger dimensions needs to be investigated. Ultimately the approach should be assessed on real ISAR and DBS data. To allow for complicated sensor measurement models, the proposed Bayesian approach uses MCMC sampling techniques. Results based on these MCMC sampling techniques were found to be in agreement with analytical approaches that were available for the simplified problem. This bodes well for the use of the approach in real situations.
169
CHAPTER 6. GENERALISATION BETWEEN SENSORS
i=2, j=2
i=2, j=3
i=2, j=4 1.6 1.4 1.2 1 0.8 0.6 0.4
1 0.8 0.6 0.4 0.2 0 -0.2
1.2 1 0.8 0.6 0.4 0.2 0
500
1000
1500
2000
0
500
i=3, j=2
1000
1500
2000
1000
1500
2000
0
500
i=4, j=2
1000
1500
2000
1.2 1 0.8 0.6 1000
1500
0
500
i=4, j=3
2000
1500
2000
1000
1500
2000
i=4, j=4 1.2
1 0.8 0.6 0.4 0.2 0 -0.2
1.4
500
1000
1 0.75 0.5 0.25 0 -0.25
1.6
0
500
i=3, j=4
2.25 2 1.75 1.5 1.25 1 0.75 0.5 500
0
i=3, j=3
1 0.75 0.5 0.25 0 -0.25 -0.5 0
6.6. SUMMARY
1 0.8 0.6 0.4 0
500
1000
1500
2000
0
500
1000
1500
2000
Figure 6.24: Separated pixels of the MCMC RCS samples from the posterior distribution based upon a sensor one image from class 1 (generated from an RCS sampled with standard deviation 0.1). The red dotted lines correspond to the values for the actual RCS. Only the pixels from an inner 3 × 3 grid are displayed.
1.75
i=3, j=2
1.5 1.25 1 0.75 0.5 0.25 0.5 0.75
1
1.25 1.5 1.75 i=2, j=3
2
Figure 6.25: Demonstration of correlation between grid positions for the MCMC RCS samples from the posterior distribution based upon a sensor one image from class 0. The sample points were obtained by plotting the MCMC RCS samples from grid position (i, j) = (2, 3) as the first coordinate, and from (i, j) = (3, 2) as the second coordinate. The red line is the path obtained from joining the first 1000 samples (corresponding to a burn-in period), while the blue line is the path from the last 1000 samples.
170
CHAPTER 6. GENERALISATION BETWEEN SENSORS
6.6. SUMMARY
3
4 3.5
2.5
3 2.5
2
2
1.5
1.5 1
1 0
500
1000
1500
2000
0
500
1000
1500
2000
Figure 6.26: Euclidean distances between the MCMC RCS samples and the actual RCS grid (generated with standard deviation 0.1). LHS an example from class 0, RHS an example from class 1.
171
CHAPTER 7 BAYESIAN NETWORKS FOR INCORPORATION OF CONTEXTUAL INFORMATION IN TARGET RECOGNITION SYSTEMS 7.1
Introduction
7.1.1 General Previous chapters in this thesis have concentrated solely on classifying isolated targets, based upon single sensor measurements. This chapter proposes an approach to take into account the range of contextual information and domain specific knowledge that will be available. The focus is on a landbased scenario, although the approach could be adapted to naval and air scenarios. A probabilistic approach is adopted, that attempts to recognise multiple military targets, given measurements on the targets, knowledge of the likely groups of targets and measurements on the terrain in which the targets lie. This allows us to take into account such factors as clustering of targets, preference to hiding next to cover at the extremities of fields and ability to traverse different types of terrain. Bayesian networks are used to formulate the uncertain causal relationships underlying such a scheme.
7.1.2 Details In a realistic land-based ATR scenario, after an initial detection phase, there will be a set of multiple locations which have been identified as possibly occupied by targets/vehicles. Appropriate measurements (e.g. radar, sonar, infra-red) will then be taken at each of these locations, so that classifications can be made. Some of the measurements might be from real targets, while others will be false alarms from clutter objects, or just background noise. Most standard ATR techniques [189] will consider each potential target independently, and thus fail to take into account any contextual information and domain specific knowledge that could be brought to the problem. This chapter examines how, in a military scenario, the posterior probabilities of class membership at each location, can be combined with additional domain specific knowledge. This reflects the fact that a human assigning classes to measurements would take into account contextual information as well as the data measurements themselves. The use of this sort of additional contextual information by a human operator might be stronger than just having a closer look at the data measurements in
172
CHAPTER 7. CONTEXTUAL INFORMATION
7.1. INTRODUCTION
certain locations; it may tip the balance towards (or away from) certain classes. Thus, two nearly identical measurements may actually be assigned to different classes, depending on their respective contextual information. The type of contextual information that can be incorporated could include the proximity of other vehicles, recognising that military targets will often travel in groups. A human operator might also pay more attention to the extremities of fields close to hedges and woodland edges, reflecting the fact that military commanders would consider their vehicles exposed in the centre of a field and might choose to get them as close to cover as possible. Further domain specific knowledge that could be brought to the problem includes the type of terrain that surrounds a potential target. E.g. some vehicles such as MBTs are able to traverse through difficult ground while other vehicles have to detour around it. There might also be some knowledge about the likely deployment of military vehicles. For instance, based on an Intelligent Preparation of the Battlefield (IPB) estimate of likely enemy activity, it might be known that an air defence unit (ADU) will always be accompanied by an armoured personnel carrier (APC). Furthermore, vehicles might travel, or assemble, in formation. This is particularly the case for air defence sites without covert communications, whose vehicles may be constrained in their deployment by the need to have hard-wired connections between the command vehicle and the individual ADUs. For the extension of the proposed approach to naval applications, the formations of targets might convey a lot of target identity information (e.g. naval escorts surrounding an aircraft carrier). For applications to the aviation domain, possible destinations of an aircraft might provide contextual information that is useful in determining the class of the aircraft. For example, bombers heading towards high value targets, or refuelling aircraft heading towards a known rendezvous point. If additional time tracks of targets are available, then information on the weights and maximum speeds of vehicles could be useful. The weight of a vehicle will determine the type of bridges it can cross, indicating that a certain vehicle would be unlikely to have followed a specific route. If an object has travelled from one location to another faster than the top speed of a potential class of vehicle, then it might be possible to exclude that class from consideration. However, it will not always be desirable to treat such exclusions as definitive. For instance, it may be that the slower vehicle was actually being transported by another vehicle. Similarly, we would not want to exclude a theoretically fast class of target from consideration just because the object has been travelling slowly, since fast vehicles do not always manoeuvre at top speed and may travel slowly to keep pace with accompanying vehicles. Devising hard-rules based upon the contextual information, is not appropriate, since there are almost always going to be uncertainties. For example, our estimates of the surrounding terrain might be in error. The most appropriate formalism for handling the possibly conflicting pieces of information in a consistent manner is probabilistic. Thus, conventional expert systems [110] are not appropriate. However, a Bayesian network [81, 83] based on the causal relationships leading to a deployment of targets within a region, can be used in a probabilistic way to integrate domain specific knowledge with the actual data measurements.
173
CHAPTER 7. CONTEXTUAL INFORMATION
7.1. INTRODUCTION
7.1.3 Bayesian networks Bayesian networks [81, 83] can be used to model situations in which causality plays a role, but where there is a need to describe things probabilistically, since our understanding of the processes occurring is incomplete or, indeed, the processes themselves have a random element. Bayesian networks are frequently referred to by other names, including belief networks, Bayesian belief networks, Bayesian graphical models and probabilistic expert systems. A Bayesian network can be illustrated graphically as a series of nodes, which may assume particular states, connected by directional arrows. Figure 7.1 shows such a network. The states of a node can be discrete or continuous. The set of nodes originating the arrows that point to any given node is said to be the “parent set” of the node. The set of nodes terminating the arrows leading from any given node is said to be the “daughter set” of the node. Every node has an associated conditional probability table or density, specifying the probability or probability density of the node being in a certain state given the states of the parent nodes. Nodes with no parents are assigned prior distributions over their states. Given the directed graph, all the information which is known about the causality and behaviour of the system, is contained in the prior distributions (in the case of parentless nodes), conditional probability tables (in the case of nodes with discrete states) and conditional densities (in the case of nodes with continuous states). Given observations on some of the nodes, beliefs are propagated up through the network to obtain posterior probabilities for the unobserved nodes. Having introduced the terminology of Bayesian networks, it is important to note that they are simply a way of representing multivariate probability distributions, with both discrete and continuous parts.
7.1.4 Problem definition This chapter focuses on the situation where (after an initial target/object detection phase) there is a set of objects at estimated locations, with each object being a potential target/vehicle. Each of these objects needs to be assigned to a class, i.e. either a specific class of target or just clutter. A single multi-dimensional measurement of each object (i.e. at each location) is available, as well as estimates of the terrain of the region surrounding the objects. This terrain estimate consists of the division of the overall region into sub-regions, each with an associated local terrain (field, marsh, etc). These sub-regions are separated by boundaries, which are of unspecified type. The boundaries are allowed to split sub-regions of the same local terrain type, so fields split by hedges or walls are treated as separate sub-regions. This work considers only a subset of the contextual or domain specific information that can be incorporated into such an ATR problem. In particular, the focus is on the proximity of other vehicles, the distances to boundaries, the immediate type of terrain and known groupings of targets.
174
CHAPTER 7. CONTEXTUAL INFORMATION
7.2. BAYESIAN NETWORK
7.1.5 Related work A related approach to the work documented here is given by Blacknell [13], who looked at the incorporation of contextual information in Synthetic Aperture Radar target detection, by altering the prior probabilities of targets, depending on terrain, the distances to boundaries and the proximity of other potential targets. Although providing a mechanism to incorporate contextual information, the approach did not provide a mechanism to cope with uncertainty in the contextual information. The use of Bayesian networks to exploit contextual information for vehicle detection in infrared linescan imagery has been reported by Ducksbury et al[45]. Musman and Lehner use Bayesian networks to schedule a shipboard self-defence mechanism [131] and also for scheduling sensors searching for targets [132].
7.1.6 Chapter outline The structure of this chapter is as follows. Section 7.2 introduces the form of the proposed Bayesian network and specifies the nodes. Section 7.3 shows how one of the conditional distributions in the network can be used to encode the majority of the contextual information. How to make inference from the specified network is briefly covered in Section 7.4, with the mathematical details deferred to Appendix I. Section 7.5 assesses the technique on a simulated example. Finally, Section 7.6 summarises the work and give some suggestions for possible future extensions.
7.2
Bayesian network
7.2.1 Introduction It is proposed that a Bayesian network can be used to model the causal relationships leading to a deployment of targets within a region. Together with any clutter objects these deployed targets lead to the measurements on our initial (detected) objects. The proposed Bayesian network is illustrated in Fig. 7.1. The nodes denote states of affairs and the arcs can be interpreted as causal connections. The “groups” node represents the collections/groups that targets are likely to be deployed from, while the “terrain” node represents the terrain over the region of interest. The “terrain estimate” node is made up of our observations of the terrain in the region of interest, based, for example, on SAR image analysis. The node labelled “classes” and “locations” represents the classes and locations of the objects, whereas the node labelled “measurements” and “estimated locations” contains our measurements of the objects and our estimates of the locations. In the scenario, after an initial detection phase, there is a set of nl potential target locations, ˆl = {ˆli ; 1 ≤ i ≤ nl }, with corresponding data measurements x = {xi ; 1 ≤ i ≤ nl }. The region covered by the sensors, within which the locations can lie, is denoted R. There are J ≥ 1 target types (labelled j = 1, . . . , J), supplemented by a noise/clutter class (labelled j = 0). Thus, in total there are J + 1 classes. The measurements at the potential target
175
CHAPTER 7. CONTEXTUAL INFORMATION
7.2. BAYESIAN NETWORK
terrain
groups
classes locations terrain estimate
measurements estimated locations
Figure 7.1: Bayesian network formulation for the incorporation of contextual information in an ATR system. locations can, therefore, be referred to as coming from a collection of nl objects, each of which belongs to one of the J + 1 classes.
7.2.2 Groups of targets The majority of deployed targets are taken to come from specific collections/groups of targets, which are denoted by the discrete random variable G. The cover of G is the set of possible groups and has cardinality denoted by nG . This set of possible groups is assigned using expert knowledge, which may change with each scenario being considered. This would lead to an additional “scenario” node on the Bayesian network of Fig. 7.1, leading into the “groups” node. A given state g can be defined by quoting how many objects from each target type are present. Specifically, we define cg = (cg,1 , . . . , cg,J ), where cg,j is the number of objects from the j-th class, J contained in group g. Thus, the number of targets making up the state/group g is ng = j=1 cg,j . If ng < nl (i.e. the number of targets within a group is less than the number of potential target locations), then the group is augmented with a set of nl − ng clutter/noise targets. The prior probabilities for the states of G would ideally be assigned using intelligence information and IPB estimates for each scenario. However, in the absence of such information a flat prior can be assigned, namely p(G = g) = 1/nG .
7.2.3 Terrain The random variable representing the terrain, denoted T , is a broad variable covering many aspects and is made up of both continuous and discrete elements. In our case, this includes the positions of 176
CHAPTER 7. CONTEXTUAL INFORMATION
7.2. BAYESIAN NETWORK
the boundaries of the sub-regions of the area R and the local terrain types within each sub-region. The overall region is defined by a set of vertices and edges, (V, E), which divide R into nT sub-regions. The terrain types within these sub-regions are denoted by the random variables R1 , . . . , RnT . We suppose that there are nτ specific types of local terrain, which we denote by τ1 , . . . , τnτ . These could include “field”, “urban”, “marsh” etc. A natural form for the prior distribution is: p(T ) = p(V, E)p(R1 , . . . , RnT |V, E).
(7.1)
The distribution p(R1 , . . . , RnT |V, E) could be defined to take into account relationships between neighbouring sub-regions, e.g. it is unlikely that an urban area would be completely surrounded by marshes. Although certain rules could be built up to take into account these relationships, this sort of assignment of terrain type probabilities for all the allowed sets of vertices and edges would soon become very awkward and time consuming. Thus, we assume independence between the local terrain types of the sub-regions: p(R1 , . . . , RnT |V, E) =
nt
p(Ri ),
(7.2)
i=1
with: p(Ri = r) = p(τr ),
r = 1, . . . , nτ ,
1 ≤ i ≤ nT ,
where p(τr ) ≥ 0 is the prior probability for the r-th type of local terrain and
(7.3) nτ r=1
p(τr ) = 1.
Attempting to assign appropriate prior distributions for (V, E) is a complicated task, since not only is the number of vertices a random variable, but also the positions of the vertices and edges joining them are random variables. Assigning flat priors is possible, but this makes later inference awkward, since the domain of the random variables has to be integrated over. Fortunately, progress can be made if our conditional distribution for the observation of the terrain given the actual terrain is restrictive. This is covered in Section 7.2.7.
7.2.4 Classes and locations The random variable C records the class at each of the nl potential target locations. A state of C consists of a nl -dimensional vector, c = (c1 , . . . , cnl ), with elements the classes for each of the nl objects. The class allocations variable is coupled with the locations variable, L, which contains the actual locations of the objects. A state of L consists of a nl -dimensional vector, l = (l1 , . . . , lnl ), with elements the locations for each of the nl objects. A pair (C, L) = (c, l) is referred to as an allocation of targets. If all possible allocations of targets are allowed, then there will be nJ+1 possible states of C. l Thus, the computational load of evaluating the posterior node probabilities increases exponentially with nl . This can be reduced if the possible states of C are restricted based on the constituents of the groups, G; e.g. if none of the states of G have more than one ADU, we might choose to restrict the states of C to have at most one ADU.
177
CHAPTER 7. CONTEXTUAL INFORMATION
7.2. BAYESIAN NETWORK
The conditional distribution for this node of the Bayesian network is, for ease of notation, denoted by p(c, l|g, t). Its specification (Section 7.3) allows incorporation of contextual information and domain specific knowledge into the classifier (ATR system).
7.2.5 Measurements and locations The measurements x = (x1 , . . . , xnl ) and estimated locations ˆl = (ˆl1 , . . . , ˆlnl ) depend on the actual classes and locations (C, L) = (c, l) and the terrain T = t, via the conditional distributions p(x, ˆl|c, l, t). The pairs of data measurements and estimated locations (xi , ˆli ) are taken to be independent given (c, l, t). Under the assumption that the accuracy of a location estimate is independent of the class of the object at the location, we can then write: p(x, ˆl|c, l, t) =
nl
p(xi |ci , li , t)p(ˆli |li , t).
(7.4)
i=1
This assumption is a reasonable first approximation, but it should be noted that some types of target will be easier to locate than others, in which case the independence assumption will not necessarily hold. The distribution p(xi |ci , li , t) is the measurement distribution for the class ci , in the terrain t, at location li . Specification of these distributions is the focus of chapters 3, 4, 5 and 6, and is the subject of much research interest, as evidenced by the literature survey of chapter 2. The dependence of the measurement distributions on the terrain is likely to be greatest for the noise/clutter class. However, for ease of implementation, in our documented experiments we do not vary the measurement distributions across the terrain. Thus we set: p(xi |ci , li , t) = p(xi |ci ) i = 1, . . . , nl ,
(7.5)
removing the arc between object measurements and terrain in Fig. 7.1. In general, this experimental simplification would not be made. The distribution p(ˆli |li , t) is generally taken to be a δ-function, so that the measured locations are the same as the actual locations. Specifically: p(ˆli |li , t) = δ(ˆli − li ),
(7.6)
where δ(x − y) indicates a point mass for x at the value y. In theory, a non-trivial distribution of simple perturbations about the actual locations could be used. For example, in the continuous case, a multivariate Gaussian distribution could be used for each object, with mean vector set to be the actual location and small variance terms in the covariance matrix. In the discrete (i.e. pixelised) case small grids could be used centred about the actual locations. However, the conditional probability distributions p(c, l|g, T ) (covered in Section 7.3) would have to be recalculated for all possible values of l. This is not feasible in the continuous case and soon becomes computationally challenging in
178
CHAPTER 7. CONTEXTUAL INFORMATION
7.2. BAYESIAN NETWORK
the discrete case. This difficulty in specifying the conditional distributions is in addition to any problems with propagating evidence in the network, once observations (measurements) are made.
7.2.6 Standard Bayesian classifier The standard Bayesian classifier considered in the previous chapters of this thesis (and the majority of ATR literature as documented in Chapter 2), comprises only the distributions p(xi |ci ), along with some very simple prior probabilities for the classes p(ci ). Classifications are made to the maximum a posteriori (MAP) class, as determined by the posterior class probabilities from Bayes’ rule: p(ci |xi ) ∝ p(xi |ci )p(ci ).
(7.7)
Thus, the standard formulation does not take into account any of the extra contextual information and domain specific knowledge that is available.
7.2.7 Terrain estimate The random variable Tˆ represents our estimates of the terrain. Similarly to the discussion on terrain in Section 7.2.3, the terrain estimate consists of the positions and boundaries of sub-regions, along with their respective local terrain types. Specifically, we introduce the estimated vertices and edges, ˆ which divide the region R into n ˆ1, . . . , R ˆ nˆ . (Vˆ , E), ˆ T sub-regions, of local terrain types R T
The distribution of the terrain estimate Tˆ depends on the actual terrain T = t, via the conditional distribution p(Tˆ|T ). A natural conditional dependence between the components of Tˆ and T is given by: ˆ Vˆ , V, E)p(R ˆ1 , . . . , R ˆ nˆ |Vˆ , E, ˆ V, E, R1 , . . . , Rn ), p(Tˆ|T ) = p(Vˆ |V )p(E| T T
(7.8)
where we note that there is no requirement for the estimated number of sub-regions to be equal to the actual number of sub-regions, or indeed for the number of estimated vertices to be the same as the actual number of vertices. Note, however, that (7.8) does not take into account the fact that edges and vertices will be easier to locate for some terrain type boundaries than others. A full treatment of the possible measurement errors is not feasible, because of the requirement to specify the distribution p(c, l|g, t) for each allowable terrain t. It might seem reasonable to propose that the number of vertices is correctly estimated, with simple perturbations (Gaussian in the continuous case and grid-based in the discrete case) for their positions. Under this scheme, provided that the perturbations are not too large, it would be reasonable to assume that the existence of each edge is correctly determined. However, even after this simplification there would be vast computational consequences. For example, a 3 × 3 discrete grid centred about each of 5 vertices, would result in 95 = 59049 allowable (non-zero posterior probability) terrain states from the first two terms of (7.8) alone. This, in turn, would require the conditional distributions p(c, l|g, t) (introduced in Section 7.2.4 and specified in Section 7.3) to be determined for at least 59049 different terrain states, which soon makes inference on the network impracticable. To circumvent this problem, in this thesis it is assumed that the number of sub-regions is ob-
179
CHAPTER 7. CONTEXTUAL INFORMATION
7.3. CONDITIONAL DISTRIBUTIONS
served/determined correctly, as are the boundaries of these sub-regions. This gives: v , eˆ) = (v, e). n ˆ t = nt , and (ˆ
(7.9)
The third term on the RHS of (7.8) then simplifies to: ˆ1, . . . , R ˆ nˆ |Vˆ , E, ˆ V, E, R1 , . . . , Rn ) = p(R t T
nT
ˆ i |Ri ), p(R
(7.10)
i=1
with: ˆ i = r|Ri = κ) = p(τr |τκ ), p(R
r, κ = 1, . . . , nτ ,
1 ≤ i ≤ nt ,
(7.11)
where p(τr |τκ ) is the conditional probability that a local terrain estimate is τr , when the actual local terrain is τκ . The nτ ×nτ matrix B of these conditional probabilities (defined so that Bi,j = p(τj |τi )) would be determined by consultation with human experts, who could take into account the techniques used to estimate local types of terrain (and their knowledge on the likely errors). In summary, for a region R, our simplification assumes that the number of sub-regions is correctly estimated, as are the boundaries of these sub-regions. However, the local terrain types within these boundaries are only estimates, which can be erroneous.
7.3
Conditional distributions of target allocations
7.3.1 General This section explains how the conditional distributions of the target allocations, given the group and terrain, can be defined so that they incorporate the domain specific knowledge and contextual information. There are of course many possible ways of coding up the contextual information into suitable conditional distributions p(c, l|g, t). Each method will have free parameters, whose values will determine the relative effects of the pieces of contextual information being considered. Ideally, these parameters should be learnt from exemplar target deployments in regions of interest. Techniques exist for learning the conditional distributions for the nodes of Bayesian networks given suitable training data [81, 83]. However, the availability of such data is limited for our application. Standard collections of training data for ATR experiments consist of measurements on isolated targets, reflecting the nature of standard ATR approaches. The cost of collecting controlled data under realistic conditions is prohibitively expensive, and would no doubt be security classified. Use of real surveillance data would suffer from both high security classification, and lack of accurate ground-truth (i.e. the data would be unlabelled). In the absence of suitable training data and, indeed, to complement such data if it exists, expert opinion is highly useful for determining appropriate values for the free parameters. How to convert such expert opinion into suitable prior and conditional distributions is the subject of much research within the Bayesian statistics community. In particular, work on prior elicitation [85, 143] is relevant.
180
CHAPTER 7. CONTEXTUAL INFORMATION
7.3. CONDITIONAL DISTRIBUTIONS
The conditional distribution used for the documented example is expressed as a product of weights: p(c, l|g, t) ∝ wbndry (c, l, t) × wclust (c, l, t) × wterr (c, l, t) × wgrp (c, l, g, t),
(7.12)
where wbndry (c, l, t) is a factor related to the distances of objects from boundaries, wclust (c, l, t) is a factor related to the clustering of objects, wterr (c, l, t) is a factor related to the local types of terrain at the object locations and wgrp (c, l, g, t) is a factor relating the allocation defined by (c, l) to the group g. Over the following four sub-sections the forms for each of the weightings are detailed.
7.3.2 Proximity to boundaries This section attempts to account for the fact that military vehicles are unlikely to be found exposed in the centre of fields and will tend to be located near hedges or the edges of woodland, because of the extra cover provided. The approach adopted is similar to that proposed by Blacknell [13]. For each location li the (Euclidean) distance db,i to the closest boundary is calculated. The weight factor is then taken to be: wbndry (c, l, t) =
exp[−ψd2b,i ]
{i|ci =0}
exp[−ψd2b,0 ] ,
(7.13)
{i|ci =0}
where the objects that are classified as targets (and therefore have ci = 0) have been separated from those that are classified as noise/clutter (and therefore have ci = 0). The factor ψ can be used to determine the relative weights for different distances of targets from boundaries. The larger the value of ψ, the more locations that are close to a boundary are favoured. Clutter objects are expected to be randomly located, so the weighting for clutter measurements is assigned to be db,0 regardless of the distance of the clutter object to the boundary. Given a value for ψ, the extra weight assigned to targets close to boundaries, is therefore determined by the choice of db,0 . For our experiments, in the absence of suitable training data, we have used representative values for the terms ψ and db,0 . The chosen values are based on the distribution of the minimum squared distance between a uniformly drawn point in the region R and the sub-region boundaries. Denoting this random variable by Υ, we set: ψ=
1 E(Υ)
and d2b,0 = Median(Υ).
(7.14)
Note that these values have been selected in a somewhat ad hoc manner (for demonstration purposes only), and should not be taken to be definitive values. In practice, they should be determined by consultation with military experts. Analytical calculation of the two terms in (7.14) gets quite complicated for non-trivial sub-region boundaries. Therefore, the values are approximated by calculating the minimum squared distance to a sub-region boundary for each point in a grid approximation of R.
181
CHAPTER 7. CONTEXTUAL INFORMATION
7.3. CONDITIONAL DISTRIBUTIONS
7.3.3 Clustering The clustering weight wclust (c, l, t) attempts to account for the fact that military vehicles will tend to travel and assemble in clusters. As a consequence of this, targets are more likely to be found closely spaced than isolated. For each object that has been assigned to be a target, the proximity of other objects that have also been assigned to be targets is determined. Specifically, for each object location li , for which ci = 0, we calculate: min dl,i = d(li , lj ), (7.15) {j|j =i,cj =0}
where d(a, b) is the Euclidean distance between the locations a and b. In the event of there being no other targets among the object class assignments, so that the set {j|j = i, cj = 0} is empty, dl,i is set to a default value of dl,0 . The weight factor is taken to be:
wclust (c, l, t) =
exp[−ϕd2l,i ]
{i|ci =0}
exp[−ϕd2l,0 ] ,
(7.16)
{i|ci =0}
where (as in Section 7.3.2) the objects that are classified as targets have been separated from those that are classified as noise/clutter. The factor ϕ determines the relative weights for different degrees of isolation of targets (as measured by squared Euclidean distance), while dl,0 determines the weighting for both clutter measurements and single targets. Given a value for ϕ, the extra weight assigned to clustered targets, rather than single targets or clutter, is therefore determined by dl,0 . A human expert looking for clusters of targets may well look for local objects, rather than only objects that are also assigned to be targets. In this case the quantities dl,i are replaced by: dl,i =
min
1≤j≤ni ,j =i
d(li , lj ).
(7.17)
However, since the justification for this second formulation is only that it is quicker, it is (7.15) that is used in the experiments in this chapter. For our experiments, representative values for the terms ϕ and dl,0 have been used. To determine reasonable values, nl points, denoted (Q1 , . . . , Qnl ), are drawn independently from a uniform distribution on the region R. The variable χ is defined to be the minimum squared distance between any of these nl points: χ=
min
1≤i≤nl ,i