Jul 13, 2007 - Research Laboratory (AFRL) developed a mechanical system to automatically release the ..... Figure 4.2 Polar Coordinates of Ambiguity Domain⦠...... The WVD shows that within the time frame the frequency content is around DC, which ... The subject is strapped to a seat on the sled and is accelerated.
UNIVERSITY OF CINCINNATI July 13, 2007 Date:___________________
Hatim Alqadah I, _________________________________________________________, hereby submit this work as part of the requirements for the degree of:
Master of Science (MS) in:
Electrical Engineering It is entitled: Optimized Time-Frequency Classification Methods for Intelligent Automatic Jettisoning of Helmet-Mounted Display Systems
This work and its defense approved by: Dr. Howard Fan Chair: _______________________________ Dr. Raj Bhatnagar _______________________________ Dr. Carla Purdy _______________________________
_______________________________ _______________________________
Optimized Time-Frequency Classification Methods for Intelligent Automatic Jettisoning of Helmet-Mounted Display Systems
A thesis submitted to the Division of Graduate Studies and Research of the University of Cincinnati in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in the Department of Electrical & Computer Engineering and Computer Science of the College of Engineering at the University of Cincinnati July 2007 By Hatim Alqadah Bachelor of Science in Computer Engineering University of Cincinnati, 2005 Thesis Advisor and Committee Chairperson: Dr. Howard Fan
Abstract Helmet-Mounted Display Systems (HMDS) improve the effectiveness of air force pilots in combat. The embedded display systems provide the pilot with vital aircraft information, “look and shoot” weapon cueing systems, as well as night vision capability. However the increased weight and shift in center of gravity due to these systems significantly contribute to the probability of a neck injury to the pilot during the event of an ejection or crash. The Air Force Research Laboratory (AFRL) developed a mechanical system to automatically release the HMDS based on measured acceleration/force to the head. However the acceleration/force measurements during a pilot’s normal air combat maneuvering (ACM) can be near peak accelerations seen during a crash or ejection. The acceleration signals measured from the two environments can be modeled as non-stationary random signals with time-varying and unknown statistics. Time-frequency distributions have been shown to be an effective tool in the analysis and classification of non-stationary signals. In this thesis we present two basic approaches to the optimization of time-frequency distributions with the objective of classifying between a set of non-stationary signals. The approaches are independent of any statistical model of the environments. Experimental results are based on real-life data supplied by the AFRL
iii
iv
Acknowledgments I would like to give my deep thanks to my advisor Dr. Howard Fan for his constant support, guidance, and encouragement throughout my studies. He has always managed to find time to provide me with invaluable advice. It was my honor as well as my pleasure to work with him. I would always like to also thank Dr. Carla Purdy and Dr. Raj Bhatnagar for serving on my committee and providing me with insightful comments. I would like to give my gratitude to my family. I am forever thankful to my father and mother for their endless support. I thank my brother and sister, who are also colleagues of mine, for their intelligent insight as well. Finally very special thanks to my fiancée who has seen me through the hard times and been there to share the good times as well. Her love has carried me through this remarkable journey.
v
Table of Contents LIST OF FIGURES ------------------------------------------------------------------------------------ VIII LIST OF TABLES --------------------------------------------------------------------------------------- XI CHAPTER 1 INTRODUCTION ------------------------------------------------------------------------ 1 1.1 1.2 1.3 1.4
SIGNAL DETECTION AND CLASSIFICATION OVERVIEW --------------------------------------- 1 NON-STATIONARY SIGNALS --------------------------------------------------------------------- 2 INTRODUCTION TO JOINT TFRS ----------------------------------------------------------------- 3 THESIS OUTLINE ---------------------------------------------------------------------------------- 4
CHAPTER 2 THEORETICAL BACKGROUND OF TFR---------------------------------------- 5 2.1 TIME-FREQUENCY FUNDAMENTALS ------------------------------------------------------------ 5 2.2 SHORT-TIME FOURIER TRANSFORM ------------------------------------------------------------ 8 2.3 QUADRATIC TFRS -------------------------------------------------------------------------------10 2.3.1 General Properties---------------------------------------------------------------------------11 2.3.2 Noise and Cross Terms----------------------------------------------------------------------12 2.3.3 Kernel Functions -----------------------------------------------------------------------------13 2.3.4 The Ambiguity Function---------------------------------------------------------------------14 2.3.5 Cohen Class TFRs ---------------------------------------------------------------------------14 2.3.6 Discrete Formulation of Quadratic TFR--------------------------------------------------15 CHAPTER 3 THE USE OF TFR IN CLASSIFICATION OF SIGNALS ---------------------18 3.1 3.2 3.3 3.4 3.5
USE OF AMBIGUITY FUNCTION FOR CLASSIFICATION ---------------------------------------19 DIMENSIONALITY REDUCTION FOR TFR CLASSIFICATION ----------------------------------21 FINAL FORMULATION OF TFR CLASSIFIER ---------------------------------------------------22 DISTANCE MEASURES ---------------------------------------------------------------------------23 PROBABILITY OF ERROR ESTIMATION ---------------------------------------------------------25
CHAPTER 4 PARAMETRIC KERNEL DESIGN FOR CLASSIFICATION ---------------28 4.1 KERNEL ANALYSIS ------------------------------------------------------------------------------28 4.1.1 Choi-Williams Kernel -----------------------------------------------------------------------29 4.1.2 Radially Gaussian Kernel (RGK) ----------------------------------------------------------29 4.2 PARAMETRIC OPTIMIZATION PROCEDURE ----------------------------------------------------31 CHAPTER 5 NON-PARAMETRIC KERNEL DESIGN FOR CLASSIFICATION--------34 5.1 5.2 5.3 5.4 5.5
PREVIOUS FORMULATION OF NON-PARAMETRIC APPROACH -------------------------------34 AN IMPROVED APPROACH TO NON-PARAMETRIC OPTIMIZATION--------------------------35 DETERMINING THE OPTIMAL NUMBER OF POINTS -------------------------------------------37 ACCOUNTING FOR CORRELATION --------------------------------------------------------------38 NON-PARAMETRIC OPTIMIZATION PROCEDURE----------------------------------------------39
CHAPTER 6 EXPERIMENTAL RESULTS --------------------------------------------------------42
vi
6.1 6.2 6.3 6.4 6.4
ACM DATA ACQUISITION ----------------------------------------------------------------------42 EJECTION/CRASH DATA ACQUISITION --------------------------------------------------------45 SIMULATION SETUP -----------------------------------------------------------------------------48 PARAMETRIC APPROACH RESULTS ------------------------------------------------------------48 NON-PARAMETRIC APPROACH RESULTS ------------------------------------------------------63
CHAPTER 7 CONCLUSIONS AND FUTURE WORK ------------------------------------------74 7.1 7.2
CONCLUSIONS ------------------------------------------------------------------------------------74 FUTURE WORK -----------------------------------------------------------------------------------76
BIBLIOGRAPHY -----------------------------------------------------------------------------------------77
vii
List of Figures Figure 1.1 Figure 2.1
Typical signal classification system………………………………………………… 2 A linear chirp signal sweeped from DC to 150 Hz…………………………………
Figure 2.2
STFT of linear chirp signal using Hanning window length 25 ms…………………
Figure 2.3
STFT of linear chirp signal using Hanning window length 75 ms………………...
Figure 2.4
WVD of linear chirp signal………………………………………………………...
Figure 3.1
Gaussian pulses occurring within a 256 ms buffer………………………….
Figure 3.2
WVD of the Gaussian pulses shown in fig 3.1……………………………...
Figure 3.3
The magnitude response of the 3 Gaussian pulses is identical,
Figure 4.1
regardless of time shift……………………………………………………… 21 Choi-Williams kernel with σ =6.2 and σ =0.6 respectively………………… 29 Polar Coordinates of Ambiguity Domain…………………………………… 30
Figure 4.2 Figure 4.3 Figure 4.4 Figure 5.1 Figure 6.1 Figure 6.2 Figure 6.3 Figure 6.4 Figure 6.5 Figure 6.6 Figure 6.7 Figure 6.8 Figure 6.9 Figure 6.10 Figure 6.11
9 9 9 10 19 20
One example of a RGK…………………………………………………….. 31 Live mode system block diagram…………………………………………... 33 Non-parametric system block diagram……………………………………… 40 Helmet Acceleration Axis Definition………………………………………. 42 ACM Data for the +Z direction……………………………………………. 43 Example ACM signal from the +Z direction………………………………. 44 WVD of the ACM signal…………………………………………………… 44 Ambiguity Function of ACM signal……………………………………….. 45 VDT test apparatus…………………………………………………………. 46 VDT pulse in +Z direction………………………………………………….. 46 WVD of the VDT signal in the +Z direction……………………………….. 47 Ambiguity Function of VDT signal………………………………………… 47 Mean VDT class singular values…………………………………………… 49
Figure 6.13
Mean F15 class singular values…………………………………………….. 49 Detection rate versus the number of singular values……………………….. 51 Plot of error probability versus σ …………………………………………... 52
Figure 6.13
Normal Plot of e12 and e21 using correlation distance
Figure 6.12
measure and Choi-Williams kernel………………………………………… 53
viii
Figure 6.14
Choi-Williams Kernel Probability Density of e12 ……………………………54
Figure 6.15
Choi-Williams Kernel Probability Density of e21 ……………………………..54
Figure 6.16
Normal Plot of e12 and e21 using Euclidean distance measure………………….55
Figure 6.17
Choi-Williams Kernel Probability Density of e12 and Euclidean distance…..55
Figure 6.18
Choi-Williams Kernel Probability Density of e21 and Euclidean distance….. 56
Figure 6.19
Normal Plot of e12 and e21 using quadratic discriminant function distance measure………………………………………………………………. 56
Figure 6.20
Choi-Williams Kernel Probability Density of e12 using quadratic
Figure 6.21
discriminant…………………………………………………………………...57 Choi-Williams Kernel Probability Density of e21 using quadratic
Figure 6.22
discriminant…………………………………………………………………. 57 Normal Plot of e12 and e21 using correlation distance measure and RGK………. 59
Figure 6.23
RGK Probability Density of e12 using correlation distance…………………..59
Figure 6.24
RGK Probability Density of e21 using correlation distance………………..…60
Figure 6.25
Normal Plot of e12 and e21 using Euclidean distance measure for RGK…….
Figure 6.26
60 RGK Probability Density of e12 using Euclidean distance…………………. 61
Figure 6.27
RGK Probability Density of e21 using Euclidean distance………………… 61
Figure 6.28
Normal Plot of e12 and e21 using quadratic discriminant function……………..62
Figure 6.28
RGK Probability Density of e12 using quadratic discriminant…………….. 62
Figure 6.29
RGK Probability Density of e21 using quadratic discriminant……………..
Figure 6.31
63 Selecting the number of features using quadratic discriminant……………….64 Selecting the number of features using quadratic discriminant……………….65
Figure 6.32
Selecting the number of features using Euclidean distance…………………..66
Figure 6.33
Selecting the number of features using correlation distance………………… 66
Figure 6.34
Normal Plot of e12 and e21 using quadratic discriminant distance measure….. 67
Figure 6.35
Non-parametric kernel Probability Density of e12 using quadratic
Figure 6.30
discriminant………………………………………………………………… 68
ix
Figure 6.36
Non-parametric kernel Probability Density of e21 using quadratic discriminant……………………………………………………………………68
Figure 6.37
Normal Plot of e12 and e21 using Euclidean distance for non-parametric kernel……………………………………………………….. 69
Figure 6.38
Non-parametric kernel Probability Density of e12 using Euclidean distance…………………………………………………………….69
Figure 6.39
Non-parametric kernel Probability Density of e21 using
Figure 6.40
Euclidean distance………………………………………………………..… 70 Normal Plot of e12 and e21 using correlation distance for non-parametric kernel……………………………………………………….. 70
Figure 6.41
Non-parametric kernel Probability Density of e12 using correlation distance………………………………………………………….. 71
Figure 6.42
Non-parametric kernel Probability Density of e21 using correlation distance………………………………………………………….. 71
x
List of Tables Table 2.1
15 List of some well known kernel functions……………………………………..
Table 6.1
WVD classification results (no kernel applied) the predicted probabilities were computed by (3.22)……………………………………… 50
Table 6.3
Optimized Choi-Williams kernel results………………………………………52 Results for optimized RGK……………………………………………………58
Table 6.4
Results for optimized non-parametric kernel…………………………………67
Table 6.5
Results for using 24 features and correlation distance…………………….
Table 6.2
.
xi
68
Chapter 1 Introduction
1.1
Signal Detection and Classification Overview
Signal detection is a common problem throughout many signal processing fields such as radar, sonar, communications, or biomedical signal processing [1],[2],[3]. Detection of ejection/crash (EC) events in the background of normal air-combat maneuvering (ACM) signal interference can also be formulated as a signal detection problem. The problem is to determine in an environment of interference whether a signal is present or not. This can be given as a two hypothesis problem, in which H0 is the hypothesis that only interference exists in the received signal, or H1 in which a signal is present. H 0 : r (t ) = n(t )
(1.1)
H1 : r (t ) = s (t ) + n(t ) where n(t) is the noise or interference component. The classical solution to this problem is to use a matched filter. This filter is known to maximize the signal to interference ratio (SIR) when the statistics of the interference is known and stationary. However when the interference is nonstationary the filter must take a time-varying nature of its own. In addition when the interference has unknown statistics the use of a matched filter becomes a non-viable solution [4].
Another way to approach this problem is from a signal classification point of view. This approach derives a set of features of the input signal that are used to determine which hypothesis
1
class the particular signal belongs to. Let h1 and h2 be two signal classes that are defined as follows: h1 : The measured acceleration is an EC type signal. h2 : The measured acceleration is an ACM type signal. A typical signal classification system is shown below in figure 1.1.
Figure 1.1: Typical signal classification system.
Our objectives then become to determine how best to generate a set of features, selection of the features that give the most distinguishable characteristics between the two classes, and the design of classifier whose objective is to make a decision as to which class a given set of features belong to [5]. Another important aspect in signal classification is the concept of supervised or unsupervised classification. For this problem we took a supervised approach in which we had available training data from both classes.
1.2
Non-Stationary Signals
Researchers at the AFRL have found that acceleration/force responses from ACM and EC data have differences in rise times and pulse shapes. These differences in the two environments are manifest more clearly in the signal’s frequency domain and can serve as a feature for the classifier. However, like the majority of signals that occur in the physical world, the ACM and EC acceleration signals are random non-stationary signals that may appear for only a short period of time. That is the statistics of the signal of interest are time-varying, and are often not 2
known or too complex to model. The time-varying statistics reveal themselves in the frequency domain by time-varying spectra. The time-dependant nature of the signals’ frequency content cannot be analyzed using standard Fourier tools. Fourier analysis is an excellent tool that decomposes a time represented signal into a spectrum of frequency components and is commonly used in a number of engineering applications. To understand why Fourier analysis is not able to analyze signals with time-varying spectra, we first give the definition of the Fourier transform for a given signal s(t). S (ω ) =
1 ∞ − jωt ∫ s (t )e dt 2π −∞
(1.2)
The Fourier transform as given by (1.2) transforms the signal from a one dimensional time representation to another one dimensional representation that is a function of the frequency variable ω only. This is because the computation of the Fourier transform completely integrates out the time information. As a result the Fourier transform reveals the frequency components but not when these components have occurred. For signals that have their frequency components changing with time a full analysis would be a two dimensional function that is a function of t and ω , as we will discuss in the next subsection.
1.3
Introduction to Joint TFRs
The notion of a time-varying spectrum leads naturally to the search of a representation that can take this into account. By representing a signal jointly in time and in frequency we can bring out the non-stationary behavior of a signal in a discernible fashion. Research over the past 60 years has yielded a good number of tools that compute the energy distribution jointly in time and frequency [6]. The main problem of computing these time-frequency distributions is the concept of resolution. As one may imagine time and frequency are not independent quantities. The laws of physics restrict how well we can measure time and frequency simultaneously [7]. The concept 3
is very similar to the uncertainty principle in quantum mechanics where position and momentum are the quantities of interest. There exist two main methods of computing time-frequency representation (TFR); linear TFRs and quadratic TFRs. The differences between the two methods shall be explained more thoroughly in the next chapter.
1.4
Thesis Outline
Having considered the basic formulation of our problem, this thesis can be summarized as follows: In chapter 2, we discuss the basic theory upon which TFR are built upon. This chapter will lay down the basic properties and tools which will provide the necessary foundation of the reasoning behind our approaches. In chapter 3 we discuss the use of TFR in the context of using them for classification. Specifically we will illustrate the problem of designing TFR with the specific objective of using them for classification. Chapter 4, we discuss a so called parametric approach to designing optimal TFR for classification. In chapter 5 we discuss a non-parametric approach to designing optimal TFR. Chapter 6, we discuss the data that was collected, the experiments, and MATLAB simulations that were used to contrast the two approaches laid out in chapters 4 and 5. Finally, in chapter 7 we give our conclusions and discuss future research work involving optimal TFR design for detection/classification.
4
Chapter 2 Theoretical Background of TFR
This chapter is devoted to providing a brief introduction to the theory behind time-frequency analysis. Section 2.1 deals with fundamental ideas regarding TFR, section 2.2 describes a linear TFR; the popular short-time Fourier transform, and section 2.3 gives an introduction to the computation of quadratic transforms and how they can be derived from the Wigner-Ville distribution.
2.1
Time-Frequency Fundamentals
We know that the total energy [6] contained in a signal can be equally computed from its time representation s(t), or frequency representation S(ω) as seen below. Energy = ∫ | s (t ) |2 dt = ∫ | S (ω ) |2 d ω
(2.1)
The integrand quantities in (2.1) | s (t ) |2 and | S (ω ) |2 define the energy densities in time and frequency respectively. From these densities we can compute useful quantities, the first of which is the fractional energy in a small time or frequency intervals. | s (t ) |2 ∆t = fractional energy in the differential time interval ∆t
(2.2)
| S (ω ) |2 ∆ω = fractional energy in differential frequency interval ∆ω
(2.3)
The average time and average frequency by the following
tmean = E[t ] = ∫ t | s(t ) |2 dt
(2.4)
ωmean = E[ω ] = ∫ ω | S (ω ) |2 dω
(2.5)
5
We can also define the time and frequency standard deviations which we term the time duration and bandwidth of the signal respectively. The deviations are defined as
σ t2 = ∫ (t − tmean ) 2 | s(t ) |2 dt = E[t ] − E[t ] 2
2
(2.6)
and
σ ω2 = ∫ (ω − ωmean ) 2 | S (ω ) |2 dω
(2.7)
= E[ω ] − E[ω ] 2
2
A large σ t would indicate that the energy density in time is widely spread around the mean time; a small σ t would indicate that the signal has narrow energy density in time. Similarly the frequency deviation σ ω gives some indication about how wide or narrow the energy density in frequency is. We now introduce the concept of representing the signal energy jointly in time and frequency simultaneously.
The basic idea of time-frequency analysis is to devise a function that measures how the energy density of a signal is distributed jointly in time and frequency. Such a representation may be useful for a number of purposes. For example in the context of a music signal we would be able to derive pitch (frequency) information as well as tempo (time) information. By these definitions a joint time-frequency distribution P (t , ω ) , ideally should satisfy some basic properties as defined below [7].
1. Given the joint distribution P (t , ω ) we can compute the instantaneous energy at a particular time or frequency 2 ∫ P (t , ω )d ω = | s (t ) |
6
(2.8)
2 ∫ P (t , ω )dt = | S (ω ) |
(2.9)
2. The total energy should be conserved as indicated by Energy = ∫∫ P(t , ω )d ω dt = ∫ | s (t ) |2 = ∫ | S (ω ) |2
(2.10)
3. A TFR would allow us to compute the expected value of any function dependent on time and frequency
E[ x(t , ω )] = ∫∫ x(t , ω ) P(t , ω )dω dt
(2.11)
Equations (2.6) and (2.7) defined the time duration and frequency bandwidth. The two quantities however are related to each other by the uncertainty principle. We state the uncertainty principle in terms of time and frequency below, the proof of which can be found in [7].
σ tσ ω ≥
1 2
(2.12)
From (2.12) we can easily see that if the energy density in time is narrow, which corresponds to a short duration signal, the energy density in frequency will be wide, corresponding to a large bandwidth, and vice versa. This forms, although indirectly, a relationship to the concept of timefrequency concentration. Although it has been noted that often in literature concentration is mistaken with the concept of resolution [8], [9], the two are not the same. Concentration is a measure of how densely the signal energy is spread over a specific area in time and frequency. Let us define a time-frequency cell as +t +ω , so the energy contained in this cell would be P (t , ω )+t +ω = the fractional energy contained in a time-frequency cell
(2.13)
The concentration of energy in the cell can be seen by (2.13), which is governed by the size of the cell and (2.12). Time-Frequency resolution refers to how well we can resolve components in the time-frequency plane. A more thorough treatment on the concept of time-frequency resolution can be found in [8]. The concentration of a TFR can be used to assess how “good” of a
7
time-frequency distribution it is. It is desirable to have a distribution that is concentrated well in both time and frequency simultaneously.
2.2
Short-Time Fourier Transform
The short-time Fourier transform (STFT) is a popular tool in time-frequency analysis. The reason for its popularity is that their computation is simple yet powerful. The STFT is based on the assumption that a non-stationary signal can be broken up into a number of segments which exhibit stationary behavior. These segments can then be analyzed using standard Fourier tools to bring out the frequency response. This computation is straightforward and is expressed as P (t , ω ) = ∫ s (τ )h(τ − t )e − jωτ dτ
(2.14)
Where h(t) is a window function of finite duration applied to the signal s(t). We can localize the signal in time by defining a window function of short duration. However the further we localize the signal in time, equation (2.12) tells us the frequency uncertainty will increase and therefore will be less localized in frequency. To illustrate this concept consider a linear chirp signal given as s (t ) = Re{e − jω ( t ) t }
ω (t ) = ωo + βt
(2.15)
Where β is some constant that defines the slope of the frequency function ω (t ) . Figure 2.1 illustrates a plot of this function.
8
Figure 2.1: A linear chirp signal sweeped from DC to 150 Hz
The STFT of this function is illustrated below using a Hanning window of different lengths.
Figure 2.2: STFT of linear chirp signal using Hanning window length 25 ms.
Figure 2.3: STFT of linear chirp signal using Hanning window length 75 ms.
9
As we can see from figures 2.2 and 2.3 the more we shorten the duration of the window, the less concentrated the energy is in frequency for each cell +t +w . This is a limitation of the technique and not because of the signal’s uncertainty properties [7]. The reason is that we are altering the signal by chopping it into smaller segments.
2.3
Quadratic TFRs
This class of TFRs offers better time-frequency resolution characteristics than the STFT approach. We begin with the introduction of the Wigner-Ville distribution (WVD) which can be described as follows
τ
τ
τ
τ
P (t , ω )WVD = ∫ s* (t − ) s (t + )e − jωτ dτ 2 2
(2.16)
If we define Rt (τ ) = s* (t − ) s (t + ) 2 2
(2.17)
as the local autocorrelation function at time t. Then we can rewrite the WVD as P (t , ω )WVD = ∫ Rt (τ )e − jωτ dτ
(2.18)
The TFR is labeled as quadratic because the signal enters the computation twice. Figure 2.4 shows the WVD of the linear chirp signal described in the previous section.
Figure 2.4: WVD of linear chirp signal
10
Figure 2.4 shows the superior time and frequency concentration of the WVD as compared to the standard STFT.
2.3.1 General Properties We now describe some important properties that the WVD exhibits [10]: 1. WVD is always real The WVD is always a real distribution regardless if the signal s(t) is complex. Therefore
τ
τ
P* (t , ω )WVD = ∫ s (t − ) s* (t + )e jωτ dτ 2 2
τ
τ
= ∫ s* (t − ) s (t + )e − jωτ dτ = P(t , ω )WVD 2 2
(2.19)
2. WVD satisfies the marginals We can obtain the spectral energy or time energy by the following ∫ P (t , ω )WVD d ω =| s (t ) |
2
∫ P (t , ω )WVD dt =| S (ω ) |
2
(2.20)
(2.21)
3. Time-shift invariance If we shift the signal s(t) by an amount to this will also correspond to a shift in the WVD. if s (t ) → s (t − to ) then P '(t , ω )WVD → P (t + to , ω )WVD
(2.22)
4. Frequency-shift invariance Similarly if we shift the spectrum of s(t) by an amount wo this will correspond to a frequency shift in the WVD by the same amount.
11
if s (t ) → e jωot s (t ) then P '(t , ω )WVD → P(t , ω + ωo )WVD
(2.23)
2.3.2 Noise and Cross Terms The high time-frequency resolution of the WVD comes at the cost of noise enhancement and introduction of cross term interference. Consider a signal s(t) composed of a signal component x(t) and an additive noise component n(t). s (t ) = x (t ) + n(t )
(2.24)
The WVD of s(t) is then computed as
τ
τ
WVDs = ∫ s* (t − ) s (t + )e − jwτ dτ 2 2
τ
τ
τ
τ
= ∫ ( x* (t − ) + n* (t − ))( x(t + ) + n(t + ))e − jwτ dτ 2 2 2 2
τ
τ
τ
τ
τ
τ
τ
τ
= ∫ [ x* (t − ) x(t + ) + x* (t − )n(t + ) + n* (t − ) x(t + ) + n* (t − )n(t + )]e − jwτ dτ 2 2 2 2 2 2 2 2 (2.25)
= WVD x + WVD xn + WVD nx + WVD n
We know that the cross Wigner distributions WVD xn and WVD nx are complex, however we notice from (2.25) that WVDnx = WVD*xn , therefore the sum of the two terms is a real quantity. Therefore equation (2.25) can be simplified to WVDs =WVD x + WVD n + 2 Re{WVD nx }
(2.26)
As we see the quadratic nature of the WVD causes noise terms to appear at times and frequencies that were not in the original signal. The same phenomenon as described above
12
occurs for signals that are a combination of other signal components. To illustrate this concept consider the signal s(t) to be composed of two sinusoidal tones. s (t ) = s1 (t ) + s2 (t ) = e − jω1t + e − jω2t
Using a similar argument as in equation (2.25), the WVD of s(t) can be shown to be WVDs = WVD1 + WVD 2 + 2 Re{WVD12 } = δ (ω − ω1 ) + δ (ω − ω2 ) + 2 Re{∫ e jω1 ( t −τ / 2) e − jω2 ( t +τ / 2) e − jωτ dτ } = δ (ω − ω1 ) + δ (ω − ω2 ) + 2 Re{e jt (ω2 −ω1 ) ∫ e − jτ (ω1 +ω2 ) e − jωτ dτ } = δ (ω − ω1 ) + δ (ω − ω2 ) + 2 cos((ω2 − ω1 )t )δ (ω − (ω1 + ω2 ) / 2)
(2.27)
As we know from Fourier analysis the frequency transform of two sinusoidal components will result in impulse functions at those two frequencies, which is what the WVD reveals. But it also has an additional term as indicated in equation (2.27). These terms are named cross terms and are a major problem of the WVD when trying to accurately model the TFR of a signal. The terms that are desired in the TFR are labeled as auto terms.
2.3.3 Kernel Functions The previous section illustrated the enhanced interference problem that plagues the WVD. A solution to enhancing the true TFR of the signal is to apply a 2-D filter on the WVD. This is similar to processing an image. The 2-D filter which in the domain of TFR is termed a kernel function operates on the WVD as a 2-D convolution operation. The resulting TFR can be expressed as
τ
τ
P (t , ω ) = ∫∫∫ s* (u − ) s (u + )φ (θ ,τ )e − jθ t − jτω + jθ u du dτ dθ 2 2
13
(2.28)
where φ (θ ,τ ) is defined as the kernel function, the variables τ and θ , are known as time lag and frequency Doppler respectively. While the variable u is a dummy integration value used to index time. The kernel function is typically designed with the objective of retaining the auto terms, and smoothing undesired noise and cross terms.
2.3.4 The Ambiguity Function We first define the ambiguity function as
τ
τ
A(θ ,τ ) = ∫ s* (t − ) s (t + )e jθ t dt 2 2 (2.29)
= ∫ Rt (τ )e jθ t dt
This is similar in structure to the WVD, except we are now taking the Fourier transform over the time variable instead of the time lag variable. This representation is then a function of Doppler frequency and time lag, and in fact can be thought of as a 2-D inverse Fourier transform of the WVD. The application of the kernel is now a multiplicative operation on the ambiguity function. P (t , ω ) = ∫∫ A(θ ,τ )φ (θ ,τ )e jθ t − jτω dθ dτ
(2.30)
The representation also gives a global view of the signal structure. The τ axis represents global frequency information while the θ represents global time information. All other information reveals the non-stationary character of the signal. It was pointed out in [11] that the ambiguity function is a very useful representation to work with especially in the case of multi-component signals. It has been shown in [11] that in the ambiguity function domain auto components tend to lie near the origin of the ambiguity function, while cross components tend to appear farther from the origin. This property makes the design of kernel functions much simpler.
2.3.5 Cohen Class TFRs
14
With the application of the kernel on the ambiguity function of a signal, the resulting TFR is then obtained by performing a 2-D Fourier transform on the resulting ambiguity function. The TFR as a result of this will be a smoothed version of the WVD depending on the structure of the kernel. Thus we see an infinite number of TFR may be computed. The WVD is in fact a special case when the kernel function φ (θ ,τ ) =1. In this case there is no filtering and the resulting TFR is the WVD. Table 2.1 illustrates some common kernel structures. Kernel: φ (θ ,τ ) φ (θ ,τ ) 1 cos(θτ / 2) e jθτ / 2 sin(θτ / 2) θτ / 2 Page e jθ |τ | 2 2 Choi-Williams e −θ τ / σ * − jθ u Spectrogram ∫ h (u − τ / 2)h(u + τ / 2)e du g (τ ) | τ | sin( aθτ ) / aθτ Zhao-Atlas-Marks Table 2.1: List of some well known kernel functions
Name General class Wigner-Ville Margenau-Hill Kirkwood Born-Jordan
An interesting note is the spectrogram, which is known as the magnitude squared of the STFT is related to the quadratic TFR class by the spectrogram kernel listed in table 2.1.
2.3.6 Discrete Formulation of Quadratic TFR All the theory presented thus far in this thesis has been in continuous time and assuming the signal s(t) was of infinite duration. However for practical computational purposes we need to compute discrete TFRs. For the case of the WVD, replacing the integral with a summation in equation 2.2, sampling at t=nT, and make a change of variable l= τ /2 we can write the discretetime version of the WVD for a signal of L samples as L
P (n, ω )WVD = ∑ s* (n − l ) s (n + l )e − j 2ωl l =− L
15
(2.31)
From (2.31) we can easily see that the discrete WVD is π periodic instead of 2 π periodic. This would imply that the minimal sampling rate needed to represent an alias free version of a signal would be 4 times the bandwidth of the signal, or twice the Nyquist rate [23]. We can also represent the TFR in matrix form, starting the local autocorrelation matrix R. Where R has its nth row and lth column defined as Rnl = s* (n − l ) s ( n + l )
(2.32)
The time variations are along the rows of the matrix, and the time lag is across the columns. The WVD matrix P then has its nth row and ω th defined as L −1
Pnω = FTl →ω {R} = ∑ Rnl e − jlω / L l =0
(2.33)
Where FT{..} is the Fourier transform operator. This is another method of representing the computation in (2.31). The ambiguity function matrix A is computed in a similar fashion and has its elements defined as L −1
Aηl = IFTn→η {R} = ∑ Rnl e jnη / L
(2.34)
n =0
In this case IFT{..} is defined as the inverse Fourier transform operator. Finally we can compute any TFR from the ambiguity function matrix, by applying a kernel matrix as Φ on the ambiguity function matrix as an element by element multiplication defined by the operator “ D ”, and applying a double Fourier transform operation L −1 L −1
Pnω = FTη →n ,l →ω {A D Φ} = ∑∑ [ Aηl Φηl ]e− j (lω − nη ) / N
(2.35)
η =0 l =0
Equation (2.35) is the discrete equivalent of equation (2.30), which shows the direct relation between the ambiguity function and the time-frequency plane. Let us define the quantity A D Φ as
16
A Φ = A D Φ = new ambiguity function after applying the kernel
(2.36)
In order to avoid confusion for the remainder of this thesis when we refer to the ambiguity function, we refer to the definition of (2.36).
17
Chapter 3 The Use of TFR in Classification of Signals
In this chapter we discuss the methodology of using TFR for the classification of signals. We begin by defining a discrete time signal s(n) of finite length L that is to be classified into a finite set of classes hi, where i= 1,2,….N. We also assume the availability of a training set for each specific class with M signals of each class, all with length L. Previous research [12], stated the decision rule of classifying an unknown signal s(t) as the following ___
^
Φ i = arg min d(Ps(n) , PiΦ )
(3.1)
i =1,2,... N
Φ is the TFR matrix of the unknown signal computed via Where d(.,.) is a distance measure, Ps(n)
___
the kernel Φ , and Piθ is the average TFR matrix estimated from the training set. The average TFR matrix is a representative of class hi and its elements are defined as follows ___ Φ i
P ( n, ω )
1 M Φ ∑ P i ( n, ω ) M k =1 s ( n )k
(3.2)
The performance of the classifier is rated primarily on the probability of making a decision error. The decision rule stated in (3.1) is written explicitly in terms of the TFR matrices of the unknown signal and the mean class representative. We propose two changes to the rule stated in (3.1); the first change is to classify the signal in the ambiguity function domain as defined in (2.36) rather than explicitly using the time-frequency domain. The second change would be to apply a dimensionality reduction algorithm to the ambiguity function matrices to reduce the
18
computational complexity of making a decision. Both of these changes will be discussed in more detail below.
3.1
Use of Ambiguity Function for Classification
In this section we will give justification to our reasoning behind using the ambiguity function matrix. Although (2.35) shows that the time-frequency domain and the ambiguity function domain are theoretical equivalents of each other, there is an advantage of using the ambiguity function. The rule stated in (3.1) makes an assumption that all the signals in the training set and also the unknown signal have their coordinate system and origin perfectly aligned [13]. For a system that is operating in real-time this may not be the case, since we will be buffering the signal into a memory buffer of finite length. The part of the signal of interest may occur at any time within this buffer. To illustrate this concept let us say, for example, that the class of signals we wish to detect are signals that are a Gaussian pulse with a certain pulse width. Figure 3.1 shows three signals from the same class that differ only by their location in the observation window.
Figure 3.1: Gaussian pulses occurring within a 256 ms buffer.
19
In figure 3.1 we see the signals are identical in structure except that they are shifted away from each other with a random time shift. The WVD of these 3 pulses will exhibit the same time shift as a result of the time-invariance property listed in chapter 2. The resulting WVD is shown in figure 3.2; notice the three pulses corresponding to their time plot in figure 3.1.
Figure 3.2: WVD of the Gaussian pulses shown in fig 3.1 When computing a distance measure between these signals, the measure will be unnecessarily higher due to the time shift. This will contribute to increased probability of error. Of course if we knew the amount of shift no the signals experienced from the origin, we could simply shift the signals back. But in practice the amount of shift is unknown. To overcome this problem we propose working with the ambiguity function instead of the TFR explicitly. We claim that the energy distribution in the ambiguity domain will be the same regardless of any time shift. This can be seen as follows If s(n) → s(n+n o ) then P (n, w) → P (n + n o , w) as given by the time-shift property, therefore the ambiguity function will be computed as
20
(3.3)
| A '(η , k ) |= IFTn + no →η {P(n + n o , ω )} =| e − jη no A(η , k ) |=| A(η , k ) |
(3.4)
Thus we see the time shift no corresponded to a phase shift in the ambiguity domain, but the magnitude responses was identical to that of the original ambiguity function of the signal s(n). This hypothesis is confirmed by figure 3.3, which illustrates the ambiguity functions of the three Gaussian pulses shown in fig 3.1.
Figure 3.3: The magnitude response of the 3 Gaussian pulses is identical, regardless of time shift. Working with the ambiguity function directly also saves on computation since we will not have to compute the double inverse Fourier transform after applying the kernel function as shown in equation (2.35) for the discrete time case or (2.30) for the continuous time case.
3.2
Dimensionality Reduction for TFR Classification
The second modification to (3.1) deals with reducing the number of features needed to make a decision. For a signal of length L, the TFR matrix or ambiguity matrix maps the signal into a
21
representation that is L x L in size. The number of features is then L2 which would render a significant increase in computational complexity. Often times the majority of these features contribute little or nothing at all to classification of a class of signals. It was pointed out in [5] that the M/m , the ratio of the number of signals in the training set to the number of features plays a role in how well we can estimate the probability of error of the classifier. It was discussed in [5] that the classification error estimate improves as this ratio becomes higher. Since a large training set is often undesirable, to keep the classification error small we need to reduce the number of features, i.e., discard those features that contribute little or nothing to the classification objective. There are many methods for reducing the number of features, one of which is the method of principle components analysis (PCA) as suggested by [4]. This operation involves performing an eigendecomposition on the covariance matrix of a feature vector x. In the context of an ambiguity function matrix we can construct an L2 x 1 feature vector by scanning the ambiguity function matrix on a column-by-column basis [4]. However the covariance matrix of such a vector is quite large (L2 x L2), and the computation becomes quite complex. Another method suggested in [7] was based on a Singular Value Decomposition (SVD) of the ambiguity function matrix. Only the dominant singular values are retained and are used to serve as the reduced sized feature vector. We applied this method for our parametric kernel designed classifier described in the next chapter. For the non-parametric kernel classifier we do not need the use of dimensionality reduction as will be described in chapter 5.
3.3
Final Formulation of TFR Classifier
After applying the proposed changes described in sections 3.1 and 3.2 our newly formulated decision rule for an unknown signal s(n), the decision rule can be re-written as
22
^
Φ
___ Φ i
i = arg min d(x , x ) i =1,2,... N
(3.5)
Where the feature vector xΦ is written to emphasize that it is still parameterized by a kernel, and ___
x iΦ , is the representative average feature vector for class hi. The feature vector is generated by computing the local autocorrelation matrix of the signal R by (2.32). The ambiguity function matrix A Φ is then computed by (2.34) and (2.36). The feature vector is then computed by first computing the SVD of the matrix AΦ (3.6)
A Φ = UΛV T
The m dimensional feature vector x then has its elements defined as the first m elements from the diagonal of the L x L matrix Λ . xi = diag{Λ}i=1,2,..m
,m ≤ L
(3.7)
The decision rule in (3.5) has two design parameters that we can control and that directly affect the classifier performance. The first parameter is the type of kernel that we use; there are infinite types of kernels that can be applied. We mentioned a few in table 2.1, however those kernels were designed with the purpose of representing a signal’s energy in the time-frequency plane accurately. We however are not concerned with the appearance or accuracy of the TFR; we would like to design a kernel that is optimal for the purpose of classification. In this thesis we present two approaches, described in the next two chapters, to achieving this task. The second parameter that affects the classifier performance is the choice of distance measure.
3.4
Distance Measures
23
In this thesis we shall analyze the performance of 3 common distance measures; correlation distance, quadratic discriminant functions, and the well known Euclidean distance. We first define the correlation distance as
xT x i d corr (x, xi ) 1 − 2 || x ||2 + || xi ||2
(3.8)
The measure would give a distance on the interval [0,1]. If we assume our feature vector is m dimensional and normally distributed for each class hi, we then can derive a distance measure called the quadratic discriminant function from the probability density function (pdf) of each class. The pdf of a multivariate normally distributed feature vector x from class hi is given as p ( x | hi ) =
1 1 exp(− ( x − µ i )T Σ i ( x − µ i )) 1/ 2 (2π ) | Σi | 2 1/ 2
(3.9)
Where µi and Σi are the mean vector and covariance matrix corresponding to class hi. If we assume equiprobable classes then it can be shown that the optimal classifier for classes distributed as such can be expressed as d quad (x, xi ) = (x − xi )T Σi−1 (x − xi ) + ln | Σi |
(3.10)
This is known as the quadratic discriminant function, it involves estimating the mean vector and covariance matrix from the training set, which can be accomplished using a maximum likelihood estimation approach. Although the Gaussian assumption is a reasonable assumption in a variety of applications, it may not be the case; however the quadratic discriminant still performs well in many cases even when the data may not be distributed normally. If we further assume that the
24
classes are not only equiprobable but also share the same covariance matrix we may ignore the last term in equation (3.10) which would yield the Mahalanobis distance
d Mahalanobis (x, xi ) = (x − xi )T Σ −1 (x − xi )
(3.11)
Furthermore if the covariance matrix is of the form Σ = σ 2 I , where I is the identity matrix then the quadratic distance measure reduces to the square of the well known Euclidean distance.
d Euclid (x, xi ) =|| x − xi ||
(3.12)
We then search for the distance measure that gives the best classifier performance.
3.5
Probability of Error Estimation
In this section we describe a model, introduced by [12], that attempts to estimate the probability of the classifier making an error. Our objective in presenting this model is that we will interact with the model directly when searching for the optimal kernel in our generation of features. The model is dependant on two parameters; the type of kernel used to derive the feature vector, and i e (Φ , d ) , to the distance measure. Therefore let us denote the probability of error estimate as P
reflect the dependence on these two parameters. Let us assume a two class problem that occur with equal probability, i.e. P(i=1) = P(i=2) = ½. The total probability of error can then be expressed as
i e (Φ, d ) = P(i = 1| i = 2) P(i = 2) + P(i = 2 | i = 1) P(i = 1) P
(3.13)
Which can be simplified to i e (Φ, d ) = 1 [ P (i = 1| i = 2) + P (i = 2 | i = 1)] P 2
(3.14)
Furthermore let us define two random variables dij and eij as shown below Φ
___ Φ j
d ij = d(x , x ) given that x ∈ hi
25
(3.15)
and Φ
___ Φ j
Φ
___ Φ i
eij = d ij − d ii = d(x , x ) − d(x , x ), for i ≠ j
(3.16)
We can now rewrite equation (3.14) in terms of dij and eij. i e (Φ, d ) = 1 [ P (d < d ) + P (d < d )] P 12 11 21 22 2 1 = [ P (e12 < 0) + P (e21 < 0)] 2
(3.17)
Therefore we can effectively estimate the probability of error if we know the distribution of e12 and e21 . However we can make the argument that the random variable d ij is the sum of many random components as seen in the distance measure equations (3.8)-(3.12). These equations include summations of a random input feature vector x, so through a central limit theorem argument it may be reasonable to assume that d ij and therefore eij , can be modeled by a Gaussian distribution [12]. With this assumption in effect it follows that 2
1 1 ⎡ u − mij ⎤ − P (eij < 0) = ∫ exp( ⎢ ⎥ )du 2 1/ 2 2 ⎢⎣ σ ij ⎦⎥ −∞ (2π σ ij ) mij = Q( ) σ ij 0
(3.18)
Here we define mij = E[eij ], σij 2 = VAR[eij ] . We can estimate these parameters from the training set using the sample mean and sample variance estimators and features x ik and x kj derived from the kth signal from classes i and j respectively. ___ ___ 1 M Φ Φ iΦ iΦ m mij = ∑ [d(x k , x j ) − d(x k , xi )] M k =1 ___ ___ 1 M iΦ iΦ Φ Φ 2 m 2 σm [d( , ) d( , = x x − x x ∑ ij k j k i ) − mij ] M k =1
Therefore the final expression for the error probability model is
26
(3.20)
(3.21)
⎛m⎞ ⎛m⎞ i e (Φ, d ) = 1 [Q ⎜ m12 ⎟ + Q ⎜ m21 ⎟] P ⎜ σn ⎟ 2 ⎜ σn ⎟ ⎝ 12 ⎠ ⎝ 21 ⎠
(3.22)
Our reasoning behind deriving this model is to give a better estimate on the probability of error rather than using the error counting approach within the training set. Our objective now is to minimize this function by choosing the kernel (with its associated parameters) and the distance measure.
27
Chapter 4 Parametric Kernel Design for Classification
In chapter 3 we introduced how we would formulate the classification problem in terms of the ambiguity function matrix. We noted that the decision rule (3.5) was dependant on the kernel and distance measure. The next two chapters are concerned with how we would approach choosing a kernel that would give us hopefully the optimal performance, as governed by our performance measure in (3.22).
4.1
Kernel Analysis
As stated in the previous chapter the feature vector x has an explicit dependence on the kernel function. While the traditional goal of kernel design as explained above is to completely eliminate cross terms and preserve auto terms, this may not be the case when a kernel is to be designed for classification. Cross terms may serve as distinguishing features between the signal classes and may need to be preserved. A truly optimal kernel would pick the points in the ambiguity domain (cross terms or auto terms) that yield the best classifier performance, and eliminate the rest that do not contribute or degrade performance. However the computation of such a kernel is quite a difficult problem to solve. The philosophy of the parametric based approach is to define a set of kernel function structures that are controlled by parameters. The kernels are then optimized with respect to the parameters that achieve the best possible classifier performance. We then pick the kernel from the set that gives the overall best performance. While this approach is clearly a sub-optimal approach, the resulting kernel will be optimal from the set. For this thesis we focus on two kernels that are exponential in form, the Choi-Williams kernel
28
and the radially Gaussian kernel (RGK). These kernels were chosen because of their effective structure in filtering cross-terms [14],[15], and their relatively small number of parameters, which makes searching for the optimal kernel simpler.
4.1.1 Choi-Williams Kernel We first begin by defining the structure of the Choi-Williams kernel 2 2
Φ(η , k ) = e−η k
/σ
(4.1)
Here the only parameter of the Choi-Williams kernel is σ , which controls the spread of the kernel about the origin. To illustrate, figure 4.1 shows the kernel with σ =6.2 and σ =0.6.
Figure 4.1: Choi-Williams kernel with σ =6.2 and σ =0.6 respectively. The smaller σ is the more smoothing is applied onto the signal TFR, the optimization goal is to search for the value of σ that yields the lowest probability of error.
4.1.2 Radially Gaussian Kernel (RGK) The RGK is given as
Φ(η , k ) = e
ρ2 2 σ 2 (ψ )
where ρ = η 2 + k 2 and ψ = tan −1 (η / k ) are the polar coordinates of (η , k ) .
29
(4.2)
Figure 4.2: Polar Coordinates of Ambiguity Domain The contour function σ (ψ ) should be π periodic in order to correspond to a real-valued TFR [12]. Thus it has been proposed in [12] that σ(ψ) should be represented as a truncated Fourier series, where pmax is the maximum number of terms in the series. pmax
σ (ψ ) = a0 + ∑ [a p cos(2 pψ ) + bp sin(2 pψ )]
(4.3)
p =1
Here we see the parameters that affect the shape of the kernel can be represented by the vectors a and b a = [a0 a1 .......a pmax ]T
(4.4)
and b = [b0 b1 .......bpmax ]T
(4.5)
As one can see there is a lot more freedom with the type of kernel shapes that can be achieved with the RGK structure, so the search for the optimal kernel structure is more computational complex. One example of a RGK is shown below in figure 4.3.
30
Figure 4.3: One example of a RGK.
4.2 Parametric Optimization Procedure As explained before the parametric TFR approach is to find the kernel and distance measure that minimizes equation (3.22). The basic outline of this procedure is as follows.
1) Choose a distance measure (3.8)-(3.12). 2) Choose a kernel structure (4.1)-(4.2). 3) Determine the kernel parameters that yields the minimal error for the given kernel structure and distance measure.
The procedure is then repeated until all the distance measures and kernel structures in the set have been run through. We then simply choose the kernel and distance measure that gave the best performance. The kernel optimization in step 3 can be accomplished by a numerical algorithm, for the purpose of this thesis we used the Nelder and Mead direct search algorithm [16], for which MATLAB has an implementation under the function fminsearch. The algorithm 31
requires the computation of equation (3.22). To compute this function we summarize the procedure as such
___
___
1) Compute the representative feature vectors x1Φ and x Φ2 by computing the ambiguity function matrix of each signal in the training set of classes h1 and h2, performing the SVD on each ambiguity function matrix, and averaging the chosen singular values for both classes. 2) Compute the feature vectors of each signal in class h1 and h2, by computing the ambiguity function matrix of each signal in the training set, and performing the SVD and retaining the chosen singular values. 3) Compute the distances d11, d12, d21, d22 by using equation (3.15) for the chosen distance measure. 4) Compute the error statistics e12 and e21 using equation (3.16). m, m m , σn , and σn by equations (3.20) and (3.21). 5) Compute m 12 21 12 21 i e (Φ, d ) using equation (3.22). 6) Compute P
As one can see the limitation of this approach is that we can only consider a few kernel structures with a small number of parameters, otherwise the search procedure outlined above will become too computationally expensive. Following the optimization procedure the system is then set to live mode, to where it will classify unknown signals. The block diagram of the system can be seen in figure 4.4 below.
32
Figure 4.4: Live mode system block diagram.
33
Chapter 5 Non-Parametric Kernel Design for Classification
The approach undertaken in chapter 4 was shown to be a sub-optimal technique to the design of TFR with the purpose of classification. The problem is that by restricting the kernel to predefined structures we are making assumptions about what points or matrix entries indexed by η , k in the ambiguity function domain to filter. As pointed out before cross-terms may give additional classifier performance, however the kernels considered all have structures that are designed to eliminate the cross-terms. The topic of this chapter is to introduce a method in which we can find the points of the ambiguity function matrix that give the best classifier performance, and discard all other points that do not contribute or degrade performance. We first present a previously used approach to finding this non-parametric kernel, we then present an approach that was shown to give superior performance.
5.1
Previous Formulation of Non-Parametric Approach
As before we assume the availability of a training set consisting of M examples of each class, the previous approach [17] was based on designing a kernel that maximizes the mean squared distance between the classes in terms of the ambiguity function. Considering a two class problem the kernel is designed to optimize the following criteria
Φopt = arg max || Φ D A1 − Φ D A 2 ||2F
(5.1)
|| Φopt ||2F = 1
(5.2)
Φ
and Φ opt is constrained by
34
Here the || .. ||2F is the Frobenius norm, and A i is the average ambiguity function matrix for class hi, and is computed by the average of each ambiguity function matrix of signal s(n)k , k = 1…M, in the training set, which is given as Ai =
1 M ∑ A s ( n )k M k =1
(5.3)
It was shown in [17] and [18] that the solution to this maximization problem resulted in a value of “1” placed at a single point in the η, k plane where the average ambiguity matrices were most separated, and a value of “0” elsewhere.
5.2
An Improved Approach to Non-Parametric Optimization
The previous approach to a non-parametric kernel design made no assumptions about the structure of the kernel, and was designed with the sole purpose of maximizing the distance between the class means. However for real-world applications the classes of signals often have a wide range of within-class variance that must be taken into account [18]. This approach is explained in detail in [18], and can be summarized as the following. Let K be the number of nonzero points in the optimized kernel. Therefore by the constraint presented in (5.2) the values of the kernel are then restricted to
Φ opt (η , k ) ∈{0, K −1/ 2 }
(5.3)
Equation (5.3) states that the values of the kernel can only be “0” or “ K −1/ 2 ”, and as we recall the application of a kernel is stated in (2.36) and is shown below for convenience AΦ = A D Φ Therefore the resulting ambiguity function matrix will only contain K non-zero points. We show a small example below for 4 X 4 ambiguity function matrix and kernel matrix containing only K=2 non-zero values.
35
⎡ a1,1 a1,2 ⎢a a 2,2 2,1 Let A = ⎢ ⎢ a 3,1 a 3,2 ⎢ ⎣a 4,1 a 4,2
a1,3 a 2,3 a 3,3 a 4,3
a1,4 ⎤ a 2,4 ⎥⎥ a 3,4 ⎥ ⎥ a 4,4 ⎦
and ⎡0 ⎢ ⎢0 ⎢ Let Φ = ⎢ 0 ⎢ ⎢ ⎢0 ⎣
0 1 2 0 0
0⎤ ⎥ 0⎥ ⎥ 0⎥ ⎥ ⎥ 0⎥ ⎦
0 0 0 1 2
Therefore our filtered ambiguity function domain matrix will be given as
AΦ = A D Φ ⎡ a1,1 a1,2 ⎢a a 2,2 2,1 =⎢ ⎢ a 3,1 a 3,2 ⎢ ⎣ a 4,1 a 4,2
a1,3 a 2,3
⎡0 ⎢ ⎢0 ⎢ =⎢ ⎢0 ⎢ ⎢0 ⎣
0
0 a 2,2 2 0 0
a 3,3 a 4,3
0 0 a 4,3 2
⎡0 a1,4 ⎤ ⎢ ⎢0 a 2,4 ⎥⎥ ⎢ D a 3,4 ⎥ ⎢ 0 ⎥ ⎢ a 4,4 ⎦ ⎢ ⎢0 ⎣ 0⎤ ⎥ 0⎥ ⎥ ⎥ 0⎥ ⎥ 0⎥ ⎦
0 1 2 0 0
0 0 0 1 2
0⎤ ⎥ 0⎥ ⎥ 0⎥ ⎥ ⎥ 0⎥ ⎦
(5.4)
In essence this can be seen as a feature selection procedure as well, since the entries in the ambiguity matrix that correspond to the entries that have the value zero in the kernel will be effectively zero, as we saw in the example presented in (5.4). Only the entries that correspond to the non-zero entries in the kernel will essentially pass through. We can then discard all the elements that are zero, and retain the non-zero points and construct a K dimensional feature
36
vector x, directly without performing any dimension reduction algorithm described in chapter 3. This is a computational advantage over the method described in chapter 4.
The statistical criterion on how we choose the entries was proposed in [18] to be based on Fisher’s discriminant ratio (FDR) which is related to a sort of signal-to-interference ratio (SIR). Essentially we wish to choose the K points in the ambiguity matrix that maximize the FDR matrix F, with the elements given by F (η , k ) =
| A(η , k )1 − A(η , k ) 2 |2 ( A(η , k )1σ + A(η , k )σ2 ) 2
(5.5)
where A(η , k )σi is the standard deviation of the ambiguity function for class hi, estimated from the training set by A(η , k )σi =
1 M 2 ∑ | A(η , k ) s ( n )k | − | A(η , k )i | M k =1
(5.7)
The underlying assumption when employing the FDR as a way to choose points is that the underlying distribution of each ambiguity matrix entry is normally distributed. If this assumption is satisfied then the FDR is indeed maximized when the separation between the means is large and the within-class variance is small [5]. If the underlying distribution of the entry is normally distributed with equal variance, and K=1, then this approach defaults to the previous approach described in section 5.1. In practice we would compute the matrix F, and then rank order the entries from maximum to minimum. We then choose the first K points, as the kernel locations in the η, k plane to have non-zero energy. The remainder of the points will be set to zero.
5.3
Determining the Optimal Number of Points
As previously mentioned, one of the advantages of the non-parametric approach over the parametric approach is that the feature selection stage is accomplished also by the kernel, 37
therefore bypassing the dimension reduction computations. Now comes the question of how to determine K. We know that K can range from 1 to L2, where the latter would correspond to the default ambiguity matrix with no kernel applied (ambiguity matrix corresponding to the WVD). It was proposed in [18] to determine K experimentally by counting the number of errors in the training set, and choosing the K that gave the least number of errors. We propose employing the probability of error model that was developed in section 3.4. We would choose the number K that results in the smallest probability of error modeled by equation (3.22).
5.4
Accounting for Correlation
The rank-ordering procedure described above does not take into account the correlations that may exist between the features. It is well known in the field of pattern recognition that there is little information gain from features that are highly correlated with each other, resulting in no improvement or even degradation of classifier performance [5]. Therefore we would like to incorporate correlation information when rank ordering the points. Firstly we should look at the structure of the ambiguity function. From the properties of TFRs we know that the ambiguity function is symmetric along both the η and k axis. Therefore K can have a value anywhere between 1, and L2/4. Therefore when computing the FDR we should only consider the quarter plane of the ambiguity function. We then apply an “ad hoc” technique described in [5], which incorporates correlation information when performing rank ordering. Let amk, m = 1,2…M and k = 1,2…. L2/4, denote the kth feature (ambiguity function entry) of the mth signal in the training set. The cross-correlation coefficient between any two ambiguity entries is given by
ρij =
∑ mM=1 ami amj 2
2
∑ mM=1 ∑ mM=1 ami amj
One can see that | ρij | ≤ 1. We can then rank order the entries by the following procedure
38
(5.7)
1. Compute the F matrix as above and rank order the entries in descending order, we label this rank ordering as f(j). 2. Choose the first point location corresponding to the maximal value in the F matrix as the first feature entry; let us call it xi1 . 3. To select the second feature point we compute the cross-correlation coefficient between entry x1 and the remaining L2/4 features. We then choose the entry xi2 according to the following criteria (5.8)
i2 = arg max{α1 f ( j ) − α 2 | ρi1 j |}, for all j ≠ i1 j
Here α1 and α 2 are weighting factors that weight the relative terms above. This procedure iterates so that we find xik , k = 3,.. L2/4, So that ik = arg max{α1 f ( j ) − j
α2
k −1
(5.9)
∑ | ρi j |}, for all j ≠ ir , r = 1, 2....k -1 k − 1 r =1 r
In other words we weight and subtract the average correlation with all the previously selected features. Hence after completing this procedure we apply our previous procedure of finding the optimal number of entries K, where now these entries will make up the feature vector
x = [ xi1 xi2 xi3 ....xiK ]T
(5.10)
The entries that correspond to the locations where the kernel has a value of zero are discarded.
5.5
Non-Parametric Optimization Procedure
In this section we summarize thus far the optimization procedure to find the non-parametric optimal kernel. Our goal is to extract features from the ambiguity function matrix and form the feature vector x. The feature vector will then be given to the decision rule described in the previous chapter. Again the choice of distance measure plays a role in how well the classifier
39
performs. We note that if the underlying Gaussian assumption of the FDR approach is satisfied, it would be reason to believe that the quadratic discriminant function would be the measure of choice. So our procedure thus far can be summarized as follows:
1) Choose a distance measure. 2) Compute the FDR for the two classes. 3) Rank order the FDR entries while taking correlation into account. 4) Begin with K = 1 entry, and iterate through K = L2/4 entries and classify the training signals using the selected distance measure. i e (Φ, d ) using equation (3.22) and select K that yields the minimal of this 5) Compute P
function.
Following the optimization procedure we switch the system to a live mode where it classifies unknown signals. The block diagram of the system can be shown in figure 4.1 below.
Figure 5.1: Non-parametric system block diagram. We note that when applying the non-parametric optimized kernel the resulting TFR may not have a visually satisfying appearance. Especially since ideally we would like the number of nonzero points K to be as small as possible to avoid a high dimensional feature vector, the resulting
40
TFR would not be an accurate TFR or visually pleasing, however would be optimized for the purpose of classification.
41
Chapter 6 Experimental Results
As explained in the introduction we applied the TFR classification techniques to the problem of releasing an HMDS during an event of an ejection/crash and avoid false alarms such as accelerations encountered during normal ACM. We stated that this problem can be looked at as a classification problem with two classes. In this chapter we look at the type of data that was used as the basis of our experiment, applied the two approaches that we described, and also checked the validity of our probability of error model against the actual error.
6.1
ACM Data Acquisition
In this section we describe the type of ACM data that was used in our experiment. ACM data as mentioned before is a case where the HMDS should not be released. The AFRL provided acceleration data taken from a 20 minute air combat exercise. The pilot had tri-axial accelerometers fitted on to the helmet that recorded acceleration measurements along the +Z, +Y, and +X axis which are aligned as shown in figure 6.1.
Figure 6.1: Helmet Acceleration Axis Definition.
42
The data was recorded at a sample rate of 1 KHz. The following figure shows the three signals that were recorded. The data was divided into 256 point frames (0.256 ms) and was separated into a training set and a test set. From our initial analysis of the VDT signals we found that only the +Z axis acceleration is critical to discriminate between EC signals and ACM signals. This is a reasonable assumption because for most ejection seat systems the major component of acceleration during the initial ejection sequence is in the +Z direction. Although different ejection seat systems are not completely vertically aligned with the +Z direction, and have an angle that may give rise to some +X and +Y directions, they are not as dominant of a signal as the +Z direction. So for this thesis we will focus entirely on the +Z accelerations.
Figure 6.2: ACM Data for the +Z direction Below is an example ACM data frame taken at one of the peaks shown in figure 6.2.
43
Figure 6.3: Example ACM signal from the +Z direction. Figure 5.2 shows a time frame in which the acceleration in the +Z direction is near 8 G, which for the current mechanical system would set off the release system for the HMDS.
Figure 6.4: WVD of the ACM signal. The WVD shows that within the time frame the frequency content is around DC, which indicates that for a time frame of 0.256 ms, normal ACM acceleration magnitude does not change very rapidly.
44
Figure 6.5: Ambiguity Function of ACM signal
6.2
Ejection/Crash Data Acquisition
Actual Ejection/Crash data is very difficult to acquire, so for this study we focused on a test that the AFRL performs named the vertical deceleration tower (VDT) test. This test consists of a sled attached to vertical tower that is approximately 15m in height. The sled is accelerated toward the bottom at a known acceleration. Upon the impact at the bottom of the VDT, a strong instantaneous upward acceleration occurs, which simulates the sudden upward acceleration experienced in an ejection [19]. The subject is strapped to a seat on the sled and is accelerated towards the ground. A photograph of the test apparatus is shown in the figure below.
45
Figure 6.6: VDT test apparatus The purpose of this test is to produce a pulse mainly in the +Z direction that is characteristic of pulses seen during pilot ejection using an ACES II ejection seat system. Numerous tests were conducted using manikins of varying weights and size, as well as human subjects. The tests were conducted with the subjects wearing helmets of different weight and center of gravity configurations. The data was also acquired at a sampling rate of 1 KHz, where each test was fit into a 256 point window similar to the ACM data. Shown below is an example of a VDT signal taken from one the experiments.
Figure 6.7: VDT pulse in +Z direction
46
Below also shows the WVD and ambiguity function of the pulse in figure 6.7.
Figure 6.8: WVD of the VDT signal in the +Z direction Here we can see in contrast to the ACM signal the VDT pulse is narrower thus reflecting a higher bandwidth at the time the pulse occurs.
Figure 6.9: Ambiguity Function of VDT signal The pulse appears to be similar to that of a Gaussian pulse. From our own analysis we saw that the pulse width and maximum acceleration magnitude is affected by a number of parameters 47
including the rate at which the sled is accelerated towards the bottom and the size/weight of the subject.
6.3
Simulation Setup
We tested our approaches by running simulations using MATLAB. The two approaches that were introduced in this thesis both make use of a training set. For our experiments we gathered a training set of data consisting of 100 signals from VDT tests and 100 frames of normal ACM taken from a typical pilot flight in a F15 aircraft. The signals in the training set are all sampled at 1 KHz, and are 256 points in length. When the optimization/training procedures are completed, the resulting kernel is then tested on a larger test data set. For our experiment we used 1345 signals from each class. In the next sections we present the results of our approaches. We define class 1 as EC class of signals where the HMDS should be released and class 2 as the ACM class of signals, where the HMDS should not be released. Thus as defined in chapter 3 the two statistics e12 and e21 , are defined as follows:
e12 : The distance measure error signal given class 1 (EC class). e21 : The distance measure error signal given class 2 (ACM class).
6.4
Parametric Approach Results
We mentioned before in section 3.2, that in order to keep the feature size small, we perform a SVD on the ambiguity function matrix of the signal and retain only a number of the eigenvalues as our feature vector. To determine the number of eigenvalues to use we computed the average ambiguity function matrix of each class as give by equation (5.3). We then performed a SVD on the ACM class matrix and the VDT class matrix. Figures 6.10 and 6.11 present the results.
48
Figure 6.10: Mean VDT class singular values
Figure 6.11: Mean F15 class singular values 49
The figures shown above indicate that a majority of the singular values are very small beyond the first a few. So we see that the first 10 singular values contain a significant portion of the energy of the ambiguity function matrix. We first present the results when no kernel is applied corresponding to the WVD. Table 6.1 shows the detection results using the distance measures presented in chapter 3. Distance Measure
Predicted Detection Probability
Predicted Error Probability
Actual Detection Rate
82.3%
Predicated False Alarm Probability 0.4%
Correlation Distance Euclidean Distance
18.1%
84.1%
0.1% 0.04%
Quadratic 95.6% Discriminant
Actual Error
Feature Size
78.0%
Actual False Alarm Rate 3.5%
25.5%
10
16.0%
78.2%
2.2%
18.2%
10
4.4%
94.1%
1.3%
7.2%
10
Table 6.1: WVD classification results (no kernel applied) the predicted probabilities were computed by (3.22). Clearly from the above results the quadratic discriminant distance measure is the optimal choice. We now explore the effect of choosing a different feature size, or a different number of singular values. Figure 6.12 shows how choosing the number of singular values to retain has an effect on the detection performance. Clearly beyond 10 eigenvalues the performance does not improve any further, thus further justifying the choice of 10 eigenvalues as the feature size.
50
Figure 6.12 Detection rate versus the number of singular values. We now present results of applying our optimization procedure using a Choi-Williams kernel. We first present the results of our optimization procedure where we determined the optimal Choi-Williams kernel spread represented by σ in equation (3.8). Figure 6.13 illustrates that we determined the optimal spread value as σ =0.1.
51
Figure 6.13: Plot of error probability versus σ . We see a clear local minimum at σ =0.1, which is therefore the optimal value for the ChoiWilliams kernel. The following table shows the classification performance of the optimized Choi-Williams kernel. Compared with Figure 6.1, it is clear that applying the Choi-Williams kernel significantly improves the classifier performance. Distance Measure
Predicted Detection Probability
Predicted Error Probability
Actual Detection Rate
95.6%
Predicated False Alarm Probability 2.7%
Correlation Distance Euclidean Distance
7.13%
98.8%
0.4% 0%
Quadratic 99.5% Discriminant
Actual Error
Feature Size
96.1%
Actual False Alarm Rate 1.8%
5.7%
10
1.6%
97.0%
1.8%
4.8%
10
0.5%
98.2%
1.6%
3.4%
10
Table 6.2: Optimized Choi-Williams kernel results 52
To get some measure of the validity of the estimated error model, we show a normal probability plot of the statistics e12 and e21 along with their probability density functions (PDF) for each of the distance measures selected. A normal probability plot is a graphical method of testing how well data fit a normal distribution. If the data follows a normal distribution the data will align itself along the straight line for each data set, the plots were generated using the MATLAB function normplot().
Figure 6.13: Normal Plot of e12 and e21 using correlation distance measure and Choi-Williams kernel. As we see from the above plot the statistic e21 fits nicely on the line with some outliers, indicating that the distribution comes close to being normally distributed. However e12 , has its normal probability plot, shaped like an “S”. This indicates that the distribution may be in fact bimodally distributed. The following figures of the PDF give some insight to the distribution.
53
Figure 6.14: Choi-Williams Kernel Probability Density of e12 .
Figure 6.15: Choi-Williams Kernel Probability Density of e21 .
54
. We now examine the results when applying the Euclidean distance measure.
Figure 6.16: Normal Plot of e12 and e21 using Euclidean distance measure.
Figure 6.17: Choi-Williams Kernel Probability Density of e12 and Euclidean distance. 55
Figure 6.18: Choi-Williams Kernel Probability Density of e21 and Euclidean distance. Finally the results for the quadratic discriminant function.
Figure 6.19: Normal Plot of e12 and e21 using quadratic discriminant function distance measure.
56
Figure 6.20: Choi-Williams Kernel Probability Density of e12 using quadratic discriminant.
Figure 6.21: Choi-Williams Kernel Probability Density of e21 using quadratic discriminant.
57
It is seen from these plots that, since the probability densities of the Choi-Williams kernel using the quadratic discriminant distance measure are more Gaussian like, satisfying the assumption needed of this distance measure, it yields the least error and the best performance. Similarly we present results for the RGK. In this case the optimization is done with respect to vectors a and b instead of a scalar as in the case of the Choi-Williams kernel. For this experiment we chose vectors a and b to 5-dimensional, because we want to keep the number of parameters low, otherwise the search for the minimum becomes too computationally exhaustive. Thus there are 10 parameters to be optimized, which were also found using the Nelder and Meade algorithm, implemented in MATLAB by the function fminsearch(). Table 6.3 presents the results for the optimized RGK. Actual Error
Feature Size
79.2%
Actual False Alarm Rate 0.2%
21.0%
10
3.6%
77.5%
0.14%
22.64%
10
2.0%
98.5%
1.2%
2.7%
10
Predicted Error Probability
Actual Detection Rate
95.9%
Predicated False Alarm Probability 0.0%
4.1%
96.4%
0.0% 1.0%
Distance Measure
Predicted Detection Probability
Correlation Distance Euclidean Distance
Quadratic 99% Discriminant
Table 6.3: Results for optimized RGK Again we analyze the validity of the error model for this kernel. Shown below are the normal plots for e12 and e21 .
58
Figure 6.22: Normal Plot of e12 and e21 using correlation distance measure and RGK.
Figure 6.23: RGK Probability Density of e12 using correlation distance.
59
Figure 6.24: RGK Probability Density of e21 using correlation distance.
Figure 6.25: Normal Plot of e12 and e21 using Euclidean distance measure for RGK.
60
Figure 6.26: RGK Probability Density of e12 using Euclidean distance.
Figure 6.27: RGK Probability Density of e21 using Euclidean distance.
61
Figure 6.28: Normal Plot of e12 and e21 using quadratic discriminant function.
Figure 6.28: RGK Probability Density of e12 using quadratic discriminant. 62
Figure 6.29: RGK Probability Density of e21 using quadratic discriminant.. As we see from the above plots the statistic e12 is not always normally distributed as we have hypothesized, we see in the case of the correlation and Euclidean distance there are two distinct peaks. So we see a big discrepancy between the predicted detection rate and the actual detection rate. The statistic e21 seems to fit the normal distribution better, although it is skewed sometimes. Of the three distances, the quadratic discriminant measure always produces error probability densities closest to being Gaussian, thus yielding the best results.
6.4
Non-Parametric Approach Results
As explained in section 5.5 we need to derive the number of features to extract from the ambiguity function matrix. Figure 6.30 shown below illustrates step 3 in our procedure; rank ordering the FDR values. This figure shows the calculated ranked FDR between the two classes.
63
Figure 6.30: Selecting the number of features using quadratic discriminant. Following the procedure that we outlined we show the results for selecting the optimal number of features to select. The plot below show the results using the quadratic discriminant distance measure.
64
Figure 6.31: Selecting the number of features using quadratic discriminant. We see that the lowest probability of error corresponds to selecting the features that have the 13 top FDR values. An interesting note is that after 30 features the estimated probability of error increases sharply. Upon investigation of this observation we found that the estimated covariance matrix used in the quadratic discriminant function and given in equation (3.10) becomes singular after choosing 30 features. We can see that equation (3.10) requires us to compute the inverse of this singular matrix which explains why the probability of error takes such a sudden increase after 30 features. The question is then why by adding this feature does the covariance matrix become singular. The answer may be found by looking at where in the ambiguity function is this particular feature located. We know that the criterion for choosing the points given in equation (5.5) does not take into account any physical significance of this particular feature. It only takes into account the measure of separation between the two classes. It is possible that there are only
65
about 29 physically significant and distinct features, so adding features beyond 29 causes them to become highly correlated, thus making the feature correlation matrix of (3.10) singular.
Similarly using the correlation and Euclidean distance measure the following two plots show that for different distance measures, we have to determine the optimal number of features. For each distance measure chosen the probability of error is directly impacted by how many features are chosen. At the same time we would like to keep the number of features as small as possible, so from these plots we want to look for the smallest number of features that give us an acceptable probability of error.
Figure 6.32: Selecting the number of features using Euclidean distance.
66
Figure 6.33: Selecting the number of features using correlation distance. We now present the classifier performance results using the non-parametric optimization approach. Distance Measure
Predicted Detection Probability
Predicted Error Probability
Actual Detection Rate
91.8%
Predicated False Alarm Probability 0.0%
Correlation Distance Euclidean Distance
8.2%
93.9%
0.0% 0.0%
Quadratic 95% Discriminant
Actual Error
Feature Size
79.7%
Actual False Alarm Rate 0.0%
20.3%
2
6.9%
72.3%
0.0%
27.7%
15
5%
100%
0.0%
0.0%
13
Table 6.4: Results for optimized non-parametric kernel. Looking at figure 6.33 and table 6.4, using the correlation distance we can observe that a local minimum occurs at 2 features, which is why we chose to use 2 features using the test set. However from figure 6.33 we can also see that choosing for example 24 features gives a similar 67
estimated performance. If we use 24 features we can expect to get a slight reduction in performance. We tested this hypothesis and the results are presented in table 6.5. Predicted
Actual
Detection Rate
89.17%
77.5%
False Alarm Rate
0.0%
0.44%
Total Error
10.83%
22.94%
Table 6.5: Results for using 24 features and correlation distance. Although the predicted performance does not match the actual performance, the prediction that using 2 features gave slightly better performance over choosing 24 features was correct. This observation supports the hypothesis that using 2 features with the correlation distance is the optimal number of points for this particular non-parametric kernel. As before we show the normal plots and the PDF for the error model, which are shown below for each distance measure.
Figure 6.34: Normal Plot of e12 and e21 using quadratic discriminant distance measure. 68
Figure 6.35: Non-parametric kernel Probability Density of e12 using quadratic discriminant.
Figure 6.36: Non-parametric kernel Probability Density of e21 using quadratic discriminant.
69
Figure 6.37: Normal Plot of e12 and e21 using Euclidean distance for non-parametric kernel.
Figure 6.38: Non-parametric kernel Probability Density of e12 using Euclidean distance.
70
Figure 6.39: Non-parametric kernel Probability Density of e21 using Euclidean distance.
Figure 6.40: Normal Plot of e12 and e21 using correlation distance for non-parametric kernel.
71
Figure 6.41: Non-parametric kernel Probability Density of e12 using correlation distance..
Figure 6.42: Non-parametric kernel Probability Density of e21 using correlation distance.
72
As we see again the quadratic discriminant function is clearly the best distance measure. Although as with the previous kernels, the PDF of e12 and e21 may not always be normally distributed, however the optimization procedure still yielded good results, as shown in table 6.3.
73
Chapter 7 Conclusions and Future Work
We introduced the basics of how time-frequency theory can be applied to classification/detection of non-stationary random signals. We identified two major approaches that could be applied with the objective of optimizing a TFR to get maximal classifier performance. A parametric approach is based on the idea that we can find the parameters for a finite set of kernels that provide the best performance possible given a distance measure. The non-parametric approach is based on the idea that we should intelligently choose the entries in the ambiguity function matrix that give the best classifier performance. The statistic that was used for this thesis was the FDR which provides some measure of the SIR. If the number of features chosen is small we do not require a dimensionality reduction algorithm, which saves on computational costs.
7.1
Conclusions
For the parametric approach we see that there is an improvement in performance when compared to when no kernel is applied. However we see that the probability error model does not always satisfy the Gaussian distribution assumption. Although there is an increase in performance there is reason to believe that the optimized kernel are still sub-optimal and can still be improved. If we knew the distribution of e12 and e21 , equation (3.25) can be easily modified to reflect that, and we expect the predicted results to more closely match the actual classification results. The parametric kernels even though designed with the objective of classification are still restricted 74
because of our assumption about the smoothing structure, although the RGK has more freedom in how we shape the kernel. The search for the point where the estimated error is minimal is difficult because of the number of parameters that are involved. The error surface is far from being quadratic so distinguishing between a local minimum and a global minimum is also a restriction. The optimization of the Choi-Williams kernel is relatively simple, because the parameter is a scalar value; however we are more restricted in determining the shape of the kernel.
Another problem is that the data is unbalanced in the sense that the number of features is significantly greater than the number of examples that are available. It is known that the variance of the error probability estimate is inversely proportional to the ratio of the number of examples to feature size [5]. With a small number of data sets available, the error probability as directly measured will have a large variance.
We clearly observe that regardless of the kernel the best performing distance measure is the quadratic discriminant. This is in line with our hypothesis which stated that even though the underlying assumption of the quadratic discriminant function is that the feature vector is jointly Gaussian, the classifier still lends itself to providing a good partition of the feature space. Accounting for the feature covariance matrix takes into account the variances and cross variances of each vector, whereas the other distance measures only take into account the mean.
The non-parametric optimization approach when combined with the quadratic distance measure gave the best overall results. Another note is that the non-parametric kernel did not require
75
reduction of dimensionality, which makes it a more attractive approach when implementing this system on a real-time microprocessor. This is important because the detection of a crash and the release of the HMDS should be accomplished in as little time as possible, to avoid the possibility of pilot injury.
7.2
Future Work
There is a good potential for future research in a number of areas.
1) Research into other kernel structures that can give better control over how much smoothing is applied to the ambiguity function matrix. 2) Research into a more accurate error model. The more accurately an error model can represent the true error; we can then come closer to achieving an optimal kernel using either the parametric or non-parametric method. 3) The exploration of other non-parametric optimization techniques and compare them to the use of the FDR methodology. 4) Applying this technique to other sets of experimental data that can represent ejection or crash. In this thesis we focused on VDT tests and ACM data taken from a normal pilot session. There are other tests the AFRL performs to study ejection acceleration responses. 5) Real-time implementation and performance analysis. Extending the algorithm onto a FPGA or DSP chip interfaced to an actual HMDS which would be exposed to actual ejection/crash environments. Detection performance and speed performance can be analyzed to determine if this is a feasible real-time algorithm. 6) Extension of these algorithms to other problems in engineering where non-stationary signals are encountered such as audio, radar, sonar, or biomedical signal detection.
76
Bibliography
[1]
T.K. Bhattacharya, S. Haykin, “Neural Network-Based Radar Detection for an Ocean Environment,” IEEE Transaction on Aerospace and Electronic Systems, vol. 33, April 1997.
[2]
B. Boashash, P O’Shea, “A Methodology for Detection and Classification of Some Underwater Acoustic Signals Using Time-Frequency Analysis Techniques,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 38, November 1990. [3]
T.P. Wang, M. Sun, C. Li, A.H. Vagnucci, “Classification of Abnormal Cortisol Patterns By Features From Wigner Spectra,” Proceedings of the 10th International Conference on
Patter Recognition, June, 1990. [4]
S. Haykin, T.K Bhattacharya, “Modular Learning Strategy for Signal Detection in a NonStationary Environment,” IEEE Transactions on Signal Processing, vol. 45, June 1997.
[5]
S. Theodoridis, K. Koutroumbas, Pattern Recognition, Elsevier, 2006
[6]
J.G. Proakis, D.G. Manolakis, Digital Signal Processing, Principles, Algorithms, and
Applications, Prentice-Hall, 1996. [7]
L. Cohen, Time-Frequency Analysis, Prentice-Hall, 1995
[8]
D.L. Jones, T.W. Parks, “A Resolution Comparison of Several Time-Frequency Representations,” IEEE Transactions on Signal Processing, vol. 40, February 1992.
[9]
P.M Oliveira, V. Barroso, “Uncertainty in the Time-Frequency Plane,” Proceedings of
the Tenth IEEE Workshop on Statistical, Signal, and Array Processing, August, 2000.
77
[10]
L. Cohen, “Time-Frequency Distributions-A Review,” Proceedings of the IEEE. No. 7, July, 1989.
[11]
P. Flandrin, “Some Features of Time-Frequency Representations of Multicomponent Signals,” IEEE International Conference on Acoustics, Speech, and Signal Processing, March, 1984.
[12]
M. Davy, C. Doncarli, G.F. Bourdreaux-Bartels, “Improved Optimization of TimeFrequency-Based Signal Classifiers,” IEEE Signal Processing Letters, vol. 8, February, 2001.
[13]
D.B. Malkoff, L. Cohen, “A Neural Network Approach to the Detection Problem Using Joint Time-Frequency Distributions,” International Conference on Acoustics, Speech,
and Signal Processing, April, 1990. [14]
H.I. Choi, W.J. Williams, “Improved Time-Frequency Representation of Multicomponent Signals Using Exponential Kernels,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 37, June, 1989. [15]
R.G. Baraniuk, and D.L. Jones, “A Signal-Dependant Time-Frequency Representation: Optimal Kernel Design,” IEEE Transactions on Signal Processing, vol.44 November, 1996.
[16]
J.M. Ortega, W.C. Rheinboldt, Iterative Solutions of Nonlinear Equations in Several
Variables, New York; Academic, 1970. [17]
J. McLaughlin, “Applications of Operator Theory to Time-Frequency Analysis and Classfication,” Ph.D dissertation, University of Washington, Seattle, 1997.
[18]
B.W.Gillespie, L.E Atlas, “Optimizing Time-Frequency Kernels for Classification,”
IEEE Transactions on Signal Processing, vol. 49., March 2001.
78
[19]
C.E. Perry, J.R. Buhrman, “Effect of Helmet Inertial Properties on the Biodynamics of the Head and Neck During +Gz Impact Accelerations,” SAFE Journal, vol.26, July, 1996.
[20]
S.S. Abeysekera, B. Boashash, “Methods of Signal Classification using Images Produced by the Wigner-Ville Distribution,” Pattern Recognition Letters, 12, 1991, pp. 717-729.
[21]
B. Boashash, P. O’Shea, “ A Methodology for Detection and Classification of Some Underwater Acoustic Signals Using Time-Frequency Analysis Techniques,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 38, November 1990. [22]
D.L. Jones, T.W. Parks, “A Resolution Comparison of Several Time-Frequency Representations,” IEEE Transactions on Signal Processing, vol. 40, February 1992.
[23]
B. Boashash, P.J. Black, “An Efficient Real-Time Implementation of the Wigner-Ville Distribution,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, November 1987.
[24]
A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, AddisonWesley, 1994.
79