Contents. List of Figures xiv. List of Algorithms xv. 1 Introduction ..... vironments such as large concert halls, indoor swimming pools, or even smaller. aIn normal ...
Particle Filtering Methods for Acoustic Source Localisation and Tracking Eric Andr´ e Lehmann Dipl.El.-Ing. (ETHZ), M.Phil.Eng. (ANU)
July 2004
A thesis submitted for the degree of Doctor of Philosophy of The Australian National University
Department of Telecommunications Engineering Research School of Information Sciences and Engineering The Australian National University
Declaration The contents of this thesis are the result of original research and have not been submitted for a similar or higher degree to any other university or institution. Adequate statements in the text indicate which parts of this thesis are based upon research carried out jointly with others. The majority, approximately 90%, of this work is my own.
Eric Andr´e Lehmann July 2004
i
Acknowledgements This thesis, and the three years of work that lead to it, were made a lot easier thanks to several people who provided a great deal of help in various ways. Once again, I would like to address many thanks to my supervisor, Professor Robert Williamson, whose continuous support, guidance and encouragement throughout the years were crucial to the successful achievement of this thesis. Also, this work would simply not exist without his involvement in providing financial support for it. I am also grateful to Dr. Darren Ward for many interesting and productive discussions on several topics elaborated in this research. His valuable collaboration on parts of this work has been an excellent experience and a real pleasure for me. His assistance and financial support during two visits to Imperial College in London, UK, were also greatly appreciated. A very special thank-you is addressed to my partner Cressida, who provided help and assistance in many regards. Her loving and understanding presence by my side, through all the good times and some bad times, has been invaluable over the years. I am deeply grateful for everything she has done for me. There are many people to whom I am particularly grateful for the moral support provided during the course of this work. In this regard, I would like to thank my parents and the rest of my family in Switzerland. Their unconditional understanding and constant support were really important to me. To all my friends back home, thank you for your supportive friendship and for keeping in touch despite the distances. All my Australian friends, and specially the Wilson family in Yass, also deserve many thanks for always being there for me. Finally, I would like to acknowledge the help of Kris Modrak who did a great job at hacking the Linux operating system for the purpose of various practical implementations related to this research. My gratitude also goes to the Research School of Information Sciences and Engineering for funding my overseas travels.
iii
Abstract The task of acoustic source tracking plays an important role in many practical speech acquisition systems. This research presents an extensive study of sequential Monte Carlo methods applied to the source localisation problem, based on the signals received at an array of microphones. A general framework for acoustic source localisation using particle filtering is proposed, and four different algorithms that fit within this framework are subsequently developed. To assess the performance of these new methods, statistical simulations are carried out using both synthetic and real-life samples of audio data. The simulation results demonstrate the superiority of an approach based on sequential estimation. The resulting particle filters are shown to drastically outperform traditional acoustic source localisation methods. Further developments attempt to improve the basic particle filtering technique. Three different methods using the concept of sequential importance sampling are proposed, and their respective performance is also tested experimentally. The practical results demonstrate the strengths of the new approach. Using importance sampling, the valuable property of reinitialisation is integrated at a low algorithm level. Despite yielding a slightly lower tracking accuracy, these methods are able to automatically recover from complete track losses, detect new targets entering the acoustic scene, and switch between alternating talkers. It is found that particle filters based on the importance sampling principle are better suited for practical applications than the filters developed previously. This work also presents a theoretical performance analysis of acoustic source tracking methods. Theoretical limits on the estimation error are derived based on the posterior Cram´er-Rao bound. To this purpose, two mathematical observation models are developed that describe how source localisation measurements are obtained in practice. These models are derived from statistical room acoustics principles. The influence of the correlation existing between sound intensity values measured in a diffuse sound field is investigated in detail. A comparison of two generic particle filters with respect to the derived lower error bound is presented. Whereas the performance of the tested algorithms is clearly influenced by the level of sound intensity correlation, simulation results show that the posterior Cram´erRao bound is not affected by it. These results hence point out that this type of estimation error bound may not be fully appropriate for a practical consideration of the acoustic source localisation problem. v
Contents List of Figures
xiv
List of Algorithms
xv
1 Introduction
1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2 Research Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2 Basics of Acoustic Source Tracking and Particle Filtering 2.1 Acoustic Source Tracking . . . . . . . . . . . . . . . . . . . . . . . .
13 13
2.1.1
Generic Problem Definition . . . . . . . . . . . . . . . . . .
14
2.1.2
Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2 Traditional Approaches to ASL . . . . . . . . . . . . . . . . . . . .
16
2.2.1
Time Delay Estimation . . . . . . . . . . . . . . . . . . . . .
16
2.2.2
Steered Beamforming . . . . . . . . . . . . . . . . . . . . . .
20
2.2.3
Discussion of Traditional ASL Methods . . . . . . . . . . . .
21
2.3 Basics of Particle Filtering . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.1
Definition of the Bayesian Filtering Problem . . . . . . . . .
24
2.3.2
Bayesian Solution . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.3
Particle Filtering Concepts . . . . . . . . . . . . . . . . . . .
26
2.3.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.4 Target Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . .
30
2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.1
Practical Setup . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.2
Image Method Setup . . . . . . . . . . . . . . . . . . . . . .
33
vii
Contents
viii
3 Particle Filter Algorithms for Acoustic Source Tracking
35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2 General PF Framework for ASL . . . . . . . . . . . . . . . . . . . .
36
3.3 Source Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4 Localisation Function . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.4.1
TDE Localisation Function . . . . . . . . . . . . . . . . . .
40
3.4.2
Direct Localisation Function . . . . . . . . . . . . . . . . . .
41
3.5 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.5.1
Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . .
42
3.5.2
Pseudo-Likelihood . . . . . . . . . . . . . . . . . . . . . . .
43
3.5.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.6 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.6.1
GCC Localisation with Gaussian Likelihood . . . . . . . . .
45
3.6.2
GCC Localisation with Pseudo-Likelihood . . . . . . . . . .
45
3.6.3
SBF Localisation with Gaussian Likelihood . . . . . . . . . .
46
3.6.4
SBF Localisation with Pseudo-Likelihood . . . . . . . . . . .
46
3.6.5
Discussion of the Proposed Algorithms . . . . . . . . . . . .
46
3.7 Analysis of the Tracking Accuracy . . . . . . . . . . . . . . . . . . .
47
3.7.1
Root Mean Square Error . . . . . . . . . . . . . . . . . . . .
48
3.7.2
Mean Standard Deviation . . . . . . . . . . . . . . . . . . .
48
3.7.3
Frame Convergence Ratio . . . . . . . . . . . . . . . . . . .
49
3.8 Image Method Simulations . . . . . . . . . . . . . . . . . . . . . . .
49
3.8.1
Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . .
50
3.8.2
Simulation Results . . . . . . . . . . . . . . . . . . . . . . .
51
3.8.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.9 Real Audio Experiments . . . . . . . . . . . . . . . . . . . . . . . .
53
3.9.1
Experimental Hardware Setup . . . . . . . . . . . . . . . . .
53
3.9.2
Experimental Software Setup . . . . . . . . . . . . . . . . .
54
3.9.3
Experimental Results . . . . . . . . . . . . . . . . . . . . . .
55
3.9.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4 Particle Filter Design using Sequential Importance Sampling
61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2 SIS Theory Review . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
Contents
ix
4.3 SIS Particle Filter Design for ASL . . . . . . . . . . . . . . . . . . .
67
4.3.1
Choice of Likelihood Function . . . . . . . . . . . . . . . . .
68
4.3.2
Choice of Importance Function . . . . . . . . . . . . . . . .
69
4.3.3
Importance Function using Steered Beamforming . . . . . .
69
4.3.4
TDE-based Importance Function . . . . . . . . . . . . . . .
72
4.4 Revised SIS Algorithm for ASL . . . . . . . . . . . . . . . . . . . .
74
4.4.1
Velocity Component of the Importance Particles . . . . . . .
75
4.4.2
Importance Weights Computation . . . . . . . . . . . . . . .
75
4.4.3
Practical Updates . . . . . . . . . . . . . . . . . . . . . . . .
77
4.4.4
Final SIS Algorithm . . . . . . . . . . . . . . . . . . . . . .
81
4.5 Practical Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.5.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .
85
4.5.2
Plots of Statistical Results . . . . . . . . . . . . . . . . . . .
85
4.5.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.6 Results Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.6.1
Discussion of Image Method Results . . . . . . . . . . . . .
98
4.6.2
Discussion of Real Audio Results . . . . . . . . . . . . . . . 101
4.6.3
Discussion of Computational Load Assessment . . . . . . . . 102
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Lower Bound on Estimation Error
107
5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 107 5.2 Review of PCRB Theory . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1
PCRB Recursion . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.2
Generic PCRB Computation Procedure . . . . . . . . . . . . 113
5.3 System Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.1
Basic Parameters . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.2
System Equation . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.3
System-related PCRB Computations . . . . . . . . . . . . . 116
5.4 Simple Observation Model . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.1
Model Derivation . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.2
Computation of PH1 . . . . . . . . . . . . . . . . . . . . . . 120
5.4.3
Observation-related PCRB Computations . . . . . . . . . . 126
5.4.4
Special Case: Noiseless System . . . . . . . . . . . . . . . . . 128
5.4.5
PCRB Results . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Contents 5.4.6
x Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.5 Correlated Observations Model . . . . . . . . . . . . . . . . . . . . 134 5.5.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5.2
Theoretical Developments . . . . . . . . . . . . . . . . . . . 135
5.5.3
Generating Correlated Ir (k) Values . . . . . . . . . . . . . 138
5.5.4
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 140
5.5.5
Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 143
(i)
5.6 Comparison with PF Performance . . . . . . . . . . . . . . . . . . . 144 5.6.1
PF Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.6.2
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 146
5.6.3
Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6 Conclusion
151
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . 154 6.2.1
Enhancement of Basic Principles . . . . . . . . . . . . . . . 154
6.2.2
Better Handling of Speech Signals . . . . . . . . . . . . . . . 154
6.2.3
Multiple Target Tracking . . . . . . . . . . . . . . . . . . . . 155
A Relationship Between SBF and CCF Approaches to ASL
157
A.1 Mathematical Developments . . . . . . . . . . . . . . . . . . . . . . 157 A.2 Practical Example
. . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 A.4 Implications on ASL Methods . . . . . . . . . . . . . . . . . . . . . 163 B Theoretical Derivation of PH1
165
C Real-Time PF Implementation for Acoustic Source Tracking
169
C.1 PF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 C.2 Practical Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.3 Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 D CD Contents
175
D.1 Thesis Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Contents
xi
D.2 Data and Other Documentation Files . . . . . . . . . . . . . . . . . 175 Bibliography
177
List of Figures 2.1 Symbolic formulation of the ASL problem . . . . . . . . . . . . . .
14
2.2 Percentage of anomalous GCCF-based TDEs versus reverberation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3 Practical example of SBF output function . . . . . . . . . . . . . .
23
2.4 Symbolic representation of the particle filtering principle . . . . . .
27
2.5 Recording environment setup used for experimental simulations . .
32
3.1 Typical microphone signal simulated with the image method . . . .
51
3.2 Example of tracking results for two classical and two PF-based ASL methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.3 RMS error versus T60 for five ASL methods . . . . . . . . . . . . . .
53
3.4 Example of microphone signal recorded in a real office room . . . .
54
3.5 Example of SBF-based pseudo-likelihood function . . . . . . . . . .
56
3.6 Tracking result example for SBF-PL algorithm . . . . . . . . . . . .
57
3.7 Experimental results for ASL methods using real audio data . . . .
58
4.1 DSB beampattern at various operating frequencies . . . . . . . . . .
70
4.2 Example of importance function derived from SBF measurements .
71
4.3 Example of ambiguous SBF importance function . . . . . . . . . . .
78
4.4 Histogram example with corresponding boxplot . . . . . . . . . . .
86
4.5 Tracking result example obtained with an SIS method . . . . . . . .
88
4.6 SIS tracking results with alternating conversation scenario . . . . .
90
4.7 Experimental image method results for RMSE parameter . . . . . .
91
4.8 Experimental image method results for MSTD parameter . . . . . .
92
4.9 Experimental image method results for FCR parameter . . . . . . .
93
4.10 Experimental real audio results for RMSE parameter . . . . . . . .
95
4.11 Experimental real audio results for MSTD parameter . . . . . . . .
96
xiii
List of Figures 4.12 Experimental real audio results for FCR parameter . . . . . . . . .
xiv 97
4.13 SIS-SBF tracking performance results for various grid spacing values used in the importance function computations . . . . . . . . . . . . 103 5.1 Cross-sections in a beamformer response . . . . . . . . . . . . . . . 120 5.2 Spatial correlation coefficients for sound pressure and intensity . . . 123 5.3 Detection probability PH1 versus reverberation time T60 . . . . . . . 125 5.4 Typical PCRB results for simplified observation model . . . . . . . 131 5.5 Steady-state PCRB versus PH1 , simple observation model . . . . . . 132 5.6 Steady-state PCRB versus η, simple observation model . . . . . . . 133 5.7 Generating correlated SBF observations: detail of some internal procedure results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.8 Realisations of the correlated observation process . . . . . . . . . . 142 5.9 Steady-state PCRB versus x˙ s , correlated observation model . . . . . 143 5.10 Performance results for bootstrap PF algorithm, with correlated and non-correlated observation models . . . . . . . . . . . . . . . . . . . 146 5.11 Performance results for SIS-like algorithm, with correlated and noncorrelated observation models . . . . . . . . . . . . . . . . . . . . . 147 A.1 Plots of CCF term involved in the DSB response computations . . . 161 A.2 Delay-and-sum beamformer response . . . . . . . . . . . . . . . . . 162 C.1 Example of graphical output from a real-time PF algorithm for acoustic source tracking . . . . . . . . . . . . . . . . . . . . . . . . 171 C.2 CPU power required by the real-time PF application as a function of the number of particles . . . . . . . . . . . . . . . . . . . . . . . 172
List of Algorithms 2.1 Bootstrap particle filter . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1 Generic PF algorithm for acoustic source tracking . . . . . . . . . .
38
4.1 Generic SIS algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2 Updated SIS algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
82
xv
Chapter 1 Introduction 1.1
Motivation
The effects of acoustic reverberation in closed spaces are taken for granted by most of us. Because this phenomenon occurs constantly in our daily lives, it becomes a natural process and goes largely unnoticed to our brains. Given the surroundings (size of the room, objects in it, carpeting, etc.), we unconsciously “expect” a corresponding level of acoustic reflection without paying particular attention to it. Thus, the reverberation effects only become noticeable when there is a discrepancy between what we expect to hear and what we actually perceive. Having a conversation in a sound absorbent environment is a distinctively unusual experience because our brains are used to a specific reverberation intensity given the size of the room, but our ears do not register any acoustic reflections. Similarly, one only becomes aware of the detrimental influence of reverberation in extreme cases.a Most of us experience very few difficulties in understanding a speaker located a few meters away in a moderately reverberant setting, say a typical office or seminar room. But block one ear and the speech signal suddenly becomes less intelligible—the speaker consequently becomes much harder to understand. By forcing our brain to operate only in “single-channel processing mode”, the multiple coherent echos picked up by our ears become more apparent and contribute to the drastic degradation of the acoustic scene we perceive. We are also all familiar with the challenges posed by highly reverberant environments such as large concert halls, indoor swimming pools, or even smaller a
In normal situations, the reverberation process helps us “understand” the acoustic scene around us, e.g. by providing some useful information about the range of sound sources.
1
1.1 Motivation
2
enclosures like tiled bathrooms and classrooms with no acoustical treatment. The reflected sounds add up to create a high level of diffuse noise, and the ratio of the direct signal level to the reverberant noise level at the listener’s position can potentially drop to relatively small values. Typically, speech becomes increasingly difficult to understand for reverberation times exceeding about two seconds, and an unaided speaker becomes virtually impossible to understand for reverberation times of four seconds or more.
Apart from these extreme situations, our familiarity with the effects of acoustic reverberation may lead us to think that its influence is relatively benign. On the contrary, for numerous practical applications involving sound pick-up in reverberant enclosures, multiple reflections of the desired signal in the microphone output constitute a real challenge to overcome. Complex and fast-changing acoustic channels in reverberant environments have particularly detrimental consequences in the following application examples. The advances of modern technologies enable the quasi-instantaneous exchange of increasingly large amounts of audio-visual information across the globe. The counterpart is that the quality of the acquired data is usually also expected to increase accordingly. Consequently, new approaches for enhancing the quality of recorded speech are constantly being formulated in various contexts. These include, for instance, hands-free telephones and teleconferencing systems [72, 117], so-called “smart” conference rooms for distributed meetings [29, 76], and any application involving multimedia information processing or collaborative human–machine interaction [20, 62]. As the limits of mobile telecommunication are being extended each year, speech acquisition systems have to cope with increasingly difficult conditions, such as those found in car environments [75, 84, 124]. In the context of algorithms used for speaker or audio-visual object localisation (e.g. for the purpose of steering a video camera in surveillance applications [50]), a significant research effort is being invariably put into improving the level of robustness against noise and reverberation [25, 31, 88, 107, 109, 116]. A range of applications also indirectly rely on the availability of clean speech recordings. For instance, the performance of automatic speech recognition or speaker identification algorithms [30, 36, 77, 89, 123] is known to be relatively sensitive
1.2 Research Focus
3
to the level of noise and reverberation in the audio input. Degraded speech signals also have unfavourable effects for hearing aid systems [47].
Several methods dealing with reverberation have been developed over the last few decades, and have been used with varying levels of success. These include for instance methods for sound equalisation (dereverberation) [18, 40, 41, 94], acoustic echo cancellation [9, 19, 73], acoustic source separation [5, 35, 68], microphone array beamforming [2, 46, 63, 86, 119, 123], and various other signal processing methods (see e.g [14]).
1.2
Research Focus
In the various domains of application mentioned above, the task of acoustic source localisation and tracking usually constitutes a central aspect of counteracting the effects of reverberation. The more or less exact knowledge of the speaker position is the key to acquiring clean speech signals using such tools as beamforming or equalisation principles. Because this specific task is so important as a basis for many speech processing algorithms, the concept of acoustic source tracking will constitute the main focus of the present research work. Whereas the detection of an acoustic source can be seen as trivial in ideal environments (i.e. anechoic and noiseless), multipath sound propagation between the speaker and the sensors in practical situations is usually extremely difficult to deal with. Methods used traditionally for speaker localisation (based on either cross-correlation or beamforming computations) usually fail even at relatively low reverberation levels. This high sensitivity to the acoustic environment typically comes from the fact that the source location estimates are delivered based on the signals received at the current time only. In recent years, the concept of particle filtering (sequential Monte Carlo methods) has been given enormous attention in the signal processing community [33, 55, 56]. Particle filtering appears as a promising candidate to solve the practical problems related to acoustic source tracking. With this approach, the source location is estimated on the basis of a series of past observations rather than a single measurement obtained during the current time frame (see Chapter 2). Also, this particular method involves a dynamics model describing how the speaker position is likely to
1.2 Research Focus
4
evolve over time, which means that gross detection errors can be efficiently filtered out of the observation sequence. This research presents an extensive study of sequential estimation methods applied in the specific context of acoustic source localisation and tracking. The different methods developed to approach this problem will make use of the signals received at an array of acoustic sensors distributed across the considered environment. A range of new particle filters will be developed and explained in detail for this type of application. Subsequently, the overall efficiency of these new algorithms will be assessed using several performance criteria, allowing for a comparison of these methods with each other. The analysis will be carried out from a purely theoretical point of view as well as on the basis of practical simulations, using both synthetic and real-life samples of noisy and reverberant audio data. Despite a tremendous collective effort and many years of research in the field of microphone arrays, this technology has not yet managed to deliver widely available commercial products for personal everyday use. For instance, computer screens are not fitted by default with a set of built-in cheap electret microphones. Neither do popular car brands offer a cost-effective option to have a microphone array built into the dashboard as part of the car’s navigation or hands-free telephony system. Given the importance of the mobile and multimedia information processing markets at the present time, just imagine how much revenue could be potentially made from such implementations! These two examples are vibrant demonstrations that the technology is not advanced enough for practically viable applications. Instead, practical systems based on microphone arrays seem to be limited to either highend and expensive products, or experimental prototypes built for research purposes only. This fact is demonstrated by the substantial number of scientific publications describing prototype implementations related to hands-free and human–machine interaction systems, while the technology does not seem to have emerged yet on the low-cost application market. The fact that so much research did not manage to be more successful commercially can be viewed with a varying level of pessimism, see for instance [37] and [112]. This can be seen as the consequence of two major issues: i) Success rate: due to the high complexity of typical real-life situations, the methods currently used to deal with microphone array applications still show
1.3 Thesis Structure
5
significant weaknesses in practice. Despite recent advances, the success rate of such algorithms generally fails to reach the performance levels which allow for a potentially viable commercial product. ii) Implementation complexity: because of the large amount of data involved in multi-channel systems, techniques developed for real-time array signal processing require more computational power than is usually available in practice. The ever increasing performance of modern computers however contributes to reducing these limitations. While the present research work does not claim to solve these issues completely, it will address both of them. The various practical results described in the following chapters demonstrate how the performance of traditional methods can be drastically improved when integrated within the framework of sequential estimation. This yields algorithms for acoustic source localisation with improved robustness against noise and capable of dealing successfully with increased levels of reverberation. Also, the implementation of a real-time particle filter on a standard personal computer using an eight sensor array will provide a demonstration that the processing speed of modern computing equipment is largely sufficient for such an application.
1.3
Thesis Structure
A review of the basic theoretical concepts used throughout this work is given in Chapter 2. This includes a problem overview of acoustic source tracking in reverberant environments, and a review of the basics of particle filtering. This chapter also presents a detailed description of an office room environment which is used as generic practical setting for experimental simulations in the rest of the study. On the basis of these reviews, the particle filtering principle is elaborated in the specific context of acoustic source localisation in Chapter 3. A general framework for particle filtering is developed and four different algorithms that fit within this framework are presented. Specific tracking accuracy parameters are described and used subsequently to assess the performance of these new methods. To this purpose, statistical simulations are carried out using both synthetic and real-life samples of noisy and reverberant audio data. The developments of Chapter 4 attempt to improve the particle filtering technique used in the preceding chapter. The concept of sequential importance sampling
1.3 Thesis Structure
6
is introduced and the original particle filter is updated with this new principle. The resulting algorithm is also updated to reflect some more practical issues related to the specific problem under consideration. The final method is then also tested experimentally using both simulated and real audio recordings. Chapter 5 is concerned with a theoretical performance analysis of the methods developed in the context of acoustic source localisation. The research presented in this chapter is motivated by the need to determine the extent to which the performance of the developed source tracking algorithms can be theoretically improved. Theoretical limits on the estimation error are derived on the basis of a modified version of the well-known Cram´er-Rao bound, the so-called posterior Cram´er-Rao bound. A review of this lower bound theory is first presented and subsequently applied to the problem of sound source localisation. This requires mainly the derivation of an equation describing how practical observations are obtained from the current state of the acoustic target. To this purpose, two different observation models are developed. First, a simplified model is derived on the basis of simple statistical room acoustics considerations, and then used as example for the lower bound computations. A second, more elaborate observation equation is then derived to determine the influence of the correlation existing between sound intensity values measured in a diffuse sound field. This chapter also presents a comparison of the performance obtained with two generic particle filter representatives with respect to the lower error bound. Chapter 6 finally concludes this work with a review of the new concepts introduced in this study and summarises the main contributions made to the field of acoustic source tracking in reverberant environments. Future directions for a possible continuation of this research are also briefly considered in this chapter. The appendix sections contain additional information related to this work. Appendix A and Appendix B present the mathematical derivation of two theoretical concepts used elsewhere in this work. Appendix C gives an overview of the real-time implementation of a particle filtering algorithm for acoustic source tracking. This practical realisation makes use of a multi-channel audio acquisition and processing system specifically developed for the purpose of testing particle filtering methods in real-life conditions. Finally, Appendix D describes the contents of two data CDs included in this thesis where a number of electronic files relevant to this research can be found.
1.4 Publications
1.4
7
Publications
The two papers listed below have been published as a direct result of this research work. Both these publications report the main results of Chapter 3 with different levels of detail. ◦
Darren B. Ward, Eric A. Lehmann, and Robert C. Williamson. Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Transactions on Speech and Audio Processing, 11(6):826–836, November 2003.
◦
Eric A. Lehmann, Darren B. Ward, and Robert C. Williamson. Experimental comparison of particle filtering algorithms for acoustic source localization in a reverberant room. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 177–180, Hong Kong, China, April 2003.
1.5
Notation
Effort was made to keep the notation consistent with previous literature works as well as across the different parts of this thesis. The notational conventions will be introduced in the sequel where appropriate. However, for quick reference, the following list summarises the most important symbols used throughout this work. It can be seen that some of the symbols are used to denote two different concepts. However, this does not lead to misunderstandings as the context makes it clear which definition is being used.
Standard Operators, Functions and Symbols E{·}
statistical expectation
Pr[ · ]
probability
[ · ]T
matrix transpose
[ · ](i,j)
(i, j)-th matrix element
(·)∗
complex conjugation
⊛
convolution
|·|
absolute value
1.5 Notation
8
⌈·⌉
ceiling function (rounding towards +∞)
,
generic correspondence or definition
k·k
vector 2-norm
Specific Functions and Variables κ
wave number
ξ, ξ
general-purpose variable and vector (no specific meaning)
ρ
correlation coefficient
σ
standard deviation
τ , τˆ
time delay and time delay estimate
ω
continuous circular frequency variable
c
acoustic wave propagation velocity
e
Euler’s number
F
Fourier transform
f
continuous frequency variable
I
sound intensity
J
Fisher information matrix √ imaginary unit: j = −1
j ℓ, ℓ˙
location vector and velocity vector
l
discrete time index
N
Gaussian density function
P
steered beamformer function
p
probability density function or sound pressure function
q
importance density function
R
cross-correlation function
r
spatial distance
s, S
signal function and corresponding Fourier transform
T
raw data transform
T60
reverberation time
t
continuous time variable
U
uniform density function
w
particle weight
X
state vector
x, y, z
location variables
1.5 Notation
9
x, ˙ y, ˙ z˙
velocity variables
Y, Y
observation vector and scalar observation variable
Y
raw data matrix
Indexing Variables i, j
general-purpose counters
k, K
frame index and total number of frames
l, L
discrete time index and total number of samples per frame
m, M
sensor index and total number of sensors
n, N
particle index and total number of particles
p, P
sensor pair index and total number of sensor pairs
s
source label (usually used as subscript)
Set and Vector Notation {ai }N i=1
set of discrete values: {a1 , a2 , . . . , aN }
{1, . . . , N}
interval of discrete values
X1:k
[a, b]
[x y z] (x, y)
sequence (set) of elements: {X1 , X2 , . . . , Xk } closed interval of continuous values: {ξ | a 6 ξ 6 b} generic vector notation (line vector) pair of elements
Abbreviations 1D, 2D, 3D
one, two and three-dimensional
ACF
autocorrelation function
AEDA
adaptive eigenvalue decomposition algorithm
ASL
acoustic source localisation
CCF
cross-correlation function
CDF
cumulative distribution function
CRB
Cram´er-Rao bound
DFT
discrete Fourier transform
DSB
delay-and-sum beamformer
FCR
frame convergence ratio
FFT
fast Fourier transform
GCCF
generalised cross-correlation function
GL
Gaussian likelihood
1.5 Notation
10
MMSE
minimum mean square error
MSE
mean square error
MSTD
mean standard deviation
PCRB
posterior Cram´er-Rao bound
PDF
probability density function
PF
particle filter
PL
pseudo-likelihood
RMSE
root mean square error
SBF
steered beamformer
SIS
sequential importance sampling
SMC
sequential Monte Carlo
SNR
signal-to-noise ratio
TDE
time delay estimate
TDOA
time delay of arrival
Other Notational Conventions In addition to the above notations, the following conventions will be used throughout this work. ◦
For brevity, the notation p(X) will be used to denote the probability density function of the random variable X, i.e. pX (·). The notation pX (·) will be used only where necessary, i.e. in cases where the shortened form may lead to confusion between the subscript variable and the function argument.
◦
The expression N (µ, σ 2) is a generic term representing a Gaussian density with mean µ and standard deviation σ. The specific form N (ξ; µ, σ 2) denotes the
same function evaluated at ξ. The same notation is valid for other special distributions, such as the uniform density function on the interval [a, b], denoted with U(a, b) and evaluated as U(ξ; a, b). ◦
Any signal s(t) delivered by an acoustic sensor is assumed to be a real-valued energy signal, i.e. having finite energy and zero power: Z
∞ −∞
1 lim T →∞ T
s2 (t) dt < ∞ ,
Z
T /2
s2 (t) dt = 0 . −T /2
1.5 Notation
11
This assumption is of course motivated by the fact that in practice, signals of interest are all limited to finite time intervals. ◦
For practically oriented implementations, the signals received at the array microphones are typically processed in a succession of frames. Using this approach, the output of a delay-and-sum beamformer (steered to the location ℓ and for a specific time frame k) is then typically defined as:
Pk (ℓ) =
Z
kT
(k−1)T
M 1 X 2 sm t − τm (ℓ) dt , M
(1.1)
m=1
where M is the number of sensors, τm (ℓ) is the steering delay for the signal sm (·) received at the mth sensor, and T corresponds to the length of the considered time frame. According to the above energy signal assumption, Eq. (1.1) hence corresponds to the measurement of an energy value, and the factor 1/T should be theoretically introduced in front of the integral to transform the result into a power measurement. Due to the finite and time-invariant value of T however, the concepts of power and energy are identical in practice, up to the proportionality constant 1/T . Consequently, the term “power” might also be used when referring to a beamformer output function defined in a manner similar to Eq. (1.1). This slight abuse of terminology does not lead to confusion and is allowed here in order to keep the notation consistent with previous literature.
Chapter 2 Basics of Acoustic Source Tracking and Particle Filtering This chapter presents a review of the various concepts constituting the basis of the work presented in this thesis. Where appropriate, the developments of the next chapters will then simply refer to the principles described here instead of reproducing the theoretical derivations. The first sections describe the general problem of acoustic source localisation, and review three classical methods used traditionally in the literature to tackle this kind of problem. The concept of Bayesian filtering is then introduced as a means of improving the performance of these classical algorithms. The particle filtering principle (sequential Monte Carlo method) is described in detail as an approximation to the Bayesian filtering solution. And because this method involves the definition of a specific model describing the source motion, a section of this chapter presents the dynamics model that will be used throughout this work. The various simulations (both practical experiments and software simulations) carried out in the frame of this research are all based on one particular example of reverberant environment. The general room and microphone setup used to this purpose is finally described in detail in the last part of the chapter.
2.1
Acoustic Source Tracking
The problem of locating and tracking a wideband acoustic source in a multipath environment arises in several fields including sonar, seismology, and speech. The present research is particularly concerned with speech, where applications include automatic camera steering for video-conferencing, discriminating between individ13
2.1 Acoustic Source Tracking
14
s(t)
ct dire er a
h pat
r ev er b
tio n
s1 (t)
sensor array s2 (t) s3 (t)
Figure 2.1: Generic two-dimensional formulation of the acoustic source localisation problem. ual talkers in multisource environments, and providing steering information for microphone arrays [14]. This section gives an overview of the most important concepts relevant to the problem of acoustic source localisation and tracking.
2.1.1
Generic Problem Definition
The general problem of acoustic source localisation (ASL) and tracking can be defined in a fairly straightforward manner, as depicted in Figure 2.1. The problem definition used in this work follows that given in some previous publications, see e.g. [116], and is described as follows. It is assumed that a single acoustic source is present in a reverberant and (possibly noisy) environment. This work hence deals with single-target tracking problems. In the following chapters, the source is typically defined as a person producing speech signals, or alternatively as a generic source of band-limited white noise. For the purpose of this research, the environment will be assumed to be a small, mostly regularly shaped enclosure of typical office room dimensions (see Section 2.5). Also, the medium within this enclosure is assumed to be homogeneous and at rest. The velocity c of sound waves is hence assumed constant with reference to space and time, and its magnitude is defined for the rest of this work as c = 343m/s. A number of microphones are set up at random locations in this environment and are used to acquire (noisy and reverberant) signals in parallel. Furthermore, the sensor positions are considered fixed and known exactly. Given
2.1 Acoustic Source Tracking
15
this problem setup, the aim is to localise and, following its detection, track the acoustic source based solely on the data recorded by the microphone array. Most of the research presented in this thesis deals with a purely two-dimensional (2D) problem setting. In other words, only an estimate of the source location in the horizontal (x, y)-plane is of interest. The position of the source in the third (vertical) dimension can be seen as either known (typically assumed to be identical to the height of the array sensors) or of no significant importance to this problem.a
2.1.2
Signal Model
Consider a collection of M sensors positioned arbitrarily and located in a multipath environment. Assuming a single source, the discrete-time signal sm (·) received at the mth sensor, m ∈ {1, . . . , M}, can be expressed, as a function of the discrete time variable l = 1, 2, . . ., as:
sm (l) = htot,m (l) ⊛ ss (l) + v(l) , where htot,m (l) is the complete impulse response from the source to the mth sensor, ss (l) is the source signal, and v(l) is additive white noise (assumed to be uncorrelated with the source signal and from sensor to sensor). In the above equation, the symbol ⊛ denotes convolution. The impulse response from the source to any sensor can be separated into direct path and multipath terms, giving: sm (l) =
1 ss (l − τm ) + ss (l) ⊛ hrev,m (l) + v(l) , 4πkℓs − ℓm k
(2.1)
where ℓs = [xs ys ]T is the source location in Cartesian coordinates, ℓm is the sensor location, hrev,m (l) is the component of the impulse response between the source and the mth sensor due to multipath (reverberation) only, and k · k denotes the vector
2-norm. The delay from the source to the mth sensor is: τm =
kℓs − ℓm k , c
where c is the speed of sound wave propagation. Assume that the data at each sensor are collected over a frame of L samples, a
However, the various derivations in this work are readily generalised to handle the third dimension if necessary.
2.2 Traditional Approaches to ASL
16
and let the vector sm (k) denote the vector of data received at the mth array sensor for a specific time frame k, i.e.: sm (k) = sm kL − L + 1 · · · sm (kL) . Then, stack the sensor frames to form the M × L matrix:
s1 (k) . . Yk = . , sM (k)
(2.2)
which represents the data received at the array during time frame k. The matrix Yk will be referred to as the raw data. The problem is hence to estimate the current location ℓs of the acoustic source from the raw data Yk for k = 1, 2, . . ..
2.2
Traditional Approaches to ASL
Classical source localisation algorithms attempt to determine the current source location using the data Yk obtained at the current time k only. There are essentially two types of algorithms used: i) time delay estimation (TDE) methods, which estimate location based on the time delay of arrival of signals at the receivers; and ii) direct methods, such as steered beamforming (SBF), which avoid the two-step approach of a TDE-based principle. Each method transforms the raw data into a function that exhibits a peak in the location corresponding to the source. In the sequel, such a function will be referred to using the generic term of localisation function. Due to their different working principles, these two types of approach usually also yield different levels of tracking performance and robustness in the presence of noise and reverberation. The next subsections give a review of these classical methods.
2.2.1
Time Delay Estimation
Many conventional source localisation algorithms partition the sensors into pairs, and attempt to find a time delay estimate (TDE) for each pair of sensors. These
2.2 Traditional Approaches to ASL
17
TDEs can be obtained using a variety of techniques, including the adaptive eigenvalue decomposition algorithm (AEDA) [8] and the well-known generalised crosscorrelation function (GCCF) [66] and its variants. These techniques act to transform the raw data into another functional form (localisation function) from which time delays can be estimated. The rest of this subsection presents a brief description of the two main TDE principles considered in this work, i.e. the GCCF and AEDA approaches. Generalised Cross-Correlation (GCC) Let sm (l) denote the discrete-time signal received at the mth sensor. With F {·} de-
noting the Fourier transform, Sm (ω) = F {sm (l)} represents the frequency domain
data received during a specific time frame. With the pth sensor pair consisting of the ith and jth microphones, the GCC function is defined as: Z Rp (τ ) = WR (ω) Si(ω) Sj∗(ω) ejωτ dω , p ∈ {1, . . . , P } ,
(2.3)
where WR (ω) ∈ R+ is a frequency weighting factor, (·)∗ denotes complex conjugation, and P is the total number of sensor pairs considered in the array. The
integration in Eq. (2.3) is computed over the frequency range of interest, typically for f ∈ [300, 3000Hz] in the frame of speech processing applications.b Reference
[66] gives an exhaustive description of several possible weighting schemes related to the term WR (ω), and discusses their advantages and drawbacks in detail. One particular definition used commonly in the literature (and chosen in this work) is: WR (ω) =
1 , |Si (ω)Sj∗(ω)|
which results in the well-known phase transform (PHAT) localisation scheme [66]. The GCCF Rp (τ ) of Eq. (2.3) can be seen as a localisation function that trans-
forms the raw data Yk into a measure that exhibits a peak corresponding to the true source location. Note that for this specific method, there exists a total of P such localisation functions, one per sensor pair. The TDE (i.e. an estimate of the time delay of arrival) for pair p is then determined as: τˆp = arg max Rp (τ ) . τ
(2.4)
b Frequencies below about 300Hz generally contain a significant amount of noise in acoustically untreated environments such as those considered in this work.
2.2 Traditional Approaches to ASL
18
Determining the TDEs for all pairs hence requires P one-dimensional searches over the scalar space of possible time delays. Adaptive Eigenvalue Decomposition (AED) Let the L element vector htot,m (k) denote the impulse response from the source to the mth sensor during time frame k:c htot,m (k) = [htot,m (1) · · · htot,m (L)]T . The covariance matrix of two signals received at the ith and jth sensors (defining the pth pair) is: E [si (k)]T si (k) Rp = E [sj (k)]T si (k)
T E [si (k)] sj (k) . E [sj (k)]T sj (k)
The AED approach is based on the fact that the matrix Rp has a single eigenvector corresponding to: ξp =
"
htot,i (k) −htot,j (k)
#
.
This technique hence attempts to provide a time delay estimate τˆp on the basis of the impulse responses resulting from the eigenvalue decomposition of the signal covariance matrix Rp . In [8], Benesty develops an adaptive method (the AED algorithm, AEDA) that detects the direct path components of the two transfer functions, from which a TDE for the pth sensor pair can be finally computed. The AEDA results presented in this work make use of the implementation described in [8] with a 10-fold subsample interpolation of the time delays. Also, due to the specific working principle of this algorithm, the resulting TDEs are not necessarily guaranteed to be within the range of physically allowable time delays, given the microphone spacing and sample rate used. The AEDA implementation is hence defined to simply discard such erroneous TDEs for the corresponding time frame. It should be noted that the above AEDA developments are based on the “total” impulse response, which theoretically also accounts for the effects of reverberation. The GCCF is developed on the basis of a simplified signal model in which only c
It is assumed that the source position, and consequently the source-to-sensor transfer function, remains constant over the entire duration of each frame.
2.2 Traditional Approaches to ASL
19
the direct component in Eq. (2.1) is accounted for. Thus, it is generally expected that AEDA will provide more robust TDEs than a GCC approach due to this more accurate signal model assumption. AEDA however also differs from GCC in that it returns a single time delay estimate, whereas GCC produces a function which has the TDE as the independent variable. Technically speaking, the localisation function for AEDA is therefore a delta function, which makes it unsuitable for a particle filtering application (see Chapter 3). Final Location Estimate For each pair of sensors (and hence for each TDE), the locus of potential source locations in a two-dimensional setting is a hyperbola.d An estimate of the 2D source position ℓs = [xs ys ]T is then given as the location which best fits the potential source loci across all sensor pairs. There has been a large amount of work in the literature regarding how to find this best fit, see e.g. [24, 51, 60, 106]. In the present research, the estimated source location ℓˆs for TDE methods (i.e. both GCCF and AEDA) is defined as that minimising the sum of the distances to each intersection point of the loci with each other. In other words, once a total of int Nint = P (P − 1)/2 intersection points {ℓint,i }N i=1 have been computed (on the basis
of the TDEs obtained for P pairs), the estimated source location results as: ℓˆs = arg min ℓ
X Nint i=1
kℓ − ℓint,i k .
(2.5)
In practice, estimating the source location by minimising some least-square criterion can be potentially very complex. For the TDE methods implemented in this work, the locus of potential source locations given a specific TDE τˆp is approximated by a straight line passing through the middle of the pth sensor pair (plane wave assumption). Also, intersection points lying outside the considered room boundaries are discarded, which provides an effective way of diminishing the contribution of outliers in the TDE measurements. This method is similar to that proposed in [17] and has shown a good performance for the present research compared to other variants proposed in the literature. In this work, the total number of pairs considered for TDE methods is set to P = 4 (see Section 2.5). d
As mentioned in Section 2.1.1, this work will focus on a purely 2D problem definition, where the height of the source in the enclosure is considered known.
2.2 Traditional Approaches to ASL
20
Discussion of TDE Approaches In practice, there are two major drawbacks related to the implementation of a traditional TDE approach: i) although the true source location will usually correspond to a peak in the TDE measurements, in the presence of multipath it may not always be the global maximum; and ii) in the presence of noise there is usually no single point at which the source loci from different sensor pairs intersect. These two drawbacks have been addressed recently in [48] using the notion of realisable delay vectors. Note that TDE methods constitute an indirect approach to source localisation since they rely on a two-stage algorithm, i.e. first determine the TDEs for different pairs, then combine these time delays to find the source location.
2.2.2
Steered Beamforming
In contrast to time delay estimation, direct methods attempt to estimate the source location vector without recourse to pairwise TDEs. As an example of a direct localisation function, consider the frequency-averaged output power of a steered beamformer (SBF) [14, 31, 109]. Let Sm (ω) = F {sm (l)} denote the frequency domain data received at the mth sensor during a given time frame k. For a beamformer steered to the location ℓ, the SBF localisation function is: P(ℓ) =
Z
M 2 X WP (ω) Hm (ℓ, ω) Sm(ω) dω,
(2.6)
m=1
where the integration is computed over the frequency range of interest, here again typically for f ∈ [300, 3000Hz]. The term WP (ω) ∈ R+ is a frequency weighting
factor that is typically used to emphasise low frequencies, e.g. when dealing with speech signals. The complex-valued beamformer weighting term (steering delay) on the mth sensor is defined as: Hm (ℓ, ω) = αm exp jωc−1(kℓ − ℓm k − dref ) ,
2.2 Traditional Approaches to ASL
21
with αm ∈ R the gain applied to the mth sensor signal, ℓm the sensor location, and
dref the distance from the origin of the coordinate system to some reference point
(typically chosen as the centre of the sensor array). If αm = 1/M, ∀m ∈ {1, . . . , M}, then P(ℓ) corresponds to a conventional delay-and-sum beamformer (DSB). It was shown in [22] that if the sensor gain αm is chosen according to the signal gain level
of the source at the mth sensor, then the frequency-averaged steered beamformer output is the optimal maximum likelihood solution to locate wideband signals. It is also stated in [22] that there is no significant performance degradation if the gain is chosen as unity or is modelled by the direct path attenuation only. The conventional DSB weights αm = 1/M will be used throughout the rest of this work. In the case of the above defined SBF localisation function, the source location is finally estimated as: ℓˆs = arg max P(ℓ) . ℓ
Note that the position estimate is here directly obtained from a maximisation of the SBF localisation function (hence the direct method terminology). With a TDE approach, the results from maximising the localisation functions (see Eq. (2.4)) must be subsequently processed in a second stage, as shown for instance in Eq. (2.5) (two-step technique). Although direct methods do not require the calculation of intermediate time delays, a multi-dimensional search over the vector space of possible source locations is required, which is potentially very demanding from a computational point of view.
2.2.3
Discussion of Traditional ASL Methods
It is well known that the classical source localisation methods described in this section are quite sensitive to the level of reverberation in the considered environment. This fact is clearly demonstrated in the following two examples. Figure 2.2 (reproduced from [21]) illustrates the performance of time delay estimation in the presence of room reverberation. This graph shows the percentage of anomalous TDEs resulting from maximising the GCC function of Eq. (2.3), as defined in Eq. (2.4), with a slightly different frequency weighting factor WR (ω) (see [21] for details). These results were obtained using the image method [3, 69] in order to simulate different values of reverberation time T60 with band-limited white noise as source signal. A time delay estimate is considered anomalous if not lying within a
2.2 Traditional Approaches to ASL
22
100 90
anomalous TDEs (%)
80 70 60 50 40 30 20 10 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
T60 (s)
Figure 2.2: Percentage of anomalous TDEs resulting from GCCF computations versus reverberation time T60 . Vertical bars represent the 95% confidence intervals for each data point.e small domain around the correct time delay of arrival. This plot clearly shows how the localisation accuracy of the GCCF method sharply decreases past a certain level of reverberation. This threshold effect was also mentioned in [54]. As a second example of the influence of reverberation on classical localisation methods, Figure 2.3 shows a typical example of a practical steered beamformer output function. This measurement was obtained by scanning the acoustic field in two dimensions in a real office room (with frequency-averaged reverberation time T60 = 0.39s) with an eight sensor acoustic array. A single source emitting a speech signal was present in the enclosure at the position [x y] = [0.75m 2.3m]. This SBF result was computed according to Eq. (2.6) for one single frame of signal. In this specific example, the acoustic source generates a distinct peak in the SBF output function. Because of the reverberation effects in the room, a number of spurious peaks (with amplitudes similar to that of the true source peak) also appear at various other locations in the state space. Without prior knowledge about the source location, it is consequently impossible to guarantee that choosing the largest peak in the SBF output will deliver the correct source position. Because of this, classical methods based on the SBF principle are hence also likely to fail even for moderate levels of reverberation. e
This figure is reproduced from [21].
2.2 Traditional Approaches to ASL
23
0.08 0.06 0.04 0.02 0 3.5 3 2.5 2 1.5 y−axis (m) 2.5
1
2 1.5
0.5
1 0
0.5 0
x−axis (m)
Figure 2.3: Practical example of SBF output for a frame of audio data recorded in an office room (reverberation time T60 ≈ 0.4s). The peak generated by the acoustic source appears at the position [x y] = [0.75m 2.3m], other peaks are due to spurious reverberation effects. It should however be noted that, despite their weaknesses, classical ASL methods have been the object of a lot of research in the last few decades. TDE methods based on the GCCF principle have been used in a number of applications (see e.g. [16, 87]), whereas steered beamforming is the basic principle elaborated in works such as [24, 109]. In particular, a large number of publications describe various principles developed in an attempt to enhance the tracking performance of these traditional methods, see e.g. [15, 31, 51, 60, 102, 106, 108]. The present thesis will however not deal with these many different variants. The basic principles previously described in Sections 2.2.1 and 2.2.2 will be consistently used instead, for classical ASL methods and also as a basis for the particle filtering developments of the next chapters. Hence, it should be kept in mind throughout this work that some of the principles derived in the above references might have the potential to further improve the experimental results depicted in this research (for classical methods as well as particle filters). Finally, observe the similarity between the SBF and GCC methods, in Eqs. (2.6) and (2.3) respectively. A detailed study of these two methods is given in Appendix A, and shows that the SBF principle involves a sum of GCC functions computed for every possible sensor pair in the array. It is hence expected that from a tracking point of view, a steered beamforming approach will outperform TDE methods that do not consider all sensor pairs.
2.3 Basics of Particle Filtering
2.3
24
Basics of Particle Filtering
This section introduces the particle filtering (PF) terminology and notation used in the rest of this work. It first presents a brief summary of the Bayesian filtering problem, and the bootstrap particle filter is then described in detail as a numerical approximation of the Bayesian solution. The weaknesses of this PF algorithm are also discussed.
2.3.1
Definition of the Bayesian Filtering Problem
Consider the following problem. One would like to give an estimate of (or more generally speaking, obtain some sort of information about) a state variable X k at discrete points in time, k = 1, 2, . . .. Assuming that the system under consideration is defined by a Markov process, the dynamics of this state variable can be described using a transition equation of the form: X k = g(X k−1 , uk ) ,
(2.7a)
where g(·) is a known function describing the state transition model with process noise term uk . Note that no specific assumptions are made here regarding this model, meaning that g(·) may be nonlinear and uk may be non-Gaussian. At each time step k, an observation Y k becomes available in the form of a (noisy) measurement of the hidden state X k . This measurement is described by the observation equation: Y k = h(X k , vk ) ,
(2.7b)
where h(·) is a potentially nonlinear function of the state X k and an observation noise term vk (possibly non-Gaussian). b k of the true The ultimate aim is to compute for each time step an estimate X
state variable X k as accurately as possible. This defines the so-called Bayesian filtering problem for which no closed-form solutions usually exist in most of the practical cases, as shown in the next subsection.
2.3.2
Bayesian Solution
One approach to the above problem is the use of Bayesian statistics to approximate the probability density function (PDF) p(X k |Y 1:k ) of the state variable X k given
2.3 Basics of Particle Filtering
25
the set Y 1:k of all measurements received up to time k. Here, Y 1:k = {Y 1 , . . . , Y k }
simply denotes the concatenation of the measurements obtained at each time step.
The PDF p(X k |Y 1:k ) is known as the posterior density and contains all the sta-
tistical information available regarding the condition of the state variable X k at b k then follows naturally from the analysis of the current time k. The estimate X
p(X k |Y 1:k ) (e.g. as the mean or the mode of this PDF).
In order to compute p(X k |Y 1:k ), the Bayesian solution consists in using a recur-
sive approach based on two basic iteration steps that can be described as follows.
First, assume that the posterior PDF p(X k−1|Y 1:k−1 ) at time k − 1 is known.
The posterior density p(X k |Y 1:k ) for the current time k can then be computed as follows:
i) Prediction step: by making use of the transition PDF p(X k |X k−1 ), which is
derivable from the transition equation Eq. (2.7a), the prior PDF p(X k |Y 1:k−1 ) is defined by:
p(X k |Y 1:k−1 ) =
Z
p(X k |X k−1 ) p(X k−1 |Y 1:k−1 ) dX k−1 .
(2.8)
ii) Update step: the prior PDF can be subsequently updated with the so-called likelihood (measurement density) p(Y k |X k ) to deliver the desired posterior at time k, up to proportionality:
p(X k |Y 1:k ) ∝ p(Y k |X k ) p(X k |Y 1:k−1 ) .
(2.9)
Eqs. (2.8) and (2.9) constitute the Bayesian solution to Eq. (2.7) for all possible functions g(·) and h(·), and for any noise distributions puk (·) and pvk (·). For the specific case where the filtering problem is purely linear with additive Gaussian noise, the exact Bayesian solution can be computed analytically. The resulting posterior PDF is also purely Gaussian and the well-known Kalman filter [4] can be used to propagate its mean and variance over time, conditioned on the measurements. For the nonlinear and/or non-Gaussian cases, which often occur in practice,f the Kalman filter might however fail due to poor assumptions (linearity) in the model representation. Also, the Bayesian solution consists in solving several integrals for f
For the acoustic source tracking problem, the practical observation equation (Eq. (2.7b)) is nonlinear as a result of the specific localisation process (SBF, GCC or AEDA), and the observation noise is typically non-Gaussian due to the reverberation effects.
2.3 Basics of Particle Filtering
26
each update of the posterior density. These integrations are usually impossible to solve analytically, and there exists no closed-form solution for Eqs. (2.8) and (2.9) in practice. Thus, practical implementations of the recursion must inevitably be approximate. Several approximation methods have been proposed as solutions of the nonlinear non-Gaussian filtering problem, including mainly the extended Kalman filter, grid-based algorithms, and Gaussian sum methods. An overview of such suboptimal solutions is given in [6]. One approximation of special interest is particle filtering (PF). The use of sequential Monte Carlo (SMC) simulation offers a powerful and flexible numerical approximation of the Bayesian solution, and since the great increase in low-cost computational power in the late 80’s, particle filtering has been the object of a great deal of research in the literature. This principle is based on a discrete representation of the posterior PDF using a set of state samples with corresponding likelihood weights, effectively replacing integrations with discrete sums in the general Bayesian solution. Moreover, this technique is not restricted to problems involving linear functions and Gaussian noise. It also allows the representation of multi-modal (and hence non-Gaussian) densities. This is of particular importance in the case of ASL where the PDFs have to reflect the multi-hypothesis character that each of the peaks (or alternatively none of them) in the localisation function is due to the true source (see Chapter 3). Tutorial introductions to the particle filtering technique can be found, for instance, in [6, 32, 57, 97]. Also, more detailed theoretical developments of SMC methods are presented in various literature works, including [33, 34, 65, 70, 83]. The next section presents a review of the main particle filtering concepts.
2.3.3
Particle Filtering Concepts
The basic idea behind particle filtering is to approximate the density of interest with (n) N a discrete distribution using a set of N random samples of the state space X k n=1 (n) N (the particles) with associated likelihood weights wk n=1 . The set of particles (n) (n) N and weights X k , wk can be seen as a random draw from the posterior n=1
density p(X k |Y 1:k ), hence representing a weighted discrete approximation from
which this latter can be reconstructed [105].g In effect, particle filtering methods g
This approximation can be made arbitrarily accurate as N → ∞, see [27, 28] for a survey of convergence results on particle filtering methods.
2.3 Basics of Particle Filtering
27
(n)
(n)
X k−1 , wk−1
N
n=1
∼ p(X k−1 |Y 1:k−1 )
⇐ Resampling step (n) e k−1 , 1/N N ∼ p(X k−1 |Y 1:k−1 ) X n=1 p(Y k |X k )
⇐ Prediction step (n) N X k , 1/N n=1 ∼ p(X k |Y 1:k−1 ) ⇐ Update step Xk
(n)
(n) N n=1
X k , wk
∼ p(X k |Y 1:k )
Figure 2.4: Symbolic representation of one particle filtering iteration. The particles are represented as circles, the size of which denotes the corresponding likelihood weight. The state space corresponds to the horizontal axis. At each iteration step, the set of particles and weights is an approximate representation of a speficic PDF (given on the righthand side). implement a sequential Monte Carlo simulation (i.e. a numerical simulation) of this set of particles and weights, so that the density they represent corresponds to an approximation of the true posterior PDF at every time step. This type of algorithm hence allows a numerical recursion over time as new observation data become available online. The simplest variant of particle filtering is most certainly the so-called bootstrap filter presented by Gordon et al. in [44]. In a manner similar to the theoretical Bayesian solution of the filtering problem (see Section 2.3.2), the bootstrap algorithm relies on the two main steps of prediction and update to propagate the discrete representation of the posterior PDF from one time step to the next. Figure 2.4 contains a symbolic representation of this principle, whereas Algorithm 2.1 presents the general procedure followed by this basic PF method. The details of the particle filtering process are as follows. Assume that a set (n) N (n) N of N state samples X k−1 n=1 with corresponding likelihood weights wk−1 n=1 represents a discrete approximation of the desired posterior density p(X k−1 |Y 1:k−1) at time k − 1. In the prediction step, each particle is relocated in the state space
according to the dynamics of the system under consideration. Since the new state (n)
samples are generated from the previous states X k−1 and conditioned on the transi-
2.3 Basics of Particle Filtering
28
Assumption:
The set of particles and weights
(n)
(n)
X k−1 , wk−1
N
n=1
is a discrete repre-
sentation of the posterior distribution p(X k−1 |Y 1:k−1 ) at time k − 1. Iteration: At time k, do the following for n = 1, . . . , N: 1. (Prediction) Determine the new particle position using the transition equation:
(n) (n) X k = g X k−1 , uk .
2. (Update) Compute the unnormalised likelihood weight corresponding to the current particle: (n)
(n)
w˜k = p Y k |X k
.
Finally, normalise the likelihood weights so that they add up to unity: (n)
(n) wk
w˜k
= PN
i=1
Result: The set of particles and weights
(n)
(i)
w˜k
.
(n) N n=1
X k , wk
is approximately dis-
tributed as the posterior density p(X k |Y 1:k ).
Algorithm 2.1: Bootstrap particle filter.
(n) N tion density p(X k |X k−1 ), the new set of particles X k n=1 following this process
represents an approximation of the prior density p(X k |Y 1:k−1 ). Upon receipt of the current measurement (observation) Y k , the update step determines a likelihood
weight for each particle based on the density p(Y k |X k ), known as the likelihood (n) (n) N function. It follows that the resulting set of particles and weights X k , wk n=1 is an approximation of the true posterior PDF p(X k |Y 1:k ). Valuable information can be subsequently derived from this density approximation such as its mean,
variance, mode, etc., which can be used for instance to determine an estimate of the current state X k .
2.3 Basics of Particle Filtering
29
As depicted in Figure 2.4, an additional resampling step can be introduced, in which N new particles are drawn (with replacement) from the existing particle set (n) N (n) N X k n=1 according to the likelihood weights wk n=1 .h The likelihood weights (n)
are subsequently reset to a uniform value, wk
= 1/N, ∀n ∈ {1, . . . , N}, which
delivers an improved PDF approximation by reducing the variance of the particles’ weights. This procedure is proposed in the literature in an attempt to reduce the so-called degeneracy problem known to affect the basic particle filter described here. There exists a number of different resampling schemes that can be used in this additional step, see e.g. [44, 70, 113]. The PF algorithms described in Chapter 3 implement the bootstrap method described in Algorithm 2.1 with the systematic resampling scheme developed in [65].
2.3.4
Discussion
Despite the relative novelty of the particle filtering principle (see [32, 113] for a historical review of PF methods), the literature currently existing on this topic contains a large number of enhanced PF variants that attempt to improve the basic bootstrap method. These include for instance the auxiliary particle filter [92], the hybrid bootstrap filter [45], the fast weighted bootstrap filter [7], and the unscented particle filter [114], to name but a few. The PF methods developed in the next chapter are based on a the bootstrap implementation, so it should be kept in mind that some of the above mentioned variants might be able to improve the resulting tracking results obtained with a purely bootstrap approach. The PF algorithm described in the previous subsection has the major disadvantage that during the prediction step, the particles are relocated in the state space (n)
based on the previous states X k−1 only, and without consideration of the current measurement. This can potentially lead to some regions of the state space with high posterior likelihood being omitted in the final posterior representation (due to a lack of particles in these regions). This typically decreases the efficiency of the particle filter. To circumvent this problem, the principle of sequential importance sampling (SIS) can be applied in the particle filtering context. The development of a PF algorithm using the SIS principle is the topic of Chapter 4. h
Since this resampling step does not change the PDF represented by the particles and their weights, this step can be carried out either at the beginning or at the end of the PF iteration.
2.4 Target Dynamics Model
2.4
30
Target Dynamics Model
As can be seen in the previous section (see e.g. iteration step 1 in Algorithm 2.1), the particle filtering approach requires a general model describing the target dynamics for the specific case of ASL. In order to match the problem definition used in previous research [116], the model chosen in this work is defined as a so-called Langevin process, which is typically used to characterise many types of stochastic motion. It requires the state variable X k to be defined as: Xk =
ℓk , ℓ˙k
with ℓk = [xk yk ]T denoting the position and ℓ˙k = [x˙ k y˙ k ]T denoting the velocity of the target in the state space. Note that only ℓk , i.e. the top half of the state variable, is of interest for the calculation of the source position estimate ℓˆs . The velocity component is only included in the state variable in order to improve the representation of the considered dynamics model. For the x-coordinate variable, the Langevin process is defined as follows: x˙ k = ax x˙ k−1 + bx ux ,
(2.10a)
xk = xk−1 + Tu x˙ k ,
(2.10b)
where ux ∼ N (0, 1) is a Gaussian variable with zero mean and unit variance, Tu is the time interval between two consecutive updates of the state vector, and ax = exp(−βx Tu ), p bx = v¯x 1 − a2x , with v¯x the steady-state velocity parameter and βx the rate constant. The source motion in the x and y dimensions are assumed to be independent and identically distributed, which yields identical a, b, v¯ and β parameter values for the dynamics developments of the remaining y-coordinate variable. Together with Eq. (2.10), this assumption hence leads to the following matrix form of the transition equation: 1 0 aTu 0 0 1 0 aTu · Xk = (2.11a) X k−1 + uk , 0 0 0 a 0 0 0 a | {z } G
2.5 Experimental Setup
31
which corresponds to the form defined in Eq. (2.7a), with the zero-mean multivariate Gaussian noise variable: 0 b2 Tu2 0 2 2 b Tu 0 0 uk ∼ N , 0 0 0 0 0 0 | {z Q
0 0 0 0 . b2 0 0 b2 }
(2.11b)
The numerical value used in the practical experiments for the rate parameter is β = 10Hz (as suggested by [116]). Other model parameters will be defined in later sections where necessary. In the sequel, Eq. (2.11) will be used to model the source dynamics.
2.5
Experimental Setup
The different experimental simulations described in the following chapters (for both synthetic reverberant data and real audio recordings) are all based on the same practical setup, which corresponds to the layout of a typical office room fitted with an eight sensor array. The next section gives a detailed description of this practical environment, and Section 2.5.2 describes a few simplifying assumptions made regarding this setup for simulations based on the image method.
2.5.1
Practical Setup
A typical office room was used as experimental recording environment for the purpose of testing various ASL methods. A near-to-scale diagram of the room layout is presented in Figure 2.5, showing various measurements of interest. The room dimensions are roughly 2.9m × 3.83m × 2.7m, with slightly irregular boundaries
(window frames, door, column, etc.). Several items of furniture were also present in the room during the real audio recordings. In this environment, a total of M = 8 omnidirectional electret microphones are positioned as depicted in Figure 2.5 and at the constant height of 1.464m. This setup defines four sensor pairs, one on each wall of the room, which corresponds to the pairing considered in the implementation of the classical TDE methods described in Section 2.2.1. The level of reverberation was experimentally measured by means of a loudspeaker emitting a high-level white noise signal. Measuring the 60dB decay period
2.5 Experimental Setup
y
0
32
1315mm
600mm
0
Pin Board 1
Filing Cabinet
2
Computer
1450mm
Chair
850mm
x
Window
Desk
Desk 3
start
600mm
Chair 4
8
2900mm
600mm end Filing Cabinet
Window
7
Door Shelves
6 1915mm
5 600mm 3830mm
Figure 2.5: Practical two-dimensional layout of the recording environment used for experimental simulations, showing a typical example of source trajectory (dashed line). The microphones are numbered from 1 to 8 and positioned at a height of 1.464m. The height of the room is 2.7m.i of the sound pressure level after the source signal is switched off, for a number of speaker and microphone positions, provided the frequency-averaged reverberation time T60 = 0.39s. For this T60 measurement, the frequency averaging was carried out over the range f ∈ [0, 22050Hz]. The level of noise in the room was compa-
rable to typical office noise levels, generated mainly by a computer fan and an
air-conditioning vent in the ceiling. The average SNR recorded at the microphones for the various experiments was calculated to be approximately 9.4dB (ranging in value from 6.8dB for the noisiest experiment to 16.4dB for the best). The sound source was a loudspeaker in an upright position emitting the desired sound signal and moving along a predefined path at a constant height of 1.464m (distance from the floor to the centre of the speaker cone). For practical reasons, the source trajectory was always a straight line, showing a variety of lengths and i
This figure was originally created by Kris Modrak.
2.5 Experimental Setup
33
orientations (mainly from one side or corner of the room towards the other, within the unused floor area). The dashed line in Figure 2.5 shows a typical example of such a source trajectory. Due to the practical method used to move the loudspeaker in the room, a small source of error may have been introduced when monitoring the position of the speaker for the duration of the recording. The maximum deviation of the actual speaker path from the desired source trajectory was estimated to be less than 0.1m in every direction. More specific details regarding this practical recording environment can be found in [78].
2.5.2
Image Method Setup
In the next chapters, ASL algorithms are put to the test using audio data generated with the image method [3, 69]. This approach provides a relatively quick and easy way of simulating realistic room impulse responses with a varying level of reverberation. The implicit assumption of this method is that of an empty and purely rectangular enclosure. It is hence impossible to model exactly the practical setup depicted in Figure 2.5. All the model variables required by the image method are however defined to match this setup as closely as possible. The dimensions of the image method enclosure are set to 2.9m×3.83m×2.7m, and the coordinates of each sensor location are defined as shown in Figure 2.5 with a constant height of 1.464m. Finally, the absorption coefficient of each surface defining the room boundaries is set to a uniform value determined by the desired T60 level to be achieved in this “virtual” room.
Chapter 3 Particle Filter Algorithms for Acoustic Source Tracking∗ This chapter formulates a general framework for tracking an acoustic source in a reverberant environment, using particle filtering methods. Four specific algorithms that fit within this framework are proposed, and their performance is assessed based on both simulated reverberant data and real audio data recorded in a typical office room. The experimental results indicate that the proposed family of algorithms are able to accurately track a moving acoustic source in a moderately reverberant room. It is also found that steered beamforming methods have improved performance over correlation-based approaches.
3.1
Introduction
As mentioned in the previous chapter, traditional approaches to the ASL problem collect audio signals from several microphones and use a frame of data (obtained at the current time only) to estimate the current source location. These traditional approaches can be broadly categorised into time delay estimation (TDE) and steered beamforming (SBF) methods. Each technique transforms the received frame of data into a localisation function exhibiting a peak at the location corresponding to the source. The practical disadvantage of these traditional ASL methods is that reverberation causes spurious peaks to occur in the localisation function, as clearly demonstrated in Section 2.2.3. Moreover, these spurious peaks may have greater ∗
Most of the research presented in this chapter is the result of a joint collaborative work with Dr. Darren Ward, initiated during a visit to Imperial College, London, UK.
35
3.2 General PF Framework for ASL
36
amplitude than the peak due to the true source, so that simply choosing the global maximum to estimate the source location may not give accurate results. A different technique that overcomes these drawbacks is to use the state-space approach of particle filtering (PF) [116, 120].a As described in Section 2.3, the key advantage of this new method is that it incorporates a dynamical model that the true source peak follows from frame to frame, whereas there usually exists no temporal consistency to the spurious peaks. Based on a sequential Monte Carlo technique, particle filters are used to recursively estimate the probability density of the unknown source location conditioned on all received data up to and including the current frame. Related work on using particle filters to track multiple moving targets can be found in [53]. A generic framework for acoustic source localisation (ASL) using particle filters is formulated in this chapter. It assumes the presence of a single acoustic source in a reverberant environment, in which the sensor positions and the speed of wave propagation are known (and constant). The chapter is organised as follows. The general framework proposed for acoustic source localisation using particle filters is described in the following four sections. In Section 3.6 a detailed summary of four different algorithms that fit within this framework is presented, including those proposed in [116, 120]. Section 3.7 gives a description of the different parameters used to assess the tracking accuracy of each algorithm. Sections 3.8 and 3.9 finally present a series of experiments to test the PF algorithms, and compare them with classical source localisation approaches. These experiments involve both simulations based on the image method [3, 69], and data obtained from recordings performed in a real office room.
3.2
General PF Framework for ASL
In this section, a generic technique for source localisation based on particle filtering (PF) is formulated. The PF algorithm described in Section 2.3.3 (Algorithm 2.1) is first adapted to fit within this specific framework, and various components of the resulting algorithm are subsequently developed in detail.
a
See [6] for an overview of other (suboptimal) algorithms based on a state-space approach.
3.2 General PF Framework for ASL
37
As described in Section 2.4, a first-order model of the source dynamics is considered. The state variable at time k is hence defined as: Xk =
ℓk , ℓ˙k
(3.1)
where ℓk = [xk yk ]T is the source location in Cartesian coordinates, and ℓ˙k = [x˙ k y˙ k ]T is the source velocity. At time k, assume that a measurement Y k of the unobserved state becomes available. This measurement is described by the statespace equation: Y k = h(X k , vk ) ,
(3.2)
where h(·) is an unknown, not necessarily linear, function of the state X k and a noise term vk . Assume also that the state is a Markov process, which can be modelled by the state transition relation: X k = g(X k−1 , uk ) , where g(·) is a known, not necessarily linear, function of the previous state and a noise term uk . Physically, the measurement Y k is obtained through a transformation of some raw data: Y k = T (Yk ) .
(3.3)
In the case of acoustic source tracking using a sensor array, the raw data Yk typically consists in a series of signal frames obtained from each microphone channel (see Section 2.1.2), and the raw data transformation T (·) is usually based on a
localisation function. One should note that Eq. (3.2) is a state-space equation that describes the measurements as a function of the unobserved state, whereas Eq. (3.3) describes how the measurements are physically obtained from the raw data. Let Y 1:k = {Y 1 , . . . , Y k } denote the concatenation of all measurements up to
time k. The aim is then to recursively estimate the conditional probability density
p(X k |Y 1:k ), from which the source location can be estimated as e.g. the mean or mode. The concept of particle filtering will be used to recursively estimate this
PDF, as described in Section 2.3. Within this framework, the general algorithm proposed for source tracking using PF methods is detailed in Algorithm 3.1. This is a standard particle filtering algorithm, and only steps 2, 3 and 4 are specific to
3.2 General PF Framework for ASL
38
Initialisation:
(n) N Form an initial set of particles X 0 n=1 and give them uniform weights (n) N w0 = N −1 n=1 .
Iteration:
(n) 1. Resample the particles from the previous frame X k−1 accord (n) ing to their weights wk−1 to form the resampled set of particles (n) N e k−1 X . n=1
(n) 2. Predict the new set of particles X k by propagating the resampled (n) e k−1 according to the dynamical source model. set X 3. Transform the raw data into localisation measurements: Y k = T (Yk ) .
4. On the basis of the observation Y k , form the likelihood function p(Y k |X k ). 5. Weight the new particles according to the likelihood function: (n)
(n)
wk = p Y k |X k and normalise so that
P
n
(n)
,
wk = 1.
6. Store the particles and their respective weights Result:
(n)
(n) N . n=1
X k , wk
Compute the current source location estimate ℓˆs as the weighted sum of the particle locations: b k} = ℓˆs , E{ℓ
N X
(n) (n)
wk ℓk .
n=1
Algorithm 3.1: Generic PF algorithm for acoustic source tracking.
3.3 Source Dynamics
39
the source tracking problem. In implementing this method, there are hence three algorithmic choices to be made: i) what model to use for the source dynamics in iteration step 2, ii) what localisation function to use in iteration step 3, and iii) how to calculate the likelihood function in iteration step 4. Note that there is also a choice to be made in deciding the precise implementation of the particle filter, although this work will not deal with the many variants of PF methods (refer to [6, 33] for an exhaustive review of these algorithms).
3.3
Source Dynamics
Several dynamical models could be used theoretically to represent the time-varying location of a person moving in a typical room, see e.g. [57]. One that is reasonably simple but has been shown to work well in practice is the Langevin model used in [116], in which the source motion in each of the Cartesian coordinates is assumed to be independent. This model has been described in detail in Section 2.4, and in the sequel, the transition equation defined in Eq. (2.11) will be used in step 2 of the PF algorithm. The use of this dynamics model calls for a discussion of the specific velocity of the particles in the state space. From the definition of the state variable given in Eq. (3.1), it follows that each particle vector can be sub-divided into location and velocity components: (n)
Xk = (n)
Typically, the position ℓk
"
# (n) ℓk (n) , ℓ˙ k
n ∈ {1, . . . , N} .
of the nth particle can be regarded as an indication
of potential source position in the state space. As shown in Algorithm 3.1, an estimate of the true source location is computed, at the end of each PF iteration, (·)
as the average ℓk value over the set of all particles. (n) The same does however not apply to the velocity component ℓ˙k . For the parti-
cle filter to efficiently capture any type of motion defined by the dynamical model,
3.4 Localisation Function
40
the particles must be typically relocated in the state space according to the “most extreme” target deviation allowed by the dynamics from one time step to the next. During propagation in step 2 of Algorithm 3.1, the velocity component of each particle hence results from a random process (see e.g. Eq. (2.10)) with a variance generally much larger than the current velocity of the source. As a result, the aver(·) age ℓ˙ value computed over the entire particle set is typically approximately zero, k
regardless of how fast the target is moving. Thus, despite the Langevin dynamics involving both location and velocity information, the specific velocity component of the particles does not necessarily correspond to any quantity of practical interest.
3.4
Localisation Function
The localisation function should be chosen such that it has a maximum corresponding to the true source location. It is likely that due to multipath, the localisation function may also have peaks at false locations. It is also likely that the true source location may not always correspond to the global maximum. There are two classes of possible localisation function, corresponding to the two methods used for conventional source localisation, i.e. TDE and direct methods.
3.4.1
TDE Localisation Function
In TDE localisation, the sensor array is partitioned into P pairs, and for each of them, an estimate τˆp , p ∈ {1, . . . , P }, of the time delay of arrival of the source
signal can be determined. The generalised cross-correlation function (GCCF) will
be considered as a representative localisation function for TDE methods.b As defined in Section 2.2.1, the GCCF between the signals obtained from the ith and jth sensors (defining pair p) is given as: Rp (τ ) =
Z
WR (ω) Si(ω) Sj∗(ω) ejωτ dω,
(3.4)
with the frequency weighting factor WR (ω), and S(·) the frequency data corresponding to the time signal s(·). The TDE for that pair is then determined by the b
Note that AEDA differs from GCC in that it returns a single time delay estimate, whereas GCC produces a function which has the TDE as the independent variable. The localisation function for AEDA is therefore a delta function, and hence it cannot be used with the pseudolikelihood function described in Section 3.5, although it could be used with the Gaussian likelihood function.
3.4 Localisation Function
41
global maximum of Rp (τ ): τˆp = arg max Rp (τ ) . τ
With this approach, the observation Y k is defined as a vector of P time delay estimates, which can be expressed as follows for the GCCF example: Y k , arg max R1 (τ ) · · · arg max RP (τ ) . τ
τ
(3.5)
Generalising this concept, assume that for each GCCF Rp (τ ), p ∈ {1, . . . , P },
a maximum of Nκ possible TDEs are obtained instead of the single-value measure-
ments of Eq. (3.5). These potential TDEs are typically obtained as the Nκ largest local maxima of Rp (τ ), and the κth TDE for the pth sensor pair will be denoted (κ)
by τˆp , p ∈ {1, . . . , P }, κ ∈ {1, . . . , Nκ }. Eq. (3.5) then becomes:
Y tde,k
(1)
τˆ1
(1)
τˆ2
...
(1)
τˆP
(2) (2) (2) τˆ τ ˆ . . . τ ˆ 1 2 P . = Ttde (Yk ) , . .. .. .. .. . . . (Nκ ) (Nκ ) (Nκ ) τˆ1 τˆ2 . . . τˆP
(3.6)
The observation defined in Eq. (3.6) will be used for TDE methods in step 3 of the PF algorithm. Note that because the AED algorithm only delivers one TDE per pair, this technique will not be implemented using the PF concept which is typically designed to deal efficiently with multi-modal observations. However, the TDE observation of Eq. (3.6) can be defined quite generally and can potentially incorporate the results from several different TDE algorithms simultaneously. For example, one of the Nκ potential TDEs could be computed using AEDA with the remaining Nκ − 1 TDEs
obtained from the peaks of the GCC.
3.4.2
Direct Localisation Function
As example of a localisation function obtained from a direct method, the output P(ℓ) of a steered beamformer will be considered, with: P(ℓ) =
Z
M 2 X WP (ω) Hm (ℓ, ω) Sm(ω) dω, m=1
(3.7)
3.5 Likelihood Function
42
where WP (ω) is a frequency weighting factor and Hm (ℓ, ω) is the steering delay for the mth sensor (see Section 2.2.2). Using this technique, the observation Y k typically results from a maximisation of the SBF output function: Y k , arg max P(ℓ) . ℓ
Here again, assume that from the SBF measurement, Nκ potential source location vectors are obtained as the largest local maxima of P(ℓ), and denote the κth (κ) potential location as ℓˆ . The raw data transformation in step 3 of Algorithm 3.1 for a direct method hence becomes: (1) (Nκ ) Y sbf,k = Tsbf (Yk ) , ℓˆ · · · ℓˆ .
3.5
Likelihood Function
For a given state X k , the likelihood function measures the probability of receiving the data Y k . The likelihood function should be chosen to reflect the fact that peaks in the localisation function correspond to likely source locations. It should also reflect the fact that occasionally there may be no peak in the localisation function corresponding to the true source location (such as when the source is silent). The position of the peak may also have slight errors due to noise and sensor calibration errors. The following two classes of likelihood function are proposed.
3.5.1
Gaussian Likelihood
The Gaussian likelihood function developed here is essentially identical to that used in [116]. If Nκ potential locations have been obtained from the localisation function, then the Gaussian likelihood function is formed by assuming that either one of these potential locations is due to the true source location corrupted by additive Gaussian noise, or none of the potential locations is due to the true source location. There are two possible ways of forming the likelihood function, depending on whether a TDE or direct localisation function is used. i) For a direct localisation function, the likelihood function is formed as: 2 σ 0 (κ) p(Y k |X k ) = q0 + , qκ N ℓk ; ℓˆ , 2 0 σ κ=1 Nκ X
(3.8)
3.5 Likelihood Function
43
where N (ξ; µ, Σ) denotes the distribution of a multi-dimensional Gaussian variable
with mean vector µ and covariance matrix Σ evaluated at ξ. The Nκ potential loca(κ) tions obtained from the localisation function are denoted by ℓˆ , κ ∈ {1, . . . , Nκ },
and ℓk is the localisation parameter corresponding to the state X k .
The value of q0 ∈ [0, 1] is the prior probability that none of the potential
locations is due to the true source,c and qκ ∈ [0, 1] is the prior probability that the κth potential location is the true source location. Without prior knowledge of likely source locations, one would typically choose: qκ =
1 − q0 , Nκ
κ ∈ {1, . . . , Nκ }.
ii) For a TDE localisation function, the likelihood function for the pth sensor pair is defined as: fp (Y k |X k ) = q0 +
Nκ X κ=1
qκ N (τk,p; τˆp(κ) , σ 2 ),
(3.9)
(κ)
where τˆp , p ∈ {1, . . . , P }, κ ∈ {1, . . . , Nκ }, is the κth potential location obtained
from the pth sensor pair. With the ith and jth sensors constituting the pth sensor pair, the TDE τk,p corresponding to the state X k is: τk,p = c−1 kℓk − ℓi k − kℓk − ℓj k .
(3.10)
Assuming that the measurements across sensor pairs are independent, the complete likelihood function becomes: p(Y k |X k ) =
3.5.2
P Y p=1
fp (Y k |X k ) .
(3.11)
Pseudo-Likelihood
The idea behind this approach is that the localisation function itself is typically a continuous function which could be used directly as the basis of a pseudo-likelihood. It results that a pseudo-likelihood function is usually not properly normalised, c
A larger value of q0 indicates that the true source location is often not among the Nκ candidates. This is likely in cases where there is a high level of reverberation, or the source is often silent.
3.5 Likelihood Function
44
which is not a problem as long as it is understood that this type of function represents a density and not a probability. A lower bound is included to allow for the case where no peak in the localisation function corresponds to the true source location. Again, there are two possible ways of forming the likelihood function, depending on whether a TDE or direct localisation function is used. i) For a localisation function obtained from the SBF principle, the proposed pseudo-likelihood function is given by: r p(Y k |X k ) = max {P(ℓk ), q0 } ,
(3.12)
where ℓk is the state location vector, q0 > 0, and r ∈ R+ . The purpose of r is to
help shape the localisation function to make it more amenable to recursive esti-
mation, and the design parameter q0 fulfils here a role similar to that of the same parameter in the Gaussian likelihood (i.e. background probability that none of the peaks in P(ℓ) is due to the true source). ii) For a TDE localisation function, the pseudo-likelihood function is defined as follows: p(Y k |X k ) =
P Y p=1
fp (Y k |X k ) ,
(3.13)
where the pseudo-likelihood function for the pth sensor pair is defined in a manner similar to Eq. (3.12): r fp (Y k |X k ) = max {Rp (τk,p ), q0 } ,
(3.14)
with τk,p given by Eq. (3.10), q0 > 0, and r ∈ R+ . Here, q0 also ensures that the resulting pseudo-likelihood function remains non-negative, since e.g. the GCCF might deliver negative values.
3.5.3
Discussion
The Gaussian likelihood has the advantage that it treats all peaks as being equally likely, which is a valid practical assumption: with no prior knowledge regarding the true source position, it is theoretically not possible to determine which peak in the localisation function really corresponds to the source. A relatively important
3.6 Proposed Algorithms
45
drawback of the Gaussian likelihood function is that it requires a search over the localisation function to find a series of local peaks. Since only a pointwise evaluation of the likelihood function is necessary in the PF implementation (numerical likelihood values are only to be computed at the current positions of the particles, see e.g. step 5 in Algorithm 3.1), this constitutes unnecessary computations that will ultimately decrease the overall efficiency of the tracking algorithm. The pseudo-likelihood does not require such a search (likelihood values are computed directly from the localisation function), but it imposes a weighting on the possible source positions according to the peaks in the localisation function. In other words, a larger peak will be implicitly treated as a more likely source location than a smaller peak, which may not always be correct or advantageous.
3.6
Proposed Algorithms
The framework developed so far is rather general and can be implemented using a wide range of TDE or direct localisation schemes. To clarify this development, four specific PF algorithms are now summarised corresponding to each of the likelihood– localisation pairs proposed in the previous subsections.
3.6.1
GCC Localisation with Gaussian Likelihood (GCCGL)
The sensor array is organised into P pairs. The PF method in Algorithm 3.1 is then implemented with iteration steps 3 and 4 as follows: Step 3: for each sensor pair, calculate the GCCF according to Eq. (3.4) across a set of candidate time delays. Find the Nκ largest local maxima in the GCCF (κ)
and denote the corresponding time delays as τˆp , p ∈ {1, . . . , P }, κ ∈ {1, . . . , Nκ }.
(n)
(n)
Step 4: for each resampled state X k , calculate the likelihood function p Y k |X k using Eqs. (3.9), (3.10) and (3.11).
3.6.2
GCC Localisation with Pseudo-Likelihood (GCC-PL)
The sensor array is organised into P pairs. The PF method in Algorithm 3.1 is then implemented with iteration steps 3 and 4 as follows:
3.6 Proposed Algorithms
46
Step 3: for each sensor pair, calculate the GCCF according to Eq. (3.4) only at the (n)
time delays corresponding to the resampled states X k , where these time delays are found using Eq. (3.10). (n)
(n)
Step 4: for each resampled state X k , calculate the likelihood function p Y k |X k using Eqs. (3.13) and (3.14).
3.6.3
SBF Localisation with Gaussian Likelihood (SBF-GL)
The PF method in Algorithm 3.1 is implemented with iteration steps 3 and 4 as follows: Step 3: calculate the steered beamformer output power Eq. (3.7) over a grid of candidate source locations. Find the Nκ largest local maxima in the output (κ) power function and denote the corresponding location vectors as ℓˆ , κ ∈ {1, . . . , Nκ }.
(n)
(n)
Step 4: for each resampled state X k , calculate the likelihood function p Y k |X k using Eq. (3.8).
3.6.4
SBF Localisation with Pseudo-Likelihood (SBF-PL)
The PF method in Algorithm 3.1 is implemented with iteration steps 3 and 4 as follows: Step 3: calculate the steered beamformer output power only at the set of location (n)
vectors corresponding to the resampled states X k . (n)
(n)
Step 4: for each resampled state X k , calculate the likelihood function p Y k |X k using Eq. (3.12).
3.6.5
Discussion of the Proposed Algorithms
Algorithm GCC-GL requires the calculation of P separate GCC functions across a set of time delays. It also requires P one-dimensional searches to find the candidate TDEs. This algorithm is essentially that used in [116]. Note that the GCC can be implemented efficiently using the fast Fourier transform (FFT), although in this case the time delays would be restricted to a specific set of values (determined by the sampling frequency and the number of points in the FFT).
3.7 Analysis of the Tracking Accuracy
47
Algorithm GCC-PL requires the calculation of P separate GCC functions only at the specific time delays corresponding to the resampled states. It is not necessary to perform any searches. Because the time delays are determined by the resampled states, however, the FFT would not be efficient when computing the GCC with this method. Algorithm SBF-GL requires the calculation of a steered beamformer response over the set of all possible source locations (this set is potentially very large). Furthermore, it requires a multi-dimensional search to find the candidate source locations. This particular PF algorithm is most likely to be too computationally demanding to be viable. Algorithm SBF-PL requires the calculation of the steered beamformer response only at the locations corresponding to the resampled states, and hence no multidimensional searches are required. This algorithm was used in [120].
3.7
Analysis of the Tracking Accuracy
The following two sections present the results from a series of experiments using simulated as well as real audio data. These experiments were performed in order to determine the performance of the various algorithms proposed in Section 3.6, together with traditional algorithms presented in Section 2.2. These tests allow for a comparative assessment of the tracking ability of each method when used in a moderately reverberant and noisy environment. It must be emphasized that the PF results given in Sections 3.8 and 3.9 are obtained from the algorithms operating in tracking mode only. Higher level problems (such as dealing with long speech pauses or determining the initial distribution of the particle set) are of course important for a functional system, but these issues are not examined in the present work. Consequently, the particle set for each PF algorithm is initialised by placing all the particles at the start location of the sound source in the room. This way, the unpredictable effects of a uniform initial particle distribution are reduced to a negligible level, and the performance results ultimately provide a measure of the algorithms’ ability to track a moving source only. In Chapter 4, a PF algorithm will be developed which is able to deal with problems such as automatic track acquisition (target detection) and random initialisation. Three parameters have been implemented in order to provide a reproducible
3.7 Analysis of the Tracking Accuracy
48
and algorithm-independent measure of tracking performance. These parameters are briefly described in this section. Only the first parameter (root mean square error) is applicable to the traditional localisation methods described in Section 2.2; the other two assessment parameters are based on the specific distribution of the particles for PF algorithms.
3.7.1
Root Mean Square Error (RMSE)
For each frame of raw data Yk received from the sensors, the tracking algorithm delivers an estimate ℓˆs , ℓˆs,k of the current source location. The square error εk for time frame k is computed as:
2 εk = ℓs − ℓˆs ,
where ℓs = [xs ys ]T corresponds to the true source position. The RMSE parameter then corresponds to the square root of the average value of the variable εk over the total number of frames K in the audio sample: v u K u1 X √ t RMSE , ε¯k = εk . K k=1 This parameter gives an indication about how much the source location estimate deviates from the true source position. A high RMSE value hence always reflects an inaccurate tracking ability.
3.7.2
Mean Standard Deviation (MSTD)
For each time frame k, the standard deviation ςk of a particle set around its estimate ℓˆs is defined as follows: v uN uX (n) (n)
2 ςk = t wk ℓk − ℓˆs . n=1
In a manner similar to the RMSE parameter, the MSTD value corresponds to the variable ςk averaged over the total number of frames processed by the algorithm: K 1 X MSTD , ς¯k = ςk . K k=1
3.8 Image Method Simulations
49
The MSTD parameter is a measure quantising the accuracy of the estimated source position delivered by the particle filter. A large ςk value means that the position estimate ℓˆs results from a widely spread particle set, indicating a low level of estimation certainty.
3.7.3
Frame Convergence Ratio (FCR)
The term convergence is first defined as follows. For time frame k, a particle filter is said to be converging toward the true source position ℓs if this latter lies within one standard deviation ςk from the estimated source location ℓˆs . In other words, a particle filter is convergent if the following inequality is verified: kℓs − ℓˆs k 6 ςk + δ ,
(3.15)
where δ accounts for the inaccuracy of the source position measurements during the process of recording the audio samples. The parameter FCR is defined as the percentage of frames for which the particle filter has been found to converge, over the entire audio sample length. In mathematical terms, this can be expressed as: FCR , with ξk =
(
K 1 X ξk , K k=1
1 if kℓs,k − ℓˆs,k k 6 ςk + δ , 0 otherwise .
It must be noted here that the FCR value depends indirectly on the MSTD parameter. If the particles are widely spread around the source location estimate, the probability of the true source lying within one standard deviation of the estimate is higher, which in turn implies a higher FCR value. This implicitly allows the convergence criterion to become less strict when the level of estimation confidence of the particle filter decreases. It should be kept in mind however that as a result, a high FCR percentage may partly result from a large MSTD value.
3.8
Image Method Simulations
This section presents a few examples of the tracking results obtained using synthetic audio data for the classical GCC, AEDA and SBF approaches. Results obtained
3.8 Image Method Simulations
50
from algorithms SBF-PL and GCC-GL are shown as generic representatives of the PF methods. A purely two-dimensional (2D) tracking problem is considered, where the height of the source is defined to be the same as the height of all the microphones.
3.8.1
Simulation Setup
For all the results presented in this section, the audio data at each sensor was obtained using the image method for simulating small-room acoustics [3] for a set of reverberation times T60 ranging from 0 to 0.79s. The largest tested value of T60 = 0.79s was determined more or less arbitrarily based on various constraints of the image method’s software implementation. The approximation of the sourceto-sensor transfer functions delivered by this technique might start to break down when setting the parameter T60 to large numerical values (while keeping the volume of the room constant). The simulation times required to generate the audio samples also become impractical as a consequence of the increasingly large number of image sources involved in the computations. In any case, this maximum T60 value corresponds to a substantial reverberation level for typical office rooms. The simulation setup was defined to match the experimental setup used in Section 2.5.1 as closely as possible: the room dimensions were set to 2.9m × 3.83m × 2.7m, and 8 microphones were used in total, the positions of which are defined
as shown in Figure 2.5 at the constant height of 1.464m. A single sample of simulated audio data was used to obtain the results presented in this section. The corresponding source trajectory was a straight line oriented from one corner of the room towards the opposite corner, with a total distance of approximately 1.6m. The source signal used was the speech utterance “Draw every outer line first, then fill in the interior” pronounced by a male speaker and looped twice, yielding a length of about 7.2s. The source signal was split into 120 frames along the source path, resulting in a frame length of about 60.4 · 10−3 s. The data received at each
sensor was obtained by convolving these frames of source signal with the corresponding impulse responses computed with the image method between the source and sensor positions. After recombining the convolution results, random Gaussian noise was finally added to each microphone signal yielding an SNR level of about 20dB. Figure 3.1 shows a typical signal generated with the image method using this setup, for a reverberation time of approximately T60 = 0.13s.
3.8 Image Method Simulations
51
0.2 0.1 0 −0.1 −0.2
1
2
3
4
5
6
7
time (s)
Figure 3.1: Typical example of microphone signal simulated with the image method for T60 = 0.13s.
3.8.2
Simulation Results
Each of the methods under investigation in this section was simulated for a variety of T60 values and using the setup described above. Figure 3.2 shows typical tracking results obtained for some of the algorithms for T60 = 0.13s. The tracking quality of the classical methods (especially GCC and SBF) rapidly degrades for increasing reverberation times, and plots of the tracking results for these methods become practically unusable for larger T60 values. Figure 3.3 gives a good insight into this kind of behaviour. It shows the RMSE obtained for each algorithm plotted as a function of the T60 value resulting from the image method computations. For the two PF algorithms (SBF-PL and GCC-GL), the RMSE plotted in this graph corresponds to the average RMS error resulting from 100 algorithm runs using the same audio sample and simulation setup.
3.8.3
Discussion
As clearly depicted in Figures 3.2 and 3.3, methods based on a sequential Monte Carlo principle show a distinct improvement in tracking ability compared to more traditional source localisation methods. It can be seen in Figure 3.3 that for practically relevant levels of performance (i.e. for RMSE values close to zero), PF-based methods are able to deal with reverberation levels two to three times higher than classical localisation methods. Also, as has been previously reported [8], AEDA usually gives the best results among the classical techniques. It must be noted however that both the classical AEDA and GCC methods deliver unusually large RMSE values for T60 → 0 (including the anechoic case T60 = 0). This result is the
direct consequence of the two-step concept behind these algorithms. Despite the
3.8 Image Method Simulations
52
Classical GCC method
Classical AEDA method 2.5
2
x−position (m)
x−position (m)
2.5
1.5 1 0.5 0
2 1.5 1 0.5
1
2
3
4
5
6
0
7
2
1
0
2
3
4
5
6
7
1
2
3
4
5
6
7
3 y−position (m)
y−position (m)
3
1
1
2
3
4
5
6
2
1
0
7
GCC−GL method
SBF−PL method 2.5 x−position (m)
x−position (m)
2.5 2 1.5 1 0.5 0
2 1.5 1 0.5
1
2
3
4
5
6
0
7
2
1
0
2
3
4
5
6
7
1
2
3
4 time (s)
5
6
7
3 y−position (m)
y−position (m)
3
1
1
2
3
4 time (s)
5
6
7
2
1
0
Figure 3.2: Example of tracking results for two classical methods (top plots) and two particle filters (bottom plots). Solid lines are estimates of the source location, dotted lines represent the true source position (in x and y dimensions). The audio sample was simulated with the image method for a reverberation time T60 = 0.13s.
GCCF and AEDA providing accurate time delay estimates for each sensor pair, the process of combining these TDEs to determine the estimated source location generates additional noise in the results. This clearly illustrates another weakness of TDE methods where the inaccuracies of both stages in the algorithm add up to generate a larger overall tracking error. As shown in Figure 3.2, the presence of outliers in the observations computed from the raw data appear as spurious peaks in the tracking results of the classical methods. The frequency of these peaks increases dramatically as the reverberation time becomes larger, which results in a deterioration of the overall tracking ability. On the other hand, PF-based methods provide an efficient way of filtering these
3.9 Real Audio Experiments
53
SBF GCC AEDA SBF−PL SBF−GL
1.2
RMSE (m)
1
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
T60 (s)
Figure 3.3: RMS error for three classical and two PF-based source localisation methods, plotted versus reverberation time T60 . The audio data was simulated with the image method. peaks out and hence prove to be consistently more robust against reverberation and noise compared to the traditional source localisation methods.
3.9
Real Audio Experiments
In the previous section, the tracking accuracy of classical and PF-based methods was determined using synthetic audio data. In this section, real audio samples recorded in a typical office room are used to assess the tracking performance of these algorithms when used in a moderately reverberant environment. Again, only a two-dimensional tracking problem is considered.
3.9.1
Experimental Hardware Setup
The recording environment used in these experiments is described in detail in Section 2.5.1. To summarise the main simulation parameters, the size of the enclosure was roughly 2.9m × 3.83m × 2.7m, with some irregular encased and protruding shapes (window frames, door, column, etc.). A number of office furniture objects were also present in the room during the recordings. The practically measured, frequency-averaged reverberation time was T60 = 0.39s. The sensor array consisted of 8 microphones organised as one pair per wall, as depicted in Figure 2.5. Due to various noise sources in this environment (computers, air vents, etc.), the average
3.9 Real Audio Experiments
54
0.1 0.05 0 −0.05 −0.1 0.5
1
1.5
2 time (s)
2.5
3
3.5
Figure 3.4: Example of microphone signal recorded in a real office room (with frequency-averaged reverberation time T60 ≈ 0.4s). SNR level was measured to be approximately 9.4dB in the resulting recordings. Because of the practical method used to move the sound source in the room, a small source of error may have been introduced when monitoring the position of the speaker for the duration of the recording [78]. The maximum deviation of the actual speaker path from the desired source trajectory was estimated to be less than 10cm in every direction, and the measurement inaccuracy parameter δ in Eq. (3.15) was therefore set to 0.1m. The sensor signals were all sampled at 8kHz and band-pass filtered between 300 and 3000Hz prior to source localisation processing. The audio samples used as source signals were speech utterances by male speakers with a sample length varying from 3.6s to 7.5s. Figure 3.4 shows a typical example of sensor signal recorded as the loudspeaker was being moved with constant velocity across the room. The source signal was the speech sample “Draw every outer line first, then fill in the interior” (taken from the TIMIT database) pronounced by a male speaker.
3.9.2
Experimental Software Setup
The main objective of the simulations presented in this section is to give a comparison of the classical and PF-based methods described previously with each other. In order to ensure a fair algorithm comparison, the parameters of each PF algorithm were independently tuned using a reference audio sample to achieve the best particle filter performance. This process was done empirically by running each algorithm a number of times with varying parameters until satisfactory performance was achieved. Table 3.1 presents the parameter settings chosen for each PF algorithm. Because SBF measurements can never become negative, the threshold parameter q0 is not practically relevant for the case of SBF-PL and this value was
3.9 Real Audio Experiments
55
N SBF-PL GCC-PL SBF-GL GCC-GL
30 30 25 30
q0
r
0 3 0.01 0.5 0.5 – 0.4 –
Nκ
σ
– – 4 3
– – 0.25 1.5 · 10−4
Table 3.1: Parameter settings defined for each PF algorithm. hence simply set to zero. For algorithm SBF-GL, the beamformer output was computed over a set of candidate locations distributed on a grid across the room with a uniform spacing of 0.15m. In each algorithm (classical methods included), the incoming sensor signals were split into frames of L = 512 samples (corresponding to a frame length of 64 · 10−3 s)
multiplied by a Hamming window, and data processing was carried out using a frame overlapping factor of 50%. For the classical SBF method, the output power function of the steered beamformer was computed on a uniformly spaced grid of points across the room with a 0.02m spacing. Given the relatively small dimensions of the room, the steady-state velocity variable v¯ in the source dynamics model (see Section 2.4) was set to 0.7m/s.
3.9.3
Experimental Results
To illustrate some of the simulation results, some typical plots obtained from algorithm SBF-PL using a sample of real audio data are first presented. Figure 3.5 shows an example of the function used as pseudo-likelihood plotted for one signal frame over the entire two-dimensional state-space.d This plot shows clearly the multi-hypothesis character of the observation: the peak associated with the true source is located at the position [x y] = [0.75 2.3] (in m), other peaks are clutter measurements due to reverberation. Figure 3.6 presents the tracking result in the x and y coordinates for a 3.8s run of algorithm SBF-PL. It demonstrates the ability of this method to accurately track the sound source across the room despite the relatively high level of reverberation. This kind of result typically yields tracking quality values of RMSE = 0.114m, MSTD = 0.094m and FCR = 95%. d
Note that this figure is shown for illustration purposes only; with algorithm SBF-PL, the likelihood function is evaluated only at the particles’ positions.
3.9 Real Audio Experiments
56
0.08 0.06 0.04 0.02 0 3.5 3 2.5 2 1.5 y−axis (m) 2.5
1
2 1.5
0.5
1 0
0.5 0
x−axis (m)
Figure 3.5: Example of beamformer output function used as pseudolikelihood in algorithm SBF-PL, for one signal frame. The peak generated by the acoustic source is located at the position [x y] = [0.75m 2.3m].
Finally, results from a comparative assessment of the different classical and PF-based methods are depicted in Figure 3.7. These results have been obtained in the following manner. Each of the four particle filters under test was run 100 times with each one of six different samples of real audio data (which implicitly corresponds to a variety of source signals and trajectories). Since a different level of performance is usually achieved for different source signals or different paths, the results obtained for each of the audio samples are given separately. Figure 3.7 presents the results obtained for the performance assessment parameters averaged over the 100 real audio simulations. As for the classical results presented in this figure (i.e. SBF, GCC and AEDA), a single run of each algorithm has been used to generate the RMSE value for each audio sample. Contrary to PF-based methods (where the resampling and prediction steps introduce some degree of randomness), these classical methods generate the exact same tracking results when applied twice to the same audio sample.
3.9.4
Discussion
In Figure 3.7, the noticeable differences in the overall performance results from one audio sample to the other reflect a variable degree of tracking difficulty for the algorithms. This typically results from the specific quality of the audio signals and trajectory of the sound source. As expected, these comparative results also
3.9 Real Audio Experiments
57
x−position (m)
2.5 2 1.5 1 0.5 0
0.5
1
1.5
0.5
1
1.5
2
2.5
3
3.5
2.5
3
3.5
y−position (m)
3
2
1
0
2 time (s)
Figure 3.6: Example of tracking results for one run of algorithm SBF-PL using real audio data (dotted line: true source trajectory; solid red line: estimated source position). Green lines represent +− one standard deviation of the particle set about the source location estimate. demonstrate the major tracking improvement of PF-based methods versus classical source localisation algorithms. When comparing PF methods only, results from Figure 3.7 tend to show that algorithm SBF-PL generally works better than the other methods, yielding on average lower RMSE and higher FCR values. This confirms the simulation results obtained in Section 3.8, despite the fact that more simulations using real audio data may be required here in order to fully verify this statement. The MSTD values shown in Figure 3.7 are more or less constant for each PFbased algorithm, reflecting the fact that this value mainly results from the specific parameter setting chosen for each of these algorithms. This parameter is typically influenced by variables such as σ and r that determine the “mobility” of the particles in the state space. Consequently, the MSTD parameter does not strongly depend on the considered audio sample. To give an indication of the computational complexity of each algorithm, the CPU time required to process a single audio sample (corresponding to about 7.25s of audio data) was also measured.e For PF methods, both of the pseudo-likelihood e
Note that the algorithms were not specifically optimised for execution speed. Also, the CPU times reported here are based on Matlab implementations.
3.10 Conclusions
58
1.4
SBF GCC AEDA SBF−PL SBF−GL GCC−PL GCC−GL
1.2
RMSE (m)
1
0.8
0.6
0.4
0.2
0
1
2
3 4 audio sample
5
6
100
0.25
90 80
0.2
60
0.15
FCR (%)
MSTD (m)
70
50 40
0.1
30 20
0.05
10 0
1
2
3 4 audio sample
5
6
0
1
2
3 4 audio sample
5
6
Figure 3.7: Experimental results for each of six different samples of real audio data, showing the average performance measures RMSE, MSTD and FCR for various ASL methods. algorithms (SBF-PL and GCC-PL) took 24s, whereas the Gaussian-likelihood algorithms took 99s for GCC-GL, and 306s for SBF-GL. The classical methods required 53s for GCC, 93s for AEDA, and several hours for SBF. As well as providing the best tracking performance as depicted in Figure 3.7, the SBF-PL algorithm is also one of the most computationally efficient.
3.10
Conclusions
Carrying out acoustic source tracking in the practical environment of a moderately reverberant office room is not a trivial task. Even low levels of reverberation or background noise can rapidly become detrimental to classical TDE-based or
3.10 Conclusions
59
beamforming methods. Under such adverse conditions, the use of sequential Monte Carlo methods proves to be of advantage compared to these more traditional algorithms. The enhanced tracking performance of particle filters mainly results from the fact that these methods deliver a source location estimate based on the set of all past observations rather than the current one only. Another important aspect of the sequential estimation methodology allows particle filters to deal efficiently with multi-modal observations, unlike other algorithms such as the Kalman filter. Whereas traditional ASL methods are easily mislead by spurious peaks, PF algorithms are able to draw on any information related to the acoustic source, even if the true source peak is not the global maximum in the localisation function. This chapter has developed a framework for source tracking using particle filters and discussed four specific PF-based algorithms, each of them differing from the others in the nature of the observations or in the way the measurement likelihood is computed. Results obtained from three traditional source localisation methods have also been investigated and used as reference for an overall comparison of each algorithm’s tracking ability. Using synthetic audio data as well as audio samples recorded in a real office room, sequential Monte Carlo methods have been shown to possess a much higher degree of robustness against reverberation and background noise compared to these classical algorithms.
Chapter 4 Particle Filter Design using Sequential Importance Sampling The principle of sequential importance sampling (SIS) is reviewed and investigated in detail in this chapter. The basic SIS algorithm is adapted to the specific example of acoustic source tracking, and it is subsequently used to “upgrade” the particle filters (PF) of Chapter 3. Experimental simulations are carried out in order to test this new method, and practical results demonstrate that SIS algorithms are able to successfully avoid cases of complete track loss, which constitutes the biggest drawback of bootstrap methods when disturbance levels become too high. It is found that the importance sampling technique is the method that should be favoured in practice, despite exhibiting a slightly lower average tracking performance compared to the best bootstrap algorithm.
4.1
Introduction
The type of particle filter used in Chapter 3 and other literature papers [118, 120] for the specific problem of acoustic source localisation (ASL) is principally based on the simple bootstrap method introduced by Gordon, Salmond and Smith [44]. This basic sequential Monte Carlo method presents the indisputable advantage of being conceptually very simple, leading to straightforward practical implementations and moderate computational requirements. These properties are of special interest for first-try implementations and real-time systems (see [80, 82]). A more general PF implementation makes use of the principle of sequential importance sampling. The bootstrap method is in effect a simplified version of this 61
4.1 Introduction
62
method, which leads to the significant advantages mentioned above. However, due to the nature itself of this simplification, each iteration of the bootstrap algorithm relocates the particles in the state space without taking account of the current observations. The only region in which the bootstrap PF might generate new particles is defined solely by the position of the state samples at the previous time step, meaning that some important regions of the state space might be omitted when searching for a potential target. This mainly precludes the PF from reinitialising after a target disappears or becomes occluded for a short period of time. Despite showing promising results, this algorithm consequently still lacks some important characteristics necessary for a smooth operation in practical scenarios, such as the automatic detection of new targets and the ability to recover from track loss. In the following research, the bootstrap principle is upgraded using the concept of importance sampling, which provides the resulting algorithm with the important property of reinitialisation. Importance sampling further allows the combination of different types of observation in a global statistical framework. Thus, three kinds of importance sampling algorithms are here investigated, drawing on additional information from measurements based on steered beamforming (SBF), the adaptive eigenvalue decomposition algorithm (AEDA), and the generalised cross-correlation function (CCF). These methods are tested and compared with bootstrap implementations, using both simulated reverberant data and samples of real audio data recorded in a typical office room. The development of a robust acoustic source tracking algorithm involving the concept of importance sampling is the main motivation behind the research described in this chapter. Section 4.2 reviews the general theory of sequential importance sampling, and the design of particle filtering methods within this framework is discussed in Section 4.3 for the specific problem definition of acoustic source localisation. The resulting SIS algorithm and details of its implementation are presented and explained in Section 4.4. Practical experiments are then developed to assess the performance (acquisition and tracking ability) of three different SIS-based PF algorithms. The results from these simulations are depicted and analysed in detail in Section 4.5. For comparison purposes, the results obtained from previous bootstrap and classical ASL methods (see Chapter 2 and Chapter 3) are also given in this results section. Finally, a general discussion of the algorithms and some concluding remarks are given in the last two sections of this chapter.
4.2 SIS Theory Review
4.2
63
SIS Theory Review
This section builds on the theoretical developments given in Section 2.3, which defines the basics of Bayesian and particle filtering. Therefore, the various concepts described in that part of the thesis are assumed to be already known and understood in this section. (n) N Assuming perfect Monte Carlo sampling, let X k n=1 be a set of N random
samples (with uniform weights 1/N) drawn from the density p(X k |Y 1:k ). This sample set allows the approximate computation of any statistical quantity of interest
based on p(X k |Y 1:k ) (such as the mean, variance, mode, etc.), including a reconstruction of the density itself as the approximation:
N 1 X (n) pˆ(X k |Y 1:k ) = δ Xk − Xk , N n=1
(4.1)
where δ(·) is the Dirac delta function. Then, for any function f (·) of the state variable X k , the expectation E{f (X k )} =
Z
f (X k ) p(X k |Y 1:k ) dX k
can also be approximated on the basis of this sample set as: b (X k )} = E{f =
Z
f (X k ) pˆ(X k |Y 1:k ) dX k
N 1 X (n) f Xk , N n=1
(4.2)
where a straightforward use of the approximation given in Eq. (4.1) has been made. Eq. (4.2) can be used to compute e.g. the minimum mean square error (MMSE) b k for the case f (X k ) = X k . Thus, the set of samples X (n) N can estimate X k
n=1
be used as an explicit representation of the PDF p(X k |Y 1:k ) instead of the density
itself (within the approximation limits).
In real-life Bayesian filtering problems however, the posterior density p(X k |Y 1:k )
is usually not available and it is hence impossible to sample directly from it. An alternative solution is the use of importance sampling. This method consists
4.2 SIS Theory Review
64
in choosing a so-called importance function q(·), which can be interpreted as a (n)
conditional PDF q(X k |Y 1:k ), and from which particles X k (n) Xk
are easy to sample:
∼ q(X k |Y 1:k ), n ∈ {1, . . . , N}. Then, the (unnormalised) importance weights
can be computed according to the following equation [34]: (n) w˜k
=
(n)
p Y k |X k
(n)
p Xk (n) q X k |Y 1:k
.
(4.3)
Provided the current state X k is independent of future observations (which is usually the case in practice), the importance density can be shown to factorise as follows: q(X k |Y 1:k ) = q(X k−1 |Y 1:k−1) q(X k |X k−1 , Y k ) .
(4.4)
Then, the new state samples can be drawn according to q(X k |X k−1 , Y k ) and an iterative equation can be derived to compute the importance weights recursively
over time, as given by Eq. (4.5) in Algorithm 4.1. This algorithm gives a description of the generic SIS particle filtering method using this sequential weight update equation. Here again, an additional resampling step can be included in the iteration after having normalised the importance weights (which then defines the socalled SIRa Monte Carlo method, as described in [34]). The sequential importance sampling principle allows a decreased estimate variance by virtue of an improved sample-based representation. To emphasise the fact that the particles are sampled according to a specific PDF (rather than propagated from the previous time step as in the bootstrap implementation), the term importance particles will be used (n)
from now on to denote the samples X k generated by drawing from the importance function q(·). The problem of degeneracy that commonly occurs in particle filter implementations is due mainly to the variance of the importance weights increasing over time.b If no measure is taken to reduce this phenomenon, all but one particle will end up having negligible weights after a few iterations. This results in a poor sample set approximation of the filtering density p(X k |Y 1:k ), and it also implies a waste
of computational resources since a lot of time is spent updating particles whose contribution to the approximation is virtually nil. a
Standing for SIS method with resampling. It can be shown that the variance of the importance weights as given in Eq. (4.5) can only increase stochastically over time for importance functions that factorise as in Eq. (4.4) [34]. b
4.2 SIS Theory Review
65
Assumption: The set of particles and weights
(n)
(n)
X k−1 , wk−1
N
n=1
is a discrete repre-
sentation of the posterior distribution p(X k−1 |Y 1:k−1 ) at time k − 1. Iteration: At time k, do the following for n = 1, . . . , N: (n)
1. Sample X k ∼ q(X k |X k−1 , Y k ). 2. Compute the unnormalised importance weights according to the following recursive equation: (n)
(n)
w˜k = w˜k−1
(n) (n) p X k |X k−1 . (n) (n) q X k |X k−1 , Y k (n)
p Y k |X k
(4.5)
Finally, normalise the importance weights so that they add up to unity: (n)
w˜ (n) wk = PN k
(i)
˜k i=1 w
Result:
The set of particles and weights
(n)
.
(n) N n=1
X k , wk
is approximately dis-
tributed as the posterior density p(X k |Y 1:k ).
Algorithm 4.1: Generic SIS algorithm. As mentioned earlier, one way of reducing this effect is to implement a resampling step in the PF algorithm. Another way consists in an appropriate choice of importance function. In terms of minimising the variance of the importance weights, the optimal importance function has been shown to be [34]: qopt (X k |X k−1 , Y k ) = p(X k |X k−1, Y k ) . It can be seen that this choice of optimal importance density takes into account both the previous state X k−1 and the current observation Y k , making the SIS algorithm more robust than the bootstrap method.c However, this optimal importance density suffers from a number of drawbacks that make it inadequate for c
Indeed, the bootstrap algorithm can be derived from the SIS procedure with the transition prior p(X k |X k−1 ) chosen as importance density, hence confirming the fact that the bootstrap method has no knowledge of the current observations when exploring the state space.
4.2 SIS Theory Review
66
practical use, although it has been derived analytically for a small subset of model classes exhibiting some very specific characteristics [6, 34]. While a lot of research has been carried out in order to develop generic sub-optimal importance functions [6, 70, 92, 101, 114], the choice of a good importance function in practice usually results from a case-by-case consideration of the specific tracking problem at hand. In theory, any density (subject to some weak assumptions) could potentially be chosen as importance function, the main purpose of which is to redirect some of the particles so that most regions of the state space with high posterior likelihood will be explored during the PF iteration. In a few papers describing PF implementations for real-life applications, an alternative and interesting approach is used. The importance function q(·) is implemented in order to take advantage of measurements from auxiliary sensors that complement or refine the observations used in the form of the likelihood p(Y k |X k )
for the computation of the importance weights. A typical Bayesian approach to this problem of sensor fusion would be to mix the different measurements into a multisensor observation density p(Y k |X k ) in the reweighting step of the algorithm. This is the method implicitly chosen e.g. in [116] for an application of acoustic source
tracking using a sensor array. In this example, the likelihood function is defined as a product of measurement PDFs based on the cross-correlation functions obtained for a number of sensor pairs. Theoretically, this approach is only valid when the statistical relationship between sensors is understood, and in practice it is often assumed that these produce independent measurements. In [117] and [39], two papers concerned with the problem of audio-visual speaker tracking, the importance sampling principle is used to first generate particles in the state space according to the audio measurements (derived in both papers from the generalised cross-correlation function). In a second step, these particles are given likelihood weights based on either the visual observations (in [117]), or both the visual and audio observations (in [39]). This technique provides an efficient way of fusing data from different sensors. Furthermore, it also takes advantage of the complementary strengths of each measurement processd by including these at different levels of the SIS tracking algorithm. d
Audio information is usually better for fast tracking of, or recovery from discontinuities in the dynamics model, such as target occlusions, whereas visual data is more adequate for accurate fine-scale object localisation.
4.3 SIS Particle Filter Design for ASL
67
Similarly, the Icondensation algorithm (presented in [58] and addressing the problem of real-time hand tracking in video streams) implements the SIS method to draw on information obtained from two different measurement processes. The difference in this work is that both measurements (used for importance sampling and in the reweighting step) are derived from the same raw image data. The importance function is based on the output of a (coarse) skin-colour blob detector, and the likelihood results from contour measurements specifically designed for hand templates, both of which are based on the same frame of video data. Hence, the notion of independent observations is in this case clearly violated, although the inter-dependence of these two measurement processes is not obvious either. Contrary to the method consisting in combining the different observations in the representation of Y k , the SIS technique offers a principled way of including them in a common framework even when the statistical relationship between sensors is not known or hard to determine.
This way of combining information from different measurement processes using importance sampling is the approach chosen here to deal with the problem of acoustic source localisation and tracking. The next section discusses various aspects of the SIS method applied to this specific type of target tracking problem, and the updated SIS algorithm is finally summarised in Section 4.4.
4.3
SIS Particle Filter Design for ASL
There are basically three design choices to be made for a practical implementation of the SIS principle. First, the source dynamics model must be explicitly described (n)
in terms of a transition equation of the form X k
(n)
= g(X k−1 , uk ). This is a rel-
atively unimportant design choice in the SIS algorithm, and a variety of models could potentially be chosen as long as the dynamics of the true source are included in it. In the following developments, the transition model of Eq. (2.11) (Section 2.4) will be used. The second and third choices are more critical and are about how to build the importance function q(·) and the likelihood p(Y k |X k ). As mentioned previously, both these functions are based on observations delivered by some sensors,
which corresponds to the signals acquired by the microphones in the particular case of ASL. Therefore, any of the traditional ASL methods mentioned in Section 2.2
4.3 SIS Particle Filter Design for ASL
68
could potentially be used for either of the likelihood or importance function, resulting in many combinations and as many different variations of the SIS particle filtering method. Considering every possible combination would obviously lead to too many SIS PF variants. Thus, in an attempt to reduce the overall number of tested methods, only a few special cases are chosen, as discussed in the following subsections. Only these specific SIS methods will be ultimately tested in the practical experiments and analysed in the results section.
4.3.1
Choice of Likelihood Function
In Chapter 3, the results of experimental simulations using bootstrap algorithms for ASL tend to show that steered beamforming has improved tracking performance compared to other TDE-based methods. This observation also results from the developments in Appendix A, which presents a theoretical comparison of these two approaches. Since the final SIS principle can be roughly regarded as a bootstrap method with an additional importance sampling step, it is sensible to choose as likelihood function the method delivering the best results in the bootstrap simulations. Hence, the SBF principle is chosen as observation for the computation of the likelihood function p(Y k |X k ) in the SIS algorithm. Also, a pseudo-type likelihood function will be used in the following, as introduced in Section 3.5.2. Let sm (t) denote the continuous signal received at sensor m. With Sm (ω) = F {sm(t)} the Fourier transform of the signal data, the pseudo-
likelihood function can be computed as follows as the output power P(·) of a delay-and-sum beamformer (DSB) steered to the location ℓ = [x y]T : p(Y k |X k ) , P(ℓ) =
Z
M 2 X WP (ω) Hm (ℓ, ω) Sm(ω) dω ,
(4.6)
m=1
where WP (ω) is a frequency weighting function, and the complex beamformer weights Hm (·) for the mth microphone with location ℓm = [xm ym ]T are defined as:
Hm (ℓ, ω) = exp jωc−1(kℓ − ℓm k − dmax ) ,
with dmax corresponding to the maximum possible distance in the room and c being the propagation speed of acoustic waves. In Eq. (4.6), the location vector ℓ reflects the current state of the variable X k . For instance, the likelihood function
4.3 SIS Particle Filter Design for ASL (n)
p Y k |X k
69
used for the computation of the importance weights in Eq. (4.5) can be
computed using Eq. (4.6) with the focus location ℓ corresponding to the position (n)
ℓk of the nth particle.
4.3.2
Choice of Importance Function
As mentioned previously, the distribution of the particles at every time step in a bootstrap algorithm is only determined by the previous state PDF p(X k−1 |Y 1:k−1) and the motion density p(X k |X k−1 ), without any knowledge of the current mea-
surements Y k . In the SIS principle, the importance function is expected to contain
some sort of knowledge describing which areas of the state space contain most information about the posterior density p(X k |Y 1:k ). Its global purpose is thus to relocate some of the particles taking the current observations Y k into account, so that state space regions with potentially large posterior likelihood are also examined during the PF iteration. Here again, several techniques can be used in the specific case of ASL in order to construct an importance function based on the current observations (i.e. the signals recorded by the array sensors). The methods that will be used in this chapter for the practical tests of the SIS algorithm are described in the following subsections.
4.3.3
Importance Function using Steered Beamforming
Rather than a fine-scale and accurate representation of the particle sampling areas, the importance function is typically meant to give a coarse indication of where the particles are to be sampled in the state space. The output power of a steered beamformer computed for low frequencies precisely possesses this property. The beampattern at high frequencies generally exhibits a narrow main lobe, and it suffers from aliasing which typically generates spurious peaks in the observations. For a smaller range of low frequencies however, the effects of aliasing are reduced and the width of the main lobe in the beampattern becomes more important. This phenomenon is demonstrated in Figure 4.1 where the beampattern of a delay-and-sum beamformer, steered to a location corresponding to the middle of the array field, is simulated for three different operating frequencies. For these plots, the beamformer was defined with the sensor and room configuration given in Section 2.5.1, and only the direct path is taken into account (reverberation effects not included).
4.3 SIS Particle Filter Design for ASL
0
70
0
0
−10
−20 −20
−20
−40
−30 0
3
0.5 1 1.5 1
2 2.5
3
0.5 1
2
x−axis (m)
−40 0
y−axis (m)
0
1.5 1
2 2.5
0
3
0.5 1
2
x−axis (m)
−60 0
y−axis (m)
2 1.5
x−axis (m)
1
2 2.5
y−axis (m)
0
Figure 4.1: Beampatterns (in dB) of a delay-and-sum beamformer for three operating frequencies, f = 100Hz, 200Hz and 300Hz respectively (from left to right). Based on these remarks, it can be seen that the beamformer’s output computed for low frequencies lends itself particularly well to a use as importance function q(·), which is further demonstrated in Figure 4.2. The plots at the top of this figure are the output functions of the delay-and-sum beamformer obtained over a wide range of frequencies and over a small range of low frequencies, respectively. These plots are the result of using Eq. (4.6) with a single frame of real audio data recorded in a reverberant office room by means of the sensor array described in Section 2.5.1. A speech signal was being emitted by the source during this specific frame from the source location [xs ys ] = [0.93m 2.17m]. A star symbol (∗) in each plot of Figure 4.2 is displayed on each function’s surface to indicate the true source position. From here on, the term Plow (·) will be used to denote the output power of a steered
beamformer where the integration of Eq. (4.6) is computed over a restricted range of low frequencies only. The effects of aliasing can be clearly seen in Figure 4.2 when the DSB output P(·) is computed over a wide range of frequencies. A multitude of spurious
peaks appear in the power measurementse making the explicit detection of the true source peak more challenging. Note that in this example, the peak closest to the true source location is by far not the largest one. At low frequencies, the resulting DSB output Plow (·) provides less accurate information about the source location
(broader peak), but this measurement is also more explicit regarding the most likely
source position (less spurious peaks). This low-frequency power function can hence e
This phenomenon is also partly due to the effects of reverberation in the recording environment.
4.3 SIS Particle Filter Design for ASL
71
40
100 80 60 40 20
30 20 10
0
0 0.5
0.5 1
1 1.5
x−axis (m)
1.5 2 2.5 0
0.5
1
2.5
2
1.5
3
3.5
x−axis (m)
2 2.5
y−axis (m) 0
0.5
1
1.5
2
2.5
3
3.5
y−axis (m)
0.01 0.005 0 0 0.5 1 1.5 x−axis (m)
2 2.5 0
0.5
1
1.5
2
2.5
3
3.5
y−axis (m)
Figure 4.2: Example of importance function derived from SBF measurements. The true source position [xs ys ] = [0.93m 2.17m] is indicated with a ‘∗’ on each function’s surface. Top plots: DSB output computed over the frequency range f ∈ [300, 3000Hz] (left-hand side) and f ∈ [100, 400Hz] (right-hand side). Bottom plot: 2D uniform PDF derived from the DSB output at low frequencies, and used as sampling function in the SIS algorithm. be used directly as importance function: qsbf (X k |Y 1:k ) , Plow (ℓk ). Note that this definition is a rather coarse approximation in theory, since Plow (·) does not incor-
porate any past observations (the same is also valid for other importance functions developed in this work). This is however still a substantial improvement compared
to the bootstrap algorithm, which implicitly uses the transition prior p(X k |X k−1)
as importance density and hence does not take any observation into account at all. Note also that qsbf (·) is defined as a pseudo-density, differing from a true PDF in that it is not suitably normalised.f In order to draw the importance samples, a simplified importance function q˜sbf (·) can be derived from Plow (·), based on a threshold function Θsbf (ℓ) that is f
Normalisation can be a problem in practice when a measurement-based PDF is only to be computed for a finite set of points, which is typically the case with particle filtering. The use of pseudo-densities is hence more appropriate in these cases.
4.3 SIS Particle Filter Design for ASL
72
non-nil only for regions of the state space where Plow (ℓ) is above a certain threshold level θ:
q˜sbf (X k |Y 1:k ) ∝ Θsbf (ℓ) =
(
1 if Plow (ℓ) > θ , 0 otherwise .
(4.7)
Here again, the location vector ℓ , ℓk corresponds to the current state of the variable X k . After suitable normalisation, the threshold function Θsbf (ℓ) can be used as a two-dimensional (2D) uniform distribution for the sampling of the importance particles.g The bottom plot of Figure 4.2 shows the resulting sampling distribution q˜sbf (·) derived from the low-frequency DSB output Plow (·) depicted in the top
right-hand plot of the same figure.
The specific method resulting from the specifications given in this section will be denoted SIS-SBF throughout the rest of this chapter. This notation highlights the fact that this algorithm is based on the SIS principle using steered beamforming as the raw data processing method for the importance density. Unless stated otherwise, the SIS-SBF method uses an importance function built on the basis of the DSB output Plow (·) computed over the frequency band f ∈ [100, 400Hz], and with a threshold factor θ defined as 90% of the maximum value of Plow (·).
4.3.4
TDE-based Importance Function
As already mentioned, importance sampling allows the combination of different measurement modalities originating from the same underlying raw data, even when the inter-dependence between the various measurement processes is not completely known. Hence, it is potentially advantageous to complement the SBF observations Y k involved in the reweighting step of the SIS algorithm (i.e. the likelihood function p(Y k |X k ) in Eq. (4.5)) with a different kind of measurement, e.g. obtained
with a method based on time delay estimates (TDE). Combining these two types of observation constitutes here an attempt to combine the strengths of both methods in order to make the resulting algorithm more robust. An estimate of the time delay of arrival of a source signal with respect to two acoustic sensors can be obtained using the adaptive eigenvalue decomposition g
Note that the function q˜sbf (·) is only used to draw state samples. If a pointwise evaluation of the importance function is required (e.g. for the computation of importance weights), the function qsbf (X k |Y 1:k ) = Plow (ℓk ) is used instead.
4.3 SIS Particle Filter Design for ASL
73
algorithm (AEDA, see [8]) or a method based on the generalised cross-correlation principle (GCC, see [66]). With the cross-correlation method, the GCC function R(·) between two sensor signals si (t) and sj (t) is defined as: R(τ ) =
Z
WR (ω) Si(ω) Sj∗(ω) ejωτ dω ,
(4.8)
with S(·) (ω) = F {s(·) (t)} the frequency domain signal data, WR (ω) the GCC fre-
quency weighting function, and (·)∗ denoting the complex conjugation operation. In a manner similar to previous developments, the integration over frequency can here also be reduced to a small range of low frequencies. As demonstrated in Appendix A, the cross-correlations between different sensor pairs in an array constitute the basic building blocks in the output power computation of a delay-and-sum beamformer, thus validating this “reduced frequency band” approach. However, no significant robustness improvement in the time delay estimates has been noticed when using this method in practice. An estimate τˆgcc of the time delay of arrival for the sensor pair (i, j) is then determined by: τˆgcc = arg max R(τ ) . τ
Similarly, a time delay estimate τˆaeda for the sensor pair (i, j) can be computed using the AEDA algorithm (see Section 2.2.1). Due to more realistic assumptions regarding the signal propagation model made by the AEDA algorithm, it is usually expected that the TDE τˆaeda resulting from this method will be more robust towards reverberation and noise than the TDE obtained with a cross-correlation technique. The reader is referred to [8] for more details on the AEDA algorithm.
Once a set of TDEs {ˆ τp }Pp=1 has been obtained for P sensor pairs (using either
of the two methods mentioned above), a TDE-based importance function qtde (·) can be implemented (by assuming independence between the different TDE realisations) as a product of Gaussian densities:
qtde (X k |Y 1:k ) =
P Y p=1
N τp (ℓ); τˆp , σp2 .
(4.9)
Here, τp (ℓ) is the time delay of arrival corresponding to the location ℓ with respect
4.4 Revised SIS Algorithm for ASL
74
to the pth sensor pair, and the location vector ℓ , ℓk corresponds to the current state of the variable X k : Xk =
ℓk . ℓ˙k
The deviation parameter σp in Eq. (4.9) is a free parameter of the algorithm that typically has to be learned from training data. For the practical ASL example described in Section 2.5, good results were obtained for this type of importance function with σp set to 15% of the maximum time delay measurable with the corresponding sensor pair. With the pth pair consisting of the microphones i and j, located at the positions ℓi and ℓj respectively, this corresponds to: σp = 0.15 · kℓi − ℓj k c−1 . Here again, a simplified importance function q˜tde (X k |Y 1:k ) can be obtained for
the purpose of drawing the importance particles. In a manner similar to Eq. (4.7), this is done by quantising the function qtde (X k |Y 1:k ) according to a specific level θ, also defined here as 90% of the peak value of qtde (X k |Y 1:k ).
Despite the fact that the AEDA method is commonly assumed to deliver better results than the GCC method, this latter has also been implemented for completeness of the practical algorithm tests carried out in this research. The resulting SIS algorithms will be denoted SIS-AEDA and SIS-GCC, respectively. Unless otherwise stated, the importance function used in SIS-GCC is computed using the integration in Eq. (4.8) over the frequency range f ∈ [100, 1000Hz].
4.4
Revised SIS Algorithm for ASL
In this section, the generic SIS technique defined in Algorithm 4.1 (Section 4.2) is updated to reflect the specific constraints of the ASL problem. First, some theoretical issues are considered, mainly regarding the computation of the importance weights. For a practical development of the SIS principle, some functional updates are also necessary and their implementation is described subsequently. The final SIS algorithm for ASL is then presented in full detail.
4.4 Revised SIS Algorithm for ASL
4.4.1
75
Velocity Component of the Importance Particles
On the basis of the practical developments of Sections 4.3.3 and 4.3.4, it can be seen that the importance function q(·) (or its simplified version q˜(·)) really only contains spatial information about the state, i.e. it does not provide any data related to the velocity component of the state vector. As a result, only the upper half of the particle vector (n) Xk
=
(n)
ℓk (n) ℓ˙ k
is assigned upon sampling from the importance function. Hence, the velocity part of these importance samples must be determined by some other means. According to the discussion of Section 3.3, the velocity component of the particle vector does not necessarily correspond to any physical quantity in practice. This variable is merely introduced as a requirement of the assumed dynamics model, and basically allows to modify the “mobility” of the particles in the state space. Thus, it results that the exact velocity value assigned to the importance particles is here not a critical issue, and therefore several possibilities could be used to determine this parameter (including setting it to zero). In order to keep the current implementation (n) in line with earlier developments, the velocity component ℓ˙ of the newly sampled k
importance particles will be assumed to be a zero-mean Gaussian variable with a variance directly determined by the dynamics model (see Section 2.4): (n) ℓ˙k
4.4.2
2 0 b 0 ∼N , . 0 0 b2
Importance Weights Computation
In the basic bootstrap algorithm described in Section 2.3.3, the numerical approximation of the posterior PDF p(X k |Y 1:k ) is obtained with particles distributed
according to the prior density (as a result of the transition prior p(X k |X k−1 ) being used implicitly as sampling distribution), (n)
X k ∼ p(X k |Y 1:k−1) , and with weights proportional to the likelihood function: (n)
(n)
wk ∝ p Y k |X k
.
4.4 Revised SIS Algorithm for ASL
76
In relation to this, the importance sampling principle is based on the following fact. If the particles are sampled from an importance function q(·) instead of the prior density: (n)
X k ∼ q(X k |Y 1:k ) , then, for the set of particles and weights
(n)
(n) N n=1
X k , wk
to remain a truthful
representation of the posterior p(X k |Y 1:k ), the computation formula for the weights must be updated as follows (see Eq. (4.3)): (n) wk
∝p
(n) Y k |X k
(n) p X k |Y 1:k−1 . (n) q X k |Y 1:k
(4.10)
The additional term in the weight update equation compensates for a potentially uneven distribution of the particles that might result from the importance function. In a practical implementation, the computation of the prior density p(X k |Y 1:k−1) =
Z
p(X k |X k−1 ) p(X k−1 |Y 1:k−1 ) dX k−1
Z
p(X k |X k−1 ) pˆ(X k−1 |Y 1:k−1 ) dX k−1
can be approximated as: pˆ(X k |Y 1:k−1) = =
N X i=1
(i)
(i)
wk−1 p(X k |X k−1) ,
(4.11)
where use has been made of the sample set approximation of the posterior PDF: pˆ(X k−1 |Y 1:k−1) =
N X i=1
(i)
(i)
wk−1 δ(X k−1 − X k−1 ) .
The transition PDF p(X k |X k−1) in Eq. (4.11) follows directly from the transition
equation and the process noise statistics defined by Eq. (2.11): p(X k |X k−1 ) = puk(X k − GX k−1 ) = N (X k ; GX k−1 , Q)
≈ N xk ; xk−1 + aTu x˙ k−1 , b2 Tu2 · N yk ; yk−1 + aTu y˙ k−1, b2 Tu2 , | {z } | {z } p(xk |xk−1 , x˙ k−1 ) p(yk |yk−1, y˙ k−1)
(4.12)
4.4 Revised SIS Algorithm for ASL
77
where the last line results from the assumption of independence between the target’s x and y motions, and puk (ξ) denotes the PDF of the system noise uk evaluated at ξ. The approximate equality of Eq. (4.12) results from the fact that the right-hand side of this equation should theoretically also include PDFs of the form N (x˙ k ; ax˙ k−1 , b2 ), i.e. p(x˙ k |x˙ k−1 ), for both x˙ k and y˙ k variables. However, in the light of the discussion in Section 4.4.1, these densities do not provide any relevant statis-
tical information regarding the state variable, and they are therefore left out of the current analysis. Eqs. (4.12) and (4.11) can be finally substituted into Eq. (4.10) in order to compute the importance weights. As a last remark in this section, it must be noted that the formula used in Eq. (4.10) to compute the importance weights is derived from the non-sequential version given by Eq. (4.3). The term “SIS” (for sequential importance sampling) used here to name the resulting algorithms is hence a slight abuse of notation.
4.4.3
Practical Updates
The SIS method described in Algorithm 4.1 results as a purely theoretical solution of the general Bayesian filtering problem. As a consequence of the non-ideal nature of the measurement processes and due to the ineluctable compromises inherent to any practical implementation, a few modifications of this original SIS method are necessary. Standard Bootstrap Option The different types of importance function described in Sections 4.3.3 and 4.3.4 are derived solely from the current observations and hence do not account for the previous condition of the state variable X k−1 . Also, the different measurement processes are not ideal in practice, causing the importance function to become ambiguous or even unsuitable for sampling purposes with certain observations. This is the case depicted in Figure 4.3: due to reverberation effects and other acoustic disturbances, the simplified importance density resulting from low-frequency DSB power computations is clearly non-optimal for a single-target scenario. One solution to these problems is to allow for a few particles to be sampled from the prior density p(X k |Y 1:k−1 ) during the PF iteration. This can be simply
achieved by including a standard bootstrap option in the basic SIS algorithm, in
4.4 Revised SIS Algorithm for ASL
78
−3
x 10 6 2 4 1 2 0 0
0 0.5
0.5 1
1 1.5
x−axis (m)
3 2
1.5 x−axis (m)
2 2.5
1
y−axis (m)
3 2
2 2.5
0
1
y−axis (m)
0
Figure 4.3: Example of low-frequency DSB output (f ∈ [100, 400Hz]) and resulting non-ideal sampling distribution. The true source position [xs ys ] = [1.29 1.90] (in m) is indicated with a star (∗). which a certain proportion of the particles are sampled according to the prior rather than from the importance density q(·). The decision regarding the percentage of particles to generate according to each procedure is discussed in a following section.
Better Transition Density for Importance Particles The importance weights computed using Eq. (4.10) with a transition density given by Eq. (4.12) suffer from the problem described below. In theory, the level of randomness introduced by the motion model generates some non-zero posterior probability everywhere in the state space. It is hence expected that a sample set approximation to the posterior density will have at least a few particles distributed everywhere in the state space. In practice however, the state samples will tend to be rather fairly localised and concentrated around the most likely target position, due to the finite nature of the particle representation. This results in large areas of the state space containing no samples at all. A brute force method to avoid this effect would be to increase the number of particles to improve the posterior approximation, but this is generally not an effective solution for a practical PF implementation, for obvious reasons regarding computational costs. The importance sampling principle is indeed also a means of indirectly repopulating state space regions with low particle density. This is done by redirecting some of the state samples into regions of the state space potentially containing some information about the posterior density. Hence, importance sampling repre-
4.4 Revised SIS Algorithm for ASL
79
sents a way of improving the accuracy of the sample set approximation for a small number of particles. The problem here is that generic motion models usually do not allow for such a “spontaneous” appearance of samples across the state space. With the transition PDF defined in Eq. (4.12), any state sample generated through importance sampling in “remote” areas of the state space will be given negligible importance weights in Eq. (4.10). For instance, with the various model parameters defined in Section 2.4 and given the practical simulation parameters (such as sampling frequency, frame size, etc.) defined later in Section 4.5.1, it can be deduced that (n) (n) the importance weight wk will be practically zero (i.e. p X k |Y 1:k−1 6 10−20 (n)
in Eq. (4.10)) for any particle X k any other state sample
that is not located within about 0.15m from
(i) {X k−1 }N i=1
at the previous time step. This specific result (n) (n) was computed with the simplifying assumptions p Y k |X k = q X k |Y 1:k ≈ 1 and x˙ k−1 = y˙ k−1 = 0. Consequently, the contribution of the importance sampling particles to the overall algorithm would be virtually nil most of the time.h In an attempt to solve this problem, it would be a mistake to relax the overall constraints implied by the dynamics model, as this would generally allow unwanted source motions that are not expected from the true target. However, the transition density can be updated specifically for the computation of the importance weights as follows (using the x-coordinate case as example, the same result also applies to the y-coordinate): p(xk |xk−1 , x˙ k−1 ) = N xk ; xk−1 + aTu x˙ k−1 , b2 Tu2 + ψ .
(4.13)
The transition probability ψ is a small constant accounting for the fact that importance sampling particles are not governed by the same model dynamics as particles used in a standard bootstrap step. Hence, this allows importance samples to have non-zero weights and as a consequence, a non-vanishing chance of survival. The definition of Eq. (4.13) will be used instead of the PDFs given in Eq. (4.12) for an h
(i)
(i)
If the importance weight wk−1 of a specific particle X k−1 is nil, then this particle has no chance of being resampled at the next time step in the bootstrap part of the updated SIS algorithm. It is hence guaranteed to die off straight after being generated. Also, if a new state sample (j) (i) (i) X k is generated through importance sampling in the vicinity of X k−1 , the contribution of X k−1 (j)
(i)
(i)
to the computation of the new weight wk is still nonexistent due to the term wk−1 p(X k |X k−1 ) in the sum of Eq. (4.11) also becoming zero.
4.4 Revised SIS Algorithm for ASL
80
implementation of the SIS method, and unless otherwise stated, the parameter ψ (for both variables xk and yk ) is set to a small fraction of the peak value of the original transition PDF p(xk |xk−1, x˙ k−1 ): ψ=
1 √
b Tu 2π
· 10−3 .
Reinitialisation Option In a typical tracking scenario, it is natural for the target to disappear from the observations every now and then. This type of phenomenon happens e.g. if the target becomes occluded or, in the context of ASL, if the source stops emitting an acoustic signal for a short period of time. A similar but more problematic situation is when the target generates measurements that violate the dynamics of the source motion model defined in the algorithm, as in the case of a target disappearance and reappearance. This particular situation is bound to happen quite regularly with talker tracking, in fact whenever the talker becomes momentarily silent while still moving, and then resumes speaking from a totally different room location. These natural (and expected) effects, together with the non-ideal character of the practical observations, can potentially lead to the particle filter losing track of the target for short periods of time. In order to recover successfully from these occasional errors, an additional reinitialisation density pr (X k ) is introduced in the generic SIS algorithm, from which some of the state samples are drawn during each PF iteration. This procedure can be seen as a standard bootstrap iteration with a prior density redefined as: p˜(X k |Y 1:k−1 ) = (1 − α) p(X k |Y 1:k−1 ) + α pr (X k ) , with α ∈ [0, 1] some (constant) prior probability of reinitialisation.i Consequently,
the likelihood weights corresponding to particles sampled from pr (·) are to be com(n) (n) puted according to the likelihood function, wk ∝ p Y k |X k , as described in
Section 2.3.3.
Note that since the importance function q(X k |Y 1:k ) is also based on the cur-
rent observations Y k , the generic SIS algorithm already possesses some sort of i
The problem of how to choose the value of the reinitialisation probability α is discussed in the next subsection, as part of the final SIS algorithm development.
4.4 Revised SIS Algorithm for ASL
81
intrinsic reinitialisation capability. Using importance sampling, the particle set will for instance naturally converge towards a new target appearing at any location in the state space. This process is however rather slow and also necessitates quite a few frames of consistent observations for the particles to eventually converge. Recovery from tracking errors is hence greatly accelerated by making use of the reinitialisation density pr (·). As mentioned in [58], in the absence of a proper (i.e. practical) initialisation density, the importance function can be typically used instead: pr (·) , q(·). The reinitialisation procedure then becomes similar to a normal importance sampling step, with the difference that the importance weights are not conditioned by the transition density p(X k |X k−1 ) any longer (as is the case in Eq. (4.10) on the basis of Eq. (4.11)), thus allowing faster convergence.
4.4.4
Final SIS Algorithm
The updated version of the SIS method is finally given in this section. The resulting particle filter, presented in Algorithm 4.2, is a three option algorithm, similar to (n) N that used in [39, 58]. The task of drawing from a set of particles X k−1 n=1 ac (n) N cording to the corresponding weights wk−1 n=1 (step B.1. in Algorithm 4.2) can be easily implemented using a resampling method based on a cumulative weight
function (see e.g. [6, 44, 57]). Alternatively, a variety of resampling methods have also been proposed in the literature (see [7, 65, 70, 113] for an exhaustive list). In the computation of the importance sampling weights (step C.2.), an upper limit is imposed on the multiplicative correction factor that results from the application of the importance sampling principle. This “compression” operation is explained as follows. The (unnormalised) weight assigned to a particle generated in the standard bootstrap part of the algorithm is equal to the likelihood function p(Y k |X k ). This is also the weight given to a particle sampled from the reinitialisa-
tion density, corresponding to the case where the particle is relocated in the state space regardless of its previous state. Using the likelihood function can hence be regarded as a way to give a “standard” weight to the particles. Since the different importance functions q(X k |Y 1:k ) defined in Sections 4.3.3 and 4.3.4 are defined as pseudo-densities (with no particular care to ensure they are suitably normalised),
specific values obtained from a pointwise evaluation of q(X k |Y 1:k ) might result in
abnormally large importance weights. It is hence necessary to limit the importance
4.4 Revised SIS Algorithm for ASL
82
Assumption: The set of particles and weights
(n)
(n)
X k−1 , wk−1
N
n=1
is a discrete repre-
sentation of the posterior distribution p(X k−1 |Y 1:k−1 ) at time k − 1. Iteration: At time k and for n = 1, . . . , N, choose (according to their respective probabilities) one of the following sampling methods: A. Reinitialisation with probability Pr : (n)
1. Sample particle X k ∼ pr (X k ) = q˜(X k |Y 1:k ).
(n)
(n)
2. Compute unnormalised likelihood weight: w˜k = p Y k |X k B. Standard bootstrap with probability Pb : (n)
1. Sample particle X k drawing a sample
.
∼ p(X k |Y 1:k−1 ). This can be achieved by (n) N from the set X k−1 n=1 with probability
(i) X k−1
(i)
wk−1, and then propagating it through the transition equation: (n)
(i)
X k = g(X k−1 , uk ) . (n)
(n)
2. Compute unnormalised likelihood weight: w˜k = p Y k |X k C. Importance sampling with probability Pi :
.
(n)
1. Sample particle X k ∼ q˜(X k |Y 1:k ). 2. Compute unnormalised importance weight: (n)
(n)
w˜k = p Y k |X k
· min
(
using Eqs. (4.11) and (4.13).
) (n) p X k |Y 1:k−1 ,1 , (n) q X k |Y 1:k
Finally, normalise the importance weights so that they add up to unity: (n)
w˜ (n) wk = PN k
(i)
˜k i=1 w
Result:
The set of particles and weights
(n)
.
(n) N n=1
X k , wk
is approximately dis-
tributed as the posterior density p(X k |Y 1:k ).
Algorithm 4.2: Updated SIS algorithm.
4.4 Revised SIS Algorithm for ASL
83
correction factor to make sure that the weights given to the importance samples remain within the “standard” range as well.
Finally, some consideration must be given to the respective probabilities of choosing one of the three sampling methods, denoted Pr , Pb and Pi in Algorithm 4.2. In [57], where the SIS principle is interpreted as a mixed-state model application, similar probability parameters are defined as constants, meaning that their respective values are set prior to the start of the algorithm and remain unchanged over time. However, as mentioned previously, the different types of importance function used in this work are derived from an imperfect measurement process, possibly leading to an improper sampling distribution for the importance samples in some cases. Figure 4.3 depicts such an example where particles can be drawn from three separate (and relatively spread) areas of the state space. This is a clearly ambiguous case in the sense that for single-source tracking, it is impossible to determine which one of these state space regions is located around the true target (if any). Thus, rather than drawing a large number of samples according to the importance density, a good percentage of which will be inevitably located at erroneous state space positions, it is preferable to drop this current importance measurement and rely on the bootstrap samples only to track the correct target. To implement this principle, the number of particles generated by each of the three options is here determined by the shape of the simplified importance function q˜(X k |Y 1:k ). Prior to the PF iteration, the suitability of this function for sampling
purposes is assessed by detecting the number Nr of distinct sampling regions where importance particles are to be generated (for instance, Nr = 3 in the case of Figure 4.3).j Following the obvious idea that the larger the value of Nr , the less suitable q˜(·) becomes to draw samples from, the various probabilities can be simply defined as: Pr =
Pr , Nr
Pi =
Pi , Nr
Pb = 1 − Pr − Pi , j
In this work, the task of determining the number Nr of sampling regions is implemented with a recursive flood-fill algorithm.
4.5 Practical Experiments
84
where Pr and Pi are the prior probabilities of reinitialisation and importance sampling, respectively, for the ideal case of the importance function presenting a single peak (Nr = 1). In the practical SIS implementation, these parameters have been optimised “by hand” and the best numerical values were found to be Pr = 0.01 and Pi = 0.25. Note that it could also prove advantageous to implement a different way of influencing the number of state samples generated by each of the three sampling methods (or even combine it somehow with the technique described above). For instance, one could imagine that for talker tracking applications, the output of a speech detector would be used to determine whether reinitialisation and importance sampling are to be reduced in favour of pure bootstrapping, depending on whether speech is detected at the current source location estimate. From a more theoretical point of view, the quality of the sample set approximation could be determined at the beginning of each time frame, e.g. by using the effective sample size parameterk : Neff =
1 N P
n=1
(n) 2 wk
.
The proportion of bootstrap, reinitialisation and importance particles could then be determined on the basis of the current value of Neff (below or above a certain level). These techniques could potentially constitute the object of further research in order to determine if they can improve the overall efficiency of the tracker.
4.5
Practical Experiments
The practical simulation setup used for the experiments can be found in Section 2.5. The numerical values used for the practical simulations of the different SIS algorithms have also been given in the previous sections. A few more details remain to be described and these are discussed in the following subsection. The main simulation results are then given in Section 4.5.3. k
This parameter in effect measures (approximately) the inverse of the weight variance. It is generally used in a bootstrap-only PF implementation in order to determine if a resampling step should be carried out at the end of the PF iteration.
4.5 Practical Experiments
4.5.1
85
Experimental Setup
The microphone signals used in the experiments were samples of audio data sampled at 8kHz, either recorded in a typical office room or generated using the image method [3]. Each algorithm was based on the same data processing procedure: the incoming microphone signals were split into frames of 512 samples (processed using a Hamming window), and subsequently used as observation for that particular time frame to compute both the importance and likelihood functions. The data processing was furthermore carried out using a 50% overlapping factor. The time variable k used throughout this document hence corresponds to a time increment of half the window size (256 samples), rather than the discrete time index of the sampled audio data. The main purpose of the experiments carried out in this research is to compare the different techniques (AEDA, GCC or SBF) used in the specific implementation of the SIS method. In this regard, algorithm parameters that are common to all three techniques were set to the same numerical value. The number N of particles was set to 30 for each algorithm. Also, the importance function was computed over a horizontal grid of points across the state space with a uniform 0.1m spacing, regardless of the type of importance observation. Finally, all the free parameters related to the SBF likelihood function were set in a similar fashion to identical values for all three SIS algorithms. These specific likelihood parameters were set to the optimised numerical values resulting from the tuning process for the SBF-PL algorithm, as given in Section 3.9.2. The other algorithm-specific parameters were tuned “by hand” for each method separately by simulating it a number of times with variable parameter values. The resulting parameter setting was chosen to be the value delivering the best tracking results for that specific method. The numerical values of such parameters have already been given in the previous sections of this chapter.
4.5.2
Plots of Statistical Results
The assessment parameters used to characterise the tracking ability of an algorithm are the RMSE, MSTD and FCR factors (respectively root mean square error, mean standard deviation and frame convergence ratio), described in detail in Section 3.7. Since the tested algorithms are based on a stochastic simulation of Monte Carlo
4.5 Practical Experiments
86
20
count
15
10
5
0
0
0.2
0.4
0.6
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0.8
1
1.2
1.4
1.6
0.05
0
−0.05
data
Figure 4.4: Example histogram of some statistical data with corresponding (horizontal) boxplot. samples, it follows that different tracking results will generally be observed for two separate runs of the same algorithm on the same input data. Thus, a complete assessment of these methods can only be deduced by carrying out a number of statistical algorithm runs, and displaying the resulting statistical distribution of the tracking assessment parameters. In order to gain a maximum insight into these results, a boxplot representation has been chosen [103], as depicted in the bottom plot of Figure 4.4. This type of plot has several graphic elements. The lines delimiting the ends of the box are the 25th and 75th percentiles of the data set (lower and upper quartile of the distribution, respectively). The distance between both ends of the box consequently determines the interquartile range. The line in the middle of the box is the median of the data set. If the median is not centered in the box, it is an indication of skewness. Data points plotted outside of the box can be considered either as part of the tail of the distribution, or as outliers in the data. Outliers may be the result of e.g. a poor measurement or a change in the system that generated the data. The entire set of statistical data points is also plotted together with this box representation, allowing some insight into more complex characteristics of the results, as for instance in the case of a multi-modal data distribution. Note that simply plotting the discrete distribution of the data set can be misleading because several data points could be drawn on top of each other in this representation, which would lead to an inaccurate interpretation of the
4.5 Practical Experiments
87
results. Representing median and interquartile information hence gives some useful indication regarding the distribution of the main part of the data set. Rather than the usual averaging technique (geometric mean, arithmetic average, etc.), the median was chosen as measure of central tendency in this work. The average is a simple and popular estimate of location, and if the data samples come from a normal distribution, it is also an optimal parameter. However, outliers, data entry errors, or glitches exist in almost all real data, and the sample average is sensitive to these problems. One bad data value can move the average away from the center of the rest of the data by an arbitrarily large distance. The median is the 50th percentile of the sample set, which will only change slightly if a large perturbation is added to any value. It is hence a measure that is robust to outliers. As for the measure of dispersion, its purpose is to determine how spread out the data values are. The standard deviation and the variance are popular measures of spread that are again optimal for normally distributed samples. Neither the standard deviation nor the variance are robust to outliers however. A data value that is separate from the body of the data can increase the value of these statistics by a potentially large amount. The interquartile range corresponds to the difference between the 75th and 25th percentile of the data. Since only the middle 50% of the data affects this measure, it is more robust to outliers. As depicted at the top of Figure 4.4, a histogram of the resulting data would be the ideal tool for investigating statistical results of experimental tests. However, the 2D nature of histograms does not lend itself very well to a 2D representation of the results and would not be appropriate for an easy comparison between the different methods. Also, boxplot representations still allow for a complete understanding of various distribution characteristics (including multi-modality, skewness and the presence of outliers) while displaying at the same time the most useful distribution parameters like the median and interquartile range. These are the reasons why boxplots are used here instead of histograms.
4.5.3
Results
Practical simulation results are finally presented in this section. To begin with, two examples are considered that demonstrate the reinitialisation property of the SIS algorithm and its ability to successfully recover from large tracking errors. The SIS
4.5 Practical Experiments
88
0.2
0.1
0
−0.1
−0.2
1
2
3
4 time (s)
5
6
7
SIS−SBF
SBF−PL 2.5 x−position (m)
x−position (m)
2.5 2 1.5 1 0.5 1
2
3
4
5
6
0
7
3 2 1 0
1 0.5
y−position (m)
y−position (m)
0
2 1.5
1
2
3
4 time (s)
5
6
7
1
2
3
4
5
6
7
1
2
3
4 time (s)
5
6
7
3 2 1 0
Figure 4.5: Tracking result achieved with an SIS-based (right-hand side) and a non-SIS method (left-hand side). Bottom graphs: true source position (dotted line), source location estimate and lines representing +− one standard deviation of the particle set. Top plot: example of signal recorded with one array sensor for this simulation. performance in these cases is clearly superior compared to the bootstrap results. The efficiency of the SIS method in tracking-only mode (effects of track acquisition excluded) is subsequently assessed and shows that this algorithm is also a viable practical tracker. Tracking and Reinitialisation Example A typical example of the tracking results achieved with the SIS algorithm is depicted in Figure 4.5. It contains the plots of the estimated source position versus time resulting from two example PF implementations, one of which is based on the SIS principle (SIS-SBF) and the other not (SBF-PL). The lines above and below the estimated source position represent plus/minus one standard deviation of the particle set for both coordinates x and y. The audio sample used in this example was recorded in a real office room with reverberation time T60 = 0.39s, with the acoustic source moving at a constant speed along a straight line over a distance of about 1.6m. The signal recorded with one of the array sensors is given as an example in the top plot of Figure 4.5.
4.5 Practical Experiments
89
This practical result also demonstrates the reinitialisation capabilities of the SIS method. At the start of the simulation, the set of particles is purposely initialised in a random room location (corresponding in this example to [x y] = [2.4m 3.4m]) about 2m away from the true source start position. As soon as the source starts emitting an acoustic signal, the SIS method is able to relocate the particles towards the true source position and subsequently tracks the target as it moves across the room. The other (non-SIS) method is unable to detect the source, because the current measurement data is not taken into account as it propagates the particles. For this algorithm, the particle set simply evolves randomly across the state space. Recovery will hence only occur if the true source happens to move to a location in the vicinity of the particle set, in which case tracking will resume.
The situation presented here constitutes a typical example of track acquisition (target detection), for which the SIS method clearly shows its superiority over a pure bootstrap implementation. In a practical tracking system, this latter will typically necessitate additional processing units dealing with target localisation and recovery from track loss, whereas an SIS method already integrates these functionalities at a low level in the algorithm.
Alternating Speaker Conversation The results depicted in Figure 4.6 were obtained with a scenario where two speakers (located at different room positions) take part in an alternating conversation. The simulation was carried out using the image method to generate signals originating from two locations in the “virtual” room setup defined in Section 2.5, with reverberation time T60 = 0.35s. The first source, located at position ℓ1 = [0.7m 1.1m], was defined as a male speaker uttering the sentence “Draw every outer line first, then fill in the interior”. The second source was at location ℓ2 = [1.5m 2.8m] and emitted the female speech signal “What is this large thing by the ironing board?”. Each sentence was segmented into two parts and “sent” from the corresponding source location in an alternating way. White Gaussian noise was added to the resulting microphone signals with an SNR level of about 20dB. The plot at the top of Figure 4.6 shows an example of signal resulting for one of the sensors. The vertical dotted lines represent time instants at which a speaker change occurs in the original
4.5 Practical Experiments
90
0.02 0.01 0 −0.01 −0.02
1
2
3
4
5
1
2
3
4
5
1
2
4
5
x−position (m)
2.5 2 1.5 1 0.5
y−position (m)
0
3 2 1 0
3 time (s)
Figure 4.6: SIS tracking results with alternating conversation scenario. Top plot: example of audio signal recorded with one of the array sensors. Vertical dotted lines denote a change of speaker. Bottom plots: tracking results for coordinates x and y. Dotted lines represent the position of the active source. source signal. Note that the apparent power level difference between the sources is here mainly the result of one speaker being located closer to the considered sensor than the other. The plots at the bottom of Figure 4.6 show the tracking results obtained with the SIS algorithm (as depicted in Figure 4.5, the bootstrap method would clearly fail in such a situation). This demonstrates once again the efficiency of the SIS method which automatically switches between talkers as soon as a speech signal is detected at a different location of the state space. Note that this situation is not a case of multiple source tracking, since only one position estimate is available at any one time. These results are hence only relevant for cases where the source signals do not overlap substantially over time.
Image Method Results This subsection, as well as the next one (real audio results), deals with an assessment of the particle filters operating in tracking mode only. From now on, it is assumed that at the start of every single simulation, the particles are all initialised at the location corresponding to the true source position in the state space. This
4.5 Practical Experiments
91
SBF−PL
SBF−SIS 1.5 RMSE (m)
RMSE (m)
1.5
1
0.5
0
1
0.5
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
SBF−GL
AEDA−SIS 1.5 RMSE (m)
RMSE (m)
1.5
1
0.5
0
1
0.5
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
GCC−PL
GCC−SIS 1.5 RMSE (m)
RMSE (m)
1.5
1
0.5
0
1
0.5
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
GCC−GL
Classical methods 1.5 RMSE (m)
RMSE (m)
1.5
1
0.5
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 T
60
SBF GCC AEDA
1
0.5
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
(s)
T
60
(s)
Figure 4.7: Experimental image method results for RMSE parameter. Bootstrap algorithms are grouped on the left-hand side of the figure, SIS methods are on the right-hand side (top three plots). The last plot (bottom right-hand corner) shows the RMSE results obtained with traditional ASL methods. allows to bypass the initial phase of track acquisition and to focus on how well the algorithms perform once the source has been detected.
Figures 4.7, 4.8 and 4.9 present the results from experimental simulations carried out with a range of “synthetic” audio samples generated using the image method [3]. Despite being a purely theoretical approximation technique, the advantage of the image method is the possibility to choose the reverberation time ultimately resulting in the audio samples. It is hence possible to gain a general feeling about how the tracking algorithms behave with different levels of reverberation. The simulation setup was defined as described in Section 2.5. In these experiments, the reverberation time T60 was varied over the range [0, 0.79s], with T60 = 0s corresponding to the anechoic case. The audio samples used in the simulations were generated as follows. A single example of source trajectory
4.5 Practical Experiments
92
SBF−PL
SBF−SIS 0.4 MSTD (m)
MSTD (m)
0.4 0.3 0.2 0.1 0
0.3 0.2 0.1 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 SBF−GL
AEDA−SIS 0.4 MSTD (m)
MSTD (m)
0.4 0.3 0.2 0.1 0
0.3 0.2 0.1 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 GCC−PL
0.4 MSTD (m)
MSTD (m)
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 GCC−SIS
0.4 0.3 0.2 0.1 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
0.3 0.2 0.1
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 GCC−GL
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 T
60
(s)
MSTD (m)
0.4 0.3 0.2 0.1 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 T
60
(s)
Figure 4.8: Experimental image method results for MSTD parameter. Bootstrap algorithms are grouped on the left-hand side of the figure, SIS methods are on the right-hand side.
and source signal was considered. The source path corresponds to a 1.59m straight line located approximately in the middle of the room, with start and end points ℓstart = [0.78 2.29] and ℓend = [2.04 1.33] (in m). The acoustic source signal was the sentence “Draw every outer line first, then fill in the interior ” uttered by a male speaker and looped twice, resulting in a 7.26s audio sample and a source speed of about 0.22m/s. The source trajectory was divided into 120 segments, corresponding to a distance increment of 1.32cm from one frame to the next. Similarly, the source signal was split up into 120 non-overlapping frames of audio data, yielding a frame length of about 60.5 · 10−3s. For each signal frame, the transfer function between the corresponding source position and the current array sensor was computed using the image method. The audio data for that specific sensor was then obtained by convolving the frame of signal data with the resulting transfer function, and subsequently combining it additively with the previous convolution results. The process
4.5 Practical Experiments
93
SBF−SIS 100
80
80 FCR (%)
FCR (%)
SBF−PL 100
60 40 20 0
60 40 20 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
100
80
80
60 40 20 0
60 40 20 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79
100
80
80
60 40 20
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 GCC−SIS
100 FCR (%)
FCR (%)
GCC−PL
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 AEDA−SIS
100 FCR (%)
FCR (%)
SBF−GL
60 40 20
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 GCC−GL
0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 T60 (s)
100 FCR (%)
80 60 40 20 0
0.00 0.06 0.13 0.21 0.28 0.35 0.42 0.50 0.57 0.64 0.71 0.79 T60 (s)
Figure 4.9: Experimental image method results for FCR parameter. Bootstrap algorithms are grouped on the left-hand side of the figure, SIS methods are on the right-hand side.
was then repeated for each array microphone over the entire set of signal frames. Finally, white noise was added to the resulting microphone signals with an approximate SNR level of 20dB.
The results presented in this section were obtained by simulating each PF algorithm 100 times for the considered audio sample (i.e. the algorithms were run 100 times on the same audio data). The statistical distribution of the assessment parameters for these 100 trials are then plotted using a boxplot representation. Figures 4.7, 4.8 and 4.9 present the statistical results for the RMSE, MSTD and FCR parameters respectively, with a separate graph for each of the PF methods. For comparison purposes with the SIS methods, results from the various PF implementations of Chapter 3 (i.e. SBF-PL, SBF-GL, GCC-PL and GCC-PL) have been included in the plots. These bootstrap methods have been statistically simulated in exactly the same way as the SIS methods. And for completeness, the tracking
4.5 Practical Experiments
94
SIS PF methods SIS-SBF SIS-GCC SIS-AEDA 5.9
6.1
10.5
Bootstrap PF methods SBF-PL SBF-GL GCC-PL GCC-GL 2.9
20.4
1.4
6.1
Traditional methods SBF GCC AEDA ∼ 1076
3.4
5.7
Table 4.1: Computation times (in s) required by ASL methods to process one second of audio data, on the basis of a Matlab implementation. results of the traditional SBF, GCC and AEDA methods (described in Section 2.2) are also given where applicable, i.e. in the plot of RMSE parameter values. Due to the non-stochastic (deterministic) nature of these classical algorithms however, this plot shows the results of one single algorithm run (no statistical average necessary).
Real Audio Results The SIS PF algorithms have been tested further with samples of audio data recorded in a typical office room, using a variety of source paths and speech signals. Here again, the experimental setup description can be found in Section 2.5. The frequency-averaged reverberation time of the recording environment was practically measured as T60 = 0.39s. Calculations based on the silence periods detected in the recorded audio data delivered an estimated average recording SNR of 9.4dB for the entire set of audio samples. Figures 4.10, 4.11 and 4.12 present the statistical results obtained for 9 different audio samples (each reflecting a different source signal and trajectory, and hence a different level of tracking difficulty) for the tracking assessment parameters RMSE, MSTD and FCR respectively. The last three audio samples in each figure (samples 7, 8 and 9) correspond to cases where the acoustic source is stationary.
Computation Complexity The computational load of each method has been roughly assessed by measuring the time required for them to process a certain amount of audio data. Table 4.1 presents the resulting computation times required per second of input data. These
95 4.5 Practical Experiments
F SB
F SB
Sample 1
L A PL BF GL GL CC −P ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
Sample 4
A PL BF GL GL CC ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
Sample 7
L A PL BF GL GL CC −P ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
L −P
2
1.5
1
0.5
0
2
1.5
1
0.5
0
2
1.5
1
0.5
0
L
P F−
SB
L P F− SB
L P F− SB
Sample 2
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
Sample 5
Sample 8
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
2
1.5
1
0.5
0
2
1.5
1
0.5
0
2
1.5
1
0.5
0
PL
F− SB
Sample 3
Sample 6
A PL BF GL GL CC ED −S S−G C− C− −A GC SIS GC SI SIS F− SB
PL
F− SB
Sample 9
A PL BF GL GL CC ED −S S−G C− C− −A GC SIS GC SI SIS F− SB
A PL BF GL GL CC ED −S S−G C− C− −A GC SIS GC SI SIS F− SB
PL F− SB
Figure 4.10: Experimental real audio results for RMSE parameter. A stationary source was used in the recording configuration of samples 7, 8 and 9.
2
1.5
1
0.5
0
2
1.5
1
0.5
0
2
1.5
1
0.5
0 F SB
RMSE (m) RMSE (m) RMSE (m)
RMSE (m) RMSE (m) RMSE (m)
RMSE (m)
RMSE (m) RMSE (m)
96 4.5 Practical Experiments
Sample 1
Sample 2
Sample 8
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS
Sample 5
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
0
0.1
0.2
0.3
0.4
0.5
0
0.1
0.2
0.3
0.4
0.5
0
0.1
0.2
0.3
0.5
L
P F−
SB
L P F−
0.5
0.3 0.2 0.1 0
0.5 0.4 0.3 0.2 0.1 0
0.5 0.4 0.3 0.2 0.1 0 L P F− SB
SB
0.5
Sample 4
A PL BF GL GL CC ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
Sample 7
L A PL BF GL GL CC −P ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
L −P
SB
0.4
F SB
0.4
0.3 0.2 0.1 0
0.5 0.4 0.3 0.2 0.1 0
0.5 0.4 0.3 0.2 0.1 0 F SB
MSTD (m)
L A PL BF GL GL CC −P ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
MSTD (m) MSTD (m)
0.4
F SB
MSTD (m) MSTD (m) MSTD (m)
MSTD (m)
MSTD (m) MSTD (m)
PL
F− SB
Sample 3
Sample 6
A PL BF GL GL CC ED −S S−G C− C− −A GC SIS GC SI SIS F− SB
PL
F− SB
Sample 9
A PL BF GL GL CC ED −S S−G C− C− −A GC SIS GC SI SIS F− SB
A PL BF GL GL CC ED −S S−G C− C− −A GC SIS GC SI SIS F− SB
PL F− SB
Figure 4.11: Experimental real audio results for MSTD parameter. A stationary source was used in the recording configuration of samples 7, 8 and 9.
97 4.5 Practical Experiments
Sample 1
A PL PL BF GL GL CC ED F− −S S−G F− C− C− −A SB SB GC SIS GC SI SIS
Sample 4
Sample 7
A PL PL BF GL GL CC ED F− −S S−G F− C− C− −A SB SB GC SIS GC SI SIS
100 80 60 40 20 0
100 80 60 40 20 0
100 80 60 40 20 0
PL F−
SB
PL F− SB
Sample 2
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
Sample 5
Sample 8
A PL BF GL GL CC ED −S S−G F− C− C− −A GC SIS GC SI SIS SB
A PL PL BF GL GL CC ED F− −S S−G F− C− C− −A SB GC SIS GC SI SIS SB
100 80 60 40 20 0
100 80 60 40 20 0
100 80 60 40 20 0
PL
PL
Sample 3
Sample 6
A PL BF GL GL CC ED −S S−G F− C− C− −A SB GC SIS GC SI SIS
Sample 9
L A PL BF GL CC −G ED −S S−G C− C− −A GC SIS GC SI SIS F SB
A PL PL BF GL GL CC ED F− −S S−G F− C− C− −A SB GC SIS GC SI SIS SB
F− SB
F− SB
Figure 4.12: Experimental real audio results for FCR parameter. A stationary source was used in the recording configuration of samples 7, 8 and 9.
100 80 60 40 20 0
100 80 60 40 20 0
100 80 60 40 20 0 A PL PL BF GL GL CC ED F− −S S−G F− C− C− −A SB SB GC SIS GC SI SIS
FCR (%) FCR (%) FCR (%)
FCR (%) FCR (%) FCR (%)
FCR (%)
FCR (%) FCR (%)
4.6 Results Analysis
98
values are based on a Matlab implementation of the different algorithms running on a 3GHz computer. No particular attempt was made to optimise the code for execution speed. Consequently, the processing times reported in this table are only for a general indication of the computational complexity, and only allow for a coarse performance comparison between these different methods.
4.6 4.6.1
Results Analysis Discussion of Image Method Results
In the results obtained from the image method simulations (Figures 4.7, 4.8 and 4.9), it can be clearly seen that a general tendency of the bootstrap algorithms is to have, for higher T60 values, a distribution of the tracking parameter values split into two or more distinct modes. This is for instance the case for the RMSE and FCR distribution results of SBF-PL for T60 > 0.42s (Figures 4.7 and 4.9 respectively). This typical behaviour of bootstrap algorithms can be explained as follows. For low levels of reverberation, the bootstrap filters successfully manage to track the source for the totality of the algorithm runs. This is done furthermore with a more or less constant level of accuracy resulting in a tight concentration of the statistical assessment parameter values. Increasing the reverberation time induces a certain amount of disturbance leading to a less constant tracking accuracy for the different bootstrap methods over the statistical runs. The tracking performance decreases for some of the simulation runs (despite the specific method being run on the exact same audio data), and the distribution of the statistical results becomes more and more spread. Past a certain level of T60 , which varies from algorithm to algorithm, the bootstrap filter will start to lose track of the correct source location occasionally (i.e. for a certain number of statistical runs), due to the disturbance becoming too important. As mentioned previously and demonstrated in Figure 4.5, bootstrap methods do not have any reinitialisation capabilities, and a bootstrap PF will usually never recover from a track loss. As a consequence of the particle set diverging from the true source position in these cases, the values of RMSE and FCR will be anomalously large for that specific algorithm run, creating an “outlier” in the resulting distribution of these two tracking assessment parameters. As the reverberation level is increased even more, the statistical incidence of such cases will also increase accordingly, and a distinct mode will eventually appear in the
4.6 Results Analysis
99
tracking parameter distributions as more and more outliers are generated in this process. As expected, this “split distribution” effect is usually avoided with a particle filter based on the SIS technique. Whenever such an algorithm momentarily loses track of the target, its built-in reinitialisation capability allows it to automatically resume a correct tracking shortly after. Consequently, no outlier can be observed in the statistical distribution of the tracking assessment parameters for these methods. The spreading of the values in the empirical SIS distributions is however more pronounced than for the bootstrap methods, as this effect is the result of a slightly different process in the case of SIS algorithms. Whereas the reinitialisation ability of such methods allows them to successfully recover from track losses, it also means that they are more likely to be mislead by erroneous observation data as some of the particles can be incorrectly relocated around a likelihood peak resulting from reverberation effects. Due to the spurious (i.e. temporally inconsistent) nature of reverberation disturbances, the SIS algorithms are usually only mislead for a short period of time, after which the particle set is reinitialised and starts tracking the correct source again. It is however these short and occasional deviations from the true source trajectory that lead ultimately to an increased spread of the tracking parameter values in the resulting experimental distributions.
Note that as the reverberation time increases, the bootstrap methods will usually go from a state where tracking is successful on average with a few cases of track loss, to a state where tracking is unsuccessful most of the time with a few cases of accurate tracking. This transition can be distinctively observed for instance in the RMSE results of the GCC-GL method (Figure 4.7) for a T60 value of about 0.64s. This transition effect (determining the level of reverberation for which the tracking performance of an algorithm breaks down) is bound to happen for each bootstrap method. The image method simulations have simply not been carried out over a wide enough range of T60 values to be able to observe this effect for all the bootstrap algorithms.
The tracking performance of the SIS methods is on average better compared to the GCC-based bootstrap algorithms. However, this result is most likely to be a consequence of using steered beamforming as observations for the reweighting part
4.6 Results Analysis
100
of the SIS algorithm (likelihood function). Knowing that SBF measurements are more robust than those obtained with a GCC-based technique, it is not surprising to see that particle filters using SBF as main observation deliver the best tracking results. When comparing the algorithms delivering the best tracking performance in each of the SIS and bootstrap classes, namely SBF-PL and SIS-SBF, the results tend to show that the introduction of the SIS principle into the basic bootstrap technique does not seem to improve the tracking performance. Despite the fact that the SIS method manages to avoid the occurrence of track loss, its average tracking performance (based on the RMSE parameter) is not as good as for SBF-PL, and the extent of the RMSE distribution is usually as wide for SIS-SBF as it is for the worst-case results (outliers) of SBF-PL. It is hence a subjective matter to decide which of these two methods is more suitable for a practical implementation. This specific question is discussed further in the last section of this chapter. It is worth emphasising here once again that this comparison is made on the basis of tracking-only results. The effects resulting from the initial phase of track acquisition are here not included in the analysis. The performance of the bootstrap method would be for instance much worse if the particles were to be initialised uniformly across the state space.
Contrary to what was expected, the specific technique (SBF, GCC or AEDA) chosen to build the importance function does not drastically influence the resulting tracking performance of the SIS algorithm. All three tested SIS methods show very similar tracking accuracy with only minor differences that could possibly even be the sole result of the statistical nature of the simulations. This is somewhat surprising considering that the SBF principle usually provides observations that are more robust to reverberation than TDE methods. Consequently, this tends to show that the choice of importance function is not as critical as the choice of likelihood function in the design of an SIS algorithm, at least for the practical case of ASL.
Some comments are finally given regarding an interesting phenomenon occurring in the image method simulations. Whereas the average MSTD values can be considered somewhat constant for the bootstrap methods over the range of T60 , the average standard deviation of the particle set for SIS algorithms distinctively
4.6 Results Analysis
101
increases as the reverberation becomes more important. From a close inspection of some simulation runs, the cause of this behaviour can be tracked down again to the intrinsic reinitialisation property of the SIS principle. Due to importance sampling, some of the state samples might be relocated away from the bulk of the particles, potentially scattering the particle set across the state space, which then accounts for a larger MSTD value. With increased levels of reverberation, the importance and likelihood functions also become more spurious, leading to an increased probability of some particles being relocated away from the main particle set (spurious importance function) as well as an increased chance of survival for these relocated particles (spurious likelihood). This effect becomes hence more and more pronounced as T60 increases. The MSTD parameter can be considered as an estimate of the confidence level provided by the particle filter regarding its source location estimates. From this point of view, the MSTD values obtained with the SIS algorithms are obviously more realistic than those resulting from bootstrap methods. Whereas the overall tracking performance of all tested algorithms consistently decreases with an increasing T60 , only the MSTD values of the SIS methods effectively indicate that the tracking becomes less accurate for increasing levels of reverberation.
4.6.2
Discussion of Real Audio Results
In the simulations involving real audio data recordings, the tracking performance achieved by the tested algorithms is not always consistent from sample to sample (Figures 4.10, 4.11 and 4.12). It is then hard to determine whether the difference in tracking performance is really due to some algorithms being more efficient than others, or to some other influencing factors like the specific source signals and trajectories used, or even the practical recording setup. Also, due to the rather limited number of simulations using real audio data, it is difficult to observe a definitive trend as to which algorithm is more successful in these practical tests. A comparison based solely on the median RMSE values seems to indicate a slightly better performance for SBF-PL over all other methods. It also tends to show that the SIS-GCC method has the worst tracking performance compared to the other SIS algorithms. However, it is clear that these results should be verified with more samples of real audio data and using a more accurate monitoring of the true source location.
4.6 Results Analysis
102
In the case of a stationary source, it can be expected that the PF principle would not work well. The spurious reverberation peaks in the measurements would become as consistent temporally as the true source peak, misleading the PF algorithm regarding which peak really belongs to the true source. As mentioned earlier, the statistical simulations were all started with the particle set initialised at the correct source location (explicitly assessing the algorithms’ tracking performance rather than their localisation ability). Since the bootstrap filters cannot generate particles outside the existing set at any time step, they should theoretically deliver better results in the experimental simulations due to the fact that the particle set “tracks” the correct peak from the start. With the SIS methods on the other hand, if e.g. the reverberation process creates a peak that is equally or more important than the true source peak in the measurements, the particles might be relocated onto that spurious peak and incorrectly “track” it instead of the desired source. Also, this behaviour is independent of whether the particles are initialised at the correct source position or not. For two out of three samples where the source is stationary (samples 7, 8 and 9), the SIS methods seem to achieve a lower level of performance compared to bootstrap algorithms, with larger RMSE values on average (especially for SISGCC). This could indeed be the result of the SIS methods’ reinitialisation property being mislead by temporally consistent spurious peaks in the observations. Here again, a larger set of audio samples would be necessary for the simulation results to fully confirm this hypothesis.
4.6.3
Discussion of Computational Load Assessment
The tested implementations of SIS algorithms all involve heavier computational requirements compared to the bootstrap methods that are of interest for practical implementations (mainly SBF-PL). This is of course the result of having to compute an importance function for a grid of points across the state space, which is the main additional load introduced by the SIS principle. This increase of computational complexity is however not excessively important (the resulting processing requirements remain in the same order of size) and this hardly makes a difference for nowadays computers, as shown in the following example. The SBF-PL algorithm has been implemented in C-code on a 1.7GHz Linux computer (see [82]) and runs very comfortably with 30 particles using approxi-
4.6 Results Analysis
103
RMSE (m)
0.4 0.3 0.2 0.1 0
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
100
FCR (%)
80 60 40 20 0
grid spacing (m)
Figure 4.13: SIS-SBF tracking performance results for various grid spacing values used in the importance function computations.
mately 10 to 15% of the CPU power (including the graphical display of the particles and source position estimate on screen). Also, the number of particles can be further increased to a value of about 450 before the application starts misbehaving. See Appendix C for more details on this real-time PF development. When considering a practical implementation of e.g. the SIS-SBF algorithm, each grid point used for the importance function computations can be basically regarded as one additional particle for which the output power of the beamformer has to be computed. The major difference for the importance function is that the DSB computations are only to be carried out over a small range of low frequencies, therefore reducing drastically the computational load involved for these grid points compared to the “real” particles. The SIS implementations tested in this work made use of a uniform grid spacing of 0.1m as described in Section 4.5.1, which defines a grid of about 1170 points for the importance function computations, given the practical room dimensions of Section 2.5. Depending on how much processing power is saved when computing the low-frequency DSB output (rather than the full-range output power), this algorithm may or may not be too complex computationally for a practical implementation. However, as depicted in Figure 4.13, the spacing of the importance function grid can be increased by a relatively large amount without a substantial degradation of the tracking performance. For instance, doubling the spacing to 0.2m results in about the same level of performance for SIS-SBF, and the
4.7 Conclusions
104
grid now only contains about 300 points. The computational expense introduced by the SIS principle on top of the standard bootstrap method is consequently not a problem for a practical implementation of this specific algorithm using the computer system mentioned previously. It is expected that the same sort of logic applies to a practical implementation of the SIS-AEDA method, despite the fact that this algorithm requires about twice as much processing power as SIS-SBF in the software simulations (see Table 4.1).
4.7
Conclusions
The research presented in this chapter has been partly driven by the work done in [58], which is concerned with the implementation of a real-time hand tracker using video information, and which is also based on the SIS principle (Icondensation algorithm). This application presents a number of similarities with the problem of acoustic source tracking considered in this research. In a way similar to the techniques developed here, coarse information regarding the target position is first extracted from the image data using skin colour detection, which is then used to build the importance function. The likelihood is then computed on the basis of the same underlying visual data using a refined contour tracker for hand templates. The target dynamics for both applications at first seem to be quite similar as well. Hence, considering the claim of [58] that Icondensation is more robust to rapid motion, heavy clutter and hand-coloured distractors than a standard bootstrap implementation, it is somewhat disappointing to see that “upgrading” SBFPL to include an importance sampling step does not actually result in an improvement of the average tracking performance. After some considerations given to that specific question however, it appears that the type of tracking problem defined by ASL is significantly more challenging than for visual tracking. First, the process of reverberation introduces a complex type of disturbance in the measurements used in ASL, affecting both the importance function and the likelihood. For the hand tracker case, the presence of a target (hand) or noise source (other skin-coloured object) in the vision field will usually generate one single mode in the importance function. This mode is further likely to be rejected by the likelihood function in case it is generated by a skin-coloured distractor not recognised as a hand template. For ASL however, a sound source (speaker)
4.7 Conclusions
105
or acoustic noise (computer fan, air vent, etc.) in the state space will result in multiple “ghost” peaks in the observations used for the importance function and the likelihood, due to the reverberation effects. These multiple modes constitute an increased level of disturbance for an acoustic source tracker. Also, the practical problem of target occlusion is encountered in both types of application, but again to a lower extent in the case of visual tracking. In vision, occlusion typically occurs whenever the hand becomes totally hidden by another object, in which case the sensors will momentarily fail to detect the target. The corresponding case in ASL is when the acoustic source becomes momentarily silent (no measurements available from the source). Whereas occlusion is a phenomenon that is expected to be rather occasional in vision, it occurs in the case of ASL with every pause between words in the speech signal, which is relatively often. This again leads to a higher degree of misleading information in ASL. The combination of frequent occlusions and reverberation probably also contributes to making things even worse for an acoustic source tracker.
The major advantage involved in the use of importance sampling in the design of particle filters is unquestionably its reinitialisation property. In particular, the SIS principle makes the resulting algorithm able to automatically recover from track loss, lock onto a new source entering the scene, switch between speakers taking turns, etc., and consequently makes it much better suited than bootstrap methods for most of the typical scenarios encountered in practice. The design of an SIS algorithm, as with any other tracking system, involves a careful compromise between increasing the filter’s freedom to search a wider area of the state space, and increasing the probability of incorrect (i.e. off-track) reinitialisation. In other words, a compromise must be found between localisation performance and tracking performance. The SIS method developed in this research (see Algorithm 4.2) offers an “analogue” way of tuning this compromise, namely by changing the values of Pr and Pi . Bootstrap methods constitute an extreme limit in this tradeoff (with Pr = Pi = 0) where the particle filter has no way to reinitialise. Provided the bootstrap filter is tuned very accurately for a specific tracking scenario, it will hence deliver the best possible results, as disturbances are much less likely to distract it. This is why the bootstrap filter SBF-PL shows a better
4.7 Conclusions
106
tracking performance than any SIS method in the simulation results.l On the other hand, tracking using a bootstrap method is absolutely useless for instance if it fails to localise a new acoustic source that has just appeared randomly in the room. As mentioned earlier for this algorithm, additional (and potentially complex) units would have to be developed for a real system to deal with such weaknesses, adding to the overall cost and processing delay of the resulting tracker. Finally, it must be noted that the SIS methods tested in this work were the result of design choices considered to be the most sensible for this kind of tracking application. Whether a better combination of the SBF, GCC, AEDA or other techniques (used as importance or likelihood function) exists is matter for a possible further research.
l
One should however also note that the SIS tracking performance can be potentially largely improved by using some simple external information (such as e.g. the output of a voice activity detector) to influence its tracking behaviour (i.e. its reinitialisation probability), as mentioned e.g. in Section 4.4.4.
Chapter 5 Lower Bound on Estimation Error The research presented in this chapter is motivated by the need to determine a lower estimation error bound for ASL-based tracking algorithms. Since an optimal filter is usually impossible to realise in practice, this lower bound is used as a maximum performance limit to which the previously developed algorithms are compared, on the basis of the specific ASL problem definition. As a generalisation of the well-known Cram´er-Rao bound theory, the so-called posterior Cram´er-Rao bound (PCRB) is directly applicable to this kind of problem, and the present research consequently makes use of this specific parameter. It is found however that the PCRB does not completely capture the sort of performance one is interested in for practical applications of acoustic source localisation and tracking.
5.1
Introduction and Motivation
On the basis of the previous developments and results given in this work, it should be clear that optimal estimators for nonlinear filtering problems are usually impossible to implement in practice. In order to assess the quality of any filtering algorithm in terms of absolute performance (not simply with respect to some other algorithm), a theoretical maximum efficiency limit can sometimes be derived. This type of bound on estimation error gives an indication of the best performance achievable by an optimal estimator, and it is solely determined by the specific system under consideration. This theoretical performance limit can then be used as a reference to which the implemented estimator can be compared. Several types of lower bound for nonlinear systems have been presented in the literature, including the well-known Cram´er-Rao bound (CRB) for random 107
5.1 Introduction and Motivation
108
parameter estimation [115]. Theoretical derivations of such bounds can be found e.g. in [1, 11, 110], and the reader is referred to [64, 111] for a more detailed overview of existing lower bound types. In this chapter, the approach described by Tichavsk´y et al. in [111] will be used to compute the so-called posterior Cram´er-Rao bound (PCRB). This approach constitutes a novel and simple method of deriving the posterior lower bounds for discrete-time nonlinear filtering problems, and it will be applied here specifically to the problem of interest in this thesis, namely acoustic source localisation (ASL) and tracking.
The Cram´er-Rao lower bound has been used in a variety of practical applications as an indication of the maximum performance theoretically achievable with an optimal unbiased estimator. In [22, 24] for instance, the CRB is used to assess the accuracy of source location estimates delivered using a randomly distributed beamforming sensor array. Ristic and Arulampalam, in association with various authors, present extensive developments of the CRB for the problem of bearingsonly target motion analysis (with an emphasis on defence applications) and ballistic object tracking in articles such as [95, 96, 99, 100]. The work presented in [52] uses the same principles but applied to multiple target tracking. This type of lower bound analysis has also been used in conjunction with the concept of terrain-aided aircraft navigation [11, 12]. However, the majority of the above mentioned publications present lower bound computations derived on the basis of a relatively simple modelling of the specific problem under consideration. In particular, the expression describing the system observations usually relies on a straightforward (despite being nonlinear) mathematical function of the state variable with additive white Gaussian noise. This type of assumption regarding the measurement model leads to a considerable simplification in the lower bound derivations. For the problem of acoustic source localisation considered in this thesis, the complex process of reverberation cannot be simply modelled in the same way. As will be shown in the following sections, the observations in the case of ASL typically result from a mixture model comprising a potentially non-white noise process. This creates a serious challenge for a PCRB analysis of the ASL problem, and to the best of the author’s knowledge, the consideration of a practical problem involving a degree of nonlinearity similar to that
5.1 Introduction and Motivation
109
assumed in the present thesis has not been the object of significant research as of this day.
Besides the general purpose of determining a theoretical performance limit for ASL and comparing previously developed algorithms to it, the developments presented in this chapter were also motivated by the following speculation. It is known (see e.g. [26]) that sound pressure values recorded at two different locations in a random acoustic field present a certain level of correlation that is a function of the distance between the two measuring points. Likewise, it can be expected that the sound intensity values measured at the same location but at two different times also shows some correlation that depends on the displacement of the sound source producing the acoustic field—this reasoning will be elaborated in Section 5.5. A trivial example leading to this idea is that of a stationary sound source producing a random acoustic field in a reverberant enclosure. When measuring the acoustic intensity spatially, e.g. by means of a beamformer scanning the acoustic field, spurious peaks will appear at various locations (differing from the sound source position) due to the reverberation process. Provided the structure of the reverberant environment remains unchanged, different intensity measurements taken at different times will remain very similar, if not identical. However, if the characteristics of the environment vary in between the intensity measurements (e.g. if the source position changes substantially or if large pieces of furniture are moved inside the room), the complex reverberation process will lead to very different measured intensity “images”. In the context of acoustic source tracking, this means that in this second case, a spurious peak is less likely to have the exact same position and amplitude from one measurement to the next. It is hence less likely to be mistaken for the true source position. In the first case (unchanged acoustic field), the spurious peaks will all remain unchanged and if one of them happens to be larger than the peak generated by the true source, picking the largest peak will lead to a 100% tracking error over time. The goal of this section is hence also to investigate this correlation between the measured sound intensity values in more detail, and to quantify its effects on the theoretically achievable tracking performance. According to the above mentioned reasoning, it is expected that a larger degree of correlation between measurements
5.2 Review of PCRB Theory
110
(typically when the source is moving slowly) will correspond to a decreased tracking performance.
The developments made in this chapter will proceed as follows. First, a review of the theory on PCRB is given, highlighting the existing concepts and principles of the lower bound computations. The PCRB is then determined for the specific problem of ASL by making use of a simplified observation model. A more complex model, taking the correlation effects into account, is then proposed and the PCRB developments are updated accordingly. Finally, the tracking performance of some of the particle filter algorithms described in previous chapters is assessed with respect to the computed PCRB.
5.2
Review of PCRB Theory
This section reviews the basics of the PCRB theory for discrete-time nonlinear filtering. It mainly follows the concepts of [111], but similar developments can also be found in [11, 97].
5.2.1
PCRB Recursion
Let X be a system state vector comprising d parameters that one would like to estimate:
T X = θ(1) · · · θ(d) ,
and let Y denote a vector of measurement data (observation). The aim is to derive a lower bound on the mean squared error in estimating parameter X : MSEX = E
n
b (Y) − X X
b (Y) − X X
T o
,
(5.1)
b (Y) represents an estimate of the state variable X based on the measurewhere X
ment data Y. It can be seen from Eq. (5.1) that the variable MSEX is defined as the covariance matrix of the state vector X . The following formula determines the lower bound on the estimation error for parameter X : MSEX > J−1 X , PCRBX ,
(5.2)
5.2 Review of PCRB Theory
111
where JX is the d × d Fisher information matrix, the elements of which are defined as [115]:
[JX ](i,j)
∂ 2 log p(X , Y) =E − ∂θ(i) ∂θ(j)
,
(5.3)
with p(X , Y) the joint probability of the state and observation parameters. Here, [ · ](i,j) represents the (i, j)-th element of the corresponding matrix. The inequality
in Eq. (5.2) means that the difference MSEX −J−1 X is a positive semidefinite matrix. Since MSEX actually corresponds to the covariance matrix of X , the PCRB on the estimation error for each parameter θ(·) in the X vector is hence simply given by the diagonal elements of the inverse of the Fisher information matrix: MSEθ(i) > J−1 X (i,i) , PCRBθ (i) .
(5.4)
Using the conventional notation for the first-order and second-order partial derivative operators:
∂ ∂ ∇X = · · · ∂θ(1) ∂θ(d)
T
,
T ∆X Z = ∇Z ∇X ,
with arbitrary vector variables X and Z, Eq. (5.3) can be summarised as: JX = E −∆X X log p(X , Y) .
(5.5)
The above developments are valid provided the derivatives and expectations in Eqs. (5.1) and (5.3) exist, and subject to some additional constraints on the estimation bias [111, 115].
Now let k represent the discrete time variable, k = 1, 2, . . ., and consider the following filtering problem: X k = g(X k−1 , uk ) ,
(5.6a)
Y k = h(X k , vk ) ,
(5.6b)
where g(·) and h(·) are the (potentially nonlinear) transition and observation functions respectively, and with uk and vk two independent (not necessarily Gaussian)
5.2 Review of PCRB Theory
112
noise processes. With X 0:k denoting the concatenation of the first k + 1 state vectors, i.e. X 0:k = {X 0 , X 1 , . . . , X k }, it can be demonstrated that the kd × kd
Fisher information matrix JX 0:k for parameter X 0:k can be decomposed into four submatrices as follows: JX 0:k =
"
[JX 0:k ](1,1) [JX 0:k ](1,2) [JX 0:k ](2,1) [JX 0:k ](2,2)
#
X 0:k−1 Xk E − ∆X 0:k−1 log p(X , Y) E − ∆X 0:k−1 log p(X , Y) = . X 0:k−1 Xk E − ∆X k log p(X , Y) E − ∆X k log p(X , Y)
(5.7)
The size of the matrix given by Eq. (5.7) however increases over time. Also, only the estimation error on parameter X k is of interest in practice. Hence, the following type of bound is sought after: MSEX k = E
n
T o b b Xk − Xk Xk − Xk > J−1 Xk ,
where JX k now represents the d × d Fisher information submatrix for parameter X k . For brevity and without loss of generality, the matrix JX k will be denoted by Jk in the following developments.
The PCRB on parameter X k is determined by the right-lower block of the inverted Fisher information matrix JX 0:k , i.e. in mathematical terms: Jk = [JX 0:k ](2,2) − [JX 0:k ](2,1) [JX 0:k ]−1 (1,1) [JX 0:k ](1,2) .
(5.8)
Eq. (5.8) represents the general formula to compute the PCRB for the parameter of interest. It however involves a matrix inversion operation and is hence computationally demanding. In [111], Tichavsk´y et al. propose an efficient way of computing Jk recursively that avoids the manipulation of large matrices. This recursion is given by the following equation: (4) (3) (1) (2) Jk = Dk−1 − Dk−1 Jk−1 + Dk−1 Dk−1 ,
(5.9)
with the definitions: (1) Dk−1
=E
n
X −∆X k−1 k−1
o
log p(X k |X k−1 ) ,
(5.10a)
5.2 Review of PCRB Theory
(4)
Dk−1
n o (2) k Dk−1 = E −∆X log p(X |X ) , k k−1 X k−1 n o X (3) Dk−1 = E −∆X k−1 log p(X |X ) , k k−1 k n o n o Xk k = E −∆X log p(X |X ) + E −∆ log p(Y |X ) , k k−1 k k Xk Xk
113 (5.10b) (5.10c) (5.10d)
where p(X k |X k−1 ) and p(Y k |X k ) are the transition and observation PDFs respec-
tively, which follow directly from Eq. (5.6). The reader is referred to [111] for a formal proof of Eqs. (5.9) and (5.10). Assuming that the transition and observation equations g(·) and h(·) are time-invariant, it can be seen that the various matrices defined by Eq. (5.10) also remain constant over time. In such a case, the recursion for Jk can be shown to converge to a finite matrix J∞ as k → ∞ [111]. This steady-state solution determines the PCRB value of interest and it can be obtained (analytically or numerically) by solving Eq. (5.9) with Jk = Jk−1 , J∞ . It can be seen from the developments of this section that the resulting PCRB (i.e. the minimum achievable estimation error) is only influenced by two factors. It first depends on the type and level of noise introduced into the system by the variables uk and vk in Eq. (5.6). The error bound is also influenced by the amount of information regarding X k provided by the observation variable Y k , which is ultimately also determined by Eq. (5.6). It is hence not surprising that the lower bound on estimation error only depends on the fundamental properties of the given filtering problem. However, the PCRB results obtained in Sections 5.4 and 5.5 are only as relevant in practice as the assumed theoretical model is close to the practical system it attempts to reproduce. This general caveat hence needs to be kept in mind throughout the following sections where the various system and observation equations are derived.
5.2.2
Generic PCRB Computation Procedure
Based on the concepts put forward so far, a detailed procedure can now be elaborated according to which the PCRB can be computed. It consists in the following main steps: 1. Formulate the filtering problem in terms of the transition and observation equations, as given in Eq. (5.6). 2. Derive the transition and observation PDFs p(X k |X k−1 ) and p(Y k |X k ).
5.3 System Definition
114
3. Compute the Fisher information submatrix Jk according to the recursion defined by Eqs. (5.9) and (5.10). 4. As k → ∞ (i.e. for large values of k), determine the steady-state solution J∞ . 5. The PCRB for the variable X k is then defined as MSEX k > J∞ . The expectations involved in Eq. (5.10) for step 3 can be practically computed either analytically or as Monte Carlo average over a series of realisations of the different variables inside the expectation operator. This process will be explained in more detail in the following. The main remaining task with respect to the above procedure is to determine the exact transition and observation equations and PDFs, i.e. steps 1 and 2, for the ASL problem definition. This is the subject of the next three sections.
5.3
System Definition
Here, the variables and specific assumptions used for the PCRB computations are defined for the particular context of acoustic source tracking. On the basis of these definitions, the PCRB matrices of Eq. (5.10) involving the system transition PDF are derived. The developments however only cover the state variable and the system equation, whereas the definition of variables and concepts related to the measurement process (observations) is dealt with in detail in Sections 5.4 and 5.5.
5.3.1
Basic Parameters
The type of ASL scenario considered in this chapter is the same as that used in previous chapters in which the position of an acoustic source is to be estimated. For simplicity of the following theoretical developments, and due to the fact that the problem definition is identical in both coordinates x and y, the problem will be considered in one single dimension only. Hence, the state variable X k that will be assumed throughout the rest of this chapter is defined as: X k = [xk x˙ k ]T . This situation can be viewed for instance as a case where the sound source is moving along a straight line in a reverberant room. Only the position of the target along
5.3 System Definition
115
this “virtual” line is required, and the observation variable Y k also results from some measurement recorded along this one-dimensional path. Since only the PCRB on the target position is of practical interest (the velocity component of the state vector is added only as a requirement from the considered system model), only a lower bound on the location estimate xk is to be computed, as defined by Eq. (5.4): MSExk > J−1 X k (1,1) , PCRBxk . It must be noted that the exact same developments are valid for any coordinate in case the problem is to be analysed in two or three dimensions. This is due to the previously made assumption of independent and identical target motion in any direction (see Section 2.4). If necessary, a single “total” PCRB value can also be defined, e.g. as follows for 3D problems: PCRBX k , PCRBxk + PCRByk + PCRBzk , which basically corresponds to the trace of the inverted Fisher information matrix JX k (with the unnecessary target velocity information furthermore discarded).
5.3.2
System Equation
The transition equation used in the sequel to model the target dynamics is: X k = g(X k−1 ) + uk 1 Tu = X k−1 + uk , 0 1 | {z }
(5.11)
G
with zero-mean Gaussian noise process uk ∼ N [0 0]T , Q . The system noise
covariance matrix Q is defined as:
Q=η
Tu2 0 0
1
.
(5.12)
The type of model defined by Eqs. (5.11) and (5.12) is commonly used in the target-tracking literature (see e.g. [52, 95, 96, 99]) and allows to control the amount
5.3 System Definition
116
of process noise in the target dynamics through a single parameter η. Note the similarity between the model defined in Eq. (5.11) and the transition equation previously given in Section 2.4. Following the same developments as in Section 4.4.2, the transition PDF can be simply computed as: p(X k |X k−1 ) = puk (X k − g(X k−1 )) = N (X k ; GX k−1 , Q) .
(5.13)
As mentioned earlier, the notation puk (ξ) is used for the PDF of the system noise evaluated at ξ, and N (ξ; µ, Σ) corresponds to the density of a multi-dimensional
normally distributed random variable with mean vector µ and covariance matrix Σ evaluated at ξ.
5.3.3
System-related PCRB Computations
All matrices involved in the PCRB recursion of Eq. (5.9) and defined in Eq. (5.10) involve the transition PDF p(X k |X k−1). Given the result of Eq. (5.13), these matrices can be now explicitly computed. Mathematically expanding the normal dis-
tribution function leads to: 1 1 −1 T p p(X k |X k−1 ) = exp − (X k − GX k−1 ) Q (X k − GX k−1) , 2 2π |Q|
where |Q| = det(Q). The generic term log p(X k |X k−1) in Eq. (5.10) then becomes: p 1 log p(X k |X k−1 ) = − log 2π |Q| − (X k −GX k−1) Q−1 (X k −GX k−1 )T . (5.14) 2 The first term in the right-hand side of Eq. (5.14) is a constant and therefore disappears upon derivation. Using standard algebraic developments, the negated second-order partials of Eq. (5.14) with respect to variable X k , which is involved in Eq. (5.10d), can then be shown to be:
k −∆X X k log p(X k |X k−1 ) =
1 2
∂2 ∂x2k ∂ ∂ ∂ x˙ k ∂xk
= Q−1 .
∂ ∂ ∂xk ∂ x˙ k ∂2 ∂ x˙ 2k
(X k − GX k−1 ) Q−1 (X k − GX k−1 )T (5.15)
5.4 Simple Observation Model
117
Since the matrix Q is a constant, taking the expectation of Eq. (5.15) results in (4)
the same value and the first term of matrix Dk−1 is hence given as: n o k E −∆X log p(X |X ) = Q−1 . k k−1 Xk
(5.16a)
Similar developments lead to the following results: (1)
(5.16b)
Dk−1 = −GT Q−1 ,
(2)
(5.16c)
Dk−1 = −Q−1 G .
(5.16d)
Dk−1 = GT Q−1 G , (3)
The results of Eq. (5.16) define most of the variables required for a computation of the Fisher information submatrix Jk given in Eq. (5.9). The last term to be computed, i.e. the second term in Eq. (5.10d), is based on the observation density p(Y k |X k ), which in turn requires the formulation of an observation model.
5.4
Simple Observation Model
A simplified observation model is now developed and then used as an example in the PCRB computations. This model is based on some basic principles of statistical room acoustics, but does not include the effects of the temporal correlation between sound intensity measurements. This specific case is treated specifically in Section 5.5. The purpose of the observation equation in the filtering problem of Eq. (5.6) is of course to model the practical measurement process as accurately as possible. If this formula is an exact representation of what happens in the real world, then the overall model can be used to efficiently simulate the corresponding system, thus avoiding the need for “physical” experiments. Within the framework of acoustic source localisation, several processes have an influence on the measurement value ultimately used by the tracking algorithm. These mainly include acoustic reverberation and processing the sensor signals with a specific localisation method, based on either steered beamforming (SBF) or time-delay estimation (TDE). In the following, the principle of steered beamforming will be used as example to derive the PCRB for ASL. In Chapter 3, this method was shown to achieve a
5.4 Simple Observation Model
118
better tracking performance compared to TDE-based approaches. This is also the theoretical conclusion of Appendix A. Consequently, the SBF principle is used for the purpose of analysing how well the preferred PF method performs compared to the PCRB. Also, using this approach avoids having to model the complex two-stage process of TDE-based methods (i.e. first compute a series of time delay estimates, then minimise some least-square criterion), which reduces the complexity of the following derivations. In the sequel, the source signal will be assumed to be bandlimited white Gaussian noise in the range f ∈ [300, 3000Hz].
5.4.1
Model Derivation
First, it must be noted that an analytical observation equation has not been derived so far in this work. The particle filters in Chapter 3 and Chapter 4 made use of a likelihood function derived directly from practical realisations of the localisation functions. Here, a measurement equation is first defined from which the measurement PDF p(Y k |X k ) then follows. With the assumption that a steered beamformer is implemented to measure the acoustic intensity P(x) at any focus location x, the measurement Y k derived from maximising the SBF output hence corresponds to a direct observation of a potential source position in the state space. Since the problem currently considered is to be analysed from a 1D point of view only, Y k is hence simply a scalar corresponding to a certain position along the x-axis. Consequently, the symbol Yk will be used from now on instead of the bold version Y k to reflect this. The measurement process implicitly used with SBF hence becomes: Y = arg max{P(x)} , x
and the aim of this section is to model this principle with an equation of the form Yk = h(X k , vk ). Note that this corresponds to a situation where one single peak in the SBF output function is picked, whereas particle filters presented in previous chapters work with several such peaks as observation input. This is however only due to the fact that particle filtering methods are able to efficiently deal with multi-modal observations (unlike other tracking algorithms). This characteristic is hence only a special property of the tracking algorithms and it is not part of the
5.4 Simple Observation Model
119
general filtering problem definition. If the PCRB computations are to be derived with the assumption of multi-modal observations, the problem of data association must be considered first. This topic is however beyond the scope of this chapter.
Assuming a stationary source signal, the main idea behind the derivation of a simple observation model can be explained as follows. The outcome of the SBF measurement can be basically one of the following two trivial events (hypotheses): H1 , {Target position correctly detected} , H0 , {Target position incorrectly detected (false detection)} , with respective probabilities PH1 = Pr[H1 ] and PH0 = Pr[H0 ] = 1 − PH1 . In envi-
ronments with high SNR levels and high direct-to-reverberant ratios (DRR), the SBF output can be expected to provide nearly perfect measurements of the source position, yielding PH1 → 1. As the reverberation level increases in the enclosure, the statistical probability of false detections PH0 will increase accordingly.
In a diffuse sound field comprising many different frequency components, the energy density is assumed uniform throughout the considered enclosure [121]. This consequently leads to the natural assumption that in the case of a false detection, the resulting SBF measurement is a random location uniformly distributed along the state space. The observation equation then easily follows from this simple analysis as:
Yk =
(
[1 0] · X k + vk with probability PH1 , εk
with probability (1 − PH1 ) ,
(5.17)
where εk is a random variable with uniform distribution in the state space, i.e. along the x-axis between the state-space limits xl and xu : εk ∼ U(xl , xu ). The scalar noise
process vk accounts for slight errors in the source position estimates for hypothesis
2 H1 , and is assumed to be a zero-mean Gaussian random variable: vk ∼ N (0, σobs ).
The standard deviation σobs of this observation noise process can be roughly related to the width of the main lobe in the SBF beampattern. Figure 5.1 shows two cross-sections, along both the x and y-axis, of the frequency-averaged beampattern resulting for the delay-and-sum beamformer defined previously in Section 2.5. The frequency averaging was carried out between fl = 300Hz and fu = 3kHz. According
5.4 Simple Observation Model
120
0 −5 −10 −15 −20 −25
−0.4
−0.3
−0.2
−0.1
0 0.1 x−axis (m)
0.2
0.3
0.4
0.5
−0.4
−0.3
−0.2
−0.1
0 0.1 y−axis (m)
0.2
0.3
0.4
0.5
0 −5 −10 −15 −20
Figure 5.1: Frequency-averaged beamformer response (in dB) for f ∈ [300, 3000Hz]: cross-sections in x and y directions. to the example of Figure 5.1, the average width of the main lobe is measured to be about 0.18m, and the numerical value of σobs is hence set to 0.1m for the following computations. Given the observation model of Eq. (5.17) and using the vector notation hT = [1 0], the observation density now trivially follows as a mixture model: p(Yk |X k ) = PH1 · pvk (Yk − hT X k ) + PH0 · pεk (Yk ) 2 = PH1 · N Yk ; hT X k , σobs + (1 − PH1 ) · U Yk ; xl , xu .
(5.18)
The last parameter that remains to define more precisely is the probability PH1 of the “correct position estimate” hypothesis (correct detection).
5.4.2
Computation of PH1
The probability PH1 of detecting the correct source location can be computed using principles of statistical room acoustics [10, 69] and will typically depend on various structural parameters such as reverberation time and enclosure volume: intuitively, a small level of reverberation must also result in a large PH1 value. The following developments are hence subject to the usual assumptions inherent to a random-field analysis [71, 94, 104]. Amongst other conditions of minor importance, this mainly implies that the considered frequency range must be above the so-called Schroeder
5.4 Simple Observation Model
121
p large room frequency, i.e. f > 2000 T60 /V , with V corresponding to the room
volume.a
The SBF output P(x) is effectively a measurement of the sound intensity
(squared sound pressure) across the state space. At a distance sufficiently far away from the source location, the direct path component of the acoustic field drops to a negligible level and only the diffuse field part remains. It follows that the value of PH0 = (1 − PH1 ) can be seen basically as the probability of at least one location of the state space registering a diffuse-field intensity value larger than the SBF output at the source location xs , yielding: h i PH1 = 1 − Pr max I˜r (x) > P(xs ) x
h
i = Pr max I˜r (x) 6 P(xs ) , x
(5.19)
where I˜r (x) is the (unnormalised) sound intensity due to the reverberant field only.b P(xs ) corresponds to the output of a beamformer steered to the source location.
This value can be computed using the general SBF definition equation (see e.g.
Eq. (2.6) in Section 2.2.2) with the signals received at the microphones assumed to be attenuated and delayed replicas of the source signal (narrow-band noise). In a diffuse sound field consisting of Nt distinct and statistically independent tone frequencies, the distribution pIr (·) of the normalised reverberant intensity values Ir (viewed as a stochastic process) is given by the well-known density function γ(ξ; Nt , Nt −1 ) with unit mean value and variance Nt −1 [71, 121]: pIr (ξ) = γ(ξ; Nt , Nt −1 ) =
Nt Nt ξ Nt −1 exp(−Nt ξ) . (Nt − 1)!
(5.20)
The normalisation of the sound intensity I˜r is with respect to the spatially-averaged value I¯r of sound intensity across the state space: Ir = I˜r /I¯r . The theoretical value a
For the worst case scenario of a small and very reverberant room, e.g. V = 36m3 and T60 = 0.8s, this specific frequency limit approximately results as f > 300Hz, which is the assumption made in this work. b Note that these developments should split the state space into “near-target” and “far-fromtarget” regions, and subsequently consider the diffuse-field intensity values I˜r (x) only in the farfrom-target part. For simplicity and because it does not really influence the PCRB computations, this distinction will however not be considered.
5.4 Simple Observation Model
122
of I¯r can be easily derived as follows. Under the steady-state assumption of a diffuse sound field, the average intensity can be expressed as [10, 69]: Ws I¯r = , Aπ where Ws is the acoustic power of the source and A is the so-called equivalent absorption area of the enclosure. Combining the above equation with Sabine’s wellknown reverberation time formula (involving room volume V ): T60 =
0.163 V , A
finally yields: T60 Ws . I¯r = π 0.163 V
(5.21)
In the case of a diffuse field generated by band-limited noise, which is of interest in this section, the intensity distribution function of Eq. (5.20) remains basically valid. The difference is that Nt now represents the equivalent number of pure tones that would be required to produce a sound intensity variance equal to that obtained with the narrow-band noise excitation. Reference [71] gives a couple of equations to determine Nt in that case, including the approximate formula: B T60 Nt ≈ 1 + , 6.9
(5.22)
with the noise bandwidth B, and with ⌈·⌉ used to denote the “ceiling” function, i.e.
rounding towards +∞. The above equation for Nt provides a good approximation in practice for BT60 products larger than about 20.
Another well-known property of random sound fields is the spatial correlation of sound pressure measurements taken at two different locations x1 and x2 in the state space. With r defined as the distance between the two points, r = |x1 − x2 |, the correlation coefficient ρp (r) between the sound pressure values p1 (t) and p2 (t) measured at x1 and x2 respectively, is defined as:
ρp (r) = Z
Z
(p1 (t) − p¯1 )(p2 (t) − p¯2 ) dt
(p1 (t) − p¯1 )2 dt ·
Z
12 , (p2 (t) − p¯2 )2 dt
(5.23)
5.4 Simple Observation Model
123
1
0.5
0 0
2
4
6
8
10
12
0.4
0.5
0.6
14
κr (-) 1 0.8 0.6 0.4 0.2 0 0
0.1
0.2
0.3
r (m)
Figure 5.2: Spatial correlation coefficients. Top plot: coefficients for sound pressure (ρp , dashed line) and sound intensity (ρI , solid line). Bottom plot: frequency-averaged correlation coefficient for sound intensity ρ¯I .
where p¯i is the time-averaged sound pressure at location xi , i = 1, 2. It is shown in [26, 69] that for the case of a diffuse sound field consisting of one single tone frequency f , the correlation coefficient ρp (r) for sound pressure results in: sin(κr) , κr
ρp (r) =
with the wave number κ = 2πf /c. Sound intensity values are proportional to the squared sound pressure as given by I(t) = p2 (t)/̺0 c, where ̺0 corresponds to the static gas density. Following the same developments as in [26] leads to a correlation coefficient for sound intensity defined as: ρI (r) =
sin(2κr) . 2κr
(5.24)
The result of Eq. (5.24) determines the correlation existing between different measurements of the normalised diffuse-field intensity Ir , a distance r apart. The top plot of Figure 5.2 shows the correlation coefficients ρp and ρI for the sound pressure and sound intensity as a function of the frequency–distance product κr. It can be seen that in the case of sound intensity, the correlation between two measurements in a diffuse field drops more quickly as the distance r between the two locations increases. For random fields generated by band-limited noise, an additional averaging
5.4 Simple Observation Model
124
of Eq. (5.24) is necessary over the frequency range of interest [69]. The resulting frequency-averaged coefficient ρ¯I (r) is hence defined as: 1 ρ¯I (r) = r(κu − κl )
Z
κu r κl r
sin(2ξ) dξ . 2ξ
(5.25)
The integral of Eq. (5.25) can be computed numerically and the bottom plot of Figure 5.2 presents the resulting spatial correlation coefficient ρ¯I (r) of the sound intensity with frequency averaging between κl = 600π/c and κu = 6000π/c (corresponding to f ∈ [300, 3000Hz]). The result depicted in this plot shows clearly that the correlation coefficient drops fairly sharply as r increases and becomes practically negligible for distances larger than the correlation distance rc ≈ 0.1m. This means
that within the limits of the state space, i.e. for x ∈ [xl , xu ], it is sufficient to con (i) NI , sider only NI statistically independent measurements of sound intensity, Ir i=1 with:
xu − xl NI = . 2 rc
(5.26)
From this analysis, it follows that Eq. (5.19) now becomes: PH1
(i) P(xs ) = Pr max Ir 6 ¯ , i Ir
(5.27)
(i)
where each Ir , i ∈ {1, . . . , NI }, can be considered as a realisation of the random process Ir (diffuse field sound intensity) with PDF pIr (·) given by Eq. (5.20). The expression on the right-hand side of Eq. (5.27) is hence the cumulative distribution function (CDF) of the maximum of NI i.i.d. random variables, and is known to be [91]:
PH1 = FIr
P(xs ) I¯r
NI
,
(5.28)
where FIr (ξ) is the CDF of the random field intensity Ir and is defined as: FIr (ξ) =
Z
ξ
pIr (z) dz .
(5.29)
−∞
Eq. (5.28), together with Eqs. (5.20)–(5.22), (5.26) and (5.29), can be used to compute numerically the parameter PH1 . The final equation for the computation of PH1 is derived in Appendix B. In practice, the numerical value of this probability
5.4 Simple Observation Model
125
1
V V V V
0.9 0.8
= = = =
50 70 90 110
0.7
PH1
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
T60 (s)
Figure 5.3: Detection probability PH1 plotted versus reverberation time T60 , for different values of enclosure volume V . hence follows from a wide range of variables including the state-space limits xl and xu , the frequency range [κl , κu ] of the narrow-band noise, the acoustic power Ws of the source, and other structural parameters such as enclosure volume V , reverberation time T60 and correlation distance rc . Defining some standard numerical values for these different model parameters, and using the formula for PH1 derived in Appendix B (Eq. (B.5)), Figure 5.3 depicts a typical example of the relationship existing between PH1 and T60 . Note the close match of these theoretical results with the research outcome presented in [21] and reproduced earlier in Figure 2.2. This plot shows the experimental percentage of anomalous GCC time delay estimates as a function of T60 , which can be seen as an empirical measure of the false detection probability PH0 = 1 − PH1 (i.e. the complement of the results depicted in Figure 5.3).
Figure 5.3 gives some insight into the basic dependences between PH1 and other model variables specific to the ASL problem definition. However, in an attempt to reduce the model complexity, the number of free parameters in the analysis will be cut down by only considering the probability PH1 from now on. The numerical values of other model variables then follow either directly from PH1 or as a result of assigning specific values to other parameters of lesser importance (such as room volume or state-space limits). The numerical value of specific variables involved in the coming experimental simulations will be defined where appropriate in the next sections.
5.4 Simple Observation Model
5.4.3
126
Observation-related PCRB Computations
Based on the observation PDF defined in Eq. (5.18), the PCRB computations can now be finalised. In Eq. (5.10d), the second term involved in the computation of (4)
Dk−1 requires the derivation of the term: 2 log p(Yk |X k ) = log PH1 · N (Yk ; xk , σobs ) + (1 − PH1 ) · U(Yk ; xl , xu ) ,
(5.30)
where the simplification hT X k = xk has been made. Also, values of xk located outside the state space are of no interest for the current analysis. Therefore, all values of xk are from here on assumed to be within the state-space limits by definition, i.e. xk ∈ [xl , xu ], ∀k. This implies that the uniform PDF U( · ; xl , xu ) in Eq. (5.30) can be simply replaced by (xu − xl )−1 . With the help of standard calculus derivations,
the negative second-order partial derivatives of Eq. (5.30) can be shown to be: k −∆X X k log p(Y k |X k ) =
−
∂2 ∂x2k
∂2 ∂ x˙ 2k
∂ ∂ ∂ x˙ k ∂xk
=
∂ ∂ ∂xk ∂ x˙ k
1 0 0 0
1 − PH1 PH1 (Yk − xk )2 √ exp − − log 2 2σobs xu − xl σobs 2π
· Λ(αk ) ,
(5.31)
where the term Λ(αk ) is defined as follows: PH Λ(αk ) = √ 1 2π
β 1−
2 2βPH1 σobs √ 2π
P −αk + σ H√12π exp 2σ 2 obs P 2 σ obs , H1 obs αk −αk 3 2 + σobs β exp 2σ2 + 2π exp 2σ2 αk 2 σobs
obs
with the constant: β=
(5.32a)
obs
1 − PH1 . xu − xl
(5.32b)
Here, the squared observation error αk at time k is defined as: αk = (Yk − xk )2 . (4)
The second term of matrix Dk−1 defined in Eq. (5.10d) is hence given by:
E
n
k −∆X Xk
o
log p(Y k |X k ) =
1 0 0 0
· E {Λ(αk )} ,
(5.33)
5.4 Simple Observation Model
127
which provides the last formula required for the PCRB computations. Together with the results previously given in Eq. (5.16) (which are here reproduced once again for convenience), the final computations of the PCRB recursion matrices can be summarised as follows: (1)
(5.34a)
(2)
(5.34b)
Dk−1 = GT Q−1 G , Dk−1 = −GT Q−1 , (3)
(4)
Dk−1
Dk−1 = −Q−1 G , 1 0 −1 =Q + · E {Λ(αk )} . 0 0
(5.34c) (5.34d)
The expectation in Eq. (5.34d) can be computed using either of two numerical methods. The first one consists in Monte Carlo averaging over a (possibly large) number of realisations of Λ(αk ). That is, the sequence {Λ(αk )}K k=1 is numerically
computed according to Eq. (5.32) for a large number of trajectories {xk }K k=1 and,
for each of these trajectories, a large number of observations {Yk }K k=1 . The value of ¯ k ) taken over all these Λ(αk ) realisations. E {Λ(αk )} then results as the average Λ(α The second method consists in numerically computing the integration involved
in the theoretical formula of the expectation operator: E {Λ(αk )} = E {Λ(X k , Yk )} =
ZZ
Λ(X k , Yk ) p(X k , Yk ) dX k dYk .
(5.35)
The joint probability p(X k , Yk ) of parameters X k and Yk can be easily derived
from previous results, namely the transition PDF of Eq. (5.13) and the observation density defined in Eq. (5.18): p(X k , Yk ) = p(Yk |X k ) p(X k |X k−1) = p(Yk |xk ) N (X k ; GX k−1, Q) . Note that for the computation of a lower bound on estimation error, the target trajectory is implicitly assumed to be known, i.e. the variables X k in p(Yk |X k ) and X k−1 in p(X k |X k−1 ) are considered given. Eq. (5.35) then becomes: E {Λ(αk )} = =
ZZ
ZZ
Λ(αk ) p(Yk |xk )
Z
N (X k ; GX k−1, Q) dx˙ k
dxk dYk
Λ(αk ) p(Yk |xk ) N xk ; xk−1 + Tu x˙ k−1 , ηTu2 dxk dYk ,
(5.36a)
5.4 Simple Observation Model
128
with the observation density given as (see Eq. (5.18)): 1 − PH1 2 p(Yk |xk ) = PH1 · N Yk ; xk , σobs + . xu − xl
5.4.4
(5.36b)
Special Case: Noiseless System
It is often the case in practice that the performance of a tracking algorithm is assessed with the assumption of a purely deterministic target trajectory (no process noise) [13, 98–100]. This corresponds to the case of a vanishing system noise variable in Eq. (5.11), uk = 0, which in turn implies η → 0. Since this results in a singular noise covariance matrix Q, the developments of Section 5.3.3 are not applicable
any longer for this special case. Similarly to e.g. [100], an alternative approach is followed to compute the PCRB for deterministic trajectories, as explained below. The system equation for a purely deterministic system simply reads: X k = G X k−1 .
(5.37)
Note that despite the vanishing noise variable in the transition equation, there still exists a degree of uncertainty about the state variable X k . This uncertainty, which sequentially propagates through time, originates from the initial target position not being exactly known. Instead, it is assumed that some a priori statistical information about X 0 is available in the form of an additional “fictitious” measurement at time k = 0.c If the distribution of the initial target state is assumed Gaussian with mean X 0 and covariance matrix P0 , X 0 ∼ N (X 0 , P0), the PDF of the state variable at time k can then be shown to be:
with:
X k ∼ N Gk X 0 , Gk P0 [Gk ]T , k
G =
1 kTu 0
1
(5.38)
.
According to Eq. (5.5), the generic definition of the Fisher information matrix is given as:
c
n o k Jk = E −∆X log p(X , Y ) . k 1:k Xk
(5.39)
Alternatively, this information can also be viewed as being similar to a random filter initialisation, see [95, 110].
5.4 Simple Observation Model
129
Here, the joint PDF of the state and observation variables is derived as follows: p(X k , Y1:k ) = p(X k ) p(Y1:k |X k ) = p(X k )
k Y i=1
= p(X k )
k Y i=1
p(Yi |X i ) p(Yi |xi ) ,
(5.40)
which assumes statistical independence between the observations {Yi }ki=1 . In the
second line, the change of state variable index from k to i in the PDF p(Yi |X i ) is
a result of the purely deterministic transition equation, Eq. (5.37): given X k , any variable in the sequence {X i }ki=1 is also known exactly. The simple relationship of Eq. (5.37) also leads to X k = Gk−i X i and X i = Gi−k X k , and more specifically for the position variable: xi = xk − (k − i)Tu x˙ k , ∀i ∈
{1, . . . , k}. Using the notation T∆ = (k − i)Tu and the result of Eq. (5.40), the natural logarithm of p(X k , Y1:k ) in Eq. (5.39) hence becomes: log p(X k , Y1:k ) = log p(X k ) +
k X i=1
log p(Yi |xk − T∆ x˙ k ) .
(5.41)
Given the state PDF of Eq. (5.38), the negative second-order partials of the first term in Eq. (5.41) follow in a trivial way: k k T k −∆X X k log p(X k ) = G P0 [G ]
−1
.
(5.42)
The computations for the second term of Eq. (5.41) are more complex, due to the PDF p(Yi|xi ) now involving both the variables xk and x˙ k . However, following the same developments as in Section 5.4.3 provides a result similar to Eq. (5.31): k −∆X Xk
log p(Yi|xk − T∆ x˙ k ) = =
k −∆X Xk
1
log PH1 · N (Yi ; xk − T∆
T∆ T∆2
· Λ(αk ) ,
2 T∆ x˙ k , σobs )
1 − PH1 + xu − xl
(5.43)
where Λ(αk ) is defined as in Eq. (5.32), but with the difference that αk now corresponds to: αk = (Yi − xk + T∆ x˙ k )2 .
5.4 Simple Observation Model
130
With the results from Eqs. (5.42) and (5.43), Eq. (5.39) finally leads to the following PCRB computation formula for a system without process noise: k X Xk Xk Jk = E − ∆X k log p(X k ) − ∆X k log p(Yi|xk − T∆ x˙ k ) i=1
−1 = Gk P0 [Gk ]T +
k X i=1
1
T∆
T∆ T∆2
· E {Λ(αk )} .
(5.44)
Here again, the computation of the expectation in the term E {Λ(αk )} can be
carried out using either Monte Carlo averaging or by numerically computing the following integration: E {Λ(αk )} = E {Λ(Yi, xk , x˙ k )} ZZZ = Λ(Yi, xk , x˙ k ) p(X k , Yi ) dxk dx˙ k dYi ZZZ = Λ(Yi, xk , x˙ k ) p(Yi|X k ) p(X k ) dxk dx˙ k dYi ,
(5.45a)
where the integration PDFs are defined as previously, i.e.: p(X k ) = N X k ; Gk X 0 , Gk P0 [Gk ]T ,
2 p(Yi|X k ) = PH1 · N (Yi ; xk − T∆ x˙ k , σobs )+
5.4.5
1 − PH1 . xu − xl
(5.45b) (5.45c)
PCRB Results
Some practical PCRB results are now presented based on the theory developed so far. This is done by considering a numerical example with the following simulation parameters. The target is assumed to be in motion within the state-space limits xl = 0m and xu = 10m. The original target position is set to x0 = 2m with an initial target velocity x˙ 0 = 0.3m/s, i.e. X 0 = X 0 = [2 0.3]T . The covariance matrix of the original source distribution is defined with the following numerical values:
P0 =
σx20
0
0
σx2˙0
=
0.01
0
0
0.04
.
Following a standard practice, the initial Fisher information matrix at time k = 0 in Eqs. (5.9) and (5.44) is also defined by the original state covariance matrix, i.e.
5.4 Simple Observation Model
131
0.14
0.09
PH 1 PH 1 PH 1 PH 1 PH 1
√
PCRBxk (m)
0.12
= = = = =
0.15 0.35 0.55 0.75 0.95
η η η η η η
0.08 0.07
0.1
= = = = = =
0 1.2 · 10−5 4.8 · 10−5 1.9 · 10−4 7.6 · 10−4 3.0 · 10−3
0.06 0.08
0.05 0.04
0.06 0.03 0.04 0.02
0
50
100
time index k
150
0
50
100
150
time index k
Figure 5.4: Typical square-root PCRB on estimation error for xk , using a simplified observation model: Monte Carlo averaging (solid lines) and numerical expectation integration (◦). Left-hand plot: variable detection probability PH1 with η = 5.1 · 10−4 . Right-hand plot: variable noise parameter η with PH1 = 0.75 (dashed line corresponds to noiseless system results). J0 = P−1 0 . As mentioned earlier, the standard deviation σobs of the observation noise process is set to 0.1m. When using Monte Carlo averaging to compute the expectations of Eqs. (5.34d) and (5.44), a total of 2500 simulation runs were used. For the noiseless system case, this corresponds to 2500 simulations of the observation sequence {Yk }K k=1 . In the case of a non-vanishing process noise, the average was taken over 50 simulations
K of {X k }K k=1 with 50 simulations of the observation sequence {Yk }k=1 for each of
these trajectories. The total number of simulation steps K was arbitrarily set to 150, and the update interval was finally defined as Tu = 0.05s.
Figure 5.4 presents some typical results obtained using the developments elaborated in this section. The solid curves depicted in this figure were obtained using the Monte Carlo averaging technique. The results obtained by computing the numerical expectation integral in Eqs. (5.36) and (5.45) are also displayed to confirm the √ results (superimposed circle markers). The left-hand plot depicts PCRBxk , i.e. the square root of the lower bound on the estimation error for the target location xk , as a function of time index k and for a variable probability PH1 . The different curves in this plot were all obtained with a fixed noise parameter η = 5.1 · 10−4 . The second part of this figure shows the same PCRB results obtained this time with a variable process noise parameter η. Since large values of noise variance dras-
5.4 Simple Observation Model
132
η η η η
0.2 0.18
(m)
0.12
pPCRB
0.14
xk
0.16
= 1.0 · 10−5 = 5.9 · 10−5 = 3.4 · 10−4 = 2.0 · 10−3
0.1 0.08 0.06 0.04 0.02 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PH1
Figure 5.5: Square-root steady-state PCRB for xk as a function of the target detection probability PH1 (system noise η as parameter).
tically influence the source trajectory, the numerical range of η used in this plot is somewhat limited by the size of the considered state space. In any case, the simulation data were discarded for trajectories where the target happened to move beyond the above defined state-space boundaries. The dashed line (label η = 0) corresponds to results obtained for a system without process noise (see Section 5.4.4). Each curve in this plot was obtained using a detection probability PH1 set to 0.75.
Due to the relative complexity of the PCRB analysis presented in previous sections, it is generally difficult to compute the limit J∞ analytically as k → ∞. However, as clearly depicted in Figure 5.4, the steady-state value PCRBxk can be easily read (approximately) from the plots. Because the evolution of the theoretical bound during the first few frames is only influenced by the initial value J0 of the sequence {Jk }K k=0 , the limit J∞ is really the only parameter of interest in these √ simulations. Figures 5.5 and 5.6 hence present the results obtained for PCRBxk , i.e. the square-root steady-state PCRB value for parameter xk , which was derived from plots similar to those depicted in Figure 5.4 by discarding the initial “convergence” phase of the curves. Figure 5.5 shows the dependence of the steady-state lower bound as a function of the detection probability PH1 and for a few values of the free parameter η. Figure 5.6 shows the same PCRB with the role of PH1 and η inverted.
5.4 Simple Observation Model
133
0.1 0.09 0.08
0.06
pPCRB
xk
(m)
0.07
0.05 0.04 0.03
PH1 PH1 PH1 PH1
0.02 0.01 0
0
0.5
1
1.5
η
2
= 0.25 = 0.5 = 0.75 =1
2.5
3 −3
x 10
Figure 5.6: Square-root steady-state PCRB for xk as a function of the system noise variable η (detection probability PH1 as parameter).
5.4.6
Result Analysis
The results depicted in Section 5.4.5 confirm the obvious expectation that higher values of system noise and lower detection probabilities yield a larger theoretical bound on estimation error. Both these conditions explicitly correspond to an increase of the overall disturbance level in the system, meaning that the target position estimates delivered by any tracking algorithm can only become less accurate. As shown in Figure 5.6, the steady state value PCRBxk tends towards zero as η → 0, which can be trivially explained as follows. In the extreme case of a system
without process noise (i.e. η = 0), the implicit assumption for the PCRB developments is that the source is in constant-velocity motion (see transition model of Eq. (5.37)). With each new observation Yk , the amount of information regarding the target position increases over time, and the estimate of the (constant) source
velocity can be made more and more accurate. With an infinite number of observations (i.e. as k → ∞), the estimation error eventually becomes virtually nil. Finally, the results obtained in this section also serve the more general purpose of ratifying the theoretical concepts elaborated so far in this chapter. For instance, the results obtained from a computation using Monte Carlo averaging can be seen to be in good agreement with the values resulting from a numerical computation of the expectation integral (as shown in Figure 5.4). The validity of both approaches is hence corroborated, as far as the experimental simulations are concerned.
5.5 Correlated Observations Model
134
Also, the plot on the right-hand side of Figure 5.4 clearly demonstrates that the analysis of Section 5.4.4 (system defined with no process noise) indeed constitutes the limit η → 0 in the developments based on a non-zero system noise variable. This was of course expected to be the case.
5.5
Correlated Observations Model
The developments given in this section attempt to extend the previous analysis of Section 5.4 to yield an observation model that is a better representation of what happens in practice. In particular, the effects of the temporal correlation between sound intensity measurements is introduced into this framework. First, the reasons motivating this model update are explained. The following subsections then describe the theoretical concepts of the updated observation equation, and the corresponding results are finally presented and analysed.
5.5.1
Motivation
The spatial correlation between sound intensity values measured in a reverberant field has already been mentioned in this chapter (see Eq. (5.24) in Section 5.4.2). To put this in a different way, this correlation means that sound intensity values measured at two different positions are statistically dependent provided the two measuring locations lie close enough to each other (i.e. for distances smaller than the correlation distance rc ). The sound field in an enclosure can be typically analysed using the principles of geometrical room acoustics, i.e. with the concept of sound waves replaced by the concept of sound rays propagating in straight lines [69].d Thus, the intensity measured at a specific point in the enclosure can be viewed as the result of a multitude of sound rays originating from the acoustic source, bouncing off the walls a number of times, and then concentrating again at the receiver position.e The correlation existing between intensity values at two slightly different locations can be seen consequently as coming from the “same” rays propagating through space d
Note that as previously mentioned, the following analysis is subject to some conditions on the structural definition of the problem. Mainly, the dimensions of the room and its walls must be large compared to the considered sound wavelength, which will be implicitly assumed in the following developments. e This is exactly the approach used in the image method implementation for simulating sound fields [3], which has been used for experimental simulations in various parts of this work.
5.5 Correlated Observations Model
135
along slightly different trajectories to reach one or the other receiver position. From this point of view, it does not matter whether the difference in the paths followed by the sound rays is introduced through two different positions at the receiver or at the source. By reciprocity, the same sort of correlation should therefore exist when the sound intensity is measured at the same location but with a slightly different source position. Now assuming that a fraction of time has elapsed between the two consecutive intensity measurements (corresponding typically to the time required for the source to move from one location to the other), the velocity of the source can be finally linked to a temporal correlation existing between measurements. The next subsections develop this concept in more mathematical terms and investigate its influence on the previously derived estimation error bound.
5.5.2
Theoretical Developments
The temporal correlation ρI (τ ) between sound intensity values directly follows from Eq. (5.24) with the variable r now corresponding to the source (target) displacement: r = x˙ s τ , where x˙ s denotes the source velocity. For sound intensities measured in a single-tone random field at two different times k1 and k2 (but at the same location), ρI (τ ) is hence defined by:
ρI (τ ) =
sin(2κx˙ s τ ) , 2κx˙ s τ
with τ = |k2 − k1 |. In the case of band-limited noise, the frequency-averaged correlation coefficient ρ¯I (τ ) again simply results from averaging ρI (τ ) over the frequency band of interest: 1 ρ¯I (τ ) = κu − κl
Z
κu
ρI (τ ) dκ .
(5.46)
κl
A graphical representation of ρ¯I (τ ) is hence the same as that shown in Figure 5.2 (bottom plot), except that the abscissa is now scaled by a specific factor determined by x˙ s . The temporal analogue to the correlation distance rc is the correlation time τc , which simply follows as τc = rc /x˙ s . The aim of the following derivations is to introduce the concept of temporal correlation into the observation model previously developed in Eq. (5.17). Hence,
5.5 Correlated Observations Model
136
the original assumptions given previously in Section 5.4 remain valid for the current analysis. Among others, the observation process is still governed by the detection probability PH1 , and the resulting observation Yk in a misdetected target situation
is here also assumed to be uniformly distributed across the state space, i.e. εk ∼
U(xl , xu ). Consequently, the general observation model and observation PDF in Eqs. (5.17) and (5.18) also remain identical for the current developments, as well as the PCRB computations previously derived on the basis of these equations. The major difference however appears in the sequential occurrences of the hy-
potheses H1 and H0 . Whereas this process was defined as purely random so far
(under the constraint that H1 would occur with probability PH1 on average), it
is now to be related to the temporal correlation ρ¯I (τ ) of the sound intensity values measured across the state space. A practical example of this principle is as follows. Assume that at time k, a spurious peak in the SBF output generates an erroneous observation Yk 6= xs,k (i.e. the observation fails to detect the real target). (i)
This practically means that at least one sound intensity measurement Ir
in the
reverberant part of the sound field registers a larger value than the output of a beamformer aimed at the true source position. In a situation where a large correlation exists between successive measurements, the sound intensity recorded at the spurious peak location will remain large for a few successive frames, making it likely for hypothesis H0 to occur consistently for several frames at a time, rather than in a purely random fashion.
As previously mentioned in Section 5.4.2, it is sufficient to consider only NI statistically independent sound intensity measurements in the state space, with NI defined by Eq. (5.26). Also, note that the temporal correlation in the occurrences of H0 (or H1 ) does not directly correspond to the correlation coefficient ρ¯I (τ ) of the
sound intensity given in Eq. (5.46). Prior to determining whether H0 or H1 is re (i) NI alised at time k, a series of sound intensities Ir i=1 must first be considered. The occurrence of H0 or H1 is then determined on the basis of whether the maximum (i)
value of Ir is larger or smaller than the peak value generated by the source in the SBF output. These two processes (maximising and thresholding) hence affect the temporal correlation of Eq. (5.46) in a nonlinear way.
The updated observation model attempts to take the above mentioned concepts into account when the series of observations {Yk }K k=1 is to be generated. The easiest
5.5 Correlated Observations Model
137
way to achieve this is to reproduce the practical observation process as follows: 1. At time k, generate a sequence of NI sound intensity values (i.i.d. random vari (i) NI able): Ir (k) i=1 . Each of these realisations must be distributed according (i)
to the density γ( · ; Nt, Nt −1 ), see Eq. (5.20). Also, Ir (k) must be correlated (i)
to Ir (k +τ ), ∀i ∈ {1, . . . , NI }, according to the coefficient ρ¯I (τ ) of Eq. (5.46). (imax )
2. Determine the maximum intensity value Ir (i) where imax = arg maxi Ir (k) .
for the current time step,
3. Set Yk according to one of the following options: (imax )
a) Correct target detection (hypothesis H1 ): if Ir output at the true source location, i.e.
(i ) Ir max
is smaller than the SBF 6 P(xs )/I¯r , then
Yk = xk + vk . Note that this situation should happen with probability PH1 on average. (imax )
b) Incorrect target detection (hypothesis H0 ): if Ir
> P(xs )/I¯r , then
Yk = x(imax ) + vk ,
(5.47)
(i) NI are recorded at NI equally assuming that the intensity values Ir (k) i=1 (i) NI spaced locations x i=1 in the state space.
The noise variable vk is again introduced here to account for slight errors in the 2 position estimates delivered by the SBF principle, i.e. vk ∼ N (0, σobs ).
Note that the above procedure is especially relevant for Monte Carlo averag-
ing, where a large number of observation sequences must be delivered. For this reason, and also in order to reduce the computational burden of experimental simulations, the results presented in Section 5.5.4 do not show the plots obtained from a numerical evaluation of the expectation integral (Eqs. (5.36) and (5.45)). The equivalence of the results delivered by these two approaches has been furthermore demonstrated in Section 5.4.5. Technically speaking, the detection probability PH1 is here also a result determined by several problem variables involved in the computation of the SBF output value P(xs ) and the room-averaged reverberant intensity I¯r . As mentioned earlier
in Section 5.4.2, the current analysis will be simplified by directly working with
5.5 Correlated Observations Model
138
PH1 rather than the whole range of parameters that determine its numerical value. Then, the term P(xs )/I¯r for instance results as the threshold required to yield the
specific value of PH1 , which in turn can be used to determine other factors such as reverberation time, room volume, and so on. It must also be noted that some minor discrepancies exist between the assumed observation equation, Eq. (5.17), and the practical procedure used to generate the sequences {Yk }K k=1 . First, the noise variable vk has been introduced for the case of
a false detection as well (hypothesis H0 ), yielding a modified observation equation of the form:
Yk = vk +
(
[1 0] · X k with probability PH1 ,
with probability (1 − PH1 ) .
εk
This is necessary to avoid generating “false” observations having a purely stationary position over several time frames. This however also affects the part of the observation PDF corresponding to H0 in Eq. (5.18), since in a false detection situ-
ation, the observation variable Yk now corresponds to the addition of a uniformly distributed random variable εk and a normally distributed variable vk . The overall
2 PDF hence corresponds to the convolution of a Gaussian PDF N (0, σobs ) with the
uniform PDF U(xl , xu ). Provided the variance of the observation noise process is
small with respect to the dimensions of the state space, i.e. σobs ≪ |xu − xl | (which
is verified in the practical simulations considered in this work), these convolution effects remain very limited and the observation PDF of Eq. (5.18) can still be seen as a good approximation in practice. Also, the current observation model defines the variable εk as uniformly dis-
tributed in the state space, whereas Eq. (5.47) typically implies that Yk may only take a finite number of discrete values. Here again, in view of the typically large
dimension of the state space used in the practical simulations compared to a correlation distance of about rc = 0.1m, this effect can be regarded as small enough to be neglected in the current developments.
5.5.3
(i)
Generating Correlated Ir (k) Values
Step 1 in the previous observation-generating procedure calls for a principled way (i)
of simulating the random variable Ir (k) having a specific distribution and tempo-
5.5 Correlated Observations Model
139
ral correlation coefficient. This subsection presents the derivations leading to such a method. (i)
The PDF of the (normalised) sound intensity values Ir
measured in a rever-
berant field has already been shown to correspond to the density γ( · ; Nt , Nt −1 ): pIr (ξ) =
Nt Nt ξ Nt −1 exp(−Nt ξ) . (Nt − 1)!
As mentioned in [121], this distribution is well known to be very similar to a normal PDF with unit mean and variance Nt −1 , for Nt ≫ 1. Since this latter condition is usually verified in the practical simulations (due to the extent of the consid-
ered noise bandwidth), this Gaussian approximation will be used from here on, i.e. pIr (ξ) ≈ N (ξ; 1, Nt−1 ). As shown below, this assumption will greatly simplify the (i)
process of generating Ir (k).f
With Xk , k = 1, 2, . . . , denoting the individual realisations of a generic random 2 process X with mean µX and variance σX , the correlation coefficient corresponding
to the definition of Eq. (5.23) can be developed as follows: E {(Xk − µX )(Xk−τ − µX )} E {(Xk − µX )2 } 1 = 2 E{Xk Xk−τ } − µ2X σX 1 = 2 RX (τ ) − µ2X . σX
ρX (τ ) =
(5.48)
Here, RX (τ ) = E{Xk Xk−τ } corresponds to the usual notation used for the auto-
correlation function (ACF) of the variable X. Eq. (5.48) defines the relationship
between the ACF and the previously defined correlation coefficient for the random variable X. Furthermore, the power spectrum PX (ω) of the random process X is well known to be the Fourier transform of its ACF, i.e. PX (ω) = F {RX (τ )}. Now assume that X is a zero-mean Gaussian process used as input to a linear time-invariant system with known impulse response h(k). The output of the system Yk = Xk ⊛ h(k) (with ⊛ denoting convolution) is known to be another zero-mean Another way of justifying this result is to note that the distribution γ( · ; Nt , Nt −1 ) results from the addition of a large number of mean-squared sound pressure values (random variables), as derived in [121]. The above assumption then simply results from the central limit theorem. f
5.5 Correlated Observations Model
140
Gaussian variable with variance σY2 defined as: 2 σY2 = σX
X
h2 (i) .
i
Also, the following relationship exists between the power spectra of the input and output signals: PY (ω) = PX (ω) · |H(ω)|2 ,
(5.49)
2 with H(ω) = F {h(k)}. Together with the fact that PX (ω) = σX , Eqs. (5.48) and
(5.49) can now be used to generate a Gaussian variable Yk with desired correlation coefficient ρY (τ ) and variance σY2 . Trivial derivations lead to: h(k) =
o σY −1np F F {ρY (τ )} , σX
and 2 σX
=
σY2
X
2
h (i)
i
−1
.
(5.50a)
(5.50b)
Finally substituting ρY (τ ) , ρ¯I (τ ) and σY2 , Nt −1 yields the desired sound inten(i)
sity variable as: Ir (k) = Yk + 1. To sum up the procedure derived in this section, generating a sequence of tempo(i)
rally correlated intensity values Ir (k), k = 1, 2, . . . , simply comes down to filtering a zero-mean white noise sequence Xk as follows: Ir(i) (k) = Xk ⊛ h(k) + 1 ,
(5.51)
where the filter coefficients h(k) are defined according to Eq. (5.50a) and the vari2 ance of the input noise process σX is given by Eq. (5.50b).
5.5.4
Simulation Results
Figure 5.7 first presents the detail of some intermediate results used for the generation of a correlated random variable. The top left-hand plot shows an example of the desired correlation coefficient ρ¯I (τ ) together with the correlation resulting from an 800-length FIR filter computed according to the developments of Section 5.5.3. This practical approximation is of course a close enough match for the purpose of experimental simulations. A few coefficients h(k) of the filter are also depicted in
5.5 Correlated Observations Model
141 2
desired practical
1
1.5
0.8
1 0.5
0.6
0 0.4
−0.5
0.2
−1 −1.5
0 −1.5
−1
−0.5
0
0.5
1
1.5
−2
1
2
3
τ (s)
4
5
6
7
5
6
7
time (s)
0.2
1.2
0.15
1.1
0.1
1
0.05
0.9
0
0.8
200
250
300
350
400
450
coefficient index
500
550
600
1
2
3
4
time (s)
Figure 5.7: Generating correlated SBF observations: detail of some internal procedure results. Left-hand side: (top plot) desired temporal correlation coefficient ρ¯I (τ ) (solid line) and practical coefficient (dashed line) resulting from the filter shown in the bottom plot (filter coefficients h(k)). Right-hand side: white noise used as input signal to (i) the filter (top graph) and resulting correlated output used as Ir (k) (bottom plot). the left-lower graph. The right-hand side of Figure 5.7 shows a typical example of input signal (zero-mean Gaussian noise, top plot) and filtered output signal used (i)
directly as a temporal realisation of the sound intensity variable Ir (k), k = 1, 2, . . .. Figure 5.8 illustrates the main result of the observation process described in Sections 5.5.2 and 5.5.3. It graphically shows the measurement values Yk as the source moves along the state space. The ‘◦’ markers correspond to the values of Yk
for a situation where the target is correctly detected (hypothesis H1 ), whereas ‘×’ markers denote observations resulting from a misdetection. The numerical settings
leading to this simulation result were similar to previous examples, with, among others, the state-space limits set to xl = 0 and xu = 10m, and the observation noise level defined as σobs = 0.1. All plots in Figure 5.8 were obtained with a fixed detection probability PH1 = 0.5.g This figure clearly demonstrates the increasing level of correlation in the measurements as the velocity x˙ s of the source decreases. √ Finally, the square-root steady-state PCRB value PCRBxk for the position g
Note that the detection probability PH1 is in no way related to the correlation level, which is the sole result of the specific source velocity value.
5.5 Correlated Observations Model
142
8 6 4 2 0
0
50
100
150
200
250
300
0
50
100
150
200
250
300
0
50
100
150
200
250
300
8 6 4 2 0
8 6 4 2 0
time index
Figure 5.8: Realisations of the correlated observation process. Solid lines represent the target trajectory along the state space, circles (◦) denote observations for a correctly detected target, and ‘×’ markers are erroneous observations. The source velocity x˙ s is, from top to bottom, 0.05, 0.1 and 0.5m/s. In all three cases, the detection probability is PH1 = 0.5. estimates was computed by means of Monte Carlo averaging over 400 realisations, and according to the steady-state derivation procedure explained in Section 5.4.5. The target velocity parameter x˙ s (and consequently the correlation level) is of course the main interest of this section, and in order to accurately assess its influence on the PCRB, a constant-velocity source motion should be used. Since the process noise parameter η might slowly change the speed of the target over time (according to the system transition defined by Eqs. (5.11) and (5.12)), this would in turn imply setting η = 0. However, a non-zero system noise is necessary for the PCRB to reach a steady-state value different from zero. A compromise was reached by setting η to a small constant, η = 10−5 , and the target velocity defined in the initial state X 0 = [x0 x˙ s ]T is hence only marginally altered over time. Figure 5.9 √ shows the plots of PCRBxk , as a function of x˙ s , which result from the analysis of Sections 5.5.2 and 5.5.3. These simulation results were obtained with model
5.5 Correlated Observations Model
143
0.045
= 0.35 = 0.55 = 0.75 = 0.95
0.035
pPCRB
xk
(m)
0.04
PH1 PH1 PH1 PH1
0.03
0.025
0.02
0.015
−1
10
0
10
x˙ s (m/s)
Figure 5.9: Square-root steady-state PCRB, computed for correlated observations, as a function of the target velocity x˙ s (detection probability PH1 as free parameter). parameters defined as previously described in Section 5.4.5, except for the upper state-space limit set to xu = 14m, and the update interval defined as Tu = 0.03s.
5.5.5
Result Analysis
The results from Figure 5.9 clearly illustrate the major conclusion that can be drawn for a correlated observation model: the specific target velocity has no influence on the theoretical PCRB for xk . In other words, a performance limit analysis based on the PCRB parameter states that the correlation between measurements should not theoretically affect the performance of a tracking algorithm. This somewhat surprising fact is contrary to the expectations previously mentioned in the introduction section of this chapter. This result can be however quite easily explained as follows. The definition of the lower error bound used throughout this chapter is based on the idea of an averaged error value, see e.g. Eq. (5.2). Thus, the PCRB in effect only considers the overall average distribution of the observations Yk , meaning that the entire observation set over the Monte Carlo runs
is viewed as a whole, rather than considering each single result separately. From this point of view, the only element of importance for the PCRB is the overall distribution of erroneous observations (i.e. corresponding to hypothesis H0 ),h which h
The specific distribution of Yk influences the PCRB recursion through the second term of the (4) matrix Dk−1 in Eq. (5.10).
5.6 Comparison with PF Performance
144
is purely uniform regardless of the correlation level between measurements. In conclusion, the result of Figure 5.9 hence simply follows from the fact that the PCRB analysis is based on an “average performance” criterion. Nevertheless, it should be emphasised that on the basis of a single realisation of the target trajectory, a tracking algorithm is obviously very likely to be mislead by a long series of coherent false detections, as illustrated in the top plot of Figure 5.8. Thus, for practically relevant situations, the tracking accuracy can be expected to depart from the results depicted in Figure 5.9. From this point of view, there exists a relevant difference between the PCRB (average error criterion) and the type of performance assessment which is of real interest in practice.
5.6
Comparison with PF Performance
The overall performance of a couple of generic particle filters is now assessed using the tracking scenario considered in this chapter. This is done in the same way as previously implemented for the PCRB computations, i.e. the tracking accuracy of the algorithms results as the average error over a series of 400 Monte Carlo simulations using either of the two observation models under scrutiny.
5.6.1
PF Algorithms
The first particle filter to be investigated is the basic bootstrap method. Its working principles have been described in detail in Chapter 3 and more information can be found in that section of the thesis, including the numerical values of some of the standard parameters involved in this type of algorithm. Since the observations for the current simulations are not the results of practical measurements but rather originate from a mathematically defined process, the likelihood function p(Yk |X k ) in the PF algorithm must be “artificially” built. In
the following, it will be assumed to be Gaussian with standard deviation σlik , to which a small background probability ψ is added:i 2 pYk |X k(ξ) = N ξ; Yk , σlik +ψ. The value of σlik was set more or less arbitrarily to 0.2m, and ψ = 0.02 was defined i
This corresponds to the Gaussian likelihood approach described earlier in Section 3.5.1.
5.6 Comparison with PF Performance
145
2 as a small percentage of the peak value of N (Yk , σlik ).
A PF method based on the sequential importance sampling principle is more difficult to implement with the current tracking scenario. This algorithm indeed calls for an importance function q(X k |Y1:k ) built upon practical measurements that typically provide an additional source of information regarding the current state.
An exact model equation for these “new” measurements would be relatively hard to derive, since they are not necessarily related to the variable Yk resulting from the observation models derived so far. However, in order to obtain a general feel for how an SIS method would behave with the current tracking problem, an “SIS-like” algorithm has been implemented. It is based on the simplifying approximation that the same observations are used to build both the likelihood function p(Yk |X k ) and
the importance function q(·), with this latter furthermore assumed to be Gaussian: 2 q(ξ) = N ξ; Yk , σimp , with σimp set here to 0.4m. The main steps of the SIS principle, together with the values of standard parameters, can be found in Chapter 4. The importance sampling probability Pi was set to 0.2. Due to the likelihood function having its peak coinciding with the peak of the importance function, it results that the importance particles are automatically given unusually large weights. This in turn implies that the reinitialisation effect in the importance sampling part of the algorithm is also unduly emphasised, and for this specific reason, the reinitialisation probability Pr was set to zero in the following simulations. Once again, it should be emphasised that this implementation does not correspond to a viable SIS algorithm, and it is hence not expected to deliver meaningful tracking results. However, this method is useful to understand how the reinitialisation property generally affects the tracking accuracy of a particle filter.
Finally, the following parameters were set identically for both PF methods. The number of particles was chosen to be N = 100, and the steady-state velocity parameter was set to v¯ = 0.5m/s. As done previously when assessing the tracking performance of PF algorithms, the particle set is consistently initialised at the location corresponding to the initial target state x0 in order to avoid undesired track acquisition effects.
5.6 Comparison with PF Performance
146
Non-correlated observation model
Correlated observation model
0.4
0.4
PH1 PH1 PH1 PH1
0.35
RMSE (m)
0.3 0.25
= 0.35 = 0.55 = 0.75 = 0.95
0.35 0.3 0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
100
80
80
60
60
40
40
20
20
FCR (%)
100
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
x˙ s (m/s)
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x˙ s (m/s)
Figure 5.10: RMSE and FCR results for simulated bootstrap algorithm. The PCRB results from Figure 5.9 are reproduced in the RMSE plots (dashed lines in top graphs). Left-hand plots: non-correlated observations. Right-hand plots: correlated observations.j
5.6.2
Simulation Results
The simulations in this section were carried out using the same parameters as defined in Section 5.5.4, with η = 10−5 , xu = 14m and Tu = 0.03s. The tracking performance of the PF algorithms was determined for both the simplified and the correlated observation models. This allows to clearly distinguish the influence of the temporal correlation between measurements from the influence of other parameters. Figures 5.10 and 5.11 present the tracking results for the bootstrap and SIS methods respectively. Both figures are organised in the same way. The plots on the left-hand side contain the results obtained using simulations of the simplified model, and correlated model graphs are those on the right-hand side of the figure. Also, the top plots show the tracking results in terms of the RMSE parameter, whereas the bottom plots correspond to FCR results (square-root mean squared error and frame convergence ratio respectively, see Section 3.7). For each case, the results have been computed with the detection probability PH1 as parameter. The PCRB results of Figure 5.9 are also reproduced in the RMSE plots (dashed lines) to allow for a direct comparison with the PF performance results. j
RMSE results for PH1 = 0.35 were too noisy to be displayed for correlated observations.
5.6 Comparison with PF Performance
147
Correlated observation model
Non-correlated observation model 0.7
PH1 PH1 PH1 PH1
RMSE (m)
0.6 0.5
= 0.35 = 0.55 = 0.75 = 0.95
5 4
0.4
3
0.3 2 0.2 1
0.1
FCR (%)
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
100
100
80
80
60
60
40
40
20
20
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x˙ s (m/s)
x˙ s (m/s)
Figure 5.11: RMSE and FCR performance results for the SIS-like algorithm. The PCRB results from Figure 5.9 are reproduced in the RMSE plots (dashed lines in top graphs). Left-hand plots: noncorrelated observations. Right-hand plots: correlated observations.
5.6.3
Result Analysis
The overall performance results for the bootstrap filter using non-correlated observations (left-hand plots in Figure 5.10) conform with the logical expectations that can be deduced from the way this algorithm operates. This method is based on a fixed-parameter dynamics model, involving notably the steady-state velocity constant v¯, which determines the “mobility” of the particles in the state space. As the velocity of the target increases, it becomes more and more difficult for the particle set to “keep up” with it and the overall performance of the filter decreases, as reflected by the RMSE and FCR parameters. It is interesting to note that this increase of tracking error happens smoothly with varying velocity, and no sharp threshold effect can be detected for a specific value of x˙ s , above which the performance would suddenly break down (at least within the considered range of x˙ s values). Of course, the rate according to which the decrease of performance occurs as x˙ s increases depends on the specific value of v¯. As for the SIS-like method with non-correlated observations (left-hand plots of Figure 5.11), it can be seen that the performance results are slightly inferior compared to the RMSE and FCR values obtained with the bootstrap method. This is in
5.6 Comparison with PF Performance
148
keeping with the general statement of Section 4.7 that once the target is localised, the tracking performance achieved with a bootstrap method is slightly superior to that of an SIS algorithm. However, this also means that the SIS principle is affected by the value of x˙ s in the same way as the bootstrap principle is. A potential benefit of importance sampling could have been that the reinitialisation property would help in keeping up with faster targets, and hence allow a better tracking accuracy compared to the bootstrap method for large values of x˙ s . On the basis of Figures 5.10 and 5.11, it can be seen however that this is not the case. As mentioned earlier in this work, it takes on average a few frames of consistent observations for newly generated importance particles to “attract” the rest of the particle set to a new location. This principle allows the tracker to successfully recover from a complete track loss after only a short delay. The case of a fast moving target typically constitutes a different scenario, where the source is constantly moving away from existing particles. New importance particles are generated around the true source location at every time step, but the target then disappears too quickly for these samples to play an important role in the tracking results. Consequently, the reinitialisation property does not help in this case, as shown in the SIS results of Figure 5.11.
The most interesting outcome of these PF simulations is obtained with the correlated observation model (right-hand plots). For both algorithms, the overall tracking performance clearly decreases as the target velocity reaches low values, which confirms the expectation that the temporal correlation between sound intensity measurements also influences the tracking accuracy. As mentioned in Section 5.5.5, long series of coherent false detections tend to mislead the tracking algorithms. In the bootstrap case, this leads to an increasing chance of the filter getting stuck on a false observation peak, which might ultimately result in a total track loss. An SIS method will in general recover from such situations but is also more likely to reinitialise off-track due to an increased coherence in the erroneous measurements. A higher level of temporal correlation in the observations hence results in an increased tracking error for both PF implementations. Also as expected, the effects of the correlation slowly disappear as x˙ s increases, implying that the average performance for both PF methods tends to match the results obtained for the non-correlated scenario for large x˙ s .
5.7 Concluding Remarks
149
It should also be mentioned that the relatively large RMSE values delivered by the SIS method for correlated observations directly result from the unusually high reinitialisation rate of this SIS implementation. Since this method is allowed to relocate the particles anywhere in the state space, the resulting tracking error is furthermore scaled up according to the dimension of the considered state space.
In respect to a comparison with the previously computed PCRB, the parame√ ter PCRBxk theoretically sets a lower limit on the RMSE values resulting from the tracking algorithms. Comparing the PCRB and RMSE results in the top plots of Figures 5.10 and 5.11 (for corresponding values of PH1 ) indeed shows that the √ inequality RMSE > PCRBxk is verified. This comparison also illustrates the fact that for most values of PH1 , there exists a significant gap between the tracking performance of the PF algorithms and the theoretically achievable minimum estimation error. This highlights the fact that these tracking algorithms are somewhat far from being optimal, that is, when the term “optimal” is defined on the basis of the PCRB criterion. This is of course not surprising in the case of the pseudo-SIS implementation which is not expected to be a viable representative of SIS algorithm, as mentioned earlier. The reasons for the bootstrap filter (even for its best performance) not nearly reaching the PCRB are most likely to be linked to the specific implementation of this algorithm (e.g. due to the variance of the particle set) and the variance of the observation noise variable.
5.7
Concluding Remarks
The analysis presented in this chapter originates from a need to theoretically gauge the tracking performance of particle filtering methods with respect to a minimum error bound on the target position estimates. It has also confirmed the expectation that the temporal correlation existing in the sound intensity measurements is detrimental to the tracking accuracy of such algorithms. This effect was however shown not to influence the theoretically achievable minimum error bound, rather implying that this drawback is only a consequence due to the specific tracking algorithm implementation. Hence this also suggests that some estimators might exist that do not suffer from this loss of tracking performance for highly correlated observations. However, the fact that the theoretical bound remains unaffected with respect
5.7 Concluding Remarks
150
to the temporal correlation also raises the following concern. Perhaps the type of bound considered in this chapter is not the most appropriate parameter to choose for the purpose of this research. Despite being regularly used in the literature, the PCRB remains a measure based on an averaged error criterion that clearly disregards the effects of the temporal correlation in the observations. Different types of lower bounds have been theoretically derived (see e.g. [64]) and would perhaps lead to different results when computed in the frame of the acoustic source tracking problem. Another question is with regard to the theoretical observation models derived in this chapter. As mentioned in Section 5.1, the problem definition of ASL constitutes a real challenge for a PCRB analysis compared to the models considered usually in the literature. While some effort was put into taking account of as many practical characteristics of the SBF process as possible, some simplifications also had to be assumed in order to bring the complexity of the analysis down to acceptable levels. It is clear that in a practical ASL situation, the overall tracking difficulty will increase at specific state-space locations, due for instance to the presence of highly reflective surfaces or walls in the vicinity of these areas. The inconsistency of speech signals both in time and in frequency is another factor that would certainly have a big influence on the results presented in this chapter. Because these effects are quite hard to quantify and subsequently model, they had to be left out of the current analysis. Thus, it would be of interest to determine how closely the mathematical models derived here match with a typical series of observations obtained in practice. Such a comparison would also provide an answer to the question, still largely unresolved, of how much performance improvement is ultimately achievable for acoustic source localisation methods.
Chapter 6 Conclusion This chapter reviews the major contributions made by this research to the field of acoustic source tracking in reverberant environments. Various directions for a possible continuation of this research are finally briefly enumerated.
6.1
Overview
Carrying out acoustic source localisation in moderately reverberant environments is not a trivial task. Even a low level of reverberation or background noise is detrimental to the methods traditionally implemented to deal with such a problem. The results obtained in this research demonstrate the many advantages of an approach based on sequential Monte Carlo simulation. The present study is effectively an in-depth study of particle filtering methods developed in the framework of sound source localisation and tracking. The complex effects induced by the reverberation process have been shown to be a significant challenge in the development and analysis of the concepts presented in this work. A generic particle filtering framework was derived for the specific purpose of tracking an acoustic source in reverberant settings in Chapter 2 and Chapter 3. Sequential Monte Carlo methods call for the elaboration of three main concepts: i) a dynamics model describing the target motions, ii) a localisation function derived from practical measurements, and iii) a likelihood function reflecting the observation probability given the current target state. 151
6.1 Overview
152
These different issues have all been addressed in this work. Particular attention was given to the derivation of a likelihood function suitable for sequential estimation, based solely on the audio data received at an array of acoustic sensors. As a result, four particle filters were developed within this framework. Using several performance assessment parameters, these algorithms have been demonstrated to drastically outperform traditional acoustic source localisation methods, for both simulated and real-world reverberant audio data. Particle filtering algorithms based on the steered beamforming principle have also been shown to achieve a superior tracking performance compared to time delay estimation methods. Because particle filters involve a dynamics model describing likely target motions, and because the location estimates they deliver rely on a series of past observation data, these algorithms are able to efficiently deal with increased levels of reverberation and background noise. As numerical solution of the Bayesian filtering problem, sequential Monte Carlo methods also possess the crucial ability to take advantage of multi-modal observations. Typical examples of localisation functions derived from audio data measurements have shown that this characteristic is of particular interest in the context of acoustic source tracking.
In Chapter 4, the concept of sequential importance sampling was introduced in order to further improve the overall performance of the particle filters. On top of the three algorithmic choices mentioned above, this new approach also requires the definition of an importance function from which particles can be drawn during each iteration of the algorithm. The derivation of a suitable importance function was given particular attention, and the generic importance sampling algorithm also had to be updated on the basis of the specific characteristics of the practical problem of interest. Three different methods based on importance sampling were finally proposed. A series of extensive simulations was subsequently carried out to assess the performance of these new algorithms, demonstrating the strengths of the importance sampling approach. This principle enables the algorithm to draw on measurement information obtained at the current time during the propagation step. Some of the state samples are hence efficiently redirected to regions of the state space with potentially high posterior likelihood. As a result, the important property of reinitialisation is directly integrated at a low level in the resulting tracking algorithms.
6.1 Overview
153
Despite yielding a slightly lower tracking performance, the new methods are able to automatically recover from complete track losses, lock on to a new target entering the acoustic scene, and switch between talkers in a “ping-pong”-style conversation. Particle filters using the importance sampling principle are hence better suited for practical applications than bootstrap filters.
Chapter 5 dealt with a theoretical tracking performance assessment of the developed particle filtering methods. The posterior Cram´er-Rao bound theory defines a limit on the maximum performance theoretically achievable with an optimal estimator. Used commonly as assessment parameter in various target tracking applications, this bound was developed within the specific framework of sound source localisation. To this purpose, two mathematical models were derived from statistical room acoustics principles, in order to describe how steered beamforming observations are generated from audio measurements recorded in real reverberant environments. In particular, an attempt was made to model the specific effects of the temporal correlation existing between sound intensity values measured in a diffuse field. Simulation results showed however that the posterior Cram´er-Rao bound is not affected by these correlation effects, despite the performance of the tested particle filters being noticeably influenced by it. These results pointed out that in practice, this type of lower error bound might not be completely appropriate to capture the type of performance one is effectively interested in.
As a relatively small but nonetheless important part of this work, a real-time particle filtering algorithm for acoustic source tracking has been implemented using off-the-shelf signal acquisition and processing devices. Set up in the moderately reverberant environment of a typical office room, this implementation demonstrates the ability of particle filtering methods to accurately track an acoustic source in real-life situations, using only the data collected at an array of microphones. It is also a practical demonstration that this type of algorithm can be comfortably implemented within the current limitations of modern desktop computers. Based on these practical results, it is strongly believed that particle filtering methods have an outstanding potential to outperform most of the acoustic source tracking algorithms developed so far in the literature.
6.2 Directions for Future Research
6.2
154
Directions for Future Research
As with any project of this magnitude, a number of interesting topics naturally emerge, either as potential continuations of this work or as new directions for future research. Some of these topics are briefly described below.
6.2.1
Enhancement of Basic Principles
As flagged throughout this work, the present research has focused on the development and analysis of new algorithms for sound source localisation using mainly basic signal processing principles. The emphasis was not put on the derivation of a single method that would yield the best absolute tracking performance when compared to existing algorithms. In recent years, many literature works have presented enhanced versions of the various principles used throughout this study. These include improved particle filter versions [6, 33, 70, 83, 92, 113], more accurate beamforming methods [23, 31, 49] and time delay estimation methods [15, 16], and various other ways of computing source location estimates from the particle approximation of the posterior density [6, 43, 65]. There is no doubt that combining these more efficient methods into a global sequential estimation framework would yield a superior tracking performance. One question of specific interest is hence to determine how much improvement can be achieved compared to the performance results presented in this work.
6.2.2
Better Handling of Speech Signals
One major challenge for the practical acoustic source tracking system considered in this research is the time-varying character of speech signals, both in time and frequency. Ultimately, it would be desirable to update the observation model defined in the particle filter developments to include a better statistical description of the typical characteristics of speech. The problem of speech pauses could be addressed for instance with a missing observation data approach, or by defining a new variable accounting for the degree of “occlusion” of the target (see for instance [122]). A more global probabilistic approach of potential interest is the concept of switching observation or motion models to reflect different target states or clutter densities [59, 85, 122]. Another, more straightforward practical solution to this problem is the implementation of a
6.2 Directions for Future Research
155
simple voice activity detection algorithm, the output of which could be efficiently fused into the observation process using the particle filtering methodology.
6.2.3
Multiple Target Tracking
Tracking two or more acoustic sources in reverberant environments constitutes a natural continuation of the work presented in this thesis. Multiple target tracking introduces the additional problem of data association, i.e. how to determine which observation originates from each specific target. Despite the ability of particle filters to deal with multi-modal posterior density functions, practical considerations suggest that using a single set of particles to track multiple sources might not be completely appropriate [53]. For instance, different levels of consistency in the measurements generated by two separate acoustic sources would eventually result in all the particles “migrating” towards one single modality. The concept of multiple targets is a relatively recent topic in the particle filtering literature. Several publications, however, present interesting research results on this issue, see e.g. [45, 53, 67, 74, 90]. In the specific context of acoustic source localisation using audio data only, the presence of two or more sources in a reverberant environment might, however, prove to be a substantial challenge. In other applications, such as object tracking using video or radar measurements, each target does not usually disturb the observations received from other tracked objects. In contrast, any additional source in an acoustic scene adds its own diffuse field component to the existing one. Implicitly, this is equivalent to an increased reverberation level in the environment which, inevitably, will have a detrimental influence on the performance achieved in tracking any of the sources. Thus, it is expected that the suggested multi-target methods can only be successfully implemented for acoustic source localisation in settings involving low reverberation levels.
Appendix A Relationship Between SBF and CCF Approaches to ASL In this appendix, the close similarity between steered beamforming (SBF) and cross-correlation function (CCF) is investigated, with focus on the specific problem of acoustic source localisation (ASL) and tracking. It is shown that the beamforming principle involves a sum of the cross-correlations computed for every possible sensor pair in an array. The implications of such a relationship are briefly discussed with respect to the ASL problem definition. It is found that SBF methods theoretically provide more accurate localisation information (source location estimates) than CCF-based time delay estimation approaches.
A.1
Mathematical Developments
A simple and straightforward analysis of the beamforming and cross-correlation methods is considered under general problem assumptions: an array of M microphones with known positions is operated in a sound field with known sound wave propagation speed c. In the sequel, the variable ℓ = [x y z]T corresponds to an arbitrary position in the current 3D coordinate system, and the known vectors ℓm = [xm ym zm ]T , m ∈ {1, . . . , M}, will be used to denote the positions of the M
array sensors. As usual, an asterisk superscript (·)∗ denotes the complex conjugate operation, and k · k is used for the vector 2-norm. Let the continuous-time signal received at the mth array sensor be denoted by sm (t), m ∈ {1, . . . , M}, where t represents the time variable, −∞ < t < ∞. The 157
A.1 Mathematical Developments
158
cross-correlation function Rmn (τ ) between two signals sm (t) and sn (t) for a given
time lag τ is defined as [38]:
Rmn (τ ) =
Z
∞
sm (t)sn (t + τ ) dt .
(A.1)
−∞
The developments given in this section are derived under the assumption of real R∞ and finite-energy signals, i.e. −∞ s2m (t) dt < ∞, ∀m ∈ {1, . . . , M}. For other types
of signal (e.g. random noise or periodic signals), the definition of the average CCF, i.e.:
Z
1 Rmn (τ ) = lim T →∞ T
T /2
sm (t)sn (t + τ ) dt ,
−T /2
can be used instead [69, 93], with T denoting the observation interval. The output P(ℓ) of a generic delay-and-sum beamformer (DSB) focused onto
an arbitrary position ℓ can be computed as follows [61]:
P(ℓ) =
Z
∞
−∞
M 2 X sm (t − τm ) dt .
(A.2)
m=1
Here, the time lags τm , τm (ℓ), m ∈ {1, . . . , M}, are the steering delays to implement for each signal sm (t) in order to focus the beamformer onto the desired location ℓ. For any arbitrary value ξm ∈ R, m ∈ {1, . . . , M}, the following decomposition
can be easily demonstrated:
X 2 X M M X M 2 ξm = ξm + ξm ξn . m=1
(A.3)
m,n=1 m6=n
m=1
The second term on the right-hand side denotes a sum over all possible combinations of the indices m and n for which m 6= n. By making use of Eq. (A.3), Eq. (A.2) can be rewritten as follows:
P(ℓ) = =
Z
∞ −∞
X M
m=1
M Z X
m=1
s2m (t
∞
−∞
s2m (t
− τm ) +
M X
m,n=1 m6=n
sm (t − τm ) sn (t − τn ) dt
M Z X − τm ) dt + m,n=1 m6=n
∞
−∞
sm (t − τm ) sn (t − τn ) dt . (A.4)
A.1 Mathematical Developments
159
Without loss of generality, the substitution t′ = t − τm (time shift) can be used in the integrals of Eq. (A.4). With τmn denoting the relative lag difference τm − τn , Eq. (A.4) then becomes:
P(ℓ) =
M Z X
m=1
|
∞
−∞
s2m (t′ ) dt′ {z
}
Em
+
M Z X
|
m,n=1 m6=n
∞
−∞
′
′
′
sm (t ) sn (t + τmn ) dt . {z }
(A.5)
Rmn (τmn )
It is easily seen that the second integral in Eq. (A.5) corresponds to the CCF Rmn (τmn ) between signals sm (t) and sn (t), as given by Eq. (A.1). The first integral
of Eq. (A.5) corresponds to the energy Em of the mth sensor signal which is finite by definition. The final form of the DSB output function hence results in: P(ℓ) =
M X
Em +
m=1
M X
m,n=1 m6=n
Rmn (τmn ) .
(A.6)
Due to the inversion property of the CCF [38], i.e. Rmn (τ ) = Rnm (−τ ), and the
fact that τmn = −τnm , it can be noted that the second sum in Eq. (A.6) contains every possible CCF combination twice. Eq. (A.6) can hence be rewritten as:
P(ℓ) =
M X
m=1
Em + 2
M −1 X
M X
m=1 n=m+1
Rmn (τmn ) ,
(A.7)
where each CCF Rmn (·), m 6= n, is included in the summation only once. For more practically relevant situations, the same relationship can of course also be derived for finite-duration discrete-time signals sm (k), k ∈ {1, . . . , K}, with the CCF now defined as:
K k 1 X Sm (k) Sn∗ (k) exp j2π τ , Rmn (τ ) = K k=1 K
and the DSB output: 2 K M 1 X X k P(ℓ) = Sm (k) exp j2π τm , K k=1 m=1 K
where τ and τm are expressed in samples, and Sm (k), k ∈ {1, . . . , K}, represents the
discrete Fourier transform of the corresponding time signal, Sm (·) = DFT{sm (·)}.
A.2 Practical Example
160
The result of Eq. (A.7) shows that the DSB output function can be decomposed into a sum of CCFs computed for all possible sensor pairs in the considered array. Additionally, the DSB function also involves another term corresponding to the total energy of the sensor signals. This second term is however constant and, most important for ASL, independent of the specific DSB focus location.
A.2
Practical Example
The spatial response of a delay-and-sum beamformer can be obtained by assuming a single source, located at a specific focus position ℓf , and having constant strength over the frequency range of interest. With the microphone signals defined as delayed and attenuated replicas of the source signal, the beamformer output P(ℓ) can be computed using Eq. (A.2) for any location ℓ in the state space. This implicitly
corresponds to a convolution of the beamformer function P(ℓ) with a Dirac delta
function δ(ℓ − ℓf ) (ideal point source). In the context of a two-dimensional (2D) ASL problem definition, it follows that the CCF terms Rmn (τmn ) in Eq. (A.7)
deliver a peak value for any location ℓi in the room whose time delay of arrival (TDOA) with respect to the mth and nth sensors equals that corresponding to the focus location ℓf , i.e.: kℓi − ℓn k − kℓi − ℓm k kℓf − ℓn k − kℓf − ℓm k = . c c The geometric set of such locations is a hyperbola (hyperboloid in 3D) having the mth and nth microphones as focal points and passing through the focus position. To illustrate this result, Figure A.1 shows a CCF example simulated for a 3.8m×2.9m setting. Four microphones, shown as circles in Figure A.1, are arranged as two pairs on two adjacent walls. The simulated sound source is located at the arbitrary focus position [xf yf ] = [1.47m 1.92m] (shown as a ‘⊗’ marker in the bottom plot). Figure A.1 depicts the CCF results obtained for the pair of sensors located closest to the upper left-hand corner of the room, arbitrarily defined as the second and third array microphones. The top graph shows the CCF R23 (τ ) computed for all possible lag values τ given the sensor spacing, which corresponds
to an angle of arrival ranging from −π/2 to π/2 (in rad). The bottom plot shows
the same results computed across the entire state space. In other words, the TDOA
A.2 Practical Example
161
6 4 2 0 −2 −0.004
−0.002
0
0.002
0.004
y-axis (m)
τ (s) 3.5
5
3
4
2.5
3
2
2
1.5
1
1
0
0.5
0
−1
0
0.5
1
1.5
2
2.5
x-axis (m)
Figure A.1: Example of a single CCF term involved in the DSB response computation. Top plot: CCF R23 (τ ) computed for the sensors (denoted with circles) closest to the upper-left corner. The peak appears for the TDOA corresponding to the focus location ℓf . Bottom plot: same CCF computed as a function of the location in the state space. The focus location ℓf is indicated with ‘⊗’.
is determined for each position with respect to the current microphones, and the CCF R23 (τ ) is computed and plotted for that specific value of τ . The CCF results
shown in Figure A.1 have been obtained with the frequency-domain equivalent to Eq. (A.1) for a frequency range f ∈ [300, 3000Hz]. To confirm the result of Eq. (A.7), the steered beamformer response was computed for the same simulation setup. Figure A.2 shows the result obtained for a DSB based on all four microphones, computed using the frequency-domain equivalent of Eq. (A.2) and for the above mentioned frequency range.
A.3 Discussion
162
0 3.5
−2 −4
3
−6
y-axis (m)
2.5 −8 2
−10 −12
1.5
−14 1 −16 0.5
0
−18
0
0.5
1
1.5
2
2.5
x-axis (m)
Figure A.2: Example of delay-and-sum beamformer response (in dB). The array sensors are shown as circles on the room boundaries and the focus position is indicated by the ‘⊗’ marker.
A.3
Discussion
As expected, the presence of a source in the state space translates into a hyperbolashaped curve of large values in the 2D CCF plot (Figure A.1). When the source position is unknown, all the points located on this locus are equally likely to correspond to the true source location. Hence, a minimum of two CCFs are required in order to estimate the source position. The simulation plot shown in Figure A.2 clearly validates the main result of Section A.1—the DSB response distinctively appears to be a sum of CCFs. The features of the example CCF depicted in Figure A.1 are easily distinguishable in Figure A.2, together with another five such functions resulting from the remaining sensor pairs. The peak appearing in the beamformer response at the focus position is hence the result of the various amplitudes of each hyperbola adding up at this specific location.
Previous research in the array processing literature has focused on the implications of adding or removing sensors in the considered array. Based on the preceding analysis however, a more relevant way to consider the problem is to study the im-
A.4 Implications on ASL Methods
163
plications of adding or removing pairs in the computation of the DSB response. The results obtained in this appendix make it quite clear that in order to maximise the height of the main lobe in the beamformer response, the number of CCF components (i.e. the number of considered sensor pairs) must be maximised. Implicitly, this maximisation is automatically performed with a steered beamforming approach. ASL methods using only a subset of possible CCFs do not necessarily put an optimal emphasis on the focus position when determining an estimate of the source location.
A.4
Implications on ASL Methods
Many ASL applications involving an array of acoustic sensors are traditionally based on the information obtained from one of the following methods: i) direct methods, including the steered beamforming (SBF) principle, or ii) time-delay estimation (TDE) using e.g. cross-correlation functions. The first approach computes the output of a beamformer based on an M sensor array [24, 109, 120]. The acoustic signal originating from the source will contribute to generate a peak of acoustic energy in the beamformer output function, which can be used to deliver information about the current source position. Due to reverberation and other possible noise sources, spurious peaks may appear in the SBF measurements and possibly mislead the tracking algorithm. The second method consists in computing some of the M(M − 1)/2 possible CCFs [16, 42, 116, 118].
Locating the global maximum in these functions subsequently delivers a series of time delay estimates. Following this process, the TDEs can be used to determine the
source location as the position minimising some least-square criterion [51, 60, 106]. For TDE-based methods as well, measurement disturbances can lead to erroneous TDEs, and consequently result in erratic location estimates. As shown in this appendix, there exists a close similarity between the above mentioned SBF and TDE-based localisation methods. In particular, both of them use the cross-correlation function as a basic measurement for the detection of an acoustic event. From the analysis of Section A.1, it results that SBF observations can be expected to provide the most robust source localisation estimates in relation to ASL, since this approach automatically involves all the possible CCF combina-
A.4 Implications on ASL Methods
164
tions. TDE-based methods are likely to provide less accurate location estimates due to the following two drawbacks: i) Since they usually only rely on a subset of the possible pairs in the sensor array, TDE approaches might miss out on some important localisation information provided by pairs not included in the computations. In contrast, steered beamforming automatically includes every CCF in the localisation function. ii) The two stage approach of TDE methods involves an additional step where the TDEs derived from the localisation function are combined to determine an optimal fit corresponding to the estimated source position. This non-linear operation is a potential source for more errors in the resulting source location estimates. SBF approaches do not require this two step approach. The fact that SBF methods are better suited for acoustic source tracking was also practically demonstrated by the superior performance of the particle filters based on this principle in Chapter 3.
Appendix B Theoretical Derivation of PH1 This appendix contains the derivation of the theoretical formula for computing PH1 , i.e the probability of correct detection defined and used in Chapter 5. The result presented here has been used to generate Figure 5.3.
First, the output of a steered beamformer (SBF) aimed at the acoustic source must be computed. As defined in Section 2.2.2, the output function of a SBF is: 2 Z M 1 X Sm (ω) exp(jωτm ) dω , P(ℓ) = M
(B.1)
m=1
where Sm (ω) = F {sm (t)} is the Fourier transform of the time signal received at the mth sensor, M is the number of microphones in the array, and τm =
dm kℓ − ℓm k = c c
corresponds to the time delay from the focus location ℓ to the mth microphone location ℓm . As usual, c denotes the propagation speed of acoustic waves. The signal received at the sensors is a delayed and attenuated reproduction of the source signal: Sm (ω) =
1 Ss (ω) exp(−jωτs,m ), 4πds,m
m ∈ {1, . . . , M} ,
(B.2)
with ds,m the distance kℓs − ℓm k between the source and the mth microphone,
Ss (ω) the Fourier transform of the source signal ss (t), and τs,m = ds,m /c denoting the signal propagation time from the source to the mth sensor. 165
166 For a beamformer aimed to the location ℓs of the acoustic source, the steering delay τm , m ∈ {1, . . . , M}, is set to a value equal to some constant minus the time
delay τs,m of the source signal to each sensor. This results in the exponential terms in Eqs. (B.1) and (B.2) cancelling each other out. Assuming a white noise source
signal with constant strength Ψs over the frequency range ω ∈ [ωl , ωu ] (and zero otherwise), the SBF output then becomes:
2 M 1 X 1 dω P(ℓs ) = S (ω) s M 4πd s,m ωl m=1 2 Z ωu M X 1 1 Ψs dω = 4πM d s,m ωl m=1 | {z } α Z ωu = α2 Ψ2s dω Z
ωu
ωl
= α2 Ψ2s 2πB ,
(B.3)
with the signal bandwidth B = fu − fl . With the total source power Ws computed as follows:
Ws =
Z
ωu ωl
|Ψs |2 dω = Ψ2s 2πB ,
Eq. (B.3) finally results in: P(ℓs ) = α2 Ws .
As shown e.g. in Eq. (5.27), the sound intensity value used as a threshold for the computation of PH1 is: Ithr =
P(ℓs ) . I¯r
The spatially-averaged sound intensity I¯r in a diffuse field is defined in Eq. (5.21) as follows: T60 Ws I¯r = , π 0.163 V where T60 and V are the reverberation time and volume of the considered enclosure, respectively. The intensity threshold then results in: Ithr =
α2 π 0.163 V . T60
(B.4)
167 As derived in Eq. (5.28), the correct detection probability PH1 is given as: N PH1 = FIr (Ithr ) I Z Ithr NI = pIr (ξ) dξ . −∞
Together with Eq. (B.4) and the diffuse sound intensity distribution pIr (·) = γ(· ; Nt , Nt −1 ) defined previously in Eq. (5.20), this finally yields:
PH1 =
"
Nt
Nt (Nt − 1)!
Z
α2 π0.163V T60
ξ Nt −1 exp(−Nt ξ) dξ
−∞
#NI
.
(B.5)
The variable NI corresponds to the number of independent sound intensity measurements considered in the state space, derived in Eq. (5.26). The pure tone equivalent number Nt was also defined in Chapter 5 as: B T60 Nt ≈ 1 + . 6.9 Note that for large bandwidths B, the value of Nt can also become potentially very big, which may lead to problems for the numerical computation of the term Nt Nt /(Nt − 1)! in Eq. (B.5). In such cases, a Gaussian approximation of the PDF pIr (·) can be used instead [121]:
pIr (ξ) ≈ N ξ; 1, Nt−1 ,
for Nt ≫ 1 .
Eq. (B.5) is the formula that was used to generate Figure 5.3, with the various free parameters (room volume, signal bandwidth, sensor positions, etc.) set to some typical practical values.
Appendix C Real-Time PF Implementation for Acoustic Source Tracking∗ On the basis of the multi-channel signal processing system described in [80], a realtime particle filtering application for acoustic source localisation (ASL) was implemented. The purpose of this practical realisation was twofold. It allowed a performance assessment of the tracking algorithm while operating in real-life situations. The other incentive was to demonstrate that a PF-based algorithm for acoustic source tracking can be successfully implemented within the computational power limits of a modern personal computer. This appendix presents a brief overview of this implementation and its practical outcomes.
C.1
PF Algorithm
The results of Chapter 3 demonstrate the superior tracking performance of the SBFPL method compared to other implementations of the particle filtering principle for ASL. This algorithm was consequently chosen as the preferred method for a realtime realisation. A description of SBF-PL is given in Section 3.6.4. In accordance with the optimal parameters given in Table 3.1, the value of q0 was also set to zero for this implementation (due to SBF measurements being always positive), and the nonlinear exponent was defined as r = 3. The audio signals were acquired in parallel from eight microphone channels sampled at 11025Hz, and subsequently processed in non-overlapping frames of 512 ∗
A significant part of this implementation was carried out by Kris Modrak.
169
C.2 Practical Setup
170
samples (corresponding to a frame length of 46.4·10−3s). Internally, the PF processing was however only carried out over the frequency range f ∈ [300, 3000Hz].
C.2
Practical Setup
The entire real-time application (including signal acquisition, PF algorithm and graphical result display) was implemented on a single AMD Athlon 1.7GHz computer running under the Debian 3.0 Linux distribution. Technical details for the implementation of the multi-channel signal acquisition system can be found in [80]. The experimental setup (including the sensor locations) is described in Section 2.5 and corresponds to a typical office room. As mentioned earlier, this environment presents a medium level of reverberation with a frequency-averaged (up to 22050Hz) reverberation time T60 = 0.39s. The signals were sampled in hardware at 44.1kHz and subsequently downsampled by a factor of four as part of the signal processing routines to yield the desired sampling frequency. For demonstration purposes, various other features were added to the final ASL tracking program. These include for instance the possibility to record specific audio signals “on the fly”, and also a stereo colour visual tracker (see [81]) implemented to compare the PF tracking results with ground truth data. The application code was developed using the C/C++ programming language, and the complete demonstration software is described in [82].
C.3
Practical Results
Figure C.1 shows a typical example of the graphical result display captured from the real-time PF implementation (in this specific example, the sound signal was white noise). The different markers in this snapshot (indicating the estimated source position and the particles) are updated on screen in real-time with a rate of one update every 46.4 · 10−3 s. This practical implementation demonstrates the ability of the SBF-PL method to efficiently track a mobile acoustic source in environments with medium levels of reverberation, which constitutes a highly promising result. It is of course relatively easy to mislead the tracking algorithm, due to the purely bootstrap strategy involved in this method. As expected, relatively long pauses in the speech signal are likely to send the particle filter off-track. A simple voice activity detector would however largely solve this problem. Also, the tracking
C.3 Practical Results
171
⋄
Figure C.1: Snapshot of graphical output from the real-time PF algorithm implementation. The ‘⋄’ marker indicates the source location estimate delivered by the particle filter, ‘◦’ markers denote the particles, and ‘ ’ markers represent the last 50 positions in the source path history. results become more erratic when two or more people are talking at once, even if the algorithm has been successfully tracking one of these sources prior to the others switching on. This typically results from the fact that additional sources tend to increase the diffuse noise level in the room. Hence, the presence of multiple sources has an effect similar to tracking a single acoustic source with an increased level of reverberation. The implementation of a tracking method based on the SIS principle (see Chapter 4) is expected to improve the overall behaviour of this basic bootstrap implementation in a significant way. What the implementation described in this appendix however shows is that once the source has been detected, a PF-based algorithm is able to successfully track this target despite the reverberation effects. To the best of the author’s knowledge, no other acoustic source tracking method has been successfully tested in practice with such a level of reverberation.
C.4 Conclusion
172
100 90 80
CPU usage (%)
70 60 50 40 30 20 10 0
0
50
100
150
200 250 300 Number of particles
350
400
450
Figure C.2: Percentage of CPU power required by the real-time PF application as a function of the number of particles used in the algorithm.a
Finally, Figure C.2 gives some insight into the computing power required by the real-time particle filter implemented on the system described previously. This plot represents the percentage of CPU power necessary to the application as a function of the number of particles used in the PF algorithm. Note that these results correspond to the computation power required by the basic application processes only, i.e. the signal acquisition, particle filter and graphical display routines. No other processes (such as the visual tracking or audio recording features mentioned earlier) were running when measuring the CPU usage. Considering the results obtained in Chapter 3 with less than 50 particles, Figure C.2 shows that if necessary, the number of particles can be increased significantly on the current system. This also means that the implementation of a method based on the SIS principle should pose no problems from a computational point of view for modern desktop computers.
C.4
Conclusion
The practical implementation carried out in the frame of this research demonstrates two important characteristics of particle filtering algorithms. First, medium to low computational requirements makes them naturally well-suited for real-time a
Figure courtesy of Kris Modrak.
C.4 Conclusion
173
applications. Also, an increased efficiency at dealing with reverberation allows such methods to be successfully implemented in real-life situations where conventional ASL algorithms may fail to deliver good enough results. A lot of studies presented in the literature related to sound source tracking indeed only consider off-line algorithm tests using synthetic audio samples. It is believed that the implementation of the current tracking system bridges a certain gap between theory and practice by demonstrating the real-life capabilities of particle filtering methods.
Appendix D CD Contents Two data CDs can be found at the end of this document. They contain a collection of the most important files related to the research presented in this thesis. This appendix gives a brief overview of how the data available on these CDs is organised. The CD-ROM labelled CD1 includes two main folders, the contents of which are described in the next two sections. The second disc, labelled CD2, merely contains audio data files too large to fit on one single CD-ROM.
D.1
Thesis Files
The directory called PhD files contains some documents related to this thesis. These files are listed below: PhD thesis.pdf and PhD thesis.ps: electronic copies of the present document. LehmannEtAl2003.pdf and WardEtAl2003.pdf: electronic copies of two publications resulting from the present research work (described in Section 1.4).
D.2
Data and Other Documentation Files
The folder data doc files contains data and documentation files used for the experiments and practical implementations developed in the frame of this research. The majority of these files have been created by Kris Modrak and are documented in various technical reports. This material has been included in the current file bundle in order to make the present work as self-contained as possible, and to make these technical reports easily accessible for the reader. Consequently, this 175
D.2 Data and Other Documentation Files
176
section only gives a brief description of the contents in each subfolder, together with a reference to the corresponding report where more information can be found if necessary. Subfolder audio system and software: this subfolder contains files related to the real-time audio signal processing system used to acquire data from a 16channel sensor array, as described in [80]. An electronic copy of this technical report can also be found in this subfolder. The code of several software routines used in conjunction with this multi-channel system are also included. Subfolder recordings: contains a copy of the report describing the recording setup used for the acquisition of real-life audio samples, i.e. Reference [78]. Subfolder sound card installation: the document corresponding to Reference [79] can be found in this folder, providing technical details about the hardware and software implementation of a multi-channel recording system on a Linux platform. Subfolder audio data: includes a series of 16-channel audio samples recorded in a real office room (in the directory recordings). The environmental setup for these recordings is detailed in [78]. Due to the large size of these audio files, only the samples o1wav.gz to o17wav.gz are available in this directory. The rest of the recordings (i.e. samples o18wav.gz to o41wav.gz) can be found on the CD labelled CD2. The audio data subfolder also contains the different files used as source signals (in the signals directory). Subfolder pfbssl demo: the contents of this directory are related to the implementation of a real-time sound source localisation and visual tracking system. This includes the technical report [82] and other software programs required by this application. Subfolder colour tracker: these files are specific to the implementation of the stereo colour tracker used in [82]. The software included in this directory is described in detail in the technical report [81], a copy of which is also available in the current folder.
Bibliography [1] Jonathan S. Abel. A bound on mean-square-estimate error. IEEE Transactions on Information Theory, 39(5):1675–1680, September 1993. [2] Robert Aichner, Wolfgang Herbordt, Herbert Buchner, and Walter Kellermann. Least-squares error beamforming using minimum statistics and multichannel frequency-domain adaptive filtering. In Proceedings of the IEEE International Workshop on Acoustic Echo and Noise Control, pages 223–226, Kyoto, Japan, September 2003. [3] Jont B. Allen and David A. Berkeley. Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America, 65(4):943–950, April 1979. [4] Brian D. O. Anderson and John B. Moore. Optimal filtering. Prentice-Hall, Englewood Cliffs, N.J., 1979. [5] Shoko Araki, Shoji Makino, Robert Aichner, Tsuyoki Nishikawa, and Hiroshi Saruwatari. Subband based blind source separation for convolutive mixtures of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, pages 509–512, Hong Kong, China, April 2003. [6] M. Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174–188, February 2002. [7] Edward R. Beadle and Petar M. Djuri´c. A fast weighted Bayesian bootstrap filter for nonlinear model state estimation. IEEE Transactions on Aerospace and Electronic Systems, 33(1):338–343, January 1997. 177
Bibliography
178
[8] Jacob Benesty. Adaptive eigenvalue decomposition algorithm for passive acoustic source localization. Journal of the Acoustical Society of America, 107(1):384–391, January 2000. [9] Jacob Benesty and Dennis R. Morgan. Frequency-domain adaptive filtering revisited, generalization to the multi-channel case, and application to acoustic echo cancellation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 789–792, Istanbul, Turkey, June 2000. [10] Leo Leroy Beranek. Acoustics. McGraw-Hill, New York, 1954. [11] Niclas Bergman. Posterior Cram´er-Rao bounds for sequential estimation. In A. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo methods in practice, pages 321–338. Springer-Verlag, New York, 2000. [12] Niclas Bergman, Lennart Ljung, and Frederik Gustafsson. Point-mass filter and Cramer-Rao bound for terrain-aided navigation. In Proceedings of the 36th IEEE Conference on Decision and Control, volume 1, pages 565–570, San Diego, CA, USA, December 1997. [13] A. Bessell, B. Ristic, A. Farina, X. Wang, and M. S. Arulampalam. Error performance bounds for tracking a manoeuvring target. In Proceedings of the Sixth International Conference of Information Fusion, volume 2, pages 903–910, Cairns, QLD, Australia, July 2003. [14] Michael Brandstein and Darren Ward, editors. Microphone arrays: signal processing techniques and applications. Springer-Verlag, Berlin, 2001. [15] Michael S. Brandstein. A pitch-based approach to time-delay estimation of reverberant speech. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, October 1997. [16] Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman. A practical time-delay estimator for localizing speech sources with a microphone array. Computer, Speech and Language, 9(2):153–169, April 1995. [17] Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman. A closedform location estimator for use with room environment microphone arrays.
Bibliography
179
IEEE Transactions on Speech and Audio Processing, 5(1):45–50, January 1997. [18] Herbert Buchner, Robert Aichner, and Walter Kellermann. TRINICON: a versatile framework for multichannel blind signal processing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 889–892, Montreal, Canada, May 2004. [19] Herbert Buchner and Walter Kellermann. Acoustic echo cancellation for two and more reproduction channels. In Proceedings of the IEEE International Workshop on Acoustic Echo and Noise Control, pages 99–102, Darmstadt, Germany, September 2001. [20] Herbert Buchner and Walter Kellermann. An acoustic human–machine interface with multi-channel sound reproduction. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pages 359–364, Cannes, France, October 2001. [21] Benoˆıt Champagne, St´ephane B´edard, and Alex St´ephenne. Performance of time-delay estimation in the presence of room reverberation. IEEE Transactions on Speech and Audio Processing, 4(2):148–152, March 1996. [22] Joe C. Chen, Ralph E. Hudson, and Kung Yao. Maximum-likelihood source localization and unknown sensor location estimation for wideband signals in the near-field. IEEE Transactions on Signal Processing, 50(8):1843–1854, August 2002. [23] Joe C. Chen, Kung Yao, and Ralph E. Hudson. Acoustic source localization and beamforming: theory and practice. EURASIP Journal on Applied Signal Processing, 2003(4):359–370, March 2003. [24] Joe C. Chen, Kung Yao, Tai-Lai Tung, Chris W. Reed, and Daching Chen. Source localization and tracking of a wideband source using a randomly distributed beamforming sensor array. International Journal of High Performance Computing Applications, 16(3):52–65, August 2002. [25] Yunqiang Chen and Yong Rui. Real-time speaker tracking using particle filter sensor fusion. Proceedings of the IEEE, 92(3):485–494, March 2004.
Bibliography
180
[26] Richard K. Cook, R. V. Waterhouse, R. D. Berendt, Seymour Edelman, and M. C. Thompson. Measurement of correlation coefficients in reverberant sound fields. Journal of the Acoustical Society of America, 27(6):1072–1077, November 1955. [27] Dan Crisan.
Particle filters—a theoretical perspective.
In A. Doucet,
N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo methods in practice, pages 17–41. Springer-Verlag, New York, 2000. [28] Dan Crisan and Arnaud Doucet. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50(3):736–746, March 2002. [29] Ross Cutler, Yong Rui, Anoop Gupta, JJ Cadiz, Ivan Tashev, Li-wei He, Alex Colburn, Zhengyou Zhang, Zicheng Liu, and Steve Silverberg. Distributed meetings: a meeting capture and broadcasting system. In Proceedings of ACM Mulitmedia, Juan-les-Pins, France, December 2002. [30] Li Deng and Xuedong Huang. Challenges in adopting speech recognition. Communications of the Association for Computing Machinery, 47(1):69–75, January 2004. [31] Joseph H. DiBiase, Harvey F. Silverman, and Michael S. Brandstein. Robust localization in reverberant rooms. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 157– 180. Springer-Verlag, Berlin, 2001. [32] Arnaud Doucet, Nando de Freitas, and Neil Gordon. An introduction to sequential Monte Carlo methods. In A. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo methods in practice, pages 3–14. SpringerVerlag, New York, 2000. [33] Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors. Sequential Monte Carlo methods in practice. Springer-Verlag, New York, 2000. [34] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10:197–208, 2000. Full developments available in technical report
Bibliography
181
CUED/F-INFENG/TR 310, Cambridge University Department of Engineering, Cambridge CB2 1PZ, England. [35] Scott C. Douglas. Blind separation of acoustic signals. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 355–380. Springer-Verlag, Berlin, 2001. [36] Jasha Droppo and Alex Acero. Noise robust speech recognition with a switching linear dynamic model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 953–956, Montreal, Canada, May 2004. [37] Gary W. Elko. Future directions for microphone arrays. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 383–387. Springer-Verlag, Berlin, 2001. [38] Ronald L. Fante. Signal analysis and estimation: an introduction. Wiley, New York, 1988. [39] Daniel Gatica-Perez, Guillaume Lathoud, Iain McCowan, Jean-Marc Odobez, and Darren Moore. Audio-visual speaker tracking with importance particle filters. In Proceedings of the International Conference on Image Processing, volume 3, pages 25–28, Barcelona, Spain, September 2003. [40] Nikolay D. Gaubitch, Patrick A. Naylor, and Darren B. Ward. On the use of linear prediction for dereverberation of speech. In Proceedings of the IEEE International Workshop on Acoustic Echo and Noise Control, pages 99–102, Kyoto, Japan, September 2003. [41] Bradford W. Gillespie, Henrique S. Malvar, and Dinei A. F. Florˆencio. Speech dereverberation via maximum-kurtosis subband adaptive filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 6, pages 3701–3704, Salt Lake City, UT, USA, May 2001. [42] D. Giuliani, M. Omologo, and P. Svaizer. Talker localization and speech recognition using a microphone array and a cross-powerspectrum phase analysis. In Proceedings of the International Conference on Spoken Language Processing, volume 3, pages 1243–1246, Yokohama, Japan, September 1994.
Bibliography
182
[43] Simon Godsill, Arnaud Doucet, and Mike West. Maximum a posteriori sequence estimation using Monte Carlo particle filters. Annals of the Institute of Statistical Mathematics, 52(1):82–96, 2001. [44] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation.
IEE Proceedings on
Radar and Signal Processing, 140(2):107–113, April 1993. [45] Neil J. Gordon. A hybrid bootstrap filter for target tracking in clutter. IEEE Transactions on Aerospace and Electronic Systems, 33(1):353–358, 1997. [46] Nedelko Grbic, Sven Nordholm, and Antonio Cantoni. Optimal FIR subband beamforming for speech enhancement in multipath environments. IEEE Signal Processing Letters, 10(11):335–338, November 2003. [47] Julie E. Greenberg and Patrick M. Zurek. Microphone-array hearing aids. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 229–253. Springer-Verlag, Berlin, 2001. [48] S. M. Griebel and M. S. Brandstein. Microphone array source localization using realizable delay vectors. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 71–74, New Paltz, NY, USA, October 2001. [49] Osamu Hoshuyama and Akihiko Sugiyama. Robust adaptive beamforming. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications. Springer-Verlag, Berlin, 2001. [50] Yiteng Huang, Jacob Benesty, and Gary W. Elko. Passive acoustic source localization for video camera steering. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 909–912, Istanbul, Turkey, June 2000. [51] Yiteng Huang, Jacob Benesty, Gary W. Elko, and Russell M. Mersereau. Real-time passive source localization: a practical linear-correction leastsquares approach.
IEEE Transactions on Speech and Audio Processing,
9(8):943–956, November 2001. [52] Carine Hue, Jean-Pierre Le Cadre, and Patrick P´erez. Performance analysis of two sequential Monte Carlo methods and posterior Cram´er-Rao bounds for
Bibliography
183
multi-target tracking. In Proceedings of the Fifth International Conference on Information Fusion, volume 1, pages 464–473, Annapolis, MD, USA, July 2002. Complete results available in technical report 1457, IRISA, France, April 2002. [53] Carine Hue, Jean-Pierre Le Cadre, and Patrick P´erez. Sequential Monte Carlo methods for multiple target tracking and data fusion. IEEE Transactions on Signal Processing, 50(2):309–325, February 2002. [54] John P. Ianniello. Time delay estimation via cross-correlation in the presence of large estimation errors. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-30(6):998–1003, December 1982. [55] Special issue on Monte Carlo methods for statistical signal processing, volume 50, number 2, of IEEE Transactions on Signal Processing. The Institute of Electrical and Electronics Engineers, Inc., February 2002. [56] Special issue on sequential state estimation: from Kalman filters to particle filters, volume 92, number 3, of Proceedings of the IEEE. The Institute of Electrical and Electronics Engineers, Inc., March 2004. [57] Michael Isard and Andrew Blake. Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [58] Michael Isard and Andrew Blake. Icondensation: unifying low-level and high-level tracking in a stochastic framework. In Proceedings of the 5th European Conference on Computer Vision, volume 1, pages 893–908, Freiburg, Germany, June 1998. [59] Michael Isard and Andrew Blake. A mixed-state Condensation tracker with automatic model-switching. In Proceedings of the Sixth International Conference on Computer Vision, pages 107–112, Bombay, IN, USA, January 1998. [60] Ea-Ee Jan and James Flanagan. Sound source localization in reverberant environments using an outlier elimination algorithm. In Proceedings of the International Conference on Spoken Language Processing, volume 3, pages 1321–1324, Philadelphia, PA, USA, 1996.
Bibliography
184
[61] Don H. Johnson and Dan E. Dudgeon. Array signal processing, concepts and techniques. Prentice Hall Signal Processing. P T R Prentice Hall, Englewood Cliffs, New Jersey 07632, 1993. [62] Walter Kellermann, Herbert Buchner, Wolfgang Herbordt, and Robert Aichner. Multichannel acoustic signal processing for human/machine interfaces— fundamental problems and recent advances. In Proceedings of the 18th International Congress on Acoustics, Kyoto, Japan, April 2004. To appear. [63] Walter L. Kellermann. Acoustic echo cancellation for beamforming microphone arrays. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 281–306. Springer-Verlag, Berlin, 2001. [64] Thomas H. Kerr. Status of CR-like lower bounds for nonlinear filtering. IEEE Transactions on Aerospace and Electronic Systems, 25(5):590–601, September 1989. [65] Genshiro Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(4):1–25, December 1996. [66] Charles H. Knapp and G. Clifford Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-24(4):320–327, August 1976. [67] Esther Bettina Koller-Meier and Frank Ade. Tracking multiple objects using the Condensation algorithm. Journal of Robotics and Autonomous Systems, 34(2–3):93–105, February 2001. [68] Trausti Kristjansson, Hagai Attias, and John Hershey. Single microphone source separation using high resolution signal reconstruction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 817–820, Montreal, Canada, May 2004. [69] Heinrich Kuttruff. Room acoustics. Spon Press, London – New York, fourth edition, 2000.
Bibliography
185
[70] Jun S. Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association, 93(443):1032–1044, 1998. [71] David Lubman. Fluctuations of sound with position in a reverberant room. Journal of the Acoustical Society of America, 44(6):1491–1502, 1968. [72] Claude Marro, Yannick Mahieux, and K. Uwe Simmer. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Transactions on Speech and Audio Processing, 6(3):240– 259, May 1998. [73] Rainer Martin.
Small microphone arrays with postfilters for noise and
acoustic echo reduction. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 255–279. Springer-Verlag, Berlin, 2001. [74] Simon Maskell, Malcolm Rollason, Neil Gordon, and David Salmond. Efficient particle filtering for multiple target tracking with application to tracking in structured images. Image and Vision Computing, 21(10):931–939, September 2003. [75] M. Matassoni, M. Omologo, A. Santarelli, and P. Svaizer. On the joint use of noise reduction and MLLR adaptation for in-car hands-free speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 289–292, Orlando, FL, USA, May 2002. [76] Iain McCowan, Samy Bengio, Daniel Gatica-Perez, Guillaume Lathoud, Florent Monay, Darren Moore, Pierre Wellner, and Herv´e Bourlard. Modeling human interaction in meetings. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 748– 751, Hong Kong, China, April 2003. [77] Iain A. McCowan, Claude Marro, and Laurent Mauuary. Robust speech recognition using near-field superdirective beamforming with post-filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 1723–1726, Istanbul, Turkey, June 2000.
Bibliography
186
[78] Kris Modrak. Details of audio recordings. Technical report, RSISE – Australian National University, September 2002. Available on the CD accompanying this document. [79] Kris Modrak. The implementation of a Linux audio system. Technical report, RSISE – Australian National University, September 2002. Available on the CD accompanying this document. [80] Kris Modrak. Multi-channel real time audio signal processing system and software. Technical report, RSISE – Australian National University, September 2002. Available on the CD accompanying this document. [81] Kris Modrak. Implementation of a stereo colour tracker. Technical report, RSISE – Australian National University, January 2004. Available on the CD accompanying this document. [82] Kris Modrak. Real time sound source localisation and visual tracking software. Technical report, RSISE – Australian National University, January 2004. Available on the CD accompanying this document. [83] Peter M¨ uller. Monte Carlo integration in general dynamic models. Contemporary Mathematics, 115:145–163, 1991. [84] Sven Nordholm, Ingvar Claesson, and Nedelko Grbi´c. Optimal and adaptive microphone array for speech input in automobiles. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 307–329. Springer-Verlag, Berlin, 2001. [85] Jean-Marc Odobez, Sileye Ba, and Daniel Gatica-Perez. An implicit motion likelihood for tracking with particle filters. In Proceedings of the British Machine Vision Conference, Norwich, UK, September 2003. [86] M. Omologo and P. Svaizer. Talker localization and speech enhancement in a noisy environment using a microphone array based acquisition system. In Proceedings Eurospeech, pages 605–609, Berlin, Germany, September 1993. [87] M. Omologo and P. Svaizer. Acoustic event localization using a crosspowerspectrum phase based technique. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 273– 276, Adelaide, SA, Australia, 1994.
Bibliography [88] M. Omologo and P. Svaizer.
187 Use of the cross-power-spectrum phase in
acoustic event location. IEEE Transactions on Speech and Audio Processing, 5(3):288–292, 1997. [89] Maurizio Omologo, Marco Matassoni, Piergiorgio Svaizer, and Diego Giuliani. Microphone array based speech recognition with different talker-array positions. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 227–230, April 1997. [90] Matthew Orton and William Fitzgerald. A Bayesian approach to tracking multiple targets using sensor arrays and particle filters. IEEE Transactions on Signal Processing, 50(2):216–223, February 2002. [91] Athanasios Papoulis. Probability, random variables, and stochastic processes. McGraw-Hill, New York, 1984. [92] Michael K. Pitt and Neil Shephard. Filtering via simulation: auxiliary particle filters. Journal of the American Statistical Association, 94(446):590–599, June 1999. [93] Roland Priemer. Intoductory signal processing, volume 6 of Advanced Series in Electrical and Computer Engineering. World Scientific, Singapore – New Jersey, 1991. [94] Biljana D. Radlovi´c, Robert C. Williamson, and Rodney A. Kennedy. Equalization in an acoustic reverberant environment: robustness results. IEEE Transactions on Speech and Audio Processing, 8(3):311–319, May 2000. [95] B. Ristic, A. Farina, D. Benvenuti, and M. S. Arulampalam. Performance bounds and comparison of nonlinear filters for tracking a ballistic object on re-entry. IEE Proceedings – Radar, Sonar and Navigation, 150(2):65–70, April 2003. [96] Branko Ristic and M. Sanjeev Arulampalam. Tracking a manoeuvring target using angle-only measurements: algorithms and performance. Signal Processing, 83(6):1223–1238, June 2003. [97] Branko Ristic, Sanjeev Arulampalam, and Neil Gordon. Beyond the Kalman filter: particle filters for tracking applications. Artech House, Boston – London, 2004.
Bibliography
188
[98] Branko Ristic, Sanjeev Arulampalam, and James McCarthy. Target motion analysis using range-only measurements: algorithms, performance and application to Ingara ISAR data. Technical Report DSTO-TR-1095, DSTO Electronics and Surveillance Research Laboratory, Salisbury, SA, Australia 5108, January 2001. [99] Branko Ristic, Sanjeev Arulampalam, and Christian Musso. On Cram´er-Rao bounds for sequential angle-only target motion analysis. In Proceedings of the Third Australian Workshop on Signal Processing Applications, Brisbane, QLD, Australia, December 2000. [100] Branko Ristic, Sanjeev Arulampalam, and Christian Musso. The influence of communication bandwidth on target tracking with angle only measurements from two platforms. Signal Processing, 81(9):1801–1811, September 2001. [101] Yong Rui and Yunqiang Chen. Better proposal distributions: object tracking using unscented particle filter. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 786–793, Kauai Marriott, HI, USA, December 2001. [102] Yong Rui and Dinei Florencio. Time delay estimation in the presence of correlated noise and reverberation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 133– 136, Montreal, Canada, May 2004. [103] Richard L. Scheaffer and James T. McClave. Probability and statistics for engineers. PWS-Kent Publishing, Boston, Massachusetts, 1990. [104] M. R. Schroeder. Frequency-correlation functions of frequency responses in rooms. Journal of the Acoustical Society of America, 34(12):1819–1823, December 1962. [105] B. W. Silverman. Density estimation for statistics and data analysis. Chapman and Hall, London – New York, 1986. [106] Julius O. Smith and Jonathan S. Abel. Closed-form least-squares source location estimation from range-difference measurements. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(12):1661–1669, December 1987.
Bibliography
189
[107] S. Spors, N. Strobel, and R. Rabenstein. A multi-sensor object localization system. In Proceedings of the 6th International Fall Workshop on Vision, Modeling, and Visualization, pages 19–26, Stuttgart, Germany, November 2001. [108] N. Strobel and R. Rabenstein. Classification of time delay estimates for robust speaker localization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 6, pages 3081–3084, Phoenix, AZ, USA, March 1999. [109] Norbert Strobel, Thomas Meier, and Rudolf Rabenstein. Speaker localization using steered filtered-and-sum beamformers. In Proceedings of the Erlangen Workshop on Vision, Modeling, and Visualization, pages 195–202, Erlangen, Germany, 1999. [110] James H. Taylor. The Cram´er-Rao estimation error lower bound computation for deterministic nonlinear systems. IEEE Transactions on Automatic Control, AC-24(2):343–344, April 1979. [111] Petr Tichavsk´y, Carlos H. Muravchik, and Arye Nehorai. Posterior Cram´erRao bounds for discrete-time nonlinear filtering. IEEE Transactions on Signal Processing, 46(5):1386–1396, May 1998. [112] Dirk Van Compernolle. Future directions in microphone array processing. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 389–394. Springer-Verlag, Berlin, 2001. [113] Rudolph van der Merwe, Arnaud Doucet, Nando de Freitas, and Eric Wan. The unscented particle filter. Technical Report CUED/F-INFENG/TR 380, Cambridge University Engineering Department, Cambridge CB2 1PZ, England, August 2000. [114] Rudolph van der Merwe, Arnaud Doucet, Nando de Freitas, and Eric Wan. The unscented particle filter. Advances in Neural Information Processing Systems, NIPS13, November 2001. [115] Harry L. Van Trees. Detection, estimation, and modulation theory (part I). John Wiley & Sons, Inc., New York, 1968.
Bibliography
190
[116] J. Vermaak and A. Blake. Nonlinear filtering for speaker tracking in noisy and reverberant environments. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 3021–3024, Salt Lake City, UT, USA, May 2001. [117] J. Vermaak, M. Gangnet, A. Blake, and P. P´erez. Sequential Monte Carlo fusion of sound and vision for speaker tracking. In Proceedings of the Eighth IEEE International Conference on Computer Vision, volume 1, pages 741– 746, Vancouver, Canada, July 2001. [118] Darren B. Ward. Nonlinear filtering of the generalized cross-correlation function for source localization. In Proceedings of the IEE Workshop on Nonlinear and Non-Gaussian Signal Processing, Peebles Hydro, UK, July 2002. [119] Darren B. Ward, Rodney A. Kennedy, and Robert C. Williamson. Constant directivity beamforming. In M. Brandstein and D. Ward, editors, Microphone arrays: signal processing techniques and applications, pages 355–380. Springer-Verlag, Berlin, 2001. [120] Darren B. Ward and Robert C. Williamson. Particle filter beamforming for acoustic source localization in a reverberant environment. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 1777–1780, Orlando, FL, USA, May 2002. [121] Richard V. Waterhouse. Statistical properties of reverberant sound fields. Journal of the Acoustical Society of America, 43(6):1436–1444, 1968. [122] Ying Wu, Gang Hua, and Ting Yu. Switching observation models for contour tracking in clutter. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 295–302, Madison, WI, USA, June 2003. [123] Takeshi Yamada, Satoshi Nakamura, and Kiyohiro Shikano. Distant-talking speech recognition based on a 3-D Viterbi search using a microphone array. IEEE Transactions on Speech and Audio Processing, 10(2):48–56, February 2002.
Bibliography
191
[124] Ka Fai Cedric Yiu, Nedelko Grbi´c, Kok-Lay Teo, and Sven Nordholm. A new design method for broadband microphone arrays for speech input in automobiles. IEEE Signal Processing Letters, 9(7):222–224, July 2002.