Lessons in Digital Estimation Theory

lessons in Digital PRENTICE-HALL SIGNAL PROCESSING SERIES

Estimation

Theory

Alan V. Oppenheim, Editor

Jerry ANDREWS and HUNT Digital Image Restoration BRIGHAM The Fmt Fourier Transform Underwater Acoustic System Analysis BURDK CASTLEMAN Digital Image Processing COWAN and GRANT Adaptive Filters CROCHIERE and RABINER Multirate Digital Signal Processing DUDGEON and MERSEREAU Multidimensional Digital Signal Processing HAMMING Digital Filters, 2nd ed. HAYKIN, ED. Array Signal Processing JAYANT and NOLL Digital Coding of Waveforms KIN0 Acoustic Waves: Devices, Imaging, and Analog Signal Processing LEA, ED. Trends in Speech Recognition LIM, ED. Speech Enhancement MARPLE Digital Spectral Analysis with Applications MCCELLAN and RADER Number Theory in Digital Signal Processing MENDEL Lessons in Digital Estimation Theory OPPENHEIM, ED. Applications of Digital Signal Processing OPPENHEIM, WILLSKY, with YOUNG Signals and Systems OPPENHEIM and SCHAFER Digital Signal Processing RABINER and GOLD Theory and Applications of Digital Signal Processing RABINER and SCHAFER Digital Processing of Speech Signals ROBINSON and TRIETEL Geophysical Signal Analysis STEARNS and DAVID Signal Processing Algorithms TRIBOLET Seismic Applications of Homomorphic Signal Processing WIDROW and STEARNS Adaptive Signal Processing

A/l. Mendel

Department of Electrical Engineering University of Southern California Los Angeles, California

Prentice-Hall,

Inc., Englewood Cliffs, New Jersey 07632

Lessons in Digita/ Esfimafion

Theory

Librar)

of Congress Cataloging-in-Publication

Data

(date) Lessonsin digital estimation theory.

MENDEL,JERRY M.,

Bibliography: p. Includes index. 1. Estimation theory. I. Title. 511'.4 QA276.8.M46 1986 ISBN o-13-530809-7

Contents

86-9365

To my parents, Eleanor and Alfred Mendel and my wife, Letty Mendel

Editorial/production supervision: Gretchen K. Chenenko Cover design: Lundgren Graphics Manufacturing buyer: Gordon Osbourne

... XIII

. LESSON 1

All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher.

LESSON 2

7

6

5

4

3

2

1 LESSON 3

ISBN

LEAST-SQUARES

ESTIMATION:

BATCH PROCESSING

025

0-13-53U8U~-7

PRENTXCE-HALLINTERNATIONAL (UK) LIMITED,London

PRENTICE-HALLOFAUSTRALXAPTYLIMITED,~~~~~~ PRENTICE-HALLCANADAINC., Toronto PRENTICE-HALLHISPANOAMERICANASA., Mexico PRENTICE-HALLOFINDIAPRIVATE LIMITED,N~~ Delhi PUMICE-HALL OFJAPAN.INC.,Tokyo P~NTIcE-HALLOF~O~THEAST ASIA PIE LTD., Singapore ED~~RAPRENTICE-HALLDOBRASIL,

THE LINEAR MODEL

Introduction 7 Examples 7 Notational Preliminaries 14 Problems 15

Printed in the United States of America 8

COVERAGE, AND PHILOSOPHY

Introduction 1 Coverage 2 Philosophy 5

0 1987by Prentice-Hall, Inc. A Division of Simon & Schuster Englewood Cliffs, New Jersey 07632

10 9

INTRODUCTION,

LTDA.,

Rio deJaneiro

Introduction 17 Number of Measurements 18 Objective Function and Problem Statement 18 Derivation of Estimator 19 Fixed and Expanding Memory Estimators 23 Scale Changes 23 Problems 25

1

vi LESSON 4

Contents

Contents

LESSON 9

LEAST-SQUARES ESTIMATION: RECURSIVE PROCESSING

26

LEAST-SQUARES ESTIMATION: RECURSIVE PROCESSING (continued)

LESSON 10 35

SMALL SAMPLE PROPERTIES OF ESTIMATORS

43 LESSON tl

Introduction 43 Unbiasedness 44 Efficiency 46 Problems 52 LESSON 7

LARGE SAMPLE PROPERTIES OF ESTIMATORS

LESSON 8

PROPERTIES OF LEAST-SQUARES

LESSON 12

ESTIMATORS

Introduction 63 Small Sample Properties of Least-Squares Estimators 63 Large Sample Properties of Least-Squares Estimators 68 Problems 70

63

MAXIMUM-LIKELIHOOD

85

ESTIMATION

Likelihood 88 Maximum-Likelihood Method and Estimates 89 Properties of Maximum-Likelihood Estimates 91 The Linear Model (X’(k) deterministic) 92 A Log-Likelihood Function for an Important Dynamical System 94 Problems 97

54

Introduction 54 Asymptotic Distributions 54 Asymptotic Unbiasedness 57 Consistency 57 Asymptotic Efficiency 60 Problems 61

81

LIKELlHOOD

Introduction 81 Likelihood Defined 81 Likelihood Ratio 84 ResultsDescribed by Continuous Distributions Multiple I-Iypotheses 85 Problems 87

Generalization to Vector Measurements 36 Cross-Sectional Processing 37 Multistage Least-Squares Estimators 38 Problems 42 LESSON 6

ESTIMATION

Introduction 71 Problem Statement and Objective Function 72 Derivation of Estimator 73 Comparison of iBLU(k) and &u(k) 74 Some Properties of itaL&) 75 Recursive BLUES 78 Problems 79

Introduction 26 Recursive Least-Squares: Information Form 27 Matrix Inversion Lemma 30 Recursive Least-Squares: Covariance Form 30 Which Form to Use 31 Problems 32 LESSON 5

BEST LINEAR UNBIASED

ELEMENTS OF MULTIVARIATE RANDOM VARIABLES

GAUSSIAN

Introduction 100 Univariate GaussianDensity Function 100 Multivariate Gaussian Density Function 101 Jointly GaussianRandom Vectors 101 The Conditional Density Function 102 Properties of Multivariate GaussianRandom Variables 104 Properties of Conditional Mean 104 Problems 106

Contents Contents

LESSON 17 LESSON 13

ESTIMATION OF RANDOM PARAMETERS: GENERAL RESULTS

Introduction 149 A Preliminary Result 150 The Kalman Filter 151 Observations About the Kalman Filter Problems 158

Introduction 108 Mean-SquaredEstimation 109 Maximum a Posteriori Estimation 114 Problems 116

LESSON 14

ESTIMATION OF RANDOM PARAMETERS: THE LINEAR AND GAUSSIAN MODEL

LESSON 18

ELEMENTS OF DISCRETE-TIME GAUSS-MARKOV RANDOM PROCESSES

STATE ESTIMATION:

PREDICTION

Introduction 140 Single-StagePredictor 140 A General StatePredictor 142 The Innovations Process 146 Problems 147

153

FILTERING EXAMPLES

160

Introduction 160 Examples 160 Problems 169 LESSON 19

LESSON 20

140

STATE ESTIMATION: STEADY-STATE KALMAN FILTER AND ITS RELATIONSHIP TO A DIGITAL WIENER FILTER

170

Introduction 170 Steady-StateKalman Filter 170 Single-Channel Steady-StateKalman Filter 173 Relationships Between the Steady-State Kalman Filter and a Finite Impulse ResponseDigital Wiener Filter 176 Comparisons of Kalman and Wiener Filters 181 Problems 182

128

Introduction 128 Definitions and Propertiesof Discrete-Time Gauss-Markov Random Processes 128 A Basic State-VariableModel 131 Properties of the Basic State-Variable Model 133 Signal-to-NoiseRatio 137 Problems 138

LESSON 16

STATE ESTIMATION:

149

118

Introduction 118 Mean-SquaredEstimator 118 Best Linear UnbiasedEstimation, Revisited 121 Maximum a Posteriori Estimator 123 Problems 126

LESSON 15

STATE ESTIMATION: FILTERING (THE KALMAN FILTER)

STATE ESTIMATION:

SMOOTHING

Three Types of Smoothers 183 Approaches for Deriving Smoothers 184 A Summary of Important Formulas 184 Single-StageSmoother 184 Double-Stage Smoother 187 Single- and Double-StageSmoothers as General Smoothers 189 Problems 192

183

X

LESSON 21

Contents STATE-ESTIMATION: SMOOTHING (GENERAL RESULTS)

LESSON 22

STATE ESTIMATION:

SMOOTHING

LESSON 25

APPLICATIONS

STATE ESTIMATION FOR THE NOT-SO-BASIC STATE-VARIABLE MODEL

LESSON 26

LESSON 24

LINEARIZATION AND DISCRETIZATION NONLINEAR SYSTEMS

Introduction 236 A Dynamical Model 237 Linear Perturbation Equations 239

LESSON 27

OF 236

MAXIMUM-LIKELIHOOD ESTIMATION

STATE AND PARAMETER

Introduction 258 A Log-Likelihood Function for the Basic State-Variable Model 259 On Computing i, 261 A Steady-StateApproximation 264 Problems 269

223

Introduction 223 Biases 224 Correlated Noises 225 Colored Noises 227 Perfect Measurements:Reduced-Order Estimators 230 Final Remark 233 Problems 233

lTERATED LEAST SQUARES AND EXTENDED KALMAN FILTERING

Introduction 248 Iterated Least Squares 248 Extended Kalman Filter 249 Application to ParameterEstimation 255 Problems 256

204

Introduction 204 Minimum-Variance Deconvolution (MVD) 204 Steady-StateMVD Filter 207 Relationship Between Steady-StateMVD Fitter and an Infinite Impulse Response Digital Wiener Deconvolution Filter 213 Maximum-Likehhood Deconvolution 215 Recursive Waveshaping 216 Problems 222 LESSON 23

Discretization of a Linear Time-Varying State-Variable Model 242 Discretized Perturbation State-Variable Model 245 Problems 246

193

Introduction 193 Fixed-Interval Smoothers 193 Fixed-Point Smoothing 199 Fixed-Lag Smoothing 201 Problems 202

xi

Contents

KALMAN-BUCY

FILTERING

Introduction 270 System Description 271 Notation and Problem Statement 271 The Kalman-Bucy Filter 272 Derivation of KBF Using a Formal Limiting Procedure 273 Derivation of KBF When Structure of the Filter Is Prespecified 275 Steidy-State KBF 278 An Important Application for the KBF 280 Problems 281

270

Contents

SUFFICIENT STATISTICS AND STATISTICAL ESTIMATION OF PARAMETERS

LESSON A

282

Introduction 282 Concept of Sufficient Statistics 282 Exponential Families of Distributions 284 Exponential Families and MaximumLikelihood Estimation 287 Sufficient Statisticsand Uniformly MinimumVariance Unbiased Estimation 290 Problems 294 APPENDIX

A

GLOSSARY OF MAJOR RESULTS REFERENCES

300

INDEX

Estimation theory is widely used in many branchesof scienceand engineering. No doubt, one could trace its origin back to ancient times, but Karl Friederich Gaussis generally acknowledgedto be the progenitor of what we now refer to as estimation theory. R. A. Fisher, Norbert Wiener, Rudolph E. Kalman, and scoresof others have expanded upon Gauss’ legacy, and have given us a rich collection of estimation methods and algorithms from which to choose. This book describes many of the important estimation methods and shows how they are interrelated. Estimation theory is a product of need and technology. Gauss, for example, needed to predict the motions of planets and comets from telescopic measurements. This “need” led to the method of least squares. Digital computer technology has revolutionized our lives. It created the “need” for recursive estimation algorithms, one of the most important ones being the Kalman filter. Because of the importance of digital technology, this book presents estimation theory from a digital viewpoint. In fact, it is this author’s viewpont that estimation theory is a natural adjunct to classicaldigital signal processing. It produces time-varying digital filter designs that operate on random data in an optimal manner. This book has beenwritten as a collection of lessons.It is meant to be an introduction to the general field of estimation theory, and, as such, is not encyclopedic in content or in references. It can be used for self-study or in a one-semestercourse. At the University of Southern California, we have covered all of its contents in such a course, at the rate of two lessonsa week. We have been doing this since 1978.

xiv

Preface

Approximately one half of the book is devotedto parameter estimation and the other half to state estimation. For many years there has been a tendency to treat state estimation as a stand-alonesubject and even to treat parameter estimation as a special caseof state estimation. Historically, this is incorrect. In the musical “Fiddler on the Roof,” Tevye argues on behalf of tradition . . . “Tradition! . . . ” Estimation theory also has its tradition, and it begins with Gaussand parameter estimation. In Lesson 2 we show that state estimation is a specialcaseof parameter estimation. i.e.. it is the problem of estimating random parameters,when these parameterschangefrom one time to the next. Consequently,the subject of state estimation flows quite naturally from the subject of parameter estimation. Most of the book’s important results are summarized in theorems and corollaries. In order to guide the reader to these results, they have been summarized for easyreferencein Appendix A. Problemsare included for most lessons,becausethis book is meant to be used as a textbook. The problems fal1 into two groups. The first group contains problems that askthe reader to fill in details, which havebeen “left to the reader as an exercise.” The second group contains problems that are related to the material in the lesson. They range from theoretical to computational problems. This book is an outgrowth of a one-semestercourse on estimation theory, taught at the University of Southern California. Since1978it has been taught by four different people, who have encouraged me to convert the lecture notes into a book. I wish to thank Mostafa Shiva, Alan Laub, George Papavassilopoulos,and Rama Chellappa for their encouragement. Special thanks goes to Rama Chellappa, who provided supplementary Lesson A on the subject of sufficient statistics and statistical estimation of parameters. This lesson fits in very nicely just after Lesson 14. While writing this text, the author had the benefit of comments and suggestionsfrom many of his colleagues and students. I especially wish to acknowledge the help of Guan-Zhong Dai, Chong-Yung Chi, Phil Burns, Youngby Kim, Chung-Chin Lu, and Tom I-Iebert. Special thanks goes to Georgios Giannakis. The book would not be in its present form without their contributions. Additionaliy, the author wishes to thank Marcel Dekker, Inc. for permitting him to include material from Mendel, J. M., 1973,Discrere Techniques of Parameter Estimation : the Equation-Error Formulation, in Lessons I-9,11, 18, and 24; and Academic Press, Inc. for permitting him to include material from Mended, J. M., Optimal Seismic Deconvulution: an Estimation-Based Approach, copyright 0 1983 by Academic Press, Inc., in Lessons 11-17, 19-22, and 26. JERRYM. MENDEL Los Angeles, Califurnia

Lesson

1

Introduction, Coverage, and Philosophy

INTRODUCTION

This book is all about estimation theory. It is useful, therefore, for us to understand the role of estimation in relation to the more global problem of modeling. Figure l-l decomposes modeling into four problems: representation, measurement,estimation, and validation. As Mendel (1973, pp. 2-4) states, “The representation problem dealswith how something should be modeled. We shall be interested only in mathematical models. Within this classof models we need to know whether the model should be static or dynamic, linear or nonlinear, deterministic or random, continuous or discretized, fixed or varying, lumped or distributed. . . , in the time-domain or in the frequencydomain. , . , etc. “In order to verify a model, physical quantities must be measured. We distinguish between two types of physical quantities, signals, and parameters. Parameters expressa relation between signals.. . . ‘*Not all signalsand parameters are measurable.The measurementproblem deals with which physical quantities should be measured and how they should be measured. “The estimation problem deals with the determination of those physical quantities that cannot be measuredfrom those that can be measured. We shall distinguish between the estimation of signals(i.e., states) and the estimation of parameters. Because a subjective decision must sometimes be made to classify a physical quantity as a signal or a parameter, there is some overlap between signal estimation and parameter estimation. . . .

Introduction,

Coverage, and Philosophy

Lesson 1

Figure l-1 Modeling Problem (reprinted from Mendel, 1973,p. 4, by courtesy of Marcel Dekker, Inc).

“After a model has been completely specified, through choice of an appropriate mathematical representation, measurementof measurable signals, estimation of nonmeasurablesignals, and estimation of its parameters, the model must be checkedout. The validationproblem dealswith demonstrating confidence in the model. Often, statistical tests involving confidence limits are used to validate a model.” In this book we shall be interested in parameter estimation, state estimation, and combined stateand parameter estimation. In Lesson2 we provide six examples each of which can be categorizedeither as a parameter or state estimation problem. Here we just mention that: the problem of identifying the sampledvalues of a linear and time-invariant system’simpulse responsefrom input/output data is one of parameter estimation; the problem of reconstructing a state vector associatedwith a dynamicalsystem, from noisy measurements, is one of state estimation (state estimates might be needed to implement a linear-quadratic-Gaussian optimal control law, or to perform postflight data analysis,or signal processingsuchas deconvolution). COVERAGE

This book focuses on a wide range of estimation techniques that can be applied either to linear or nonlinear models. Both parameter and state estimation techniquesare treated. Some parameter estimation techniquesare for

3

Coverage

deterministic parameters, whereas others are for random parameters, however, state estimation techniques are for random states.Table l-l summarizes the book’s coverage. Four lessons(Lessons 3, 4, 5, and 8) are devoted to least-squares estimation becauseit is a very basic and important technique, and, under certain often-occurring conditions, other techniquesreduceto it. Consequently, once we understand the nuancesof least squaresand have established that a different technique has reduced to least squares,we do not have to restudy the nuancesof that technique. In order to fully study least-squaresestimators we must establish their small and large sample properties. What we mean by such properties is the subject of Lessons6 and 7. Having spent four lessons on least-squaresestimation, we cover best linear unbiasedestimation (BLUE) in one lesson,Lesson9. We are able to do this becauseBLUE is a special caseof least squares. In order to set the stage for maximum-likelihood estimation, which is covered in Lesson 11, we describe the concept of likelihood and its relationship to probability in Lesson 10. Lesson 12 provides a transition from our study of estimation of deterministic parameters to our study of estimation of random parameters. It provides much useful information about elements of Gaussian random variTABLE l-l

Estimation Techniques

I. LINEAR MODELS A. Parameter Estimation 1. Deterministic parameters a. Least-squares (batch and recursive processing) b. Best linear unbiased estimation (BLUE) c. Maximum-likelihood 2. Random Parameters a. Mean-squared b. Maximum a posteriori c. BLUE d. Weighted least squares B. State Estimation 1. Mean-squared prediction 2. Mean-squared filtering (Kalman filter/Kalman-Bucy filter) 3. Mean-squared smoothing II. NONLINEAR MODELS A. Parameter Estimation Iterated least squares for deterministic parameters B. State Estimation Extended Kalman filter C. Combined State and Parameter Estimation 1. Extended Kalman filter 2. Maximum-likelihood

4

Introduction,

Coverage, and Philosophy

Lesson 1

abies. To some readers, this lesson may be a review of material already known

to them. General results for both mean-squaredand maximum a posteriori estimation of random parameters are covered in Lesson 13. These results are specialized to the important caseof the linear and Gaussianmodel in Lesson 14. Best linear unbiased and weighted least-squaresestimation are also revisited in Lesson 14. Lesson 14 is quite important, becauseit gives conditions under which mean-squared,maximum a posteriori, best-linear unbiased, and weighted least-squaresestimatesof random parameters are identical. Lesson A, which is a supplemental one, is on the subject of sufficient statistics and statistical estimation of parameters.It fits in very nicely after Lesson 14. Lesson 15 provides a transition from our study of parameter estimation to our study of state estimation. It provides much useful information about elements of discrete-time Gauss-Markov random processes,and also establishes the basic m&e-variablemodel, and its statistical properties, for which we derive a wide variety of state estimators.To some readers,this lessonmay be a review of material already known to them. Lessons 16 through 22 cover state estimation for the Lesson 15 basic state-variable model. Prediction is treated in Lesson 16. The important innovations process is also coveredin that lesson. Filtering is the subject of Lessons 17, 18, and 19. The mean-squared state filter, commonly known as the Kalman filter, is developedin Lesson 17. Five exampleswhich illustrate some interesting numerical and theoretical aspects of Kalman filtering are presented in Lesson 18. Lesson 19 establishesa bridge between mean-squared estimation and mean-squared digital signal processing. It shows how the steady-state Kalman filter is related to a digital Wiener filter. The latter is widely used in digital signal processing. Smoothing is the subject of Lessons 20, 21 and 22. Fixed-interval, fixed-point, and fixed-lag smoothersare developed in Lessons 20 and 21. Lesson 22 presents some applications which illustrate interesting numerical and theoretical aspects of fixed-interval smoothing. These applications are taken from the field of digital signal processing and include minimum-variance deconvolution, maximum-likelihood deconvolution, and recursivewaveshaping. Lesson 23 shows how to modify results given in Lessons16, 17, 19, 20 and 21 from the basic state-variable model to a state-variable model that includes the following effects: 1. nonzero mean noise processesand/or known bias function in the measurement equation, 2. correlated noise processes, 3. colored noise processes?and 4, perfect measurements.

Philosophy

5

Lesson 24 provides a transition from our study of esr.Ination for linear models to estimation for nonlinear models. Becausemany rca(-worId systems are continuous-time in nature and nonlinear, this lesson explains how to linearize and discretize a nonlinear differential equation model. Lesson 25 is devoted primarily to the extended Kalman filter (EKF), which is a form of the Kalman filter that has been “extended” to nonlinear dynamical systems of the type described in Lesson 24. The EKF is related to the method of iterated least squares(ILS), the major difference between the two being that the EKF is for dynamical systems whereas ILS is not. This lesson also shows how to apply the EKF to parameter estimation, in which casestates and parameters can be estimated simultaneously, and in real time. The problem of obtaining maximum-likelihood estimatesof a collection of parameters that appears in the basic state-variable model is treated in Lesson 26. The solution involves state and parameter estimation, but calculations can only be performed off-line, after data from an experiment hasbeen collected. The Kalman-Bucy fiiter, which is the continuous-time counterpart to the Kalman filter, is derived from two different viewpoints in Lesson 27. We include this lesson becausethe Kalman-Bucy filter is widely used in linear stochastic optimal control theory. PHlLOSOPHY

The digital viewpoint is emphasizedthroughout this book. Our estimation algorithms are digital in nature; many are recursive. The reasonsfor the digital viewpoint are: 1. much real data is collected in a digitized manner, so it is in a form ready to be processedby digital estimation algorithms, and 2. the mathematics associatedwith digital estimation theory are simpler than those associatedwith continuous estimation theory. Regarding (2), we mention that very little knowledge about random processes is needed to derive digital estimation algorithms, because digital (i.e., discrete-time) random processescan be treated as vectors of random variables. Much more knowledge about random processes is needed to design continuous-time estimation algorithms, Suppose our underlying model is continuous-time in nature. We are faced with two choices: developa continuous-time estimation theory and then implement the resulting estimators on a digital computer (i.e., discretize the continuous-time estimation algorithm), or discretize the model and develop a discrete-time (i.e., digital) estimation theory that leads to estimation algo-

introduction,

Coverage,

and Philosophy

Lesson 1

rithms readily implemented on a digital computer. If both approacheslead to algorithms that are implemented digitally, then we advocate the principle of simplicity for their development, and this leads us to adopt the second-choice. For estimation, our modeling philosophy is, therefore, discretize the model at the front end of the problem.

Estimation theory has a long and glorious history (e.g., see Sorenson, 1970); however, it has been greatly influenced by technology, especially the computer. Although much of estimation theory was developed in the mathematics, statistical, and control theory literatures, we shall adopt the following viewpoint towards that theory: estimation theory is the extension of classical

lesson

2

The Linear

Model

signalprocessing to the design of digital filters that processs uncertain data in an optimal manner. In fact, estimation algorithms are just filters that transform

input streamsof numbers into output streamsof numbers. b Most of classical digital filter design (e.g., Oppenheim and Schafer, 1975;Hamming, 1983; Peledand Liu, 1976)is concerned with designsassociated with deterministic signals,e.g., low-pass and bandpassfilters, and, over the yearsspecific techniqueshavebeen developedfor suchdesigns.The resulting filters are usually “fixed” in the sensethat their coefficients do not change as a function of time. Estimation theory, on the other hand, leads to filter structures that are time-varying. These filters are designed (i.e., derived) using time-domain performance specifications (e.g., smallest error variance), and, as mentioned above, processrandom data in an optimal manner. Our philosophy about estimation theory is that it can be viewed as a naturaL adjunct to digital signal processing theory. Example 1-1 At one time or another we have all used the sample meanto computean “average.” Supposewe are given a collection of k measured values of quantity X, namely x(l), x(k), is x(2), . . . , x(k). The sample meanof these measurements, Z(k) = k’ Ii x(j)

(l-1)

j-1

A recursive formula for the samplemean is obtained from (l-l), as follows: i(k + 1) =&$x(j) x(k + 1) = A’(k)

=&

I”1 + &(k

2 x(j)+x(k

[ j-1

+1) I

(l-2)

+ 1)

This recursive version of the sample mean is used for k = 0, 1, . . . by setting Z(O) = 0. Observe that the sample mean, as expressed in (I -2), is a time-varying recursive digital filter whose input is measurement x(k). In later lessonswe show that the sample mean is also an optimal estimation algorithm; thus, although the reader may not have been aware of it, the sample mean, which he or she has been using since early schooldays, is an estimation algorithm. cl

In order to estimate unknown quantities (i.e., parameters or signals) from measurements and other a priori information, we must begin with model representationsand expressthem in sucha way that attention is focusedon the explicit relationship between the unknown quantities and the measurements. Many familiar models are linear in the unknown quantities (denoted 6), and can be expressedas Z(k) = X(k)8 + V(k)

(2-l)

In this model, %(k), which is N x 1, is called the measurement vector;8, which is n x 1, is called the parameter vector; X(k), which is N x n is called the observation matrix; and, V(k), which is N x 1, is called the measurement noise vector. Usually, V(k) is random. By convention, the argument “k” of Z(k), X(k), and V(k) denotes the fact that the last measurement used to construct (2-l) is the kth. All other measurementsoccur “before” the kth. Strictly speaking, (2-l) representsan “affine” transformation of parameter vector 8 rather than a linear transformation. We shall, however, adhereto traditional estimation-theory literature, by calling (2-l) a “linear model.”

Someexamples that illustrate the formation of (2-l) are given in this section. What distinguishes these examples from one another are the nature of and interrelationships between 8, X(k) and V(k). The following situations can occur. 7

The Linear Model

Lessun 2

A. 8 is deterministic

1. X(k) is deterministic. 2. X(k) is random. a. X(k) and V(k) are statistically independent. b, X(k) andT(k) are statistically dependent. B. f#is random I. X(k) is deterministic. 2. X(k) is random. a. X(k) and V(k) are statisticaIly independent. b. X(k) and V(k) are statistically dependent. Impulse ResponseIdentification It is weli known that the output of a single-input single-output, linear, Iirne-iwariarit, discrete-time system is given by the following convolution-sum relationship Example 2-l

where k = 1,2, . . . , IV, h (Q is the system’simpulse response (IR), u(k) is its input and y (Jc)its output. If u(k) = 0 for k c 0, and the system is causal, so that h(i) = 0 for i 5 0, and, II(~) = 0 for i B n, then y(k) = i

h(lJu(k - i)

i==l

Signa y (k) is measuredby a sensorwhich is corrupted by additive measurement noise, p(k), i.e., we only have accessto measurementz (Ic),where z(k) = y(k) + u(k) and k = 1,2,.**,

N* We now collect these AJmeasurementsas follows:

Examples

Clearly, (2-5) is in the form of (2-l). In this application the n sampledvalues of the IR, h (i), play the role of unknown parameters, i.e., 8, = h(l), 62 = h(2), . . .) 0, = h(n), and these parameters are deterministic. If input u(k) is deterministic and is known ahead of time (or can be measured) without error, then %‘(N - 1) is deterministic so that we are in case A.l. Often, however, u(k) is random so that X(N - 1) is random; but u(k) is in no way related to measurement noise v(k), so we are in caseA.2.a. 0 Example 2-2 Identification of the CoefFcientsof a Finite-Difference Equation Suppose a linear, time-invariant, discrete-time system is described by the following nth-order finite-difference equation J(k)+cwly(k-l)+

.** +cr,y(k -n)=u(k

-1)

P-6)

This model is often referred to as an all-pole or autoregressivc (AR) model. It occursin many branches of engineering and science, including speech modeling, geophysical modeling, etc, Suppose, also, that N perfect measurementsof signal y (k) are available. Parameters lyl, cx2,. . . , CX,, are unknown and are to be estimated from the data. To do this, we can rewrite (2-6) as v(k) = -ctly(k

- I) - ‘.w--a,y(k

-n)

+ u(k - 1)

(2-7)

and collect y (l), y (2), . , , , y (N) as we did in Example 2-1. Doing this, we obtain f y(N-2) y(N-3) -.a Y(N) Y(N - 0 YW - 4 YW.- 2) Y P.- 1) W*- 3) YW.- 4) ‘*’ - YW -,n - 1) . . . , = y(n :- 1) y(“:-2) y(ni- 1) -*-Y&l Y@> . . (j .:. 0 Yh Yil) YiO) .. f 0 0 0 , Y(l) Y(O) , \ X(N - 1) qlN) I u(N - 1) -aI up2) . -ff2. x : + .(,I (2-8) . -% 1-1 41) 40) 0 \ , V(N - 1) which, again, is in the form of (2-l). In this example 8 = col(--aal, -cY~,.. . , - cy,,), and these parameters are deterministic. If input u(k - 1) is deterministic, then the system’s output y(k) will also be deterministic, so that both %e(N- 1) and Sr(N - 1) are deterministic. This is a very special caseof caseA. 1, becauseusually “v is random. If, however, u (k - 1) is random then y(k) will also be random; but, the elements of X(N - 1) will now depend on those inV(N - l), becausey(k) dependsupon u (0). u (l), , . . , u (k - 1). In this situation we are in caseA.2.b. q

The Linear Model

Lesson 2 11

Examples

Example 2-3 Identification of the initial Condition in an Unforced State Equation Model Consider the problem of identifying the n x 1 initial condition vector x(0) of the linear, time-invariant, discrete-time system x(k + 1) = @x(k)

P-9

from the N measurementsz(l), t (2), . . . , z (N), where

x(k) = Qk-jx(j)

z(k) = h’x(k) + v(k)

me solution

t0

Observe that a different (unknown) state vector appearsin each of the N measurement equations; thus, there does not appear to be a common “0” for the collected measurements. Appearances can sometimes be deceiving. So far, we have not made use of the state equation. Its solution can be expressed as +

$

(2-9) is x(k) = Q&x(O)

(2-11)

z(k) = h’@x(O) + v(k)

(2-12)

SO that

(2-17)

- 1)

Qk-‘yu(i

i--j+1

(2-10)

where k 2 j + 1. We now focus our attention on the value of x(k) at k = kl, where 1 5 kl 5 N. Using (2-17), we can express x(N), x(/V - l), . . . , x(kl + 1) as an explicit function of x(k,), i.e., x(k) = @k-klx(kl)

+

i

Qk-‘yu(i

- 1)

(2-H)

i=kl+l

where k = kl + 1, kl + 2,. . . , N. In order to do the samefor x(l), x(2), . . . , x(kl - l), we solve (2-17) for x(j) and set k = kl,

Collecting the N measurements,as before, we obtain (2-13)

X(,3 = @jmklx(kl)

-

5

@j-‘yu(i

i-j+

- 1)

(2-19)

I

where j = kl - 1, kl - 2,. . . , 2, 1. Using (2-18) and (2-19), we can reexpress (2-16) as

\

k

z(k) = hWk - k’ x(kJ + h’

once again, we have been led to (2-l), and we are in caseA.l.

#k-iyu(i

- 1) + v(k)

i==kl+l

0

Example 2-4 State Estimation state-variable models are widely used in control and communication theory, and in signalprocessing. Often, we need the entire state vector of a dynamical system in order to implement an optimal control law for it, or, to implement a digital signal processor. Usually, we cannot measure the entire state vector, and our measurements are carrupted by noise. In state estimation, our objective is to estimate the entire state vector from a limited collection of noisy measurements. Here we consider the problem of estimating n x 1 state vector x(k), at k = 1, 2 ,..*v N from a scalar measurementz (k), where k = 1,2, . . . , N. The model for this example is x(k + 1) = #x(k) + yu(k) (2-14) z(k) = h’x(k) + v(k)

k =N,N - l,...,k* + 1 z(k,) = h’x(k,) + v(kl)

z(i) = hWfwklx(kl) - h’

3

cP*-‘yu(i

(2-20) - 1) + v(l)

i=I+l

2 = kl - 1, kl - 2,. . . , 1

These N equations can now be collected together, to give

(2-15)

we are keeping this example simple by assuming that the systemis time-invariant and has only one input and one output; however, the results obtained in this example are easily generalized to time-varying and multichannel systemsor multichannel systems If we try to collect our N measurements as before, we obtain z(N) = h’x(N) + v(N) z(N - 1) = h’x(N - 1) + v(N - 1) ... z(l) = h’x(1) + v(l) !

c

(2-16)

VW)

+ M(N,kl)

v(N .- 1) v(1)

1

(2-21)

where the exact structure of matrix M(N, kl) is not important to us at this point. Observe that the state at k = kl plays the role of parameter vector 8 and that both X and ‘V are different for different values of kl. If x(0) and the system input u(k) are deterministic, then x(k) is deterministic for

12

The Linear Model

Lesson 2

all k. 1x-1 this case6 is deterministic, but V(N, kI) is a superposition of deterministic and random components. On the other hand, if either x(O) or u(k) are random then 0 is a vector of randotn parumeters. This latter situation is the more usual one in state estimation. 1t corresponds to caseB. 1. III Example 2-5

A Nonlinear Model

Many of the estimation techniques that are described in this book in the cuntext of linear model (Z-I} can also be applied to the estimation of unknown signals or parameters in nonlinear models, when such models are suitably linearized. Suppose, for example, that z(k) =f(o,k)

+

LQ$

(2-22)

where, k = 1, 2,. . . , iv, and the structure of nonlinear function f (0, k) is known explicitly, To see the forest from the trees m this example we assume 0 is a scalar parameter. Let 0 * denote a nominal value of 19,86 = 19- 6 *, and 8~ = z - z *, where z*(k) = f (t?*, k)

13

Examples

where k = 1, 2,. . . , N. Noisy measurements z(l), z (2) . . . , z (NJ are available to us, and we assume that we know the system’s impulse response h (13, Vi. What is not known is the input to the system p(l), p(2), . +. , p(N). Deconvolution fi the signal processingprocedure for removing the effects of h(j) and v(j) fram the measurementsso that one & ieft with an estimateof k(j). In deconvolution we often assume that input p( 1) is white noise, but is not necessarily Gaussian. This type of deconvolution problem occurs in reflection seismology. We assume further that P w = wq

w

(2-27)

where t(k) is white Gaussian noise with variance o$, and q(k) is a random event location sequence of zeros and ones (a Bernoulli sequence).Sequencesr(k) and 4 (k) are assumed to be statistically independent. We now collect the N measurements, but in such a way that p(l), ~(2), . . . , p(N) are treated as the unknown parameters. Doing this we obtain the fohowing linear deconvohrtion model:

(2-23) h(N -2) h(N.-3)

Observe that the nominal meusurements, z* (k), can be computed once 0* is specified, becausef (. , k) is assumedto be IXIOW~L Using a frrst-order Taylor seriesexpansion off (0, k) about 19= 6 * , it is easy to show that

... ...

h(1) h(O) h(O) 0

(2-24)

W9

X(N - 1)

wherek = 1,2,.. . , N. It is easyto seehow to collect these N equations, tu give

in which $(8*,

k}/H* is short for “$f($, k)/&?evaiuated at 8 = @*.” Observe that X dependson U*. We will discussdifferent waysfor specifying 0* in Lesson 25. 0 Example 2-6

We shall often refer to 9 as p. Using (2-27), we can also express 8 = ~1as CL= Qqr

(2-29)

r = col (r(l), r(2), . . . , r(N))

(2-30)

Q, = diag (4 WY q (21, - a- ) 4 WI)

(2-31)

DeconvoIu?ion (Mended, 19S3b)

In Example 2-I we showedhow a convolutional model could be expressedas the linear model S! = X0 + V. In that example we assumed that both input and output measurements were available, and, we wanted to estimate the sampled values of the system’s impulse response,Here we begin with the same convolutional model, written as

and In this case (2-28) can be expressedas

Z(N) = X(N - l)Q,r + WY)

(2-32)

The Linear Model

14

Lesson 2

When event locations q (l), q (2), . . . , q (IV) are known, then we can view (2-32) as a linear model for determining r. Regardlessof which linear deconvolution model we use as our starting point for determining F, we see that deconvolution corresponds to case B.l. Put another way, we have shown that the design of a deconvolution signal processing filter is isomorphic to the problem of estimating random parameters in a linear model. Note, however, that the dimension of 8, which is IV x 1, increasesasthe number of measurementsincrease. In all other examples 8 was n x 1 where n is a fixed integer. We return to this point in Lesson 14 where we discussconvergenceof estimatesof + to their true values. In Lesson 14, we shall develop minimum-variance and maximum-likelihood &convolution filters. Equation (2-28) is the starting point for derivation of the former filter, whereas Equation (2-32) is the starting point for derivation of the latter filter. Cl

NOTATIONAL

PRELIMINARIES

Equation (2-l) can be interpreted as a data generating model; it is a mathematical representation that is associatedwith the data. Parameter vector 0 is assumedto be unknown and is to be estimated using Z(k), X(k) and possibly other a priori information. We use F(k) to denote the estimate of constant parameter vector 8. Argument k in 8(k) denotesthe fact that the estimate is basedon measurementsup to and including the kth. In our preceding examples, we would use the following notation for 6(k): Example 2-l Example 2-2 Example 2-3 Example 2-4 Example 2-5 Example 2-6

[see (2-91: i(N) with components&i 1ZV) [see (2-8)]: i(N) [see (2-13)]: x(0 1N) [see (2-21)]: i(kl ] IV) [see (2-291: s(N) [see (2-28)]: i(N) with componentsk(i ] N)

The notation used in Examples 1, 3, 4, and 6 is a bit more complicated than that used in the other examples, becausewe must indicate the time point at which we are estimating the quantity of interest (e.g., kl or i) aswell asthe last data point used to obtain this estimate (e.g., N). We often read i(kl I N) as “the estimate of x(k,) conditioned on N” or as “x hat at kl conditioned on N.” In state estimation (or deconvolution) three situations are possible depending upon the relative relationship of N to kl. For example, when N < kl we are estimating a future value of x(k,), and we refer to this as a predicted estimate. When N = kl we are using all past measurements and the most recent measurement to estimate x(k,). The result is referred to as a filtered estimate. Finally, when N > kl we are estimating an earlier value of x(k,) using past, present and future measurements.Such an estimate is referred to as a

Lesson 2

Problems

smoothed or interpolated estimate. Prediction and filtering can be done in real

time whereassmoothing can never be done in real time. We will seethat the impulse responsesof predictors and filters are causal, whereas the impulse responseof a smoother is noncausal. We use 6(k) to denote estimation error, i.e., 6(k) = 0 - 8(k)

(2-33)

In state estimation, %(kl 1N) denotes state estimation error, and, ji(kl 1IV’)= x(b) - i(kl 1N>. In deconvolution @(i 1N) is defined in a similar manner. Very often we use the following estimation model for Z(k), 8(k) = X(k)i(k)

(2-34)

To obtain (2-34) from (2-l), we assumethat V(k) is zero-mean random noise that cannot be measured. In some applications (e.g,, Example 2-2) g(k) representsa predicted value of Z(k). Associated with S(k) is the error S(k), where (2-35)

2(k) = Z(k) - i(k)

satisfiesthe equation f%(k) = X(k#(k)

+ V(k)

(2-36)

In those applications where &(k) is a predicted value of Z(k), %(k) is known as a prediction error. Other names for S(k) are equation error and measurement residual.

In the rest of this book we develop specific structures for 6(k). These structures are referred to as estimators. Estimates are obtained whenever data is processedby an estimator. Estimator structures are associatedwith specific estimation techniques, and these techniquescan be classified according to the natures of 8 and X(k), and what a priori information is assumedknown about noise vector V(k). See Lesson 1 for an overview of all the different estimation techniquesthat are covered in this book. PROBLEMS -1. Suppose r(k) = @I+ &k + v(k), where z(1) = 0.2, z(2) = 1.4, z(3) = 3.6, z(4) = 7.5, and z (5) = 10.2. What are the explicit structures of %(5) and X(S)? 2-2. According to thermodynamic principles, pressure P and volume V of a given massof gasare related by PVy = C, where y and Care constants. Assume that IV measurementsof P and V are available. Explain how to obtain a linear model for estimation of parameters y and In C. 2-3. (Mendel, 1973, Exercise 1-16(a), pg. 46). Supposewe know that a relationship exists between y and x1, x2, . . . , X, of the form Y =exp(alxl+a2x2+

-

+a,,u,)

Lesson 2

We desire to estima?ea~, ff2,. . . , fznfrom measUremen& of y and x = co1 (x1, xfi). Explain how to do this. (Mendel, 1973, Exercise l-17, pp. 46-47). The efficiency of a jet engine may be viewed as a linear combination of tinctions of inlet pressurep ($ and operating temperature T(i); that is to say,

lesson

3

x2,..*,

where the structures of fI? j& and fj are known a priori and v(f) represents modeling error of known mean and variance. From tests on the engine a table of values of E(l), p (t), and T(t) are given at discrete vaiues of L Explain how Cl, C2, CS,and CGare estimated from these data.

feast-Squares Estimation: Batch Processing

INTRODUCTION

The method of least squaresdatesback to Karl Gauss around 1795,and is the cornerstone for most estimation theory, both classical and modern. It was invented by Gaussat a time when he wasinterested in predicting the motion of planets and comets using telescopic measurements. The motions of these bodies can be completely characterized by six parameters. The estimation _ problem that Gauss consideredwasone of inferring the values of theseparameters from the measurement data. We shall study least-squaresestimation from two points of view: the classicalbatch-processing approach, in which all the measurementsare processedtogether at one time, and the more modern recursive processingapproach, in which measurementsare processedonly a few (or even one) at a time. The recursive approachhasbeenmotivated by today’s high-speeddigital computers; however, as we shall see,the recursive algorithms are outgrowths of the batch algorithms. In fact, as we enter the era of very large scaleintegration (VLSI) technology, it may well be that VLSI implementations of the batch algorithms are faster than digital computer implementations of the recursive algorithms. The starting point for the method of least squaresis the linear model ‘Z(k) = X(k)8 + T(k)

(3-l)

where Z(k) = co1(z(k), z (k - l), . . . , z (k - N + I)), z(k) = h’(k)9 + v(k), and the estimation model for Z(k) is it(k) = X(k)&k)

U-2) 17

Least-Squares

Estimation:

Batch Processing

Lesson 3

We denote the (weighted) least-squaresestimator of 8 as [&&k)]&(k). In this lesson and the next two we shall determine explicit structures for this estimator.

Derivation

19

of Estimator

DERIVATION OF ESTIMATOR

To begin, we express (3-4) as an explicit function of 8(k), using (3-2): J@(k)] = !i?(k)W(k)?i(k)

= [Z(k) - !i(k)]‘W(k)[%(k)

NUMBER OF MEASUREMENTS

= [Z(k) - %e(k)6(k)]‘W(k)[%(k)

Supposethat 0 contains n parametersand Z(k) contains N measurements.If N < it we have fewer measurementsthan unknowns and (3-l) is an underdetermined systemof equationsthat does not lead to unique or very meaningful values for 8i, &, . . . , 8,. If N = ~1,we have exactly asmany measurements as unknowns, and as long asthe n measurementsare linearly independent, so that %? (k) exists, we can solve (3-l) for 8, as

= %‘(k)W(k)Z(k) - 25ff(k)W(k)X’(k)6(k) + 6’(k)X’(k)W(k)X(k)6(k)

(3-6)

Next, we take the vector derivative of J [b(k)] with respect to b(k), but before doing this recall from vector calculus,that: If m and b are two yt X 1 nonzero vectors, and A is an n x yt symmetric matrix, then

6 = Ye-’(kyiqk) - %e-’(kpqk)

(3-3) Becausewe cannot measureV(k), it is usually neglectedin the calculation of (3-3). For small amountsof noisethis may not be a bad thing to do but for even moderate amounts of noise this will be quite bad. Finally, if N > it we have more measurementsthan unknowns, so that (3-l) is an overdetermined system of equations. The extra measurementscan be used to offset the effects of the noise; i.e., they let us “filter” the data. Only this last caseis of real interest to us.

- ii(k)]

- X(k)&k)]

-&b’m)

= b

P-7)

= 2Am

(3-8)

and $-(m’Am)

Using these formulas, we find that

mwl

p

d&k)

= -2[F(k)W(k)X(k)]’

Setting aY[&k)]/d&k)

+ 2X’(k)W(k)X(k)&k)

(3 -9

= 0 we obtain the following formula for b,(k),

OBJECTIVE FUNCTION AND PROBLEM STATEMENT

&r&k)

Our method for obtaining 6(k) is basedon minimizing the objective function J@(k)] = %‘(k)W(k)%(k)

P-4)

= [%?(k)W(k)X(k)]-’

X’(k)W(k)%(k)

(3-10)

Note, also, that

where ?i(k) = col[Z(k), Z(k - l), . . . , Z(k - N + l)]

6,(k)

(3-5)

and weighting matrix W(k) must be symmetric and positive definite, for reasons explained below. No generalrules exist for how to chooseW(k). The most common choice is a diagonal matrix such as W(k) = diag[pkBN+l, pkTN+', . . . ,F~] When 1~~1 c 1 recent measurementsare weighted more heavily than past ones. Such a choice for W(k) providesthe weighted least-squaresestimator with an “aging” or “forgetting” factor. When ‘1~1> 1. recent measurements are weighted lessheavily than past ones. Finally, if p = 1, so that W(k) = I, then all measurements are weighted by the same amount. When W(k) = I, ii(k) = d&k), whereasfor all other W(k), 6(k) = b-(k), Our objective is to determine the 6-(k) that minimizesJ [d(k)].

(3-11)

= [X’(k)%e(k)]-‘%e’(k)%(k)

Comments

must be invertible. This is always 1. For (3-10) to be valid, X’(k)W(k)X(k) true when W(k) is positive definite, as assumed, and X(k) is of maximum rank. 2. How do we know that &&k) minimizes J [6(k)]? We compute d2J[6(k)]i/d62(k) and see if it is positive definite [which is the vector calculus analog of the scalarcalculusrequirement that 8 minimizesJ( 6) if dJ(@/dt? = 0 and d2J(6)/d6’ is positive]. Doing this, we seethat -

= 2x'(k)W(k)x(k) db2(k) becauseX’(k)W(k)X(k) is invertible.

> 0

(3-12)

Least-Squares

Estimation:

5atch Processing

Lesson 3 Derivation

3. Estimator 6I&$ processesthe measurements%(Az)Iinearly; thus it is referred to as a Zinearestimator. It processesthe data contained in X(k) in a very complicated and nonlinear manner. 4. When (3-9) is set equal to zero we obtain the following system of normal equutions (13-13) ~~‘(k)W(k)~(k)l~~~(k) = ~‘W+YW(W This is a systemof n hnear equations in the n components of &&). In practice, one does not compute &&J$ using (3-N)), because computing the inverse of X’(k)W(k)X(k) is fraught with numerical difficulties. Instead, the normal equationsare solved using stable algorithms from numerical linear algebra that involve orthogonal transformations (see, e.g., Stewart, 1973; Bierman, 1977; and Dongarra, et al., 1979) Becauseit is not the purpose of this book to go into detaiIs of numerica linear algebra,we leaveit to the reader to pursuethis important subject. Based on this discussion, we must view (3-10) as a useful ‘Meoretical” formula and not as a useful computationalformula. Remember that this is a book on estimation theory, SOfor our purposes, theoretical formulas are just fine. 5. Equation [3-13) can also be reexpressedas f!ff(k)W(k)2C(k) = U’

(3-14)

which can be viewed as an urthogona2it-ycondition between $5(k) and W&)X(k). CMhogonality conditions pIay an important role in estirnatiun theory. We shall see many more examples of such conditions throughout this book. 6. Estimates obtained from (3-N)) will be random! This is becauseT(k) is random, and, in some applications even X(k) is random. It is therefore instructive to view (3-10) as a complicated transformation of vectors or matrices of random variables into the vector of random variables &&). In later lessons, when we examine the properties of &,&k), these will be statistical properties, becauseof the random nature of kvLs(~~. Example 3-l

(Mendel, 1973, pp. 8647)

Suppose we wish to calibrate an instrumentby making a series of uncorreIated measurementson a constantquantity, Denoting the constantquantity as 0, our measurement equationbecomes z(k) = 0 + v(k) (3-15) where /c = 1,X . . . 9N. Collecting these N measurements,we have (3-16)

21

of Estimator

Clearly, X = co1(1, 1, . . . , I); hence; (3-17) which is the sample mean of the N measurements. We see, therefore, mean is a least-squares estimator.

sample

l--l u

Example 3-2 (Mendel, 1973) Figure 3-l depicts simphfied third-order pitch-plane dynamics for a typical, highperformance, aerodynamically controlled aerospacevehicle. Cross-coupling and bodybending effects are neglected. Normal acceleration control is considered with feedback on normal acceleration and angle-of-attack rate. Stefani (1967) showsthat if the system gains are chosen as

c-2 KNi= loo K&=

(3-18)

100Ma

(3-19)

C~f1ooM, IOOMJ,

(3-20)

and K No=

(3-21) Stefani assumes2, 1845/~ is relatively small, and chooses C1 = 1400 and C, = 14,000. The closed-loop response resembles that of a second-order system with a bandwidth of 2 Hz and a damping ratio of 0.6 that respondsto a step command of input acceleration with zero steady-state error. In general, M,, Mg, and Z, are dynamic parameters and all vary through a large range of values. Also, M, may be positive (unstable vehicle) or negative (stable vehicle). Systemresponsemust remain the samefor al1values of M,, Mg, and Z,; thus, it is necessaryto estimate these parameters so that I&,, &, and KNacan be adapted to keep C, and CZinvariant at their designed values. For present purposeswe shah assume that M,, Mb, and Z, are frozen at specific values. From Fig. 3-1, 8(t) = M,a(t) + M&(t)

N4(f) = Zna((f)

(3-22)

(3-23)

Our attention is directed at the estimation of M, and Ma in (3-22). We leave it as an exercise for the reader to explore the estimation of 2, in (3-23) Our approach will be to estimate M, and Ms from the equation e,(k) = M,a(k) -I- M&k)

+ vi(k)

(3-24)

Scale Changes

23

where s,(k) denotes the measured value of d(k), that is corrupted by measurement noise v,(k). We shall assume(somewhat unrealistically) that a(k) and 6(k) can both be measured perfectly. The concatenated measurement equation for N measurements is

Figure 3-1 Pitch-plane dynamics and nomenclature: Ni, input normal acceleration along the negative Z axis; KNi, gain on Ni; 6, control-surface deflection; Ms, control-surface effectiveness; 6, rigid-body acceleration; a, angle-of-attack; M,, aerodynamic moment effectiveness; IQ, control gain on &; Z,, normal acceleration force coefficient; k, axial velocity; N,, system-achieved normal acceleration along the negative Z axis; I&, control gain on N, (reprinted from Mendel, 1973, p. 33, by courtesy of Marcel Dekker, Inc.).

I

I

s K Na 4

I

1

J

e,(k)

e,(k -if

4%)

W)

a(k .- 1) S(k .- 1) . . a(k -it’ + 1) S(k - ‘N + 1)

&,(k . - 1)

+ 1)

Hence, the least-squaresestimatesof M, and MS are -1

Ng’ a*(k - j) j =0 Ng’ a(k - j)6(k -j) j =0

Ng’ CY(~- j)6(k -j) J‘=O

Nf’ 6*(k - j) j =0 N$’ a(k - j)&,(k -j) j-0

Cl

(3-26)

Ni* S(k - j)&,(k -j)

FIXED AND EXPANDING

MEMORY ESTIMATORS

Estimator 6WLS(k)usesthe measurementsz (k - N + l), z(k - N + 2), . . . , z(k). When N is fixed ahead of time, 6-(k) uses a fixed window of measurements, a window of length N, and, 6-(k) is then referred to as a fixedmemory estimator. A secondapproach for choosing N is to set it equal to k; then, &&k) usesthe measurementst (1), z (2), . . . , z (k). In this case6-(k) uses an expanding window of measurements, a window of length k, and, b-(k) is then referred to as an expanding-memory estimator.

SCALE CHANGES

Least-squares (LS) estimates may not be invariant under changesof scale. One way to circumvent this difficulty is to use normalized data. For example, assumethat observers A and B are observing a process; but observer A reads the measurementsin one set of units and B in another.

Least-Squares

Estimation:

l3atch Processing

Lesson 3

LetMbea symmetric matrix of scalefactors relating A to B; ZA (k) and denote the total measurementvectors of A and B, respectively, Then S&(k) = &i(k)0 + V”(k) = MZA(k) = MX&)@ + M=YA(k)

(3-27) (3-28)

Let 6A,w(Q and i3B,w(k) denote the WLSE’s associatedwith observers A and B, respectively; then dB,-(k) = &.&k) if M&(k) = M-l WA(k)M-1

(3-29)

It seemsa bit peculiar though to have different weighting matrices for the two WLSE’s. In fact, if we begin with 6A,u(k) then it is impossible to obtain i$&k) such that &u(k) = 6,&k). The reason for this is simple. To obtain eA,&k), we set WA(k) = I, in which case (3-29) reducesto WB(k) = (M-1)2 + I. Next, let NA and NBdenote symmetric normalization matrices for %A(k) and !!ZB(k), respectively. We shall assumethat our data is alwaysnormalized to the same set of numbers, i.e., that Observe that N/&&(k) = N,JtA (k)O + NAVA(k)

(3-31)

= &M!&(k)

(3-32)

and N&!&(k)

= NBM&(~)~ + N&I%(k)

From (3-30), (3*31), and (3-32),we seethat

WI = NBM

(3-33)

We now find that

i&x&k)

= (~~M~~W~N~M~~)-‘~~MN~W~N~N~~(k)

(3-35)

Substituting (3-33) into (3-35),we then find bB,-(k)

= (X;NA WBNAXJ-l X;NA WBNA?&, (k)

25

Problems

PROBLEMS

which meansthat

C(k) = M%I (k)

Lesson 3

(3-36)

Comparing (3-36) and (3-34), we conclude that &m(k) = &,ms(k) if WB(k) = WA(k). This is preciselythe result we were looking for. It meansthat, under proper normalization, dB,ms(k) = &m(k), and, as a special case bws(k) = kdb

3-1. Derive the formula for &&k) by completing the square on the right-hand side of the expression for J[B(k))l in (3-6). 3-2. Here we explore the estimation of Z, in (3-23). Assume that N noisy measurements of N,(k) are available, i.e.. h6,(k) = Z,a(k) + v&k). What is the formuIa for the least-squaresestimator of Z,? 3-3. Here we explore the simultaneous estimation of M,, Ms, and Z, in (3-22) and (3-23). Assume that Nnoisy measurementsof l(k) and N,(k) are available, i.e., &.,,(Jc)= M,a(k) + M&(k) + vg(k) and N,,(k) = Z,a(k) + vK,(k). Determine the least-squares estimator of Ma, Mb, and Z,. Is this estimatpr different from &!=, and fi,, obtained just from 8,(k) measurements, and Z,, obtained just from N,,(k) measurements? 3-4. In a curvefittingproblem we wish to fit a given set of data z(l), z(2), . . . , t (IV’) by the approximating function i(k)

=

i

Bj&j(k)

j=l

where &((k)(j = 1,2,. . . , n> are a set of prespecified basis functions. (a) Obtain a formula for &IV) that is valid for any set of basis functions. (b) The simplest approximating function to a set of data is the straight line. In this case i(k) = 8, + t&k, which is known as the least-squaresor regression line. Obtain closed-form formulas for 8,&N) and &a(N). 3-5. Supposez(k) = & + &k, where z(1) = 3 miles per hour and z(2) = 7 miles per hour. Determine &,Ls and &,= based on these two measurements. Next, redo these calcuIations by scaling z(I) and z (2) to the units of feet per second. Are the least-squares estimates obtained from these two calculations the same?Use the results developed in the section entitled “Scale Changes” to explain what has happened here. 3-6. (a) Under what conditions on scaling matrix M is scale invariance preserved for a least-squares estimator? (b) If our original model is nonlinear in the measurements [e.g., z(k) = ez’(k - 1) + v(k)] can anything be done to obtain invariant WLSE’s under scaling?

Lesson

4

Least-Squares Estimation:

Recursive Least-Squares:

Information

RECURSIVE LEAST-SQUARES:

27

Form

INFORMATION

FORM

To begin, we consider the case when one additional measurement z(k + l), made at tk + 1,*becomesavailable: z (k + 1) = h’(k + 1)9 + v(k + 1)

(4-l)

When this eauation is combined with our earlier linear model we obtain a new linear mode;,

(4-2)

%(k + 1) = X(k + 1)6 + qr(k + 1)

Recursive

Processing

where

(4-3) %e(k + 1) =

(4-4)

and V(k + 1) = col(v(k

(4-5)

+ l)IV(k))

Using (3-10) from Lesson 3 and (4-2), it is clear that INTRODUCTION

In Lesson 3 we assumedthat Z(k) containedN elements,where N > dim 8 = n. Supposewe decide to add more measurements,increasingthe total number of them from N to N’. Formula (3-10) in Lesson3 would not make use of the previously calculated value of 6 that is basedon N measurementsduring the calculation of 6 that is basedon N’ measurements.This seemsquite wasteful. We intuitively feel that it should be possibleto compute the estimate basedon N’ measurementsfrom the estimatebasedon N measurements,and a modification of this earlier estimate to accountfor the N’-N new measurements.In this lessonwe shall justify our intuition. In Lesson3 we also assumedthat 6 is determined for a fixed value of 12. In many systemmodeling problems one is interestedin a preliminary model in which dimension JZis a variable. This is becomingincreasinglymore important as we begin to model large scale societal, energy, economic, etc. systemsin which it may not be clear at the onset what effects are most important. One approach is to recompute 6 by means of Formula (3-10) in Lesson 3 for different values of ~1.This may be very costly, especiallyfor large scale systems, since the number of flops to compute 6 is on the order of n3. A second approach is to obtain @for n = nl, and to use that estimate in a computationally effective manner to obtain 6 for n = ~22,where 112> nl. These estimators are recursive in the dimension of 8. We shall also examine these estimators.

dwLs(k + 1)

= [X’(k + l)W(k + l)X(k + l)]-%‘(k

+ l)W(k + l)%(k + 1)

(4-6)

TOproceed further we must assumethat W is diagonal, i.e., W(k + 1) = diag@(k + l)IW(k))

(4-V

We shall now show that it is possible to determine &k + 1) from 6(k) and r(k + 1). Theorem 4-l. (Information Form of RecursiveLSE). ture for dwLs(k) is 6-(k

+ 1) = 6-(k)

A recursive struc-

+ Kw(k + l)[z(k + 1) - h’(k + 1)6-(k)]

K&k + 1) = P(k + l)h(k + l)w(k + 1)

P-‘(k

+ 1) = P-‘(k)

+ h(k + l)w(k + l)h’(k + 1)

(4-8) (4-9) (4-10)

These equations are initialized by 6,,(n) and P-‘(n), where P(k) is defined below in (4-13) and are used for k = n, n + 1, . . . , N - 1.

28

Least-Squares

Estimation:

Recurske

Prmess~ng

Lessofl4

PJ-OOJSubstitute (4-3), (4-4) and (4-7) into (4-6) (someGmesdropping the dependenceupon k and k + 1, for notational simplicity) to see that = [W(k

Express &&k)

+ l)W(k

+ l)X’(k

+ 1)-J-‘[hwz + X’W%]

as

Information

Form

29

In (4-Q the term h’(k + 1)6,-~(k) is a prediction of the actual measurement z(k + 1). Because 6,,(k) is based on Z(k), we express this predicted vaIue as i(k + Ilk), i.e., i(k + l/k) = h’(k -i- 1)&&k),

(4-18)

so that

where P(k) = [X’(k)W(k)X(k)]-I

(4-13)

From (4-12) and (4-13) it is straightforward to show that W(k)W(k)fqk)

(4-14)

= P-l (k)&*&)

and P-l (k + 1) = P-l (k) + h(k + l)~+

+ l)h’(k + 1)

+ 1) = P(k + l)[hwz = P(k + l){hwz

+ Kw(k + l)[r(k + 1) - i(k + Ilk)].

(4- 19)

3. Two recursions are present in our recursive LSE. The first is the vector recursion for 4ww given by (4-8). Clearly &,&k + 1) cannot be computed from this expression until measurementt(k + 1) is available. The

second is the matrix recursion for P-l given by (4-10). Observe that values for P-l (and subsequently KH’)can be precomputed before measurements are made. 4. A digital computer implementation of (4-8)-(4-10) proceeds as follows: P-’ (k + 1) --, P(k + l)+Kw(k

+ l>-+ 8-(k

5. Equations (4-8)-(4-10) can also be used for k = 0, l,...,

+ P-I (k)&&k)]

+ 1). N - 1 using

the following values for P-’ (0) and &&O):

+ [P-‘(k + 1) - hwh’]&&c)j

= 6&k)

+ P(k + l)hhj[z

= 6&(k)

+ K&

(4-X)

- h’&&k)]

P-’ (0) = $1” + h(O)w(O)h’(O)

(4-20)

6,,(O) = P(0) [ $ + h(O)w(0)z (O)]

(4-21)

+ l)[z - h’&&)]

which is (4-8) when gain matrix I& is defined asin (4-9). l3asedon preceding discussionsabout dim 6 = n and dim S!+(k)= IV, we know that the first value of IV for which (3-10) in Lesson 3 can be used is AJ = n; thus, (4-8) must be initialized by &&z), which is computed using (3-N)) in Lesson3. Equation (4-10) is also a recursiveequation for P-l (k + l), which is initialized by Pm1(n) = X’(n)W(n)X(n). q

corYlments l+ Equation (4-8) can also be expressedas + 1) = [I - K&c

&r&k + 1) = &&k)

(4-S)

It now follows that

i&&k

7I.

Least-Squares:

(4-l 1) (4-12)

&,&k

Recursive

+ l)h’(k

+ l)]&=(k) + &(k + l)z(k

+ 1)

(4-17)

which demonstrates that the rtxursive [east-squares estimate (LYE) ti a time-varying digital filter that is excited by random inputs (i.e., the measurements),one whose plant matrix may itself be random, becauseKw and h(k + 1) may be random. The random natures of Kw and (I - Kwh’) make the analysis of this filter exceedingly difficult. If Kw and h are deterministic, then stability of this filter can be studied using Lyapunov stability theory.

and

In these equations (which ale derived in Mendel, 1973,pp. 101-106;see, aiso, Problem 4-1) Qis a very large number, Eis a very small number, Eis e>.When these initial values are used in II ~l,andf=col(~,~,..., (4-8)-(4-10) for k = 0, 1, . e. , n - 1, then the resulting values obtained for 6-(n) and P-’ (n) are the very sameones that are obtained from the batch formulas for b,(n) and P-’ (n). Often z(0) = 0, or there is no measurementmade at k = 0, so that we can set z(O) = 0. In this casewe can set w (0) = 0 so that P-’ (0) = I, /a’ and 6(O) = a~. By choosing E on the order of l/a’, we see that (4-8)-(4-10) can be initialized by setting e(O) = 0 and P(0) equal to a diagonal matrix of very large numbers. 6. The reason why the results in Theorem 4-1 are referred to as the “information form” of the recursive LSE is deferred until Lesson 11 where connections are made between least-squaresand maximum-likelihood estimators (see the section entitled The Linear Model (X(k) deterministic), Lesson 11).

31

Which Form To Use Least-Squares

Estimation:

Recursive Processing

Lesson 4

B = P(k + l), C = h’(k + 1) and D = 1/w(k + 1). Then (4-10) looks like

MATRIX INVERSION LEMMA

(4-22), so, using (4-23) we seethat

Equations (4-10) and (4-9) require the inversion of y1x TZmatrix P. If n is large than this will be a costly computation. Fortunately, an alternative is available, one that is basedon the following matrix inversion lemma. Lemma 4-1.

If the matrices A, B, C, and D satisfy the equation B-’ = A-’ + C’D-‘C

(4-22)

where all matrix inverses are assumed to exist, then

(4-23)

B = A - AC’(CAC’ + D)-‘CA.

Proof. Multiply B by B” using (4-23) and (4-22) to showthat BB-’ = I. For a constructive proof of this lemma seeMendel(19’73), pp. 96-97. El

Observe that if A andB are n x n matrices, C is m x n, and D is m x m, then to compute B from (4-23) requires the inversion of one m x m matrix. On the other hand, to compute B from (4-22) requires the inversion of one m X m matrix and two YIX n matrices [A-’ and (B-‘)-‘I. When m < n it is definitely advantageousto compute B using (4-23) insteadof (4-22). Observe, also, that in the specialcasewhen m = 1, matrix inversion in (4-23)is replaced by division. RECURSIVE LEAST-SQUARES:

Theorem 4-2.

COVARIANCE

FORM

(CovarianceForm of Recursive LSE). Another recuris:

sive structure for &&k)

i&r&

+ 1) = i&&c) + K,(k + l)[z(k + 1) - h’(k + 1)&,&k)]

(4-24)

where

K,(k + 1) =

P(k + 1) = P(k) - P(k)h(k + l)[h’(k + l)P(k)h(k + 1) + w-l (k + l)]-‘h’(k + l)P(k)

(4-27)

Consequently, Kw(k + 1) = = = = =

P(k + l)h(k + 1) w(k + 1) [P - Ph(h’Ph + wl)+ h’P]hw Ph[I - (h’Ph + w-‘)-‘h’Ph]w Ph(h’Ph + w-l)-l (h’Ph + w-l - h’Ph)w Ph(h’Ph + iv-*)-’

which is (4-25). In order to obtain (4-26), express(4-27) as P(k + 1) = P(k) - Kw(k + l)h’(k + l)P(k) = [I - K&k + l)h’(k + l)]P(k) Cl Comments

The recursive formula for b-, (4.24), is unchanged from (4-8). Only the matrix recursion for P, leading to gain matrix Kw has changed.A digital computer implementation of (4.24)-(4-26) proceeds as follows: P(k)-, K,(k + l)+ &&k + l)+ P(k + 1). This order of computations differs from the preceding one. When z(k) is a scalar then the covariance form of the recursive LSE requires no matrix inversions, and only one division. Equations (4-24)-(4-26) can also be used for k = 0, 1, . . . , N - 1 using the values for P-’ (0) and &&O) given in (4-20) and (4-21). The reason why the results in Theorem 4-2 are referred to asthe “covariance form” of the recursive LSE is deferred to Lesson 9 where connections are made between least-squaresand best linear unbiased minimum-variance estimators (seep. 79).

-1

P(k)h(k + 1) [h’(k + l)P(k)h(k + 1) +

’

1

w(k + 1)

(4-25) WHICH FORM TO USE

and

P(k + 1) = [I - K&k + l)h’(k + l)]P(k) These equations are initialized by b-(n)

(4-26)

and P(n) and are used for k = n,

n + 1,. . . , N - 1. Proof. We obtain the results in (4-25) and (4-26) by applying the matrix inversion lemma to (4,lo), after which our new formula for P(k + 1) is substituted into (4-9). In order to accomplish the first part of this, let A = P(k),

We have derived two formulations for a recursive least-squaresestimator, the information and covariance forms. In on-line applications, where speed of computation is often the most important consideration, the covarianceform is preferable to the information form. This is because a smaller matrix needsto be inverted in the covariance form, namely an m x m matrix rather than an m X n matrix (m is often much smaller than n).

Least-Squares

Estimation:


Lesson 4

The information form is often more useful than the covariance form in analytical studies. For example, it is used to derive the initial conditions for Pm1(0) and &&O), which are given in (4-20) and (4-21) (see Mendel, 1973, pp. 101~106)The information form is also to be preferred over the covariance form during the startup of recursiveleast squares.We demonstratewhy this is so next. We consider the casewhen

W-4= a21n

(4-28)

where a2 is a very, very large number. Using the information form, we find that, for k = 0, P-l (1) = h(l)w(l)h’(l) + l/a’&, , and, therefore, Kw(l) = [h(l)w (l)h’(l) + 1/a21$’ h(l)w(l). No difficulties are encountered when

we compute Kw(l) using the information form. Using the covariance form we find, first, a2 h(l)[h’(l)a2 h(l)]-’ = h(l)[h’(l)h(l)]-‘, and then that P(l) = {I - h(l)[h’(l)h(l)]-*

that

K&)

= {h(l) - h(l)[h’(l)h(l)]-’

(c) Show that when the measurements zU(-n), . . . , za (- 1): z (0), z( 1). . . . : z (I + I) are used, then I-r 1

/+

P=(I + 1) = X,/a’+

h’(l)h(l)]a2 = 0

c h(j]“(j]h’(j) J=o

+

1)

=

(4-29)

(4-30)

PROBLEMS In order to derive the formulas for P-’ (0) and 6WLs(0),given in (4-20) and (4-21), respectively,one proceeds as follows. Introduce n artificial measurements z= (-l), zti (-2), . . . ,2=(-n), where z”(-j) g E in which E is a very small number.Then,assumethatthemodelforz’(-lJis-z*(-j) = iQ(j = 1,2,.,,,n) where 0 is a very large number. (a) Show that 6&,(--l) = aq where n X 1 vector E = CO~(E, E, . . . , l ). Additionally, show that F (-1) = Q' k,. [b) Show that i&&(O) = F (O)[E/CZ+ ~(O)W(O)Z(O)] and F (0) = [In /u’ + h(O)w (O)h’(O)]-‘.

1

(d) Show that when the measurements z(O), z(l), . . . , z(I + 1) are used (i.e., the artificial measurementsare not used), then

w

h’(l)ju2

-1

1

-1 I

i+1

Neither P( 1) nor h(1) equal zero; hence, P(l) must be a singular matrix for P(l)h(l) to equal zero. In fact, once P(l) becomes singular, all other P(j), j ZE2, will be singular. In Lesson 9 we shall show that when WY1(kj = E{V(k)V’(k)} = CR(k), then P(k) is the covariancematrix of the estimation error, 4(k). This matrix must be positive definite, and it will be quite difficult to maintain this property if P(k) is singular; hence, it is advisableto initialize the recursiveleast-squares estimator using the information form. However, it is also advisable to switch to the covariance formulation as soon after initialization as possible, in order to reduce computing time.

4-l.

Problems

=

however, this matrix is singular. To see this, postmultiply both sides of (4-29) by h(l), to obtain P(l)h(l)

Lesson 4

[

c

cwW’(~)

J’O

(e) Finally, show that for a >>> and E Time

For this model, datum {Z(k), XI(k), &(k)} is available. We wish to compute the least-squaresestimates of e1and e2for the yt + I parameter model using the previously computed i$&).

m

Wk) /

8(k + 1)

B(k + 3)

8(k + 2)

Theorem 5-l. Given the linear model in (MI), where & is n x 1 and e2 is l? x 1. The LSEs of e1 and e2 based on datum {Z(k), X,(k), X*(k)} are found from the following equations:

dim z W

&,Ls(~) = @:.Ls(~)- G(k)C(k)W(k)[fW) : Time

- ‘%(k)~?,&)]

(5-12)

and

bs(k) = W~;(k)[~(k) - &(k)&s(k)]

(5-13)

G(k)= [X(k)&(k)l-’ W(k)%(k)

(5-14)

C(k) = [%(k)%(k) - %(k)%(k)G(k)]-’

(5-15)

where

&k)

6(k + 1)

&k + 2)

6(k+3)

/

Jd’lrn 2

The results in this theorem were worked out by Astrom (1968) and emphasizeoperations which are performed on the vector of residuals for the n parameter model, namely Z(k) - XY,(k)6~&k). Other forms for & .&k) and 6*,&k) appear in Mendel(l975).

(W Figure 5-2 Two ways to reach 6(k + l), $k + 2), . . . . (a) Simultaneous recursive processing performed along the line dim z = m, and (b) cross-sectional recursive processing where, for example, at tIr+l the processing is performed along the line TIME = fk+l and stops when that line intersects the line dim z = m.

The remarkable property about cross-sectionalprocessingis that 6,(k + 1) = 6(k + 1)

(5-9)

Proof. The derivation of (5-12) and (5-13) is based primarily on the block decompositon method for inverting %e’X,where %e= (XII%e,).SeeMende1(1975) for the details. 0

where 6(k + 1) is obtained from simultaneousrecursive processing. A very large computational advantage exists for cross-sectionalprocessingif Rli = 1. In this case the matrix inverse [H(k + l)P(k)H’(k + 1) + w-l (k + l)]-’ needed in (4-25) of Lesson 4 is replaced by the division [h,‘(k + l)Pi(k)hi(k + 1) + l/Wi(k + l)]-‘. See Mendel, 1973, pp. 113-118 for a proof of (5-9)

Similar results to those in Theorem 5-l can be developed for the removal of 2 parameters from a model, for adding or removing parameters one at a time, and for recursive-in-time versionsof all these results (seeMendel, 1975). All of these results are referred to as “multistage LSEs.” We conclude this section and lessonwith some examples which illustrate problems for which multistage algorithms can be quite useful.

MULTISTAGE LEAST-SQUARES ESTIMATORS

Example 5-2 Identification of a SampledImpulse Response:Zoom-In Algorithm A test signal, u (t) is applied at t = toto a linear, time-invariant, causal, but unknown system whose output, y(t), and input are measured. The unknown impulse response, w (t), is to be identified using sampled valuesof u (t) and y (t). For such a system

Supposewe are given a linear model Z(k) = X,(k)& + T(k) with n unknown parameters &, datum {S(k), X,(k)}, and the LSE of Or,6:,&k), where

@:.I&~= wxwwr’

%w~w

(5-10)

y(tk) = r

w(7)u(tk - 7)d7 rO

(5-16)

Lesson 5

One qqxoach to identifying NJ(~)is to discretize (5-16) and to only identify w(i) discrete values of time. If we assume that (1) w(f) * 0 for all f 2 &, (2) [rO,lW] is divided into ?I equal intervals, each of width T, so that n = (I~ - lo)/ T and (3) for Te[fi-l, ii], w(T) = ~*(r~- I) and ti (l - 7) = u (t - fi - I), then

w(+i&

- T)dr =

?i j=il

rx + (i + l)A7x

w(T)u(tk - r)d7 rx + jAT=

q-1

= x

w(tx + jATx)u(tk - tx - jATx)ATx

(5-19)

0.23 0.22 0.40 0.40 0.43 0.43 0.44 0.43 0.43 0.42

j-0

fll

=

4331

(w(to),

w(t,),

.

-

l

,

w&x),

and assume that a least-squaresestimate available. Let 02 = coi[w&

+ AT& w2(tx + 2ATx),

.

-

.

,

w&7

-

I))

(5-22)

which is based on (547), is

. . , , w& + (q - l)ATx)]

(S-23)

To obtain 6= = co! (61+M,6Z,Ls)proceed as follows: 1) modify ~TJ-~by scaling &J-s(~~) to k.~(tx )ATX/ Tand call the modified result 6*l,Ls: and 2) apply Theorem 5-I to obtain

Number of Parameters in the Model n

Equation (5-20) contains n + q - 1 parameters. Let

1 2 3 4 5 6 7 8 9 10

0.34 0.41 0.39 0.39 0.40 0.39 0.39 0.41 0.43 0.43

0.75 0.78 0.79 0.79 0.79 0.80 0.81 0.83 0.81

(5-20)

Source: Reprinted from Mendel, 1975, pg. 782, 0 1975IEEE. “Values for parameters shown in parenthesesare true values. ‘Blank spacesin the table denote “no value” for parameters.

4-3

+ 1

0.30 0.35 0.35

It is straightforward to identify the n unknown parameters wI(lO), wI(fI), . . . , ~+(i~- ,) via least squares(seeExample 2-l of Lesson 2); however, for n to be known, I%,must be accurately known and Tmust be chosen most judiciously. In actual practice rWis not known that accuratelyso that PImay have to be varied. Multistage LSEs can be used to handle this situation, Sometimes T can be too coarse for certain regions of time, in which casesignificant features of W(I) such as a ripple may be obscured. In this situation, we would like to “zoom-in” on those intervals of time and rediscretize y (1) just over those intervals, thereby adding more terms to (5-17). Multistage LSEs can also be used to handle this situation, as we demonstrate next For illustrative purposes, we present this procedure for ?hecase when the interval of interest equals T, i.e., when t E[lx, lx + l ] = Twhich is further divided into q equal intervals, each of width ATX, so that q = (fX+ I - &)/AT= = T/AT=. observe that

0.06 0.12 0.10 0.10

(518)

-0.12 -0.17 -0.15 -0.16 -0.15 -0.14 -0.15

W](ti - 1) = TW(ll - 1)

Average Estimates of ParameterCSh

(5-17)

TABLE 5-2 Impulse ResponseEstimates

y(h) = i Wl(b l)U(fk - ia- 1) t=l

-- 1.07

at

-0.29 -0.27 -0.30 -0.33 -0.32

(conGnuedJ

-0.41 -0.44 -0.44 -0.47 -0.47 -0.46


(1.96 0.75 0.63 0.62 0.44 0.40 11.30 (I.29 0.18 0.13

Estimation:

Average Performance

Least-Squares

42

Least-Squares

Estimation:

Recursive

Processing

(continued)

Lesson 5

iLs. Note that the scaling of GI,Ls(fX)in Step 1 is due to the appearanceof W&&W’ / T instead of IV&) in (j-20). It is straightforward to extend the approach of this example to regions that include more than one sampling interval, T. q Example 5-3 Impulse ResponseIdentification An n-stage batch LSE was usedto estimate impulse responsemodels having from one to ten parameters [i.e., n in (5-17) ranged from l-101 for a second-order system (natural frequency of 10.47 rad/sec, damping ratio of 0.12, and, unity gain). The system was forced with an input of randomly chosen *l’s each of which was equally likely to occur. The system’s output was corrupted by zero-mean pseudo-random Gaussian white noise of unity variance. Fifty consecutive samplesof the noisy output and noise-free input were then processedby the n-stagebatch LSE, and this procedure was repeated ten times from which the average values for the parameter estimates given in Table 5-2 (page 41) were obtained. The ten models were tested using ten input sequences,each containing 50 samples of randomly chosen 51’s. The last column in Table 5-2 gives values for the normalized averageperformance r 50

I 50 I

Li= 1

i= 1

Small Sample Properties of Esfimafors

VQ J

in which L(ti) denotes an averageover the ten runs. Not too surprisingly, we see that average predicted performance improves as n increases. All the 8i results were obtained in one pass through the n-stage LSE using approximately 3150 flops. The same results could have been obtained using 10 LSEs (Lesson 3); but, that would have required approximately 5220flops. c3 PROBLEMS 5-l. Show that, in the vector measurement case, the results given in Lessons 3 and 4 only need to be modified using the transformations listed in Table 5-l. 5-2. Prove that, using cross-sectionalprocessing, &(k + 1) = 6(k + 1). 5-3. Prove the multistage least-squaresestimation Theorem 5-l. 5-4. Extend Example 5-2 to regions of interest equal to mT, where m is a positive integer and T is the original data sampling time.

iNTRODUCTlON

How do we know whether or not the results obtained from the LSE, or for that matter any estimator, are good? To answer this question, we make use of the fact that all estimators represent transformations of random data. For example, our LSE, [X’(k)W(k)X(k)]-‘X’(k)W(k)Z(k), represents a linear transformation on Z(k). Other estimators may represent nonlinear transformations of Z(k). The consequence of this is that 6(k) is itself random. Its properties must therefore be studied from a statistical viewpoint. In the estimation literature, it is common to distinguish between smallsample and large-sample properties of estimators. The term “sample” refers to the number of measurementsusedto obtain 6, i.e., the dimension of Z. The phrase “small-sample” means any number of measurements (e.g., 1, 2, 100, 104,or even an infinite number), whereasthe phrase “large-sample” meansan infinite number of measurements. Large-sample properties are also referred to as asymptotic properties. It should be obvious that if an estimator possesses a small-sample property it also possesses the associated large-sample property; but, the converse is not always true.

Why bother studying large-sampleproperties of estimatorsif theseproperties are included in their small-sampleproperties? Put another way, why not just study small-sample properties of estimators? For many estimators it is relatively easy to study their large-sampleproperties and virtually impossible to learn about their small-sampleproperties. An analogoussituation occurs in stability theory, where most effort is directed at infinite-time stability behavior rather than at finite-time behavior.

Small Sample Propertks

of Estimators

Lesson 6

Although “large-sample” means an infinite number of measurements, estimators begin to enjoy their large-sampleproperties for much fewer than an infinite number of measurements-How few, usually depends on the dimension of 0, n. A thorpugh study into 6 would mean determining*its probability density function p (9). WsuaUy,it is too difficult to obtain p (0) for most estimators (unless4 is multivariate Gaussian);thus, it is customary to emphasizethe firstand second-order statistics of 6 (or, its associatederror 6 = 8 - 6), namely the mean and covariance. We shall examine the following smaK and large-sample properties of estimators: unbiasednessand efficiency (small-sample)Yand asymptotic unbiasedness, consistency, and asymptotic efficiency (large-sample). Smallsampleproperties are the subject of this lessonTwhereaslarge-sampleproperties are studied in Lesson 7+

Unbiasedness

Many estimators are linear transformations of the measurements, i.e., i(k) = F(k)%.(k)

(6-6)

In least-squares, we obtained this linear structure for i(k) by solving an optimization problem. Sometimes, we begin by assuming that (6-6) is the desired structure for 8(k). We now address the question “when is F(k)Z(k) an unbiased estimator of deterministic 6?” Theorem 6-f. When 55(k) = X(k)8 + V(k), E(Sr(k)) = 0, and X(k) is deterministic, then b(k) = F(k)%(k) [where F(k) is deterministic] is an unbiased estimator of 0 if and only if F(k)X(k)

= I

for ali

k

(6-7)

Note that this is the first place where we have had to assumeany a priori knowledge about noise V(k).

Pruof UNBIASEDNESS

a. (Necessity). From the model for S(k) and the assumed structure for i(k), we see that 6(k) = F(k)%e(k)B + F(k)V(k)

(6-S)

If h(k) is an unbiased estimator of 8, F(k) and X(k) are deterministic, E(V(k)) = 0, then

and

E{&k)} = 9 = F(k)X(k)B

In terms of estimation error, 6(k), unbiasednessmeans,that E@(k)] = 0

for ai1k

Exampk 6- 1 In the instrument calibration example of Lesson3, we determined the following LSE of 8:

[I - F(k)X(k)j9

= 0

W-9

Obviously, for 8 # 0, (6-7) is the solution to this equation. b. (Suficiency). From (6-S) and the nonrandomness of F(k) and X(k), we have E{&k)} = F(k)X(k)O

(6-10)

Assuming the truth of (6-7), it must be that E{d(k)) = 9 which, of course, means that 6(k) is an unbiased estimator of 9.

where z(i) = e + v(i)

SupposeI~{Y(i)} = 0 for L’= 1,2, , , . , N; then,

which means that &(A$! is an unbiased estimator of 0. R

0

Example 6-2 Matrix F(k) for the WLSE of 8 is [%“(k)W(k)X(k)]-‘3t?(k)W(k). Observe that this F(k) matrix satisfies (6-7); thus, when X(k) is deterministic the WLSE of9 is unbiased. Unfortunately, in many interesting applications X(k) is random, and we cannot apply Theorem 6-l to study the unbiasednessof the WLSE. We return to this issuein Lesson 8. Cl

Small Sample Properties of Estimators

Lesson 6

Efficiency

47

(6-11)

8(k) with the smallest error variance that can ever be attained by any unbiased estimator. The following theorem provides a lower bound for E{ h2(k)} when 8 is a scalar deterministic parameter. Theorem 6-4 generalizes these results to the case of a vector of deterministic parameters.

Theorem 6-2. When z(k + 1) = h’(k + 1)0 + v(k + l), E{v(k + 1)) = 0, and h(k + 1) is deterministic, then &k + 1) given by (6-11) is an unbiased

Let z denote a set of data Theorem 6-3 (Cramer-Rao Inequality). [ i.e., z = co1 (Zi, z2, . . . , zk); z is also short for Z(k)]. If i(k) is an unbiased estimator of deterministic 8, then

Supposethat we begin by assuming a Iinear recursive structure for 6, namely b(k + 1) = A(k + 1)6(k) t b(k + l)z(k + 1)

We then have the following counterpart to Theorem 6-l.

estimator of 8 if

A(k + 1) = I - b(k + l)h’(k + 1) where A(k + 1) and b(k + 1) are deterministic.

(6-15)

E{@(k)} 2

(6-12)

E{[$Q:p(z)r}

forazzk

0 Two other ways for exPressing (6-15) are

We leave the proof of this result to the reader. Unbiasednessmeans that our recursive estimator does not have two independent design matrices (degreesof freedom), A(k + 1) and b(k + 1). UnbiasednessconstraintsA(k + 1) to be a function of b(k + 1). When (6-12) is substitutedinto (6-H), we obtain the following important structure for an unbiasedlinear recursive estimator of& 6(k + 1) = i(k) + b(k + l)[z(k

+ 1) - h’(k + 1)&k)]

(6-16)

‘Ldz PCZ >

E{p(k)}

(6-13)

Our recursive WLSE of 8 has this structure; thus, as long as h(k + 1) is deterministic, it producesunbiasedestimatesof 8. Many other estimators that we shall study will also have this structure.

for all k

E{@(k)} >

2

for all k -E i”;“”

where dz is short for dq, dt2, . . . , dzk .

(6-17)

(‘,) de2

cl

Inequalities (6-19, (6-16) and (6-17) are named after Cramer and Rao, who discovered them. They are functions of k because z is. Before proving this theorem it is instructive to illustrate its use by means of an example.

Did you hear the story about the conventioning statisticianswho all drowned in a lake that was on the average 6 in. deep? The point of this rhetorical question is that unbiasednessby itself is not terribly meaningful. We must also study the dispersionabout the mean, namely the variance. If the statisticians had known that the variance about the 6 in. averagedepth was 120 ft, they might not havedrowned! Ideally, we would like our estimator to be unbiased and to have the smallest possible error variance. We consider the caseof a scalar parameter first. Definition 6-2. An unbiased estimator, i(k) of 0 is said to be more eficient than any other unbiased estimator, 8(k), of 9, if Var (6(k)) 5 Var (8(k))

for all k

0

(6-14)

Very often, it is of interest to know if 6(k) satisfies(6-14) for all other unbiased estimators, i(k). This can be verified by comparing the variance of

Example 6-3 We are given M statistically independent observations of a random variable z that is known to have a Cauchy distribution, i.e.,

p(a) =

1 7T[l + (Zi - e)‘]

(6-18)

Parameter 6 is unknown and will be estimated using zl, z2, . . . , ZM. We shall determine the lower bound for the error-variance of any unbiased estimator of 8 using (6-15). Observe that we are able to do this without having to specify an estimator structure for 6. Without further explanation, we calculate: p(Z) = fi p (Zi) = 1 i==l

lnP(z) = -Mlnrr

i(

fl

fi [l + (Zi - e)2]

(6-19)

i=l

- 2 ln[l + (Zi ;=1

- Q”l

(6-20)

48

Smail Sample Properties of Estimators

Lesson 6

49

Efficiency

thus, when (6-23) is substituted into (622), and that result is substituted into (615), we find that

and (6-21)

for all M

(6-33)

so that Observe th at the Cramer-Rao bound depends on the number of measurements used to estimate 8. For large numbers of measurements,this bound equals zero. Cl Proof uf Theorem 6-3. Because b(k) is an unbiased estimator of t?,

Next, we must evaluate the right-hand side of (6-22). This is tedious to do, but can be accomplished,as follows+Xote that

E{@k)} = /= [i(k) -m Differentiating

- 01 p(z)dz = 0

(4-34)

(6-34) with respect to 8, we find that r

Consider TA first, i.e.,

J --a

TA = E : 2(zj - @/[l + (zj - O)*] 5 Z(Zj - 8)/[1 + (pi - 0)2] (6-24) i=l j-1 where use of statistical independence of the measurements. observe that

-

$1

%p

dz

-

f

p(z)

dz

=

0

-cc

which can be rewritten as (6-35)

(6-25) where y = zi - 8. The integral is zero because the integrand is an odd function of y. Consequently, TA = 0 (6-26)

&k)

As an aside, we note that (6-36) so that

Next, consider TEL,i.e., (6-37)

I

TB = E 5 4(Zi - 0)2/[1 + (pi - tQ212

(6-27) Substitute (6-37) into (6-35) to obtain

which can also be written as TB = 4 2 TC

(6-23)

i-1

where TC = E{(Zi - q2/[l

+ (Zj - ej2J2)

(6-29)

Or

(6-30) Integrating (6-30) by parts twice, we fmd that

where equality is achieved when b(z) = ca(z) in which c is an arbitrary constant. Next, square both sides of (6-38) and apply (6-39) to the new right-hand side, to see that 1I

(6-31) becausethe integral in (6-31) is the area under the Cauchyprobability density function, which equals v. Substituting (6-31) into (6-281,we determine that TB = M/2

(6-32)

[6(k) - Bj’p(z)dz

1-1 [$lnp(z)]:p(z)dz I

or 11 E(b(l;)jE{

[$lnp(z)]?j

(6-40)

Small Sample Properties

of Estimators

Lesson 6

Finally, to obtain (6-Q) solve (6-40) for E{@(k)}. In order to obtain (6-16) from (6-H), observethat

51

Efficiency

For a vector of parameters, we seethat a more efficient estimator hasthe smallesterror covariance among all unbiasedestimators ot8, “smallest” in the sense that

E{[8 - b(k)][8 - 6(k)]‘} - E{[6 - e(k)][e - e(k)]‘}

is negative

semi-definite. The generalization of the Cramer-Rao inequality to a vector of par;lmeters is given next. Theorem 6-4

To obtain (6-41) we have also used (6-36). In order to obtain (6-17), we begin with the identitvw x

I

-X

p(z)dz=

(Cramer-Rao inequality for a vector of parameters).

Let z denote a set of data as in Theorem 6-3, and 8(k) be any unbiased estimator of deterministic 8 based on z. Then

1

E{ij(k)h’(k)}

and differentiate it twice with respect to 0, using (6-37) after each differentiation, to show that

where J is the “Fisher information

2 J-’

for all k

(6-45)

matrix,”

J = E [$np(z)][-$ i which can also be expressed as

which can also be expressedas E[ [-$lnp(z)r]

(6-46)

d2 J = -E ,zlnp(z)

= -E[ “n!/}

i

(6-42)

(6-47)

I

Equality holds in (6-45), if and only if

Substitute (6-42) into (6-15) to obtain (6-17). Cl It is sometimeseasierto compute the Cramer-Rao bound using one form i.e., (6-15) or (6-16) or (6-17)] than another. The logarithmic forms are 1 usually used when p (z) is exponential (e.g., Gaussian). CoroNary 64.

If the lower

bound is achieved in Theorm

6-3, then

1 d lnp(z) de

8(k) =--c

(6-43)

where c is an arbitrary constant. Proof. In deriving the Cramer-Rao bound we used the Schwarz inequality (6-39) for which equality is achievedwhen b (z) = c ti (z). In our case a(z) = [6(k) - tFJm and b(z) = m (~?/a@lnp(z)- Setting b(z) = c a(z), we obtain (6-43). Cl

Equation (6-43) links the structure of an estimator tc the property of efficiency, becausethe left-hand side of (6-43) depends eAxpli,itly on e(k). We turn next to the general caseof a vector of paramzxrrs. Definition 6-3. An unbiased estimator, i(k), of vecmy 8 is said to be more efficient than any other unbiased estimator, 8(k), of& s i-s

E{[B - i(k)][O - i(k)]‘} 5 E{[6 - $k)][e - & W-- -

z]

(6-9

[$

ln P(Z)]’

=

c(e)ii(k) 0

(6-48)

A complete proof of this result is given by Sorenson (1980, pp. 94-96). Although the proof is similar to our proof of Theorem 6-3, it is a bit more intricate becauseof the vector nature of 8. Inequality (6-45) demonstrates that any unbiased estimator can have a covariance no smaller than J-‘. Unfortunately, J-l is not a greatest lower bound for the error covariance. Other bounds exist which are tighter than (6-45) [e.g., the Bhattacharyya bound (Van Trees, 1968)], but they are even more difficult to compute than J? Corollary 6-2. Let z denote a set of data as in Theorem 6-3, and &(kj be any unbiased estimator of deterministic 0i based on z. Then E{@(k)} 2 (J-‘)ii

i = 1,2, . . . ,n and all k

(6-49)

where (J-‘)ii is the i-ith element in matrix J-‘. Proof. Inequality (6-45) means that E{@k)&(k)} semi-definite matrix, i.e., a’[E{b(k)b’(k)}

- J-‘]a 2 0

- J-l

is a positive (6-50)

where a is an arbitrary nonzero vector. Choosinga = ei (the ith unit vector) we obtain (6-49). 0

Small Sample Propedes

of Estimators

Lesson 6

Results similar tu those in Theorem 6-4 and Corollary 6-2 are also available for a vector of random parameters (e.g., Sorenson, 1980, pp. 99-100). Let p (z, 9) denote the joint probability density function between z and 9. The Cramer-Rao inequality for random parameters is obtained from Theorems 6-3 and 6-4 by replacingp (z) by p (2, 9). Uf course,the expectation is now with respectto z and t3+

PROBLEMS 6-l* Prove Theorem 6-2, which provides an unbiasedness constraint for the two

design matrices that appear in a linear recur.siveestimator. 6-2* Prove Theorem 6-4, which provides the Cramer-Rao bound for a vector of parameters, 6-3. Random variabIe X - N(x; P7 4, and we are given a random sample {Xl, 12, ’ * * , x,~}. Consider the following estimator for by

where a 2 0. For what value(s) of u is b(N) an unbiased estimator of p? 6-4. Suppose zl, z2, . . . , zAtare random samples from a Gaussiandistribution with

unknown mean, p, and variance, 0’. Reasonable estimators of p and Q* are the sample mean and sample variance,

and S2

Is s2 an unbiased estimator of u’? [Hint: Show that I?&‘) = (N - l)c?/Nj 6-S. (Mendel, 1973, first part of Exercise 2-9, pg. 137). Show that if 4 is an unbiased estimate of 0, a 6 + b is an unbiased estimate of a 8 + b. xN} are made of a random 6-6. suppose that N independent observations variable X that is Gaussian,i.e.,

In this problem only p is unknown. Derive the Cramer-Rao lower bound of E{F’(N’)} for an unbiased estimator of P* 6-7. Repeat Problem 6-6, but in this case assumethat only cr’is unknown, i.,e,,derive the Cramer-Rao lower bound of E{[&‘(N)]‘] for an unbiasedestimator of 6-S. Repeat Problem 6-6, but in this case assume both p and U’ are unknown, compute J-r when 9 = co1(p , ~7.

Lesson 6

Problems

53

6-9. Suppose 6(k) is a biased estimator of deterministic 8, with bias B( IT?). Show that

[I +T12 E{$(k)} 2 E{ [-$ In p(z)]‘)

for a11k

lesson

7

Asymptotic

Large Sample Properties

of

Estimators

INTRODUCTION

To begin, we reiterate the fact that “if an estimator possessesa small-sample property it also possessesthe associatedlarge-sampleproperty; but, the converse is not always true.” In this lessonwe shall examine the following largesample properties of estimators: asymptotic unbiasedness,consistency, and asymptotic efficiency. The first and third properties are natural extensionsof the small-sample properties of unbiasednessand efficiency in the limiting situation of an infinite number of measurements. The second property is about convergenceof B(k) to 6. Before embarking on a discussionof these three large-sampleproperties, we digressa bit to introducethe concept of asymptotic distribution and its associatedasymptotic mean and variance (or covariance, in the vector situation). Doing this will help us better understand these large-sampleproperties.

55

Distributions

be degenerate, but the form that the distribution tends to put on in the last part of its journey to the final collapse(if this occurs).” Consider the situation depicted in Figure 7-1, where pi( 6) denotes the probability density function associated with estimator 6 of the scalar parameter 8, based on i measurements.As the number of measurementsincreases,pi (6) changesits shape (although, in this example, eachone of the density functions is Gaussian).The density function eventually centers itself about the true parameter value 0, and the variance associatedwith pi( 6) tends to get smaller as i increases. Ultimately, the variance will become so small that in all probability 8 = 6. The asymptotic distribution refers to pi (6) as it evolvesfrom i = 1,2, . . . , etc., especiallyfor large values of i. The preceding example illustrates one of the three possible casesthat can occur for an asymptotic distribution, namely the casewhen an estimator has a distribution of the sameform regardlessof the sample size, and this form is known (e.g., Gaussian). Someestimators have a distribution that, although not necessarilyalways of the sameform, is also known for every samplesize. For example, p5(8) may be uniform, pZO( 8) may be Rayleigh, andpzoo(6) may be Gaussian. Finally, for some estimators the distribution is not necessarily known for every sample size, but is known only for k + 30. Asymptotic distributions, like other distributions, are characterizedby their moments. We are especially interested in their first two moments, namely, the asymptotic mean and variance. The asymptotic mean is equal to the asymptotic expectaDefinition 7-l. tion, namely t@mE{i(k)}. IJ

ASYMPTOTIC DISTRIBUTIONS

According to Kmenta (1971,pg. 163), “. . . if the distribution of an estimator tends to become more and more similar in form to some specific distribution as the sample size increases,then such a specific distribution is called the asymptotic distribution of the estimator in question. . . . What is meant by the asymptotic distribution is not the ultimate form of the distribution, which may

%o~~n-h.Kl

m20

m5

Figure 7-1 Probability density function for estimate of scalar 8, as a function of number of measurements, e.g., pm is the p.d.f. for 20 measurements.


of Estimators

Lesson 7

As noted in Goldberger (1944, pg. 116), if E{@$ = m for al/ k, then

57

Consistency

ASYMPTOTIC

UNBIASEDNESS

p?m E{t!(k)j = linn m = RL Alternatively, supposethat E{@k)} = m + k-‘cI + k -‘c2 +

l

l

Definition 7-3.

9

V-1)

Estimator 6(k) is an asymptotically unbiased estimator

of deterministic 8, if ,!@=E&k))

where the c’s are finite constants;then, p9z E{i(k)] = 1$X{m + k-‘q + ke2c2 + . . l} = m

P-9

Thus if E{@c)} is expressible as a power series in k”, k-l, k-‘, . . - , the asymptotic mean of 6(k) is the leading term of this power series; ask + zcthe terms of “higher order of smallness” in k vanish. The usymptutic variurxe, wh@ is short fur “variance of the asymptotic distribution” is not equal to lirnzvar [@)]. It is defined as

Definition 7-Z.

asymptotic var [8(k)] = Lk lee E{k[&k)

- limm@(k)}]j

II

(7-3)

Kmenta (1971, pg. 164) states“The asymptotic variance . . . is not equal to j@= var(@. The reason is that in the case of estimators whose variance decreaseswith an increasein k, the variancewill approach zero as k + m. This will happen when the distribution collapseson a point. But, as we explained, the asymptotic distribution is not the same as the collapsed (degenerate) distribution, and its variance is WCzero.” Goldberger (1944, pg. 116) notes that if E{[@k) - lim E{&k)}]‘} = u /k for aZI values of k, then asymptotic var [6(k)] = v /k. A&Fnatively, suppose that E{[&k)

- j@= E{i(k)}12j = k--Iv + k-2cI + k-‘cj -+ 0. l

= 9

(7-6)

or of random 9, if dinn E&k))

= E(9)

0

P-7)

Figure 7-l depicts an example in which the asymptotic mean of 6 has convergedto 8 (note that pzoois centered about mm = 0). Note that for the calculation on the left-haqd side of either (7-6) or (7-7) to be performed, the asymptotic distribution of B(k) must exist, because E{ii(k)) = J= - - .I= l!qk)p(Cqk))&(k) -70 --m

U-8)

Example 7-1

RecalI our linear model S.(k) = Z(k)0 + T(k) in which E(Sr(k)} = 0. Let us assume that each component of Y(k) is uncorrelated and has the same variance d. In Lesson 8, we determine an unbiased estimator for CT:.Here, on the other hand, we just assume that

where 2(k) = Z(k) - X(k)&,(k).

We leave it to the reader to show that (seeLesson 8)

E{z (k)}= (q)

CT?

Observe that I$?(k) is not an unbiased estimator of d; but, it is an asymptotically unbiased estimator, because ,II- [(k - n)/k] & = CJ-?. q

where the the c’s are finite constants;then, asymptotic var [t?(k)] = i Lirn *a (V + ke1c2 + ke2c3 + -0 a)

CONSISTENCY

We now direct our attention to the issueof stochasticconvergence.The reader should review the different modes of stochastic convergence, especially con-

vergence in probability and mean-squaredconvergence (see, for example, Thus if the variance of each 6(k) is expressibleas a power series in k-l, k -2, the asymptotic variance of 6(k) is the leading term of this power series;as k goesto infinity the terms of “higher order of smallness”in k vanish. Observe that if (7-5) is true then the asymptotic variance of 6(k) decreases as k+m.. . which correspondsto the situation depicted in Figure 7-l. Extensions of our definitions of asymptotic mean and variance to sequencesof random vectors [e.g.! 6(k), k = lq 2? . . .] are straightforward and can be found in Goldberger (1964, pg. 117). .

l

*

Papoulis, 1965).

,

Theprobability limit of 6(k) is the point 0* on which the Definition 7-4. distribution of our estimator collapses. We abbreviate “probability limit of 6(k)” by plim 6(k). Mathematically speaking, plim 6(k) = 8* -i& where E is a small positive number.

Pr [Ii(k) El

- 8*1 1 e]-,O

(7-11)

Large Sample Properties of Estimators

Definition 7-5.

Lesson 7

6(k) is a consistent estimator of 8 if plim i(k) = 8 0

(7-12)

Note that “consistency” meansthe samething as“convergencein probability.” For an estimator to be consistent, its probability limit 8* must equal its true value 0. Note, also, that a consistent estimator need not be unbiased or asymptotically unbiased. Why is convergencein probability so popular and widely used in the estimation field? One reason is that plim ( 0) can be treated as an operator. For example, supposeXk and Yk are two random sequences,for which plim = X and plim Yk = Y, then (see Tucker, 1962,for simple proofs of these xk and other facts), ph &Yk = (plim xk)(plim Yk) = fl (7-13) and =- plim xk =-x (7-14) plim Yk Y Additionally, supposeAk and Bk are two commensuratematrix sequences,for which plim Ak = A and plim Bk = B [note that plim Ak, for example, means (plim aii(k))ii] ; then plim AkBk = (plim Ak)(Plirn Bk) = AB plim Ai1 = (plim AJ’

(7-15)

= A-l

(7-16) (7-17)

plim Ai1 Bk = A-‘B

The treatment of plim ( .) as an operator often makes the study of consistency quite easy. We shall demonstrate the truth of this in Lesson 8 when we examine the consistencyof the least-squaresestimator. A second reason for the importance of consistencyis the property that “consistency carries over”; i.e., any continuous function of a consistent estimator is itself a consistent estimator [see Tucker, 1967, for a proof of this property, which relies heavily on the preceding treatment of plim ( ) as an operator]. l

Example 7-2

Suppose6 is a consistent estimator of 8. Then 1/ 6 is a consistentestimator of l/ 8, (6)” is a consistent estimator of 8’, and In 6 is a consistent estimator of in 8. These facts are all due to the consistencycarry-over property. [7 The reader may be scratchinghis or her headat this point and wondering about the emphasisplaced on these illustrative examples.Isn’t, for example, using 6 to estimate 8?by (6)’ the “natural” thing to do? The answer is “Yes, but only if we know aheadof time that 6 is a consistentestimator of 8 .” If you

Consistency

do not know this to be true, then there is no guarantee that 8 = (6)“. In Lesson 11 we show that maximum-likelihood estimators are consistent; thus ML= (&L)2. we mention this property about maximum-likelihood estimators here, becauseone must know whether or not an estimator is consistent before applying the consistency carry-over property. Not all estimators are consistent! Finally, this carry-over property for consistency does not necessarily apply to other properties. For example, if b(k) is an unbiased estimator of 0, then A&) + b will be an unbiased estimator of A8 + b; but, 8(k) will not be an unbiased estimator of 82. How do you determine whether or not an estimator is consistent?Often, the direct approach, which makes heavy use of plim ( 0) operator algebra, is possible. Sometimes, an indirect approach is used, one which examines whether both the bias in i(k) and variance of t?(k) approachzero as k * 0~).In order to understand the validity of this indirect approach, we digressto discuss mean-squaredconvergenceand its relationship to convergencein probability. Definition 7-6. Estimator 6 (k) convergesto 8 in a mean-squaredsense,

if CnlE{[&k) - 6]‘}-+0

q

(7-18)

Theorem 7-1. If i(k) convergesto 8 in mean-square,then it convergesto 8 in probability. Proof (Papoulis, 1965,pg. 151). Recall the Inequality of Bienayme

(7-19) WI x- al 2 E] 5 E{lx - a12}/8 Let a =Oandx = 6(k) - 6 in (7-19), and take the limit ask + 00on both sides of (7.19), to seethat (7-20) hrl Pr[@(k) - 612 E] 5 llm E{@(k) - S]*}/E? Using the fact that 6(k) convergesto 6 in mean-square,we seethat liir Pr[li(k) - 61=>E]-+O thus, 6(k) convergesto 6 in probability.

(7-21)

cl

Recall, from probability theory, that although mean-squared convergence implies convergencein probability, the converseis not true. Example 7-3 (Kmenta, 1971, pg. 166) Let 6(k) be an estimatorof 8, and let the probability density function of 6(k) be

8

1-z

k

1 k

1

Large Sample Propertks

of Estimators

Lesson 7

In this example t?(k) can only assumetwo different values, 6 and k. Obviously, 6 (k) is consistent, becauseas k -+CCthe probability that 6(k) equals 8 approaches unity, i.e., plim 6(k) = 0. Observe, also, that E{i(k)] = 1 + G(l - l/k), which means that 6(k) is biasede Now let us investigate the mean-squared error between i and 0; i.e.,

In this pathological example, the mean-squared error is diverging to infinity; but, 6(k) converges to 0 in probability. 0

Theorem 7-L Let 6(k) denote an estimator uf 0. If bias 6(k) und variance 6(k) both approach zero as k* x, then the mean-squared error between 6(k) and 8 appruuches zero, and, therefore, 6(k) i.sa cunsistent estimatur of 0.

Lesson 7

Problems

61

PROBLEMS 7-l. Random variable X - N(x ; p, d), and we are given a random sample (x,, x2, * * *, xH). Consider the following estimator for p,

where a 2 0. (a) For what value(s) of a is r;(N) an asymptotically unbiased estimator of p? (b) Prove that F;(N) is a consistent estimator of ~1for all a 2 0. (c) Compare the results obtained in (a) with those obtained in Problem 6-3. 7-2. Suppose zl, z2, , . . : zN are random samples from a Gaussian distribution with unknown mean, p, and variance, 2. Reasonable estimators of p and 2 are the sample mean and sample variance

Proc$ From elementary probability theory, we know that

E@(k) - Oj21= [bias ~(I+z)]~ + variance I!@?) If, as assumed,bias I@) and variance I

(7-22)

both approachzero as k + =, then (7-23)

which means that 6(k) convergesto 0 in mean-square.Thus, by Theorem 7-l @c) also convergesto 8 in probability. III The importance of Theorem 7-2 is that it provides a constructive way to test for consistency.

S*= $ .i (Zi - if)’ 1=1

(a) Is sz an asymptotically unbiased estimator of d? [Hint: Show that E(s*} = (N - 1)&N]. (b) Compare the result in (a) with that from Problem 6-4. (c) One can show that the variance of s2 is - u” 2(p4 - 2$) var(s2) =- ct4 N N2

EFFICIENCY

Definition 7-7. i(k) is un asymptoticaily efficient estimator of scalar parameter 0, ifi I. 6(k) has an asymptotic distribution with finite mean and variance, 2. 6(k) ti cunsistent, and 3. the variance uf the asymptotic distribution equals

For the case of a vector of parameters conditions 1. and 2. in this definition are unchanged;however, condition 3%is changedto read “if the covariance matrix of the asymptotic distribution equals J-l (see Theorem 6-4). ”

p4

- 3u4

N3

Explain whether or not s* is a consistent estimator of 02. 7-3. Random variable X - N(x; p ,g). Consider the following estimator of the population mean obtained from a random sample of N observations of X, /.i(N)=x

ASYMPTOTiC

+

+;

where a is a finite constant, and f is the sample mean. (a) What are the asymptotic mean and variance of @(IV)? (b) Is r;(N) a consistent estimator of p? (c) Is I;(N) asymptotically efficient? 7-4. Let Xbe a Gaussianvariable with mean p and variance 2. Consider the problem of estimating p from a random sample of observations xl, x2, . . . , xN. Three estimators are proposed:

62


of Estimators

Lesson 7

You are to study unbiasedness,efficiency, asymptotic unbiasedness,consistency, and asymptotic efficiency for these estimators. Show the analysis that allows you to complete the following table.

Properties

Estimator Properties

61

fi2

8

Lesson

c;3

Small sample Unbiasedness Efficiency Large sample Un biasedness Consistency Efficiency

of L eust-Squares Estimators

Entries to this table are yes or no. 7-S. If plim Xk = X and plim Yk = Y where X and Y are constants, then plim (Xk + Yk) = X + YandplimcXk = cX, where c is a constant. Prove that plim XkYk = XY{Hint: XkYk = a [(Xk + Yk)2 - (Xk - yk )2])*

In this lesson we study some small- and large-sample properties of leastsquaresestimators. Recall that in least-squareswe estimate the n X 1 parameter vector 8 of the linear model Z(k) = X(k)8 + V(k). We will see that most of the results in this lessonrequire X(k) to be deterministic, or, X(k) and V(k) to be statistically independent. In some applicationsone or the other of these requirements is met; however, there are many important applications where neither is met. SMALL SAMPLE PROPERTIES OF LEAST-SQUARES ESTIMATORS

In this section (parts of which are taken from Mendel, 1973, pp. 75-86) we examine the bias and variance of weighted least-squaresand least-squares estimators. To begin, we recall Example 6-2, in which we showed that, when X(k) is deterministic, the WLSE of 0 is unbiased. We also showed [after the statement of Theorem 6-2 and Equation (6-13)] that our recursive WLSE of 8 has the requisite structure of an unbiased estimator; but, that unbiasednessof the recursive WLSE of 8 also requires h(k + 1) to be deterministic. When X(k) is random, we have the following important result: Theorem 8-l. 6,(k)

The WLSE of 8, = [X’(k)W(k)%e(k)]-’

X’(k)W(k)%(k)

(8-1)

Properties

is unbiased independent.

is m-u mm

and

of Least-Squares

if T(k)

Estimators

Lesson 8

and X.(k) are stahticuily

Note that this is the first place where, in connection with least squares, web lavehad to assumeany a priori knowledge about noise?F(k), Proof. From (8-l) and %(Ic)= %?(I?)8 + T(k), we find that i&~(k) = (X’wYey X~W(m3 + V) = 0 + (x’w3e)-WW for all k where, for notational simplification? we have omitted the functional dependencesof X, W, and V on k. Taking the expectation on both sidesof (8-2), it follows that E{&&)J

= 0 + E{X’WX)-’ ‘X’]WE{=V]

for all k

w9

In deriving (8-3) we have used the fact that X(k) and T(k) are statistically independent [recall that if two random variables, a and b, are statistically independent p(a, b) = p(ajp(bj; thus, E{ab] = E{@{b] and E&(a)h (b)] = E-&@)}E{!2(b)}]. The second term in (8-3) is zero, becauseE{V} = 0, and therefore

Smalt Sample Properties

of Least-Squares

65

Estimators

number generator. This random sequence is in no way related to the measurement noise process, which means that X(N - 1) and V(N) are statistically independent, and again 6wls(AT) will be an unbiased estimate of the impulse response coefficients. We conclude. therefore, that WLSEs of impulse response coefficients are unbiased. 0 Example 8-2 As a further illustration of an application of Theorem 8-1, let us take a look at the weighted least-squares estimates of the n a-coefficients in the Example 2-2 AR model. We shall now demonstrate that X(N - 1) and V(N - l), which are defined in Equation (2-8), are dependent, which meansof course that we cannot apply Theorem 8-l to study the unbiasedness of the WLSEs of the a-coefficients. We represent the explicit dependencesof %eand V on their elements in the following manner: and sr = sr[u(N - l), u(N - 2), . . . ) u(O)]

P-8)

Direct iteration of difference equation (7) in Lesson 2 for k = 1, 2, . . . , N - 1, reveals that y (1) depends on u (0), y (2) depends on u (1) and u (0) and finally, that W - 1) depends on u(N - 2) . . . , u (0); thus, %e[jl(N-l),y(N

-2),

. . . . Y (011= Xb W - 21, u 0’ - 3), - . . , u (01,y ((91 (8-9)

Comparing (8-S) and (B-9), we see that X and 9r depend on similar vaiues of random input U; hence, they are statistically dependent. 0

This theorem only states suficient condi~iom for unbiasedness of which means that, if we do not satisfy these conditions, we cannut conclude anything about whether 6=(k) is unbiased or biased. In order to obtain necessuryccmdifions for unbiasedness,assumethat E{&&k)} = 0 and take the expectation on both sidesof (8-2). Doing this, we seethat i&&k),

E{(X’WX)-‘X’W’V~ = 0

WI Letting M = (%?W%)-’ X’W and rni’ denote the ith row of matrix M, (8-5) can be expressedas the following collection of orthogonaii~ cmdihms, E{m/V] = 0

for i =1,2,...,N

WI orthogonality [recall that two random variables u and b are orthogonal if E{ab} = O] is a weaker condition than statistical independence, but is often more difficult to verify aheadof time than independence,especiallysincern: is a very nonlinear transformation of the random elements of X’. Example 8-l Recall the imp&e response identification Example 2-1, in which 8 = co1\h (1), h (2), . . . , h(n)], where h (i) is the value of the sampled impulse responseat time li. System input u(k) may be deterministic or random. If{U(k),k =o, 1, .*., *N - l} is deterministic, then ‘X(!V - 1) isee Equation (Z-5)] is deterministic, so that 0WU(AJ)is an unbiased estimator of 0. Often, one usesa random input sequencefor {u(k), k = U: I, . . . , A’ - I}, such as from a random

Example 8-3 We are interested in estimating the parameter a in the following first-order system: y(k +- 1) = -ay(k)

(8-10)

+ u(k)

where u(k) is a zero-mean white noise sequence. One approach to doing this is to collect y (k + l), y(k), . . . , y (1) as follows,

u(k) (4 -1) u(k -1) I-1 : 1 : 1 Ye + 1) Y(k) Y@ - 1) ... Y(l)

-Y -Y(k -y(k - 2) n + $0)

u(k -2) ... u (0)

(8-l 1)

W)

‘Z:(k + 1)

and, to obtain dts. In order to study the bias of ciLsI we use (8-2) in which W(k) is set equal to I, and X(k) and T(k) are defined in (8-11). We also set 6,(k) = CiLS(k + 1). The argument of ii u is k 4 1 instead of k, becausethe argument of 5 in (S-11)is k + 1. Doing this, we find that Ii WvG> (S-12)

&(k+l)=a-“O, c Y’(i) 1-0

Properties

66

of Least-Squares

Estimators

Lesson 8

Small Sample Properties

,

+ 1)) = a -

(8-13)

Ex,v

Note, from @-lo), that y(j) dependsat most on u (j - 1); therefore,

E[c}

= E{u(k))E[f$]

=0

(8-15)

Cl

In the method of instrumental variables X(k) is replacedby X*(k) where X*(k) is chosen so that it is statistically independent of V(k). There can be, and in general there will be, many choices of X*(k) that may qualify as instrumental variables. It is often difficult to check that X*(k) is statistically independent of V(k). Next we proceed to computethe covariancematrix of &&k), where &&k)

Theorem 8-2.

= 8 - &q&k)

If E(gr(k)} = 0, V(k)

I

=

Ex

mvpc

1

l

I%>

(8-16)

and X(k) are statistically independ-

.

cov [&(k)]

= W)

(8-17)

= EX{(%e’W%e)+ %?W%WX(XWX)-l}

(8-18)

then

= d [X’(k)X(k)]-’

Usually, when we use a least-squaresestimation algorithm we do not know the numerical value of d. If d is known ahead of time, it can be used directly in the estimate of 0. We showhow to do this in Lesson 9. Where do we obtain dVin order to compute (8-23)?We can estimate it! Theorem 8-3.

An unbiased estimator of d is = !%‘(k)3(k)/(k

- n)

pendent, E{&&k)}

= 0, so that cov [&r&k)]

= E{6,,(k)i&&k)}

(8-19)

Using (8-2) in (8016),we seethat ii,(k)

= -(X’W%e)-‘X’wsr

(8-24)

where s(k) = Z(k) - X(k)&(k)

Proof. Because E(gr(k)} = 0 and V(k) and X(k) are statistically inde-

(8-23)

Proof. When X is deterministic, cov [bWLS(k)] is obtained from (8-18) by deleting the expectation on its right-hand side. To obtain cov [b,(k)] when cov [V(k)] = d I, set W = I and a(k) = d I in (8-18). The result is (8-23). q

z(k)

cov [i&&k)]

(8-22)

Given the conditions in Theorem 8-2, and that X(k) ti Corollary 8-l. deterministic, and, the components of V(k) are independent and identically distributed with zero-mean and constant variance c& then

ent, and

www)~

l

As it stands, Theorem 8-2 is not too useful becauseit is virtually impossible to compute the expectation in (8-18), due to the highly nonlinear dependence of (%e,W~)-l%e,W~~T~(%e’W~)-’ on %e.The following special case of Theorem 8-2 is important in practical applications in which X(k) is deterministic and C&(/C)= d I.

(8-14)

Unless we are very lucky so that the k nonzero terms sum identically to zero, E{&(k + 1)) $ a, which means, of course, that GLsis biased. The results in this example generalize to higher-order difference equations, so that we can conclude that least-squares estimates of coeficients in an AR model are biased.

1

Applying (8-22) to (8-21), we obtain (8-18). ci

because E(u(k)} = 0. Unfortunately, all of the remaining terms in (8-13), i.e., for i=O,l, . . . . k - 1, will not be equal to zero; consequently,

E&s(k + 1)) = a + 0 + k nonzero terms

Estimators

where we have made use of the fact that W is a symmetric matrix and the transpose and inverse symbols may be permuted. From probability theory (e.g., Papoulis, 1965), recall that

thus, E{&(k

of Least-Squares

(8-25)

Proof. We shall proceed by computing E{%‘(k)%(k)} and then approximating it asg’(k)%(k), becausethe latter quantity can be computed from Z(k) and h&k), as in (8-25). First, we compute an expressionfor g(k). Substituting both the linear model for Z(k) and the least-squaresformula for i=(k) into (8-25), we find that

(8-20)

hence cov [i&&k)]

= EX,V{(3’t?W~)-1%%‘VV’WX(X’WX)-‘)

(8-21)

(8-26)

Properties

of Least-Squares

Estim&ors

Lesson 8

where Ik is the k X k identity matrix. Let M = Ik - qx%y X’ Matrix M is idempotent, i.e., M’ = M and Mz = M; therefore,

(s-27)

Large Sample

Properties

of Least-Squares

69

tions under which the LSE of 8.8:,(k). is the same as the maximum-1ikeIihood estimator of 8, &&k). Because 8,,(k) is consistent, asymptotically efficient,

and asymptotically Gaussian, b&k) inherits all these properties. Theorem 8-4.

(8-28)

If plim [X’(X-)X(k)/k]

Recall the following well-known facts about the trace of a matrix:

= CX

(8-33)

f-s1 exists, and

1. E{tr A} = tr E{A} 2, tr CA = c tr A, where c is a scalar 3. tr (A + B) = tr A + tr B 4. t&l

Estimators

plim [W(k)Sr(k)/k] = 0

(8-34)

then

= Iv

plim Qk)

5. tr AB = tr BA Using thesefacts, we now continue the developmentof (g-28), as follows: E{fk%} = tr [ME{VV’l] = tr M% = tr MCT~ = d tr M = 0: tr [Ik - %Y(WX’)-’%T’] = &k - & tr X(X’X)-’ X’ = uz k - CT:tr (X’X)(X’X)-l = dk - &trIn = uz(k -n)

(8-29)

=6

(8-35)

Note that the probability limit of a matrix equalsa matrix each of whose elements is the probability limit of the respectivematrix element. Assumption (8-33) postulates the existence of a probability limit for the second-order moments of the variables in X(k), as given by Z&. Assumption (8-34) postulates a zero probability limit for the correlation between X(k) and T(k). %e’(k)Sr(k)can be thought of as a “filtered” version of noise vector V(k). For (8-34) to be true “filter X’(k)” must be stable. If, for example, X(k) is deterministic and CT&) < a, then (S-34) will be true.

Solving this equation for d, we find that

Proof. Beginning with (8-2), but for i&k)

instead of &,&k),

we see

that (8-30)

Although this is an exact result for uz, it is not one that can be evaluated, becausewe cannotcompute E{&$. Using the structure of d as a starting point, we estimate & by the simple formula &; (k) = t!t’(k)~(k)/(k

- II)

b,,(k) = 8 + (me-‘XT

Operating on both sides of this equation with plim, and using properties (7-19, (7-16): and (7-17), we find that plim i,(k)

(8-31)

To show that x (k) is an unbiased estimator of c$, we obsewe that (8-32)

where we have used (8-29) for E{~‘%].

(8-36)

= = = =

9 + plim [(XX/k)-’ (WV/k)] 9 + plim (RX/k)-‘plim (WV/k) 9 + &’ - 0 9

which demonstrates that, under the given conditions, &(k) estimator of 0. q

is a consistent

III

LARGE SAMPLE PROPERTIES OF LEAST-SQUARES

ESTIMATORS

Many large sampleproperties of LSE’s are determined by establishing that the LSE is equivalent to another estimator for which it is known that the large sampIeproperty holds true. In Lesson 11, for example,we will provide condi-

In some important applications Eq. (8-34) doesnot apply, e.g., Example 8-2. Theorem S-4 then does not apply, and, the study of consistency is often quite complicated in these cases. Theorem 8-5.

If (8-33) and (8-34) are me, C,’ exists, and

plim [gr’(k)Sr(k)/k] = IT:

(S-37)

Properties

of Least-Squares

Estimators

Lesson 8

plim 3 (k) = d

9

(8-38)

Best Linear

where 8 (k) is given by (8-24). Proof. From (8-26),we find that fzi’% = VT

Lesson

- Olc3q~‘~)-‘%e’~

(8-39)

Consequently,

Unbiased

Estimation

plim 8 (k) = plim %‘t%/(k - n) = plim VV/(k - n) - plim ~‘X(X’X)-lX’V/(k - n) = d - plim v’Xl(k - n) . plim [X?tT/(k - n)]-1 . plim %‘V/(k - n)

INTRODUCTION

PROBLEMS 8-1. Suppose that iLs is an unbiased estimator of 0. Is & an unbiased estimator of

e’? (Hint: Use the least-squaresbatch algorithm to study this question.) 8-2. For 6WLs(Q to be an unbiased estimator of 8 we required ET(k)] = 0. This problem considers the casewhen ET(k)} f 0. (a) Assume that E(gr(k)} = V o where Sr, is known to us. How is the concatenated measurementequation Z(k) = %e(k)e+ V(k) modified in this case so we can use the results derived in this lesson to obtain bw,,(k) or ii,(k)? (b) Assume that E(Sr(k)} = m%Tlwhere m6tpis constant but is unknown. How is the concatenated measurementequation S!(k) = X(k)8 + V(k) modified in this case so that we can obtain least-squaresestimatesof both 8 and my? 8-3. Consider the stable autoregressive model y(k) = &y(k - 1) + . . + &y (k - Q + c(k) in which the e(k) are identically distributed random variables with mean zero and finite variance a’. Prove that the least-squaresestimates of 8l,***, & are consistent(see also, Ljung, 1976). 8-4. In this lesson we have assumed that the X(k) variables have been measured without error. Here we examine the situation when 3&(k) = X(k) + N(k) in which X(k) denotes a matrix of true values and N(k) a matrix of measurement errors. The basic linear model is now l

Z(k) = X(k)0 + V(k) = X,(k)8 + v(k) - N(k)e]. Prove that iLS(k) is not a consistent estimator of 8.

Least-squaresestimation, as described in Lessons 3, 4 and 5, is for the linear model 3!(k) = X(k)0 + ‘V(k)

(9-l)

where 9 is a deterministic, but unknown vector of parameters, X(k) can be deterministic or random, and we do not know anything about V(k) ahead of we time. By minimizing &(k)W(k)?%I(k), where Z%(k) = Z(k) - X(k)&&k), determined that 6-(k) is a linear transformation of Z(k), i.e., &&k) = F,&k)%(k). After establishingthe structure of &&k), we studied its smalland large-sample properties. Unfortunately, 6-(k) is not alwaysunbiased or efficient. These properties were not built into 6-(k) during its design. In this lesson we develop our second estimator. It will be both unbiased and efficient, by design. In addition, we want the estimator to be a linear function of the measurements%(k). This estimator is called a best linear unbiased estimator (BLUE) or an unbiased minimum-variance estimator (UMVE). To keep notation relatively simple, we will use i&,,(k) to denote the BLUE of 8. As in least-squares,we begin with the linear model in (9-l) where 8 is deterministic. Now, however, X(k) must be deterministic and V(k) is assumed to be zero mean with positive definite known covariance matrix S(k). An example of such a covariance matrix occurs for white noise. In the case of scalar measurements, z(k), this means that scalar noise v(k) is white, i.e.,

P-2)

Best Linear Unbiased Estimation

Lesson 9

where &, is the Kronecker 6 (i.e., &, = 0 for k # j and 6kJ= I for k = j) thus, 3(k)

= E{3’-(A-)=V-‘(k)} = diag [d(k)-

d(k

- l), . . . , &(I? - IV + 1)]

P-9

In the case of vector measurements,z(k), this meansthat vector noise, v(k), is white, i.e., E{v( k)v’( j)l = R(k)Skj

P-4)

thus, 3(k)

= diag [R(k), R(k - l), . . . ! R(k - N + 1)]

Derivation

73

of Estimator

where ei is the ith unit vector, e; = co1 (0, 0, * . . ,o, l,O, . . . 70)

(9-10)

in which the nonzero element occurs in the ith position. Equating respective elements on both sides of (9-9), we find that X’(k)f,(k)

= ei

i = 1,2, . . ..n

(9-11)

Our single unbiasedness constraint on matrix F(k) is now a set of n constraints on the rows of F(k). Next, we express E([& - 6i.BLU(k)]2>in terms of f, (i = 1,2, . . . , IV). We shall make use of (9-ll), (9-l), and the following equivalent representation of

(9-6) PROBLEM STATEMENT AND OWECTfVE FUNCTION

&HlJ(~)

= F(QW)

WI

a, &&k) is an unbiasedestimator of 0, and b. the error variance for eachone of the n parametersis minimized- In this way, &&k) wiZ1be unbiased afzd eficierzt, by design. Recall, from Theorem 6-1, that mbiasedness cmzs~raim design mutrix F(k), such Ihut for aI k

i = 1,2,. . , ,n

(9-12)

E{[& - tir,BLU(k)]2} = E(( ei - fi’%)*} = E{( Oi - SYf,)‘} = E{aj! - 2&%‘fi + (E’fJ*}

where, for notational simplicity, we have omitted subscriptingF(k) as FBLU(k). We shall design F(k) suchthat

F(k)X(k) = I

f:(k)%(k)

Proceeding, we find that

We begin by assumingthe following linear structure for &&k), k,(k)

=

= E{@ - 28,(%33+ T)‘f, + [(%I + T)‘fi]l) = E{$ - 2Oi9’X’fi - ZOiT’fi + [6’X’f; + V’ff]‘) = E(B;’ - 20i8’ei - 2OiV’fi + [B’ej + V’fj]‘) = E{fi“‘l’Tfi) = fi’9tf;

Observe that the error-variance for the ith parameter dependsonly on the ith row of design matrix F(k). We, therefore, establish the following objective function:. Ji(f;, hi) = E{[B; - ii.BLU(k)]Z) + A;‘(X’f, - eJ = fi’$Rf; + A,‘(X’fi - e,)

P-7

Our objective now is to choosethe elements of F(k), subject to the constraint of (g-7), in such a way that the error variance for eachone of the ti parameters is minunized. In solving for FBLU(k),it will be convenient to partition matrix F(k), as F(k) =

(9-13)

(9-14)

where Ai is the ith vector of the Lagrange multipliers, that is associatedwith the ith unbiasedness constraint. Our objective now is to minimize Ji with respect to fi and hi (i = 1,2, . . . , iV). DERIVATION

OF ESTIMATOR

A necessary condition for minimizing Ji(fi ,Ai) is 81,(f;,Ai)/ df; = 0 (i = 1,2, ,..) n); hence, Equation (9-7) can now be expressedin terms of the vector components of F(k). For our purposes, it iseasier to work with the transpose of (9-7), X’F = 1, which can be expressedas

29ifi + XAi = 0

(9-E)

from which we determine fj, as

fl = -;s-‘X&

(9-16)

74


Lesson 9

For (9-16) to be valid, ?R-’ must exist. Any noise V(k) whose covariance matrix % is positive definite qualifies. Of course, if V(k) is white, then $Ris diagonal (or block diagonal) and 3-l exists. This may also be true if ‘V(k) is not white. A second necessary condition for minimizing Ji (fi,hi) is dJi (fiyX;)/aXj = 0 (i = 172, . 7n), which gives us the unbiasednessconstraints l

l

X’fi=ei

of &(k)

75

Corollary 9-l. AZ1results obtained in Lessons 3,4, and 5 for 6,,(k) be applied to 6BLu(k)by setting W(k) = K’(k). Cl

can

We leaveit to the reader to explore the full implications of this important corollary, by reexamining the wide range of topics, which were discussedin Lessons3,4 and 5.

(9-17)

i=1,2,...,n

To determine Ai, substitute (9-16) into (9-17). Doing this, we find that Xi = -2(X’%‘Ye)-’

Some Properties

(9-18)

ei

Theorem 9-2 (Gauss-Markov Theorem).

If S(k) = d I,

then 6&k)

=

L(k)* Proof. Using (9-22) and the fact that a(k) = a: I, we find that

whereupon fi = 9i-’ X(X’%.-’ X)-l

(9-19)

ei

(i = 1,2, . . . , n). Matrix F(k) is reconstructed from f,(k), asfollows:

Why is this a very important result? We have connected two seemingly different estimators, one of which-&&)-has the properties of unbiased and minimum variance by design; hence, in this case &(k) inherits these properties. Remember though that the derivation of &&) required X(k) to be deterministic; thus, Theorem 9-2 is “conditioned” on X(k) being

Hence

deterministic.

(9-21)

FBLU(k) = [X’(k)%-‘(k)%?(k)]-‘X’(k)%-‘(k)

SOME PROPERTIES OF i&(k)

which means that i&&k)

= [sle’(k)W’(k)X(k)l-‘%e’(k)W(k)%(k)

(9-22)

To begin, we direct our attention at the covariancematrix of parameter estimation error 6&k). Theorem 9-3.

COMPARISON

OF t&(k)

Theorem 94.

cov [6,,(k)]

AND i,,,Ls(/o

We are struck by the close similarity between i,,,(k)

If V(k) is zero mean, then

(9-24)

Proof. We apply Corollary 9-1 to cov [6,(k)] [given in (18) of Lesson S] for the casewhen X(k) is deterministic, to seethat

and &&k).

The BLUE of 0 is thespecialcaseof the WLSE of 0 when W(k) = a-‘(k)

= [X’(k)%-‘(k)X(k)]-’

(9-23)

If W(k) is diagonal, then (9-23) requires V(k) to be white. Proof. Compare the formulas for 6&k) in (9-22) and 6-(k) in (10) of Lesson 3. If W(k) is a diagonal matrix, then 9t(k) is a diagonal matrix only if V(k) is white. Cl

Matrix %-l(k) weights the contributions of precisemeasurementsheavily and deemphasizesthe contributions of imprecise measurements.The best linear unbiased estimation design technique hasled to a weighting matrix that is quite sensible. SeeProblem 9-2.

‘Ov hdk)l

= ‘Ov [s,(k)l/

w(k) = a-‘(k)

= (%e’~-‘%e)-‘%e’~-l~~-‘~(~‘~-‘~)-’ = (x’91-1%e)-1 0 Observe the great simplification of the expression for cov [6%&k)], when W(k) = S-‘(k). Note, also, that the error varianceof ij,BLu(k) is given by the ith diagonal element of cov [bBLU(k)]. Corollary 9-2. When W(k) = S-‘(k) then matrix P(k), which appears in the recursive WLSE of 9 equals cov [6BLu(k)],i.e.,

P(k) = cov [6,,,(k)]

(9-25)

Best Lrtear Unbiased

Estimation

Lesson 9

Proof. Recall Equation (13) of Lesson4, that

P(k) = [%?(k)W(k)X(k)J-’

(9-26)

When W(k) = %‘(A$, then P(k) = [X’(k)CR-l(k)X(k)]-l hence, P(k) = cov [6&k)]

(9-27j

becauseof (9-24) . 0

Some Properties

of k&k!

77

Substituting (9-32) and (9-33) into (9-29), and making repeated use of the unbiasednessconstraints X’F’ = %e’Fd= FOX0 = I, we find that 2 = F&RF; - IGRF’ = F,9tF; - F%F’ + 2(X%-‘ste)-’

- (%%-%)-’

- (Ye’%-1x)-* = F,%F; - F9tF’ + 2(%e’9t-%)-1(%e’9t-1%F’)

(9-34)

- (XW’%e)-‘(W!R-‘%F;)

Soon we will examine a recursive BLUE. Matrix P(k) WU have to be calculated, just as it has to be calculatedfor the recursive WLSE. Every time P(k) is calculated in our recursiveBLUE, we obtain a quantitative measureof how well we are estimating 9+Just look at the diagonal elements of P(k), k = 1,2,. . . . The same statementcannot be made for the meaning of P(k) in the recursive WLSE. In the r?cursiveWLSE, P(k) has no special meaning. Next, we examine cov [9&k)] in more detail.

Making use of the structure of F(k). given in (9-21), we see that Z can also be written as

Theorem9-4* iBLu(k) k a most efficient estimutm of 0 within the classof at/ unbiasedestirnutorsthat m-elinearly reiatedto the mwsurements %(Ic).

a’Ga, where a is an arbitrary nonzero vector,

proof (Mendel, 1973, pp. 155-156). According to Definition 6-3, we must show that E = cov [&(k)] - cov [b&k)]

(9-28)

is positive semidefinite. In (g-28), 6=(k) is the error associatedwith an arbitrary hnear unbiased estimate of 0. For convenience,we write ZZas (9-29)

- (F$Eft-‘X)(X’%-%)-’

C = F,%F; - F%F’ + 2F%F’ - F%F; - F&RF = (Fa - F)%(F, - F) ’

(9-35)

In order to investigate the definitenessof 2, consider the definitenessof a’za = [(Fn - F)‘a)‘9t[(Fp

- F)‘a]

(9-36)

Matrix F (i.e., FBLU) is unique; therefore (F, - F)‘a is a nonzero vector, unless F, - F and a are orthogonal, which is a possibility that cannot be excluded. Becausematrix CRis positive definite, a’xa 2 0, which means that I: is positive semidefinite, IJ These results serve as further confirmation that designing F(k) as we have done, by minimizing only the diagonal elements of cov[8&k)], is sound.

In order to compute X0we use the facts that Corollary 9-3.

If C!&(k)= IZt I, then 6&k)

is a most efficient estimator of 8.

The proof of this result is a direct consequenceof Theorems 9-2 and 9-4. 0 F4(k)X(k) = I

(9-31)

thus,

At the end of Lesson 3 we noted that iwLs(Ic)may not be invariant under changes.We demonstrate next that 4&k) is invariant to such changes. Theorem 9-5.

Prc@(Mendel, 1973,pp. 15157). Assume that observers A and B are observing a process; but, observer A reads the measurements in one set of units and B in another. Let M be a symmetric matrix of scale factors relating A to B (e.g., 5,280 ft/mile, 454 g/lb, etc.), and C&.(k)and Z&(k) denote the total measurementvectors of A and B, respectively.Then

Because6BLu(k)= F(k)%(k) and F(k)?@) = 1, x BLU= FCRF’

6BLU(k)is invariant under changes of scale.

(9-33)

S&(k) = X,(k)B

+ ‘VB(k) = Mf& (k) = M%e, (k)8 + MT/,(k)

(9-37)

Best Linear Unbiased

Lesson 9

Estimation

Lesson 9

Problems

and

which meansthat X,(k)

= MXA(k)

(9-38)

VB(k> = Msr, (k)

(9-39)

P(k + 1) = [I - &(k + l)h’(k + l)]P(k)

(9-46)

These equations are initialized by and P(n), and are used for k = n, . . . . N - 1. They can also be used for k = O,l, . . . , N - 1 as long as &3LU(O)and P(O) are chosen using Equations (4-21) and (4-20), respectivezy, in which w(0) is replaced by r-‘(O). Cl

n+l,

and SB(k) = M%,(k)M’ = MaA(

(9-40)

Let i,,,,(k) and 6BBLU(k) denote the BLUE’s associatedwith observersA and B, respectively; then,

Recall that, in best-linear unbiased estimation, P(k) = COV[&&C)]. Observe, in Theorem 9-7, that we compute P(k) recursively, and not P-l(k). This is why the results in Theorem 9-7 (and, subsequently, Theorem 4-2) are referred to as the covariance form of recursive BLUE.

PROBLEMS

RECURSIVE BLUES

Becauseof Corollary 9-1, we obtain recursiveformulas for the BLUE of 8 by setting l/w@ + 1) = r(k + 1) in the recursive formulas for the WLSEs of 8 which are given in Lesson 4. In the caseof a vector of measurements,we set (seeTable 5-1) w-‘(k + 1) = R(k + 1).

9-l. (Mendel, 1973, Exercise 3-2, pg. 175). Assume X(k) is random, and that km,(k) = F(k)~(k). (a) Show that unbiasedness of the estimate is attained when E{F(k)%e(k)} = I. (b) At what point in the derivation of 6BLU(k)do the computations break down because X(k) is random? 9-2. Here we examine the situation when ‘V(k) is colored noise and how to use a model to compute wj* Now our linear model is z(k + 1) = X(k + lje + v(k + 1)

Theorem 9-6 (Information Form of Recursive BLUE).

A recursive

structurefor i&,,(k) is: &,(k

+ 1) = 6&k)

+ KB(k + l)[t(k

+ 1) - h’(k + 1)6,,,(k)]

KB(k + 1) = P(k + l)h(k

+ l)r-‘(k

P-‘(k + 1) = P-‘(k) + h(k + l)r-‘(k

+ 1)

+ l)h’(k

(9-42)

(9-43)

+ 1)

(9-44)

These equations are initialized by &u(n) and P-‘(n) [where P(k) is cov [6&k)], given in (9.31)] and, are used for k = n, n + 1, . . . , N - 1. These equations calt also be used for k = 0, 1, . . . , N - 1 as long as i&&O) and P-‘(O) are chosen using Equations (21) and (20) in Lesson 4, respectively, in which w(0) is replaced by r-‘(O). Cl Theorem 9-7 (Covariance Form of Recursive BLUE). is (9-42) in which sive structure for 6&k) KB(k + 1) = P(k)h(k

+ 1) [h’(k + l)P(k)h(k

Another recur-

+ 1) + r(k + l)]-’

(9-45)

where v(k) is colored noise modeled as v(k + 1) = A,v(k) + k(k)

We assume that deterministic matrix A, is known and that E(k) is zero-mean white noise with covariance a,(k). Working with the measurement difference z*(k + 1) = z(k + 1) - A,z(k) write down the formula for i&&k) in batch form. Be sure to define all concatenatedquantities. 9-3. (Sorenson, 1980, Exercise 3-15, pg. 130). Suppose and & are unbiased estimators of Blwith var (&j = d and var (&j = u$ Let t!&= (Y& + (1 - a!)&. (a) Prove that 6, is unbiased. (b) Assume that & and & are statistically independent, and find the mean-squared error of &. (c) What choice of cyminimizes the mean-squared error? 9-4. (Mendel, 1973, Exercise 3-12, pp. 176-177). A series of measurements z(k) are made, where z(k) = H8 + v(k), H is an m x n constant matrix, E{v(k)} = 0, and cov[v(k)] = 3 is a constant matrix. (a) Using the two formulations of the recursive BLUE show that (Ho, 1963, pp. 152-154): (i) P(k + 1jH’ = P(k)H’[HP(k)H’ + RI-‘R, and (ii) HP(k) = R[HP(k - 1)H’ + RI-‘HP(k - 1).

80


Lesson 9

(b) Next, show that (i) P(k)H’ = P[k - 2)Hf[2HP(k - 2)H’ + RI-‘R; (ii) P(k)H’ = P(k - 3)H’[3HP(k - 3)H’ + R]-‘R; and (iii) P(k)H’ = P(O)H’[kHP(O)H’ + R]-‘R. (c) Finally, show that the asymptotic form (k + m] for the BLUE of 9 is (Ho, 1963, pp. 152-154)

Lesson

10

Likelihood

This equation, with its l/(k + 1) weighting function, represents a form of multidimensional stochastic approximation.

INTRODUCTION

This lesson provides background material for the method of maximumlikelihood. It explains the relationship of likelihood to probability, and when the terms Zikelihood and likelihood ratio can be used interchangeably. The major reference for this lessonis Edwards (1972), a most delightful book. LIKELIHOOD

DEFINED

To begin, we define what is meant by an hypothesis, H, and results (of an experiment), R. SupposescaIarparameter 8 can assumeonly two values, 0 or 1; then, we say that there are two hypothesesassociatedwith 8, namely If0 and H,, where, for I&, 8 = 0, and for HI, 8 = 1. This is the situation of a binary hypothesis. Supposenext that scalar parameter 0 can assumeten values, a, b, c, d, e, f, g, h, i, i; then, we say there are ten hypothesesassociatedwith 8, namely IfI, HZ, H3, . . . , HIo, where, for If,, 0 = a, for Hz, 8 = b, . . . , and for Hlo, 8 = i. Parameter 8 may also assume values from an interval, i.e., a I 8 I b. In this case, we have an infinite, uncountable number of hypotheses about 0, each one associatedwith a real number in the interval [a, b]. Finally, we may have a vector of parameters eachone of whoseelements has a coliection of hypotheses associatedwith it. For example, suppose that each one of the II elements of 9 is either 0 or 1. Vector 6 is then characterized by 2” hypotheses. 81

82

Likelihood

Lesson 10

Results, R, are the outputs of an experiment. In our work on parameter estimation for the linear model Z(k) = X(k)6 + V(X-), the “results” are the data in S(k) and X(k). We let P(RIH) denote the probability of obtaining results R given hypothesis H according to some probability model, e.g., p[z(k)lO]. In probability, P(R IH) is always viewed as a function of R for fixed values of H. Usually the explicit dependenceof P on H is not shown. In order to understand the differences between probability and likelihood, it is important to show the explicit dependenceof P on H.

Likelihood

83

Defined

Example 10-3 (Edwards, 1972, pg. 10) To further illustrate the difference between P(RjH) and L(HIR), we consider the following binomial model which we assumedescribesthe occurrence of boys and girls in a family of two children: P(R[p) = wpm(l . .

(10-2)

-p)’

wherep denotes the probability of a male child, uz equalsthe number of male children, f equals the number of female children, and, in this example, (10-3)

m+f=2

Example 10-l Random number generators are often used to generate a sequenceof random numbers that can then be used as the input sequence to a dynamical system, or as an additive measurement noise sequence.To run a random number generator, you must choose a probability model. The Gaussian model is often used; however, it is characterized by two parameters, mean p and variance u*. In order to obtain a stream of Gaussian random.numbers from the random number generator, you must fix p and u*. Let m and a$- denote (true) values chosen for p and c*. The Gaussianprobability density function for the generator is ~[r (k)lp T, c&l, and the numbers we obtain at its output, are of course quite dependent on the hypothesisHT = (m, a$). 0 z(l),

z(2),

-

l

’

Our objective is to determine p; but, to do this we need some results. Knocking on neighbor’s doors and conducting a simple survey, we establish two data sets: RI = (1 boy and 1 girl} +rn = 1 and

R2 = (2 boys) +rn = 2

TABLE 10-l

H*

q

HZ

(10-l)

For likelihood R is fixed (i.e., given ahead of time) and H is variable. Likelihood cannot be compared using different data sets (i.e., different results, say, RI and R2) unless the data sets are statistically independent. There are no axioms of likelihood.

P(Ri 1Hjfixed) Rl

R2

318 l/2

l/l6 l/4

Next, we create a table of likelihoods, using (lo-l). In this table (Table 10-2) the entries are L (HilRj fixed). Constants cl and c2 are arbitrary and cl, for example, appears in each one of the table entries in the RI column. TABLE 10-2 L(H,(R, fixed)

Example 10-2 Suppose we are given a sequenceof Gaussian random numbers, using the random number generator that was describedin Example 10-1, but, we do not know PT and a:. Is it possible to infer (i.e., estimate) what the values of p and u* were that most likely generated the given sequence?The method of maximum-likelihood, which we study in Lesson 11, will show us how to do this. The starting point for the estimation of p and C* will bep [z (Ic&, u*], where now z (k) is fixed and p and u2 are treated as variables. 0

(10-6)

To begin, we create a table of probabilities, in which the entries are P(Ri’lHj fixed), where this is computed using (10-2). For HI (i.e., p = l/4), P(R@/4) = 3/8 and P(R&4) = l/16; for & (i.e., p = l/2), P(R1]1/2) = l/2 and P(R211/2)= l/4. These results are collected together in Table 10-l.

Definition 10-l (Edwards, 1972, pg. 9). Likelihood, L(HIR), of the hypothesis H given the results R and a specific probability model is proportional to P(RIH), the constant of proportionality being arbitrary, i.e., = cP(RIH)

(10-4) (10-5)

f = 0

= l/4 = l/2 I

H,:p Hz:p

L(HIR)

f = 1

In order to keep the determination of p simple for this meager collection of data, we shall only consider two values for p, i.e., two hypotheses,

?

For fixed H we can apply the axioms of probability (e.g., seePapoulis, 1965). If, for example, results RI and R2 are mutually exclusive,then P(R1 or R,IH) = P(RlIH) + P(R,IH).

and

Rl

Hl H2

3/8 cl l/2 Cl

R2

l/16 C2 l/4 c2

What can we conclude from the table of likelihoods? First, for data R 1, the likelihood of HI is 3/4 the likelihood of I&. The number 3/4 was obtained by taking the

L\ kelihoocl

a4

Lessorl 10

ratio of likehhoods L (Hl!R!) and L (H$?l). Second, on data R2, the likelihood of IfI is l/4 the likelihood of Hz [note that l/4 = l/l6 c2/ l/4 c2]. Finally, we conclude that, even from our two meager results, the vaIuep = l/Z appears to be more likely than the value p = l/4? which, of course, agreeswith our intuition. q

Multiple

85

Hypotheses

RI &I R2) = 314 x l/4 = 3/l& This reinforces our intuition that p likely thanp = l/4. El

RESULTS DESCRIBED BY CONTINUOUS LIKELIHOOD RATIO

In the preceding example we were able to draw conchis~onsabout the likelihood of one hypothesis versus a second hypothesis by comparing ratios of likelihood, defined, of course, on the same set of data. Forming the ratios of Likelihood, we obtain the Zikeiihood ratio. Definition 10-2 (Edwards, 1972, pg. 10). The Zikelihuud ratiu of two hypotheses un the sum data is the ratio of the likeNmods un the duta. Let L(HI, &\R) denote likelihood ratio; then, (10-7)

Observe that likelihood ratio statementsdo not depend on the arbitrary constant c which appearsin the definition of Iikehhood, becausec cancelsout of the ratio c~(R/HJ/cP(R~&). Theorem 10-l (Edwards, 1972, pg. 11). LikeIihoud ratios of two lzypotheses on stutisticully independent sets of duta may be multiplied tugether to form the likelihood rutiu uf the combined data. Proof. Let L (HI, H2iRl & I&) denote the likelihood ratio of HI and I& on the combined data RI & &, i.e.,

Because RI & Rz are statistically independent data, P(Rl fW$W

(&iW

& R2iHJ =

= 112is much more

DISTRIBUTIONS

Supposethe results R have a continuous distribution; then, we know from probability theory that the probabilty of obtaining a result that lies in the interval (R, R + dR) is P(R lH)dR, as dR + 0. P(R IH) is then a probability density. In this case, L (HIR) = cP(R lH)dR ; but, cdR can be defined asa new arbitrary constant, cl, so that L (HIR) = clP(R iH)* In likelihood ratio statements cl = cdR disappears entirely; thus, likelihood and likelihood ratio are unaffected by the nature of the distribution of R . Recall, that a transformation of variables greatly affects probability because of the dR which appears in the probability formula. Likelihood and likelihood ratiu, on the other hand, are unaffected by transformations variables, becauseof the absorption of dR into cl.

of

MULTIPLE HYPOTHESES

Thus far, all of our attention has been directed at the caseof two hypotheses. In order to apply likelihood and likelihood ratio concepts to parameter estimation problems, where parameterstake on more than two values, we must extend our preceding results to the caseof multiple hypotheses. As stated by Edwards (1972, pg. ll), “Instead of forming all the pairwise likelihood ratios it is simpler to present the same information in terms of the likelihood ratios for the several hypotheses versusone of their number, which may be chosen quite arbitrarily for this purpose.” Our extensions rely very heavily on the results for the two hypotheses case,becauseof the convenient introduction of an arbitrary comparison hypothesis, H * . Let Hi denote the ith hypothesis; then,

hence, L(H,,HC/R)

=

L(Hi ]R) L w *lw

(10-10)

Observe, also, that LwH*IR)

Example 10-4 We can now state the conclusions that are given at the end of Example 10-3 more formally. From Table 10-2, we see that L (l/4, l/2iRI) = 3/4 and L (l/4,1/21&) = l/4. Additionally, because RI and R2are data from independent experiments, L (l/4, 1121

L(H,,H*lR)

uH,IR) =I---=

L(HjIR)

L(H,

HjR)

”

’

(1041)

which means that we can compute the likelihood-ratio between any two hypotheses H, and Hi if we can compute the likelihood-ratio function L(Hk,H*IR).

Likelihood W,,

L(H,/R)

Lesson 10

H*)R)

Lesson 10

Problems

PROBLEMS

I

10-l. A test is said to be a likelihood-ratio testif there is a number c such that this test

leads to (dj refers to the ith decision) dl & dl or d2

I

c

I

H”

k

H,

(a>

H” H, W

Figure 10-l Multiple hypotheses case: (a) likelihood L (Hk (R) versus Hk , and (b) likelihood ratio L (I& ,H *II?) versusI&. Comparison hypothesis H * has been chosen to be the hypothesis associatedwith the maximum value of L (& (R), L *.

Figure 10-l(a) depicts a likelihood function L (H$?). Any value of Hk canbe chosenas the comparison hypothesis.We chooseH* as the hypothesis associatedwith the maximum value of L (Hk IR), so that the maximum value of L(Hk ,H*IR) will be normalized to unity. The likelihood ratio function L (Hk ,H *(R), depicted in Figure lo-l(b), wasobtained from Figure lo-l(a) and (10-10).In order to compute L(Hi,HjiR), we determine from Figure lo-l(b), = a/b. that L(Hi ,H*IR) = a and L(H,,H*IR) = b, so that L(H&IR) Is it really necessaryto know H” in order to carry out the normalization of L (Hk ,H*(R) depicted in Figure lo-l(b)? No, becauseH * can be eliminated by a clever “conceptual” choice of constantc. We can choosec, such that L(H*IR)

= L” f 1

(10-12)

According to (lo-l), this meansthat c = l/P(RIH*)

(10-13)

If c is “chosen” in this manner, then

L cHklR) = LcHkIR) L(Hk,H*lR) = L(H*IR)

(10-14)

which meansthat, in the case of multiple hypotheses, likelihood and likelihood ratio can be used interchangeably. This helps to explain why authors use different names for the function that is the starting point in the method of maximum likelihood, including likelihood function and ZikeZihood-ratio .Ifunction,

if L(Hl,H,IR) > c if L(H1,H21R) < c if L (HI,H2[R) = c

Consider a sequence of IZ tosses of a coin of which m are heads, i.e., P(R[p) = ~“(1 - p)” --. Let HI = pl and H2 = p2 where p1 > p2 and p represents the probability of a head. Show that L(H1,H21R) increases as m increases.Then show that L(HI,H21R) increaseswhenfi = m In increases,and that the likelihood-ratio test which consists of accepting HI if L (HI ,H2/R) is larger than some constant is equivalent to accepting HI if @is larger than some related constant. 10-2. Consider it independent observations x1, x2, . . . , X, which are normally distributed with mean p and known standard deviation (T. Our two hypotheses are: HI: p = p1 and H2 : p = p2. Using the sample mean as an estimator of EL, show that the likelihood-ratio test (defined in Problem 10-l) consists of accepting HI if the sample mean exceedssome constant. 10-3. Suppose that the data consists of a single observation z with Cauchy density given by PW)

1 = rr[l + (2 - e)2]

Test (see Problem 10-I) HI : 8 = 61= 1versusH2:6 = e2= -1whenc = l/2, i.e., show that for c = l/2, we acceptHi when z > -0.35 or when z < -5.65.

lesson

Maximum-Likelihood

11

Method

and Estimates

(1965). and Stepner and Mehra (1973), for example], or the conditional likelihood function (Nahi, 1969).We shall use these terms interchangeably.

Maximum4ikelihood

MAXIMUM-LIKELIHOOD

METHOD AND ESTIMATES*

The maximum-iikeiihood method is based on the relatively simple idea that different populations generate different samplesand that any given sample is more likely to have come from some populations than from others. The maximum-likelihood estimate (MLE) i,, is the value of 6 that maximizes 2or L for a particular set of measurements5. The logarithm of I is a monotonic transformation of 1 (i.e., whenever I is decreasingor increasing, in I is also decreasingor increasing); therefore, the point correspondingto the maximum of 1is also the point corresponding to the maximum of In 1 = L. Obtaining an MLE involves specifying the likelihood function and finding those valuesof the parameters that give this function its maximum value. It is required that, if L is differentiable, the partial derivative of L (or I) with respect to each of the unknown parameters O,,&, . . . , 0, equal zero:

Estimation

aww) ae; LIKELIHOOD*

Let us consider a vector of unknown parameters8 that describesa collection of N independent identically distributed observationsz (& k = 1,2, . . . , ZV. We collect thesemeasurementsinto an IV x 1 vector %(A’),(2 for short), SE= col(z(l),z(2),

. . . ,z(N))

(H-2) l Fw9 a P cm9 where Z is the likehhood function and J?is the conditional joint probability density function. Becausez(i) are independent and identically distributed, In many applicationsp (!i@3)is exponential (e.g.: Gaussian).It is easier then to work with the natural logarithm of @~CF,) than with &@iZ). Let (11-4)

Quantity L is sometimesreferred to as the log-likelihood function, the support function (Kmenta, 1971), the likelihood function [Mehra (1971), Schweppe * The material in this section is taken from Mendel (1983b, pp* 94-95).

0

foralli = 1,2,. . . ,n

(11-5)

To be sure that the soIution of (11-5) gives, in fact, a maximum value of L(9\%), certain second-order con$itions must be fulfilled. Consider a Taylor series expansionof L(O/%)about OML,i.e.,

(11-l)

The likelihood of 0, given the observations3, is defined to be proportional to the value of the probability density function of the observations given the parameters

L(@E) = ln 1(01%)

8 = 4ML=

(ei

-

ii,ML)(“J

-

bj,ML)

+

l

l

*

(11-6)

where aL(&alZ)/ &Ii, for example, is short for [aL(t$?E)/aei]le = &. BecauseiML is the MLE of 8, the second term on the right-hand side of (11-6) is zero [by virtue of (11-S)]; hence, Wl~)

= L(~M4f-Q

Recognizing that the secondterm in (11-7) is a quadratic form, we can write it in a more compact notation as l/2(9 - ~~~~)lJO(~hlL~~)(~ - && where J,(&,,&), the observed Fisher irzformation matrix [seeEquation (646)], is

J~(L&)=(~)!,=g

’ I

i,j=LL..,n ML

* The material in this section is taken from Mended(1983b, pp+95-98).

(H-8)

Maximum-Likelihood

Estimation

Lesson 11

Properties

of Maximum-Likelihood

Estimates

There are two unknown parameters in L, p and (T’. Differentiating L with respect to each of them gives dL - A- 2 [z(i) - ~1 -&I- cr2jz,

Now let us examine sufficient conditions for the likelihood function to be maximum. We assumethat, closeto 0, L(el%) is approximately quadratic, in which case

Nl dL -=----T+--T a@*) 2t7

(1147a)

21 $ Lm - PI2 q I-1

(11-17b)

Equating these partial derivatives to zero, we obtain

(11-10) From vector calculus, it is well known that a sufficient condition for ij;function of ytvariables to be maximum is that the matrix of secondpartial derivatives of that function, evaluated at the extremum, must be negative definite. For L(el%) in (ll-lo), this meansthat a sufficient condition for L(elS) to be max-

(11-18a)

oMLi=l

(1148b) For &$rLdifferent from zero (1 l-18a) reduces to

imized is CL zi0

(1141)

Jo(i,,le> < 0

-:

bML]

=

0

j z 1

giving

Example 11-l This is a continuation of Example 10-2. We observe a random sample {z(l), z (2), . . . , t(N)} at the output of a Gaussian random number generator and wish to find the maximum-likelihood estimators of p and 02. The Gaussian density function p (z 1~,02) is p (2 1~,a’) = (2m2)-‘” exp { -;

Kf - PYd]

(11-12)

’ i r(i)=2 bML=Njcl

(11-19)

Thus the MLE of the mean of a Gaussianpopulation is equal to the sample mean 2. Once again we see that the sample mean is an optimal estimator. Observe that (11-18a) and (ll-18b) can be solved for G’,,, the MLE of u*. Multiplying Eq. (11-18b) by 26hL leads to

Its natural logarithm is

--Iv&

+ li[ zi0 - ,&fd* = 0 i = 1

1 In p (2 1~,u2) = -z In (2rc2)

- $(z - p)ld2

(11-13)

Substituting z for bML and solving for &‘,, gives

The likelihood function is

SIL

= $5

[z(i) - 21’

(H-20)

i = 1

b ,u2)= p (2(l)lr.~,u2)p(2(91~

,u2)

l

l

l

p(z(N)Ip ,02)

(11-14)

and its logarithm is UPP~) = Si lnp(4p,~2)

(11-15) PROPERTIES OF MAXIMUM-LIKELIHOOD

i=l

Substituting for In p (z (i)lp ,02) gives L&a’)

Thus, the MLE of the variance of a Gaussianpopulation is simply equal to the sample variance. 0

ESTIMATES

The importance of maximum-likelihood estimation is that it produces estimates that have very desirable properties.

= 5 i = 1

N = -2ln

estimates are: (1) consistent, (2) 1 asymptotically Gaussian with mean 6 and covariarzce matrix -N J-‘, in which J Theorem 1l-1.

(271-(r2)-32

tm - PI2 I==1

(11-16)

Maximum-likelihood

Maximum-Likelihood

92

is the Fisher lnfur-matiun Matrix [Equation efficient.

(64)].

Estimation

Lesson 11

and, (3) asymptotically

Pr~uf. For proofs of consistency, asymptotic normality and asymptotic efficiency, see Sorenson (1980, pp. 187-190; 190-192; and 192-193, respecGvely). Tnese proofs, though somewhat heuristic, convey the ideas needed to prove the three parts of this theorem. More detailed analyses can be found in Cramer (1946) and Zacks (1971). See a!so, Problems U-13, H-14> and 11-15. q Theorem 11-2 (Invariance-Froperty of MLE’s). Lel g(8) be a vect~~r functiun mapping 0 into an itlterval in r-dimensi~~~al Euclidean space. Let 6ML be a ML E of 0; then g(&) is a A4L E of g(8); i.e., (H-21)

B-oaf (See Zacks, 1971). Note that Zacks points out that in many books, this theorem is cited only for the case of one-to-one mappings, g(0). His proof doesnot require g(9) to be one-to-one. Note, also, that the proof of this theorem is related to the “consistency carry-over” property of a consistent estimator, which wasdiscussedin Lesson 7. El

The Linear Model

Wk)

deterministic1

93

objectives in thisparagraph are twofold: (1) to derive the MLE of 9, iML(k), and (2) to relate &,&I?) to &u(k) and &(k). In order to derive the MLE of 9, we need to determine a formula for p(S(k)j0). To proceed, we assume that V(k) is Gaussian, with multivariate density function p (T(k)), where

Pwo~ = g (27i-ypIi(k)i l

We wish to obtain a MLE of the variance CT?in our linear model z(k) = h’(k)9 + v(k). One approach is to let $ = D:, establish the log-likelihood function for & and max. . imue 11,to determine &,ML. Usual!y, mathematical programming (i.e., search techHere is where a difficulty canoccur,because niques) rnust be used to determine @1.ML. & (a variance) is known to be positive; thus, 6IVMLmust be consfrained to be positive. Unfortunately, constrained mathematical programming techniques are more difficult than unconstrained ones. A secondapproachis to let & = CT,,. establish the log-likelihood function for O2(it will be the sameas the one for & , except that 19~ will be replaced by &), and maximize it to determine I&,~~. Because & is a standard deviation, which can be positive or negative, unconstrained mathematica1programming can be used to determine 62,ML, Finally, we use the Invariance l’roperty of MLE’s to compute &,ML,as a (M-22) kML = (LdL)2 El

[ -+ V(k)W(k)V(k)]

(11-24)

Recall (e.g., Papoulis, 1965)that linear transformations on, and linear combinations of, Gaussian random vectors are themselves Gaussian random vectors. For this reason, it is clear that when V(k) is Gaussian,s(k) is aswell. The multivariate Gaussian density function of s(k), derived from p (V(k)), is

Pcw$J~ = v(2?;)Npi(k)[ l

exp

-$ EW) - X(k)9]‘%-‘(k)[%(k)

- X(k)e]j

(11-25)

Theorem 11-3. When p(ZlO) is multivariate Gaussian and X&j deterministic, then the principle of ML leads to the BLUE of 0, i.e., kdk)

Example 1l-2

exp

is

(11-26)

= ~m#

Proof We must maximize p (3!]6) with respect to 0. This can be accomplished by minimizing the argument of the exponential in (11-25); hence, b,,(k) is the solution of ML

=0

(11-27)

= 0 where J[&L] = This equatio-n can also be expressed ,as dJ[&,]ldb Y(k)%-‘(k)%(k) and Y,(k) = Z(k) - XeML(k). Comparing this version of (11-27) with Equation (3-4) and the subsequentderivation of the WLSE of 9, we conclude that

(U-28) but, we also know, from Lesson 9, that

THE LINEAR MODEL (X(k) deterministic)

(13-29)

We return now to the linear model 3(k) = X(k)0 + V(k)

From (11-28) and (ll-29), we conclude that (H-23)

in which 9 is an tz x 1 vector of deterministic parameters, X(k) is deferministic, and V(k) is zero mean white noise, with covariance matrix a(k). This is precisely the samemodel that was used to derive the BLUE of 0, &#). Our

(U-30) We now suggest a reason why Theorem 9-6 (and, subsequently, Theorem 4-l) is referred to as the “information form” of recursiveBLUE. From

Maximum-Likelihood

Estimation

Lesson 11

Theorems 11-1 and 11-3 we know that, when X(k) is deterministic, A eBLU- A@; 8, l/N J-l). This meansthat P(k) = cov [&u(k)] is proportional to J-l. Observe, in Theorem 9-6, that we compute P-’ recursively, and not P. BecauseP-’ is proportional to the Fisher information matrix J, the results in Theorem 9-6 (and 4-l) are therefore referred to as the “information form” of recursive BLUE. A secondmore pragmaticreasonis due to the fact that the inverse of any covariance matrix is known as an information matrix (e.g., Anderson and Moore, 1979,pg. 138).Consequently, any algorithm that is in terms of information matrices is known as an “information form” algorithm. Corollary 11-l. If p[%(k)le] is multivariate Gaussian, X(k) is deterministic, and a(k) = a: I, then

(11-31) &L(k) = hJ(~) = L(k) These estimators are: unbiased, most efficient (within the class of linear estimators),consistent,and Gaussian. Proof. To obtain (H-31), combine the resultsin Theorems 11-3 and 9-2. The estimators are:

1. unbiased, becausei,,(k)

is unbiased;

2. most efficient, because&,(k) is most efficient; 3. consistent, because&IC) is consistent; and 4. Gaussian, because they depend linearly upon Z(k) which is Gaus-

A Log-Likelihood

Dynamical

System

additive white Gaussian noise. This system is described by the following state-equation model: x(k + 1) = @x(k) + ‘Pu(k)

z(k + 1) = Hx(k + 1) + v(k + 1)

(11-32)

k =O,l,...

,N - 1 (11-33)

In this model, u(k) is known ahead of time, x(0) is deterministic, E{v(k)} = 0, v(k) is Gaussian, and E{v(k)v’( j)} = R6kj.

To begin, we must establish the parameters that constitute 9. In theory 8 could contain all of the elementsin @,!P, H and R. In practice, however, these matrices are never completely unknown. State equations are either derived from physical principles (e.g., Newton’s laws) or associatedwith a canonical model (e.g., controllability canonical form); hence, we usually know that certain elements in @, V and H are identically zero or are known constants. Even though all of the elements in 0, !P, H and R will not be unknown in an application, there still may be more unknowns present than can possibly be identified. How many parameterscan be identified, and which parameters can be identified by maximum-likelihood estimation (or, for that matter, by any type of estimation) is the subject of identifiability of system (Stepner and Mehra, 1973). We shall assume that 8 is identifiable. Identifiability is akin to “existence.” When we assume 8 is identifiable, we assume that it is possible to identify 8 by ML methods. This means that all of our statements are predicated by the statement: If 8 is identifiable, then . . . . We let

sian. 0

0 = co1 (elements of a, TV, H, and R)

Observe that, when Z(k) is Gaussian, this corollary permits us to make statementsabout small-sampleproperties of MLE’s. Usually, we cannot make such statements. A LOG-LIKELIHOOD FUNCTION FOR AN IMPORTANT DYNAMICAL SYSTEM

Function for an Important

(11-34)

Example 11-3

The “controllablecanonicalform” state-variablerepresentation for the discrete-time autoregressive moving average (ARMA)

model (11-35)

H(z) =

-

In practice there are two major problems in obtaining MLE’s of parameters in models of dynamical systems 1. obtaining an expressionfor L(elZ}, and 2. maximizing L{6l%}with respect to 8. In this section we direct our attention to the first problem for a linear, timeinvariant dynamical system that is excited by a known forcing function, has deterministic initial conditions, and has measurementsthat are corrupted by

which implies the ARMA y (k + n) + aly(k

difference equation (z denotes the unit advance operator)

+ n - 1) +

l

l

l

+ any(k) =P*u(k+n

(11-36)

-1)+.*.+&u(k)

is xl(k + 1) x2@ .+ 1) x,,(k:+

1)

Xl(k) x2(k) ..

0

10 +

. u(k) O .

i

(11-37)

Maximum-Likelihood

Estimation

Lesson 11

and

For this model there are no unknown parameters in matrix W, and, @ and H each contain exactly n (unknown) parameters. Matrix @contains the n a-parameters which are associatedwith the poles of H(z), whereas matrix H contains the PIp-parameters which are associated with the zero of H(z). In genera/, an &-or&r, single-input single-output system k compkteiy charucterized by 2 n parameters0 0

Our objective now is to determine L(C$E)for the system in (U-32) and (H-33). To begin, we must determine p (910) = p (z(l), z(2), . . . , z(N)/@).This is easy to do, becauseof the whiteness of noise v(k)? i.e., p (z(l), z(2), . . . , eN

= P W~P

MW

l

l

l

P won

Wl~~

(11-39)

= h [IJPMM~]

[ - ; [z(i) - Hx(i)]‘R-’

[z(i) - Hx(i)]}

(11-40)

thus, i=l

[z(i) - Hx(i)]‘R-’

[z(i) - Hx(i)] -$hiR/

-$mln27r

(11-41)

The log-likelihood function L#!E) is a function of 0. To indicate which quantities on the right-hand side of (U-41) may depend on 6, we subscript all N such quantities with 9. Additionally, becauseT~H in 27rdoes not depend on 8 we negIect it in subsequentdiscussions.Our final log-Iikelihood function is: L(@E) = -;

2

[z(i)

- &x&)]‘Ri’

Problems

In essence, then, stute equation (11-43) is a comtraint that is associated with the computatim of the log-likelihood function. How do we determine bML for L(9[%) given in (11-42) [subject to its constraint in (ll-43)]? No simple closed-form solution is possible, because 8 enters into L($?!I’) in 2 complicated nunlinear manner. The only way presently known for obtaining 9ML is by means of mathematical programming, which is beyond the scope of this course (e.g., see Mendel, 1983, pp. 141-142). This completes our studies of methods for estimating unComment. known deterministic parameters. Prior to studying methods for estimating unknown random parameters, we pause to review an important body of material about muhivariate Gaussian random variables. PROBLEMS

thw

From the Gaussian nature of v(k) and the linear measurement model in (U-33), we know that

L(IqE) = - i i

Lesson 17

[z(i) - Hex&)] - $ In /Rot

(11-42)

i=l

observe that 8 occurs explicitly and implicitly in L@l%). Matrices I&, and R. contain the explicit dependenceof L@lS!) on 8, whereasstate vector x@(i) contains the implicit dependence of L@i$!E)on 9. In order to numerically calculate the right-hand side of (1 l-42), we must solve state equation (11-32). This can be done when vaIuesof the unknown parameterswhich appear in @ and V are given specific values; for, then (11-32)becomes x@(k+ 1) = @G@(~) + %Q4

x@(O) known

(1l-43)

11-I. If i is the MLE of a, what is the MLE of a”O?Explain your answer. 11-2. (Sorenson, 1980, Theorem 5.1, pg. 185). If an estimator exists such that equality is satisfied in the Cramer-Rao inequality, prove that it can be determined as the solution of the likelihood equation. 11-3. Consider a sequence of independently distributed random variables where 8 > 0. X1,X2 , . . . ,xN , having the probability density function #xie-‘~, (a) Derive &&1v). (b) You want to study whether or not &&N) is an unbiased estimator of 8. Explain (without working out all the details) how you would do this. 11-4, Consider a random variable z which can take on the values z = 0, 1,2, . . . . This variable is Poisson distributed, i.e., its probability function P(z) is P(t) = pze “p/z! Let 2 (l), z(2), . . . , z(N) denote N independent observations of t. Find i&&N). 11-5. If p(z) = Be+=, z > 0 and p(z) = 0, otherwise, find I&, given a sample of N independent observations. 11-6. Find the maximum-likelihood estimator of 8 from a sample of N independent observations that are uniformly distributed over the interval (0, 0). 11-7. Derive the maximum-likelihood estimate of the signal power, defined as e, in the signal z(k) = o(k), k = 0, 1,. . . , N, where s(k) is a scalar, stationary, zero-mean Gaussian random sequence with autocorrelation E{s (9s (j)} = 9(j - i>* 11-8. Supposex is a binary variable that assumesa value of 1 with probability a and a

value of 0 with probability (1 - a). The probability distribution of x can be described by P(x) = (1 - a)” - *‘a” Supposewe draw a random sample of N values (x1,x2, . . . , x.~‘}.Find the MLE of a. 11-9. (Mendel, 1973, Exercise 3-15, pg. 177). Consider the linear model for which V(k) is Gaussian with zero mean and %(A-)= (~‘1.

98

Maximum-Likelihood

Estimation

Lesson 11

(a) Show that the maximum-likelihood estimator of u*, denoted &&, is (f.E- 3tii,J(% &‘,, = . N

- ei,,)

where 6MLis the maximum-likelihood estimator of 8. (b) Show that &*MLis biased, but that it is asymptotically unbiased. 1140. We are given N independent samples z(l), z(2), . . . , z(N) of the identically distributed two-dimensional random vector z(i) = co1[z*(i), z2(i)], with the Gaussian density function PMi),

t*(i)lp>

1

=

exp -

I

27dc-7

z:(i) - 2pz4i)z*(i)

+ z;(i)

(k

+

n)

+

a1y

(k

+

II

-

1)

= bim(k

+

l

l

l

+

any(k)

+ n - 1) + bEeIrn(k

+ n - 2) +

l

.* + bym(k).

A state-variable model for this system is 0 0. . -‘a*

Problems

p (cq0)/a2 = c;“= ] a’ In p (z (i)!6)/#;

(4) Using the strong law of large numbers to assert that, with probability one, sample averages converge to ensemble averages, and assuming that E{B! in p (z(i>le)lde’} is negative definite? show that 6ML’ 8 with probability one; thus, iML is a consistent estimator of 8. The steps in this proof have been taken from Sorenson, 1980, pp. 187-191.1 11-13. Prove that, for a random sample of measurements, a maximumlikelihood estimator is asymptotically Gaussian with mean value 8 and covariance matrix (NJ)-’ where J is the Fisher information matrix for a single measurement 2 (i>. [Hints: (1) Expand d In p (%liML)/dein a Taylor series about 8 and neglect second- and higher-order terms; (2) Show that

I

20 - P’>

where p is the correlation coefficient between z1and z2. (a) Determine i)ML (Hint: You will obtain a cubic equation for jjML and must show that bL = r, where r equals the samplecorrelation coefficient). (b) Determine the Cramer-Rao bound for i)ML. 1141. It is well known, from linear system theory, that the choice of state variables used to describe a dynamical system is not unique. Consider the n th-order difference equation y

Lesson 11

m (4

and y(k) = (1 O...O)x(k)

where

Suppose maximum-likelihood estimates have been determined for the ai- and bi-parameters. How does one compute the maximum-likelihood estimates of the by-parameters? 1142. Prove that, for a random sample of measurements, a maximum-likelihood estimator is a consistent estimator [Hints: (1) Show that E{d In p (9fl8) / ae(e} = 0; (2) Expand d In p (!%&&% in a Taylor series about 8, and show that d In p(%(6)/de = -(iML - 8)‘a’ ln ~(S@*)/LJO”, where 8* = he + (1 - &)iML and 0 5 h 5 1; (3) Show that d in p(%le)/~% = cfl 1 d In p(z(i))8)/89 and 8 ln

(3) Let s(e) = dlnp ew) la and show that s(e) = xrS 1si(e), where si(e) =d ln P (f (i)k9 I% (4) Let S denote the sample mean of Si(e), and show that the

distribution of S asymptotically converges to a Gaussian distribution having mean zero and covariance J/N, (5) Using the strong law of large numbers we know that

-+ E{ a’ In p (z (#I) / 80’10); consequently, show

that (i,, - e)‘J is asymptotically Gaussian with zero mean and covariance J/N; (6) Complete the proof of the theorem. The steps in this proof have been taken from Sorenson, 1980, pg. 192.1 11-14. Prove that, for a random sample of measurements, a maximum-likelihood estimator is asymptotically efficient [Hints: (1) Let S(e) = -E{a’ in p (%$I)/ a210} and show that S(0) = NJ where J is the Fisher information matrix for a single measurement z (13;(2) Use the result stated in Problem 11-13to complete the proof. The steps in this proof have been taken from Sorenson, 1980, pg. 193.]

Lesson

Jointly

72

Elemenfs Mulfivar;afe

Gaussian

?Ol

Random Vectors

MULTIVARIATE

GAUSSIAN

Let

y, be random variables, and y = co1 (yl, y,, . . . , ym). The

yl, ~2,. - -,

DENStTY FUNCTlON

function

of

Pcyl,*-.,YnJ =p(s’) (Y- my> 1 (12-2) =v&qexp +r- WPT,’ L’2,

Gcwssiun

Random Variables

is said to be a multivariate (m-variate) Gaussian density function [i.e., y - N(y; m,, P,)]. In (12-2), my = E(Y)

(12-3)

p, = WY - my)(Y - my)')

(12-4)

and Note that, although we refer to p(y) as a density function, it is actually a joint between the random variables yl, y2, , . . , y, . If P, is not positive definite, it is more convenient to define the multivariate Gaussian distribution by its characteristic function. We will not need to do this.

density function

INTRODUCTION

Gaussian random variables are important and widely used for at least two reasons. First, they oflen provide a model that is a reasonableapproximation to observed random behavior. Second, if the random phenomenon that we observe at the macroscopiclevel is the superposition of an arbitrarily large number of independent random phenomena, which occur at the microscopic level, the macroscopicdescription is justifiably Gaussian. Most (if not all) of the material in this lesson should be a review for a reader who has had a coursein probability theory. We collect a wide range of facts about multivariate Gaussian random variables here, in one place, becausethey are often neededin the remainmg lessons. UNIVARIATE

GAUSSIAN DENSITY FUNCTION

A random variable y is said to be distributed as the wzivariate Gussian distribution with mean q, and variance CI$[Len, y - IV(y; my, $11 if the density function of y is given as

JOtNTLY GAUSSfAN

RANDOM VECTORS

Let x and y individually be n- and m-dimensional Gaussianrandom vectors, i.e., x - N(x; m,, Px) and y - N(y; m,, P,). Let P,, and P,, denote the crosscovariance matrices betweenx and y, i.e., - my)‘)

(12-5)

- m,.>(x- m,)‘)

(12-6)

P, = Wx - My and Pyx=

NY

We are interested in the joint density between x and y, p (x, y). Vectors x and y are jointly Gaussianif Ph

2 - mJ’P,-’ (z - m,)

Yl =

where 2 = co1(x, y) m, = cd (m,, my)

For notational simplicity throughout this chapter. we do not condition density functions on their parameters;this conditioning is understood. Density JJ(y) is the familiar bell-shapedcume, centered at y = mJ.+

(12-7)

(12-B) (12-9)

and (12-10)

102

Elements of Multivariate

Gaussian

Random Variables

Lesson 12

Note that if x and y are jointly Gaussian then they are marginally (i.e., individually) Gaussian.The converseis true if x and y are independent, but it is not necessarilytrue if they are not independent (Papoulis, 1965,pg. 184). In order to evaluatep (x, y) in (12-7), we need lPZiand P,‘. It is straightforward to compute lPZlonce the given values for P,, P,, PXy,and PrXare substituted into (12-10).It is often useful to be able to expressthe components of Pi’ directly in terms of the componentsof P,. It is a straightforward exercise in algebra (just form P,P,’ = I and equate elements on both sides) to show that (12-11) where A = (PX- P,P;‘P,)-’ = P,’ + P;‘p,,CP,P;’ (12-12) B = -AP,P;’

= -P;‘P,C

(12-13)

The Conditional

Density

Function

Taking a closer look at the quadratic exponent, which we denote E(x,y), we find that E(x,y) = (x - m,)‘A(x - m,) + 2(x - mX)‘B(y - m,) + (Y - mJ’(C - PY’>(Y- my) = cx - m,)‘A(x - m,) - 2(x - m,)‘A P,,P;‘(y - m,) + (Y - q)‘qpyxA P,,p;yy - my)

In obtaining (12-20) we have used (12-13) and (12-14). We now recognizethat (12-20) looks like a quadratic expression in x - m, and P,,P;‘(y - m,), and . expressit in factored form, as ax,

Y) = E(x - mx> - PxyP;‘(Y

- m,>l’N(x- mx)-

PxyP,‘(Y

c = (PY - P,P;‘P,)-’

THE CONDITIONAL

= Py’ + P,‘P,,AP,,P;’

(12-14)

DENSITY FUNCTION

One of the most important density functions we will be interested in is the conditional density function p (xly). Recall, from probability theory (e.g., Papoulis, 1965),that P(F Y) P MY) =P(Y)

(12-15)

Theorem 12-l. Let x and y be n- and mdimensional jointly Gaussian. Then

1

P(XlY) = q@qqq

1 exp i - 2 (x - mW’(x

- m,>l

(12-21)

Defining m and ZLas in (12-H) we seethat E(x,y) = (x - m)‘CL-‘(x - m)

and

(12-20)

(12-22)

If we can show that lPZl/IPY[= (21,th en we will have shown that p (xly) is given by (12-X), which will mean that p (xly) is multivariate Gaussianwith mean m and covariance 2. It is not at all obvious that 121= IPZI/ IP,I. We shall reexpressmatrix P, so that l&l can be determined by appealingto the following theorems (Graybill, 1961,pg. 6): i. If V and G are n x n matrices, then /VGl = IV11~1. ii. IfMisa square matrix suchthat

vectors that are

- m) I

(12-16)

where Ml1 and Mz2are squarematrices, and if Ml2 = 0 or MS1= 0, then IMI = IMlll Iw21* We now show that two matrices, L and N, can be found so that

where

(12-23) m = E{x/y} = mx +

P,P;'(y

- my)

(12-17)

and

!%= A-’ = P, - Px,P;‘Px

(12-18)

This means that p(xly) is also multivariate Gaussian with (conditional) mean m and covariance 9..

Multiplying the two matrices on the right-hand side of (12-23), and equating the l-l and 2-1 components from both sides of the resulting equationwe find that (12-24) px = L + P,,N

Proof. From (12-U), (12-7),and (12-2), we find that A B’

B C -P;’

(12-19)

b = P,.N

(12-25)

N = P,‘P,,

(12-26)

from which it follows that

Bernems

of Multivariate

Gaussian Rxdom

Varhbks

Lesson 12

and (12-27)

L = Px - PqP;lPfl From (12~23),nyesee that

Properties

of Conditional

Mean

Proof That E(xjy) is G aussianfollows from the linearity property applied to (12-30).An affine transformation of y has the structure Ty + f. E(x\y) has this structure; thus, it is an affine transformation. Note that if m, = 0 and m, = 0, then E{xly} is a linear transformation of y. 5

Pzl = IL1hl

Theorem 12-3. Let x, y, andzben X 1, m X 1 artdr X I jointly GausIf y and z are statistically independent, then

sian random vectors,

or, Comparing the equations for L and 3, we find they are the same; thus, we have proven that (1229)

121= IPzlm which completesthe proof of this theorem. U

PROPERTIES OF MULTIVARIATE

GAUSSIAN

Ebl5) = m, + PxkP;‘({ - m,) We leave it to the reader to show that

RANDOM VARIABLES

variate Gaukan prubabikty density functions are completely characterized by their first two moments, Le., their mean vector and covariance matrix. All

other moments can be expressedin terms of their first two moments (see, e.g., Papoulis, 1965) From probability theory, we also recall the foliowing two important facts about Gaussianrandom variables: I. Statistically independent Gaussian random variables are uncorrelated and vice-versa; thus, PWand PYxare both zero matrices. 2, bear (or affine) transformations on and linear (or affine) combinations of Gaussian random variables are themselves Gaussian random variables; thus, if x and y are jointly Gaussian,then z = Ax + By + c is also Gaussian+We refer to this property as the /inew@ property.

(12-33)

p-p,.

(12-34)

E{xle] = m, I- PxyPF’(y- m,.) + P,P;‘(z - m,)

= E(xjyj + E(x/z}- m, q In our developments of recursive estimators, y and z (which will be associatedwith a partitioning of measurementvectors 2) are not necessarily independent. The following important generalization of Theorem 12-3will be needed.

SiUtl

Theorem 12-4. Let x, y, and z ben x 1.m x landr x 1 jointly Gausrandom vectors. Ijy and z are not necessarily statistically independent, then

E(x]y ,z}=E(xly ,i)

(12-35)

- E{z[Y)

(12-36)

so that P&‘(y

- my)

(12-30)

BecauseE{xiyl dependson y, which is random, it is also random. Theorem 12-2. When x and y m-ejointly Gaussian, E{xiyl is multivariate and is an uffitx combinatiun of the elements of y.

C&&an,

0 0 P2 (-I-)

(the off-diagonal elements in P, are zero if y and z are statistically independent; this is also true if y and z are uncorrelated, becausey and z are jointly Gaussian); thus,

-z-z

= rnx +

(12-32)

where

MEAN

We learned, in Theorem 12-1,that E{x[y}

(12-31)

pxt=Px,lPxz) I-

From the preceding formulas for p(y), p (x,y) and p (xiy), we see that r4lL

PROPERTIES OF CONDiTIoNAL

Ebty, 2) = E{xly) + E(xlz} - m, Proof. Let 5 = co1(y.z); then

(12-28)

ILI = lRlml

E(xly,z) = E(x/y} + E{xlZ) - m,

(12-37)

Proof (Mendel, 1983b. pg. 53). The proof proceeds in two stages:(a) assume(12-35) is true and demonstrate the truth of (12-37), and (b) demonstrate the truth of (12-35).

Elements of Multivariate

Gaussian Random Variables

Lesson 12

a. If we can show that y and i are statistically independent, then (12-37) follows from Theorem 12-3. For Gaussian random vectors, however, uncorrelatednessimplies independence. To begin, we assert that y and 21are jointly Gaussian, because ii = z - E{zly} = z - m, - P,,P;‘(y - my) dependson y and z, which are jointly Gaussian. Xext, we show that Z is zero mean. This follows from the calculation mi = E{z - E{zly}} = E(z) - E{E{zly}}

(12-38)

where the outer expectationin the secondterm on the right-hand side of (12-38) is with respectto y. From probability theory (Papoulis, 1965,pg. 208), E(z) can be expressedas (12-39) EM = E(E(zly)Z From (12-38) and (12-39),we see that rni = 0. Finally, we show that y and Z are uncorrelated. This follows from the calculation

E{(y- my>@ - mi)‘)= E{(y

- m&i’} = E{yi’}

= Wyz’l - E(yE(z'jy)~ = E{yz’} - E{yz’} = 0

(12-40)

b. A detailed proof is given by Meditch (1969,pp. 101-102).The idea is to (1) compute E{xly,z} in expanded form, (2) compute E{xly,i} in expanded form, using Z given in (12-36), and (3) comparethe results from (1) and (2) to prove the truth of (12-35). 0 Equation (12-35) is very important. It states that, when z and y are dependent, conditioning on z can always be replaced by conditioning on another Gaussian random vector 5, where Z and y are statistically independent. The results in Theorems 12-3 and 12-4 depend on all random vectors being jointly Gaussian.Very similar results, which are distribution free, are described in Problem 13-4;however, these results are restricted in yet another way. PROBLEMS 12-1. Fill in all of the details required to prove part b of Theorem 12-4. 12-2. Let x, y and z be jointly distributed random vectors: c and h fixed constants; and g ( 0) a scalar-valuedfunction. Assume E(x), E(z). and E(g (y)x} exist. Prove the

following useful properties of conditional expectation:

Lesson 12

(a) (b) (d (d) (e) (0

Problems

E{xjy} = E(x) if x and y are independent WY)XlY~ = g(YPwYl E{c lyl = c WY~lY~ = ET(Y) E{cx + hzly} = cE{xly} + hE{z/y} E{x) = E{E{xiyl) where the outer expectation is with respect to y (9) WYbl = WY)EblYH w here the outer expectation is with respect to y. 12-3. Prove that the cross-covariance matrix of two uncorrelated random vectors is zero.

Lesson

73

Mean-Squared

Estimation

MEAN-SQUARED

ESTIMATION

Many different measuresof parameter estimation error can be minimized in

order to obtain an estimate of 6 [see Jazwinski (1970), Meditch (1969), van

Estimution

Trees (1968), and Sorenson(1980), for example]; but, by far the most widely

studied measure is the mean-squarederror.

of Rcmdom Parameters: Generd

Results

Objective

Function

and Problem

Statement

Given measurementsz(l), . . . , z(k), we shall determine an estimator of 6, namely (13-1) &s(k) = +[z(i), i = 1,2, . a. , rl-] such that the mean-squarederror

Jh,&)l = E~~~s(k)e,s(~)~

(13-2)

is minimized. In (13-2), &s(k) = 8 - i&k). The right-hand side of (13-1) means that we have some arbitrary and as yet unknown function of all the measurements. The n components of 0,,(k) may each depend differently on the measurements.The function +[z(i), i = 1, 2,-‘., k] may-be nonlinear or lineat. Its exact structure will be determined by minimizing J[9&k)]. If perchance0&/c) is a linear estimator, then INTRODUCTION

6&k)= c A(i)+)

(13-3)

i= 1

In Lesson 2 we showedthat state estimation and deconvolution can be viewed as problems in which we are interested in estimating a vector of random parameters* For us, state estimation and deconvolution serve as the primary motivation for studyingmethodsfor estimating random parameters;however, the statistics literature is filled with other apphcationsfor thesemethods. We now view 0 asan PI x 1 vector of random unknown parameters. The information available to us are measurementsz(I), z(2), . . . , z(k), which are assumed to depend upon 9. In this lesson, we do not begin by assuming a specific structural dependencybetween z(i) and 0. This is quite different than what we did in WLSE and BLUE. Those methods were studied for the linear model z(i) = H(i)9 -t v(z’),and, closed-form solutions for 6=(k) and &JJ(~) could not havebeenobtained had we not begun by assumingthe linear model. We shah study the estimatiun of random 0 for the linear model in Lesson 14. ln this lessonwe examinetwo methods for estimating a vector of random parameters. The first method is based upon minimizing the mean-squared error between 0 and 6(k). The resulting estimator is called a mean-squared estimator, and is denoted ~&Jc). The second method is based upon maximizing an unconditional likelihood function, one that not only requires knowledge of ~(9!~8) but also of p(9). The resulting estimator is called a maximum a posteriori estimakw, and is denoted iMAp(

We now show that the notion of conditional expectation is calculation of &&c). As usual, we let 95(k) = co1[z(k), z(k - l), . . . , z(l)]

to the (13-4)

The underlying random quantities in our estimation problem are 9 and Z(k). We assumethat their joint density function p[e,!E(k)] exists, so th.at

where d9 = d&d& . . . de,, d%(k) = dzr(1) . . . dri(k)dzz(l) . . . dz,(k) . . . dz,(l) . . . dz,(k), and there are H + km integrals. Using the fact that (13-6) p WW)l = p C@V)$ [Wdl we rewrite (13-5) as

(13-7)

Estimation

of Random Parameters:

General Results

Lesson 13

From this equation we see that minimizing the conditional expectation E{@&)&&)~%(k)} with respectto &.&j is equivalent to our original objective of minimizing the total expectation E{6~s(k)8Ms(k)}. Note that the integrals on the right-hand side of (13-7) remove the dependencyof the integrand on the data S(k). In summary, we have the following mean-squared estimation problem: Given the measurements z(l), z(2), . . . , z(k), determine an estimator of 8, namely, 6&k) = @[z(i), i = 1,2,. . . , k] such that the conditional mean-squared error

mLs(k)l = E(&&)L(k)l4~), - ’ 7z(k)) l

(13-8)

is minimized. Derivation

of Estimator

The solution to the mean-squaredestimation problem is given in Theorem 13-1,which is known asthe Fundamental Theorem of Estimation Theory. Theorem 13-l.

The estimator that minimizes the mean-squared error is

(13-9)

hMS(JC) = WI~(kj~

Proof (Mendel, 1983b). In this proof we omit all functional dependences on k, for notational simplicity. Our approach is to substitute i&&k) = 8 - 6,,(k) into (13-8) and to complete the square, as follows:

J1[~MS(k)I = we - hMSjv - ~MsjIffJ = E{e’e - efhMS- i&e + iifrlSiiMSpQ = E{e’ep)

- E{ep$iMs

- i&E{ela)

(13-10)

+ &,JiMs

= E{efel%}+ [bMS- E{elk3f>]@Ms - E{ele>] - E{8’pI}E{8]~}

To obtain the third line we usedthe fact that 6MS,by definition, is a fUnCtiOn of %; hence, E{&SI%} = 6MS.The first and last terms in (1340) do not dependon && hence, the smallest value of J1[6Ms(k)] is obviously attained by setting the bracketed terms equal to zero. This means that 6,s must be chosen as in (13-9). cl Let J,*[e,,(k)] denote the minimum (1340) and (13-g), that &@&k)]

Vahe

of &[&s(k)].

= E{e’el%} - &s(k)&s(k)

we see, from (1341)

As it stands, (13-9) is not terribly useful for computing g,s(k). In general, we must first compute p [cl%(k)] and then perform the requisite number of integrations of ep[el%(k)] to obtain bMS(k).In the specialbut important case

Mean-Squared

111

Estimation

when 8 and Z(k) are jointly Gaussianwe have a very important and practical corollary to Theorem 13-l. When 8 and Z(k) are jointly Corollary 13-l. that minimizes the mean-squared error is

Gaussian, the estimator

k&) = m6+ P,(Wi’OPW - m&41

(13-12)

Proof. When 8 and Z(k) are jointly Gaussian then E{OI%(k)} can be evaluatedusing (12-17) of Lesson 12. Doing this we obtain (13-12). 0

Corollary 13-1gives us an explicit structure for 6&k). We seethat 6&k) is an affine transformation of Z(k). If me = 0 and m&c) = 0, then 6&k) is a linear transformation of Z(k). In order to compute 6&k) using (13-12), we must know me and m& and we must first compute P&k) and P&k). We perform these computations in Lesson 14 for the linear model, Z(k) = X(k)0 + V(k). Corollary 13-2. Suppose 8 and %(k) are not necessarily jointly Gaussian, and that we know me, m%(k), Pz(k) and P&k). In this case, the estimator that is constrained to be an affine transformation of Z(k), and that minimizes the mean-squared error is also given by (13-12). Proof. This corollary can be proved in a number of ways. A direct proof begins by assuming that 6&k) = A(k)%(k) + b(k) and choosing A(k) and b(k) so that 6&k) is an unbiased estimator of 8 and E{6Eyis(k)&s(k)} = trace is minimized. We leave the details of this direct proof to the E(&&)%&)1 reader. A less direct proof is based upon the following Gedanken experiment. Using known first and second moments of 8 and Z(k), we can conceptualize unique Gaussian random vectors that have these same first and secondmoments. For these statistically-equivalent (through second-order moments) Gaussianvectors, we know, from Corollary 13-1, that the mean-squaredestimator is given by the affine transformation of S(k) in (13-12). 0

Corollaries 13-1 and 13-2, aswell as Theorem 13-1, provide us with the answerto the following important question: When is the linear (affine) meansquared estimator the same as the mean-squared estimator? The answer is, when 8 and Z(k) are jointly Gaussian.If 8 and ttI(k) are not jointly Gaussian, then 6&k) = E{OIZ(k)}, which, in general, is a nonlinear function of measurementss(k), i.e., it is a nonlinear estimator. Corollary 13-3 (Orthogonality Principle). Suppose f [Z(k)] is any function of the data Z(k). Then the error in the mean-squared estimator is orthogonal to f [S(k)] in the sense that

E(Ee- kls(wlf’Ew)lI = 0

(13-13)

Estimation

112


General Results

Lesson 13

Mean-Squared

113

Estimation

Proof (Mendel, 1983b, pp. 46-47). We use the following result from probability theory (Papoulis, 1945;see,also: Problem 12-2(g)). Let CKand @be jointly distributed random vectors and g (fi) be a scalar-valuedfunction; then

Becausevariancesare aIwayspositive the minimum value of J{&,(k)] must be achievedwhen each of the n variancesis minimized; hence, our mean-squared estimator is equivalent to an MVE. [7

where the outer expectation on the right-hand side is with respect to f!. We proceed as follows (again, omitti ng the argument k):

estimator.

Property 3 (Linearit)‘).

where we have used the facts that bMSis no longer random when !I!,is specified and E{@%~= iMS. q A frequently encountered specialcaseof (13-13) occurswhen f [2(k)] = d&k); then Corollary 13-13can be written as (13-S)

~oLs~~~k@~l = 0 PropertIes of Mean-Squared are Gaussian

Estimates When 9 and 3%)

In this section we present a collection of important and useful properties associatedwith 9&k) for the casewhen 0 and g(k) are jointly Gaussian.In this case6&k) is given by (1342). Property

1 (Unbiasedness).

The mean-squared estimator, &s(k) in

(13-12), is unbiased.

?vof. Taking the expectedvalue of (13-12),we seethat E&&k)} thus, O&k) is an unbiasedestimator of 0. Cl

= Q;

Variance). Dispersion about the mean value of is measured by the error variance c$ Ms(k), where i = 1, 2, . . . , n. An estimator that has the smallest error variance’isa minimum-variance estimator (an MVE). Th e mean-squaredestimator in (13-12)is an MVE. Property 2 (Minimum

t&&k)

Proof. From Property 1 and the definition of error variance, we seethat c-r;8 .Ms(k) = E{@&(k)}

i = 1,2,. . . , II

Our mean-squaredestimator was obtained by minimizing .J[&&k)] which can now be expressedas

(13-X) in (13-2), (13-17)

6&k) in (13-12) is a “linear” (i.e., affine)

Proof. This is obvious from the form of (13-12). 0 Linearity of i&k) permits us to infer the foIlowing very important property about both i,,(k) and &s(k). Property 4 (Gaussian).

Both 6&k) and 6&k> are multivariate Gaus-

Sian. Proof. We use the linearity property of jointly Gaussian random vectors stated in Lesson 12. Estimator 6&k) in (13-12)is an affine transformation of Gaussian random vector Z(k); hence, &,&k) is multivariate Gaussian. Estimation error 8&k) = 9 - 6&k) is an affine transformation of jointly Gaussian vectors 8 and 5!‘.(k);hence, 6&k) is also Gaussian. 0

Estimate 6&k) in (13-12) is itself random, becausemeasurementsZ(k) are random. To characterize it completely in a statistical sense, we must specify its probability density function, Generally, this is very difficult to do, and often requires that the probability density function of 6&) be approximated using many moments (in theory an infinite number are required). In the Gaussian case, we have just le_arnedthat the structure of the probability density function for &.&k) [and 6,,(k)] is k nown. Additionally, we know that a Gaussiandensity function is completely specified by exactly two moments, its mean and covariance; thus, tremendous simplifications occur when 6 and Z(k) are jointly Gaussian. Property 5 (Uniqueness).

Mean-squaredestimator 6&k), in (13-12),is

unique. The proof of this property is not central to our developments; hence,it is omitted. Generahations

Many of the results presented in this section are applicable to objective functions other than the mean-squaredobjective function in (13-2). See Meditch (1969) for discussionson a wide number of objective functions that lead to E{elS(k)> as the optimal estimator of 8.

Estimation

MAXIMUM


General Results

Maximum

a Posteriori

Estimation

Lesson 13

j.&, which is unknown to us in this example, is transferred from the first random number generator to the second one before we can obtain the given random sample

A POSTERIOR1 ESTIMATION

Recall Bat.es’s rule (Papoulis, 1965,pg. 39): .

Using the facts that (13-20)

exp { -$ [z(i) - @j’/d}

p (z (i)lp) = (2m+‘”

in which density functionp@j%(k)) 1sk nown as the a posteriori (or posterior) conditional density function, andp (0) is the prior probability density function for 8. Observe that p(f$!E(k)) is related to likelihood function Z(ele(k)}, because Z{C$!Z@)} ap (%(k)le). Additionally, becausep (%(k)) does not depend on 8, l

p(j~) = (2m~)+*

(13-21)

exp {+w}

(13-22)

(13-19) In maximum a posteriori (MAP) estimation, values of 8 are found that maximize p (e/%(k))in (13-19); such estimates are known as MAP estimates, and will be denoted as &&j. If 01, 62,. . . ) & are uniformly distributed, then p (f$E(k)) ap (Zf(k)lO), and the MAP estimator of 8 equals the ML estimator of 8. Generally, MAP estimatesare quite different from ML estimates. For example, the invariance property of MLE’s usually doesnot carry over to MAP estimates.One reason for this can be seenfrom (13-19).Suppose,for example, that + = g(8) and we MAPby first computing 6MAP.Becausep (0) dependson the want to determine C$ Jacobian matrix of g-l(+), 4 MAP# @MAP).Kashyap and Rao (1976, pg. 137) note, “the two estimatesare usually asymptotically identical to one another since in the large samplecasethe knowledge of the observationsswampsthat of theprior distribution.” For additional discussionson the asymptotic properties of MAP estimators, seeZacks (1971). Quantity p (elZ(k)) in (13-19) is sometimescalled an unconditional ZikeZihoodfunction, becausethe random nature of 8 has been accounted for by p (0). Density p (%(k)le>is then called a conditional likelihood function (Nahi, 1969). Obtaining a MAP estimate involves specifying both p (%(#3) and p (0) and finding those valuesof 8 that maximize p (Ol%(k)),or In p @l%(k)).Generally speaking, mathematical programming must be used to compute &&&k). When s(k) is related to 8 by our linear model, %(k) = X(li)B + V(k), then it may be possible to obtain &A#) in closed form. We examinethis situation in some detail in Lesson 14. Example 13-l This is a continuation of Example 11-l. We observe a random sample {I (l), t (2), . . . , z(N)} at the output of a Gaussianrandom number generator, i.e., z(i) - N(z (i>; p, CT). Now, however, p is a random variable with prior distribution N(p; 0, 0:). Both a: and oc are assumedknown, and we wish to determine the MAP estimator of p. We canview this random number generator as a cascadeof two random number generators. The first is characterizedby N(p; 0, o:> and provides at its output a single realization for p, say pR. The secondis characterized by N@(i); pR, 0:). Observe that

(13-23) Taking the logarithm of (13-23) and neglecting the terms which do not depend upon p, we obtain LMAP(#E(RJ)) = -i

j? {[z(i) -

pl*1(+3- i CL214

(13-24)

1=1

Setting ~LMAP/@ = 0, and solving for bMA&V), we find that (13-25)

bMAP(R? Next, we compare bMAP(lV)and b&v),

where [see (19) from Lesson 111

l f bdRJ) = jQ, I=

z(i)

(13-24)

1

In general GMAp(N) # &,&V). If, however, no a priori information about p is available, then we let &+a, in which case fiMAp(IV)= fi&V). Observe, also that, as N+ 00cMAP(lV)= bML(1V),which implies (Sorenson, 1980) that the influence of the prior knowledge about p [i.e., p - A@; 0, cri)] diminishes as the number of measurements increase. D Theorem 13-2.

If Z(k)

and 8 are jointly

Gaussian, then i&&k)

=

kdk)~ Proof. If SC(k)and 0 are jointly Gaussian, then (see Theorem 12-1) P @lw)) 1 1 i [e - m(k)]‘S!?(k)[8 = ~(271.)“l’Z(k)l exp -2

-

mwl]

(13-27)

Estimation

of Random

Parameters:

General Results

Lesson 13

where III(~) = E@/%(k)]

(13-28)

&&k) is found by maximizingJJ(@I!@)]~or equivalentiy by minimizing the argument of the exponential in (13-27). The minimum value of [O - m(k)]‘LT!F(k)[6 - III(~)] is zero, and this occurswhen

Lesson 13

Problems

If7

S’(k) is uniquely equal to the linear (i.e., affine), unbiased mean-squared estimator of 0,4(k). (c) For random vectors x, y, z, where y and z are uncorrelated. prove that k(x/y.z} = E{xly) + IQXIZ) - m.

(d) For random vectors x, y, z, where y and z are correlated, prove that ii{xly3z} = i{x/y.Z}

where z = z - i{zly} so that &ly,z}

The result in Theorems13-2is true regardlessof the nature of the model relating 0 to Z(k). Of course, in order to use it, we must first establish that %(A$ and 0 are jointly Gaussian. Except for the linear model, which we examine in Lesson 14, this is very diffkuh to do.

= k{xly} + i&Ii}

- m,

Parts (c) and (d) show that the results given in Theorems 12-3 and 12-4 are “distribution free” within the class of linear (i.e., affine), unbiased, meansquared estimators. Consider the linear model z(k) = 28 + n(k), where

PROBLEMS 13-l. Prove that 6&k), given in (13-12). is unique. 13-2. ProveCorollary 13-Zby meansof a direct proof. 13-3. Let 0 and %(N) be zero mean n x 1 and IV X 1 random vectors, respectively, with known second-orderstatistics, P8, P5, Pm, and Pz@+ View Z(N) as a vector of measurements. It is desired to determine a linear estimator of 6,

where KL(v is an n x N matrix that is chosen to minimize the mean-squared error Ei b - & Q+fE(NLl’[fl - &. (NEW)]}. [a) Show that the gain matrix, which minimizes the mean-squared error, is KL(w = E~e~‘(~~~E~~(~~‘(~}]-’ = Pa PG’. (b) Shoy that the covariance matrix, P(N), of the estimation error, 6(N) = 9 - e(N), is (c) Relate the resultsobtained in this problem to those in Corollary 13-2. 13-4, For random vectors 0 and Z(k), the hear projection, 9*(k), of 0 on a Hilbert space spanned by s(k) is defmed as 9*(k) = a + B%(k), where E{0*(k)} = E{O] end E{[O - O*(k)B’(k)} = U. We denote the linear projection, 0*(k), as Jo MW (a} rrove that the linear (i*e., affine), unbiased mean-squaredestimator of 9, 6(k), is the linear projection of 0 on 9(k). tb} Prove that the linear projection, O*(k), of 9 on the Hilbert spacespanned by

to,

otherwise A random sample of N measurements is available. Explain how to find the ML and MAP estimators of 8, and be sure to list all of the assumptions needed to obtain a solution (they are not all given).

Mean-Squared

14

Lesson Estimation Random

It is straightforward, using (14-1) and (14-2) to show that

of

(14-4)

P%(k)= X(k)P,X’(k) + a(k)

(14-5)

Pa(k) = P&?‘(k)

(14-6)

consequently, h(k)

,

and Gaussian

m%(k)= W)m* and

Parameters:

The Linear

119

Estimator

= me + P&W(k)[%e(k)PJt’(k)

+ S(k)]-‘[Z(k)

(14-7)

- X(k)m,l

Observe that &&k) dependson all the given information, namely, %(k), X(k), meand Pg. Next, we compute the error-covariance matrix, P&k), that is associated with i&&k) in (14-7). Because6&k) is unbiased, we know that E{&,(k)} = 0 for all k; thus,

Model

P&k) = E{&,&)&(k))

(14-8)

From (14.7), and the fact that 8,,(k) = 8 - 6&k), we seethat bMs(k) = (0 - m,) - P,X’(XP,X’

INTRODUCTION

From (14-9), (14-8) and (14-l) it is a straightforward exerciseto show that

In this lessonwe begin with the linear model 9!(k) = X(k)6 + “V(k)

(14-1)

where 6 is an y2x 1 vector of random unknown parameters,X(k) is deterministic, and V(k) is white Gaussian noise with known covariance matrix CR(k). We also assumethat 8 is multivariate Gaussianwith known mean, me, and covariance, PO,i.e., (14-2) 0 - WC me, Pe) and, that 8 and V(k) are Gaussian and mutually uncorrelated. Our main objectives are to compute 6,,(k) and 6MAP(k) for this linear Gaussian model, and to see how these estimators are related. We shall also illustrate many of our results for the deconvolution and state estimation exampleswhich were described in Lesson2. MEAN-SQUARED

ESTIMATOR

Because 8 and V(k) are Gaussianand mutually uncorrelated,they are jointly Gaussian. Consequently,Z(k) and 0 are also jointlywGaussian;thus (Corollarvd 13-l), &s(k) = me + ~e@)P,‘(k)~~(k) - m&l 118

(14-9)

+ a)-‘(% - Xrn&

(14-3)

P&k)

= Ps - PJt’(k)[X(k)PJt’(k)

(14-10)

+ St(k)]-‘X(k)P,

Applying Matrix Inversion Lemma 4-l to (14-lo), we obtain the following alternate formula for P&k), (14-11)

P&k) = [Pi’ + X’(k)%-‘(k)X(k)]-’ Next, we expressh&k) note that

as an explicit function of P&k). To do this, we

PJe’(XPJe’ + %)-’ = P$e’(SePJe’ + %)-’ (ZePJe + 9k - XPJe’)W = P,X’[I - (XPJe’ + %)-%P&W]W = [Pe - PJe’(XP,X’ + %)-%P,]WW = P&!e’W hence, I b&k) = me + PMs(k)X’(k)W1(k)[%(k) - X’(k)m]

(14-12)

I (14-13)

Estimation

of Random

Parameters:

The Linear and Gaussian Model

lesson

14

Theorem lb 1.

Proof. Set Pi’ = 0 in (14-ll), to see that P&k)

(14-H)

= [x’(k)W(k)x(k)J-l

and, therefore, (14-16) Compare (14-16) and (9-ZZ), to concludethat &IS(~) = &(Q

q

One of the most startIing aspects of Theorem 14-I is that it showsus that I3LL-J estirnathn upplies to random parameters as we!! us tu deterministic parameters. We return to a reexamination of BLUE below. What does the condition Pi1 = 0, given in Theorem 14-1, mean? Sup-

pose, for example, that the elements of 0 are uncorrelated; then, POis a diagonal matrix! with diagonal elements & When all of these variancesare very large, then Pi* = 0. A large variancefor 8,meanswe have no idea where 19~ is located about its mean value. Example 14-l

[Minimum-Vkance

Deconvolution)

In Example 2-6 we showed that, for the application of deconvolution, our linear modei is qlv) = X(N - l)p + qly)

(14-17)

We shall assumethat p and V(N) are jointly Gaussian, and, that rn@= 0 and rn=+= 0; hence, rn%= 0, Additionally, we assumethat cove] = pI. From (l4-7), we determine the following formula for bMS(N), ~&v) A4

= FJe’(N - l)[X(N - l)PJe’(N - 1) + p1]-?qN)

(14-18)

Recall, from Example 2-6, that when g(k) is described by the product model = q W W, then (14-19) P = QG

where (14-20) and r = col(r(l), r(2), . . . , r(N))

(14-21)

In the product model, r(k) is white Gaussian noise with variance OF, and q (k) is a Bernoulli sequence. 0bviously, if we know Qq then p is Gaussian, in which case where we have used the fact that Qi = Qq, becauseq(k) = 0 or 1, When (14-18) becomes

known,

Best Linear Unbiased

Estimation,

Revisited

Although bMSis a mean-squared estimator, so that it enjoys all of the properties of such an estimator (e.g., unbiased?minimum-variance. etc.), bMSis not a consistent estimator of )I. Consistency is a large sample property of an estimator: however. as N increases,the dimension of p increases. becausep is Iv X 1. Consequently, we cannot prove consistency of kMS(recall that, in all other problems, 9 is n x 1, where n is data independent; in these problems we can study consistencyof 6). Equations (14-18) and (14-23) are not very practical for actually computing kMSr becauseboth require the inversion of an N x N matrix, and N can become quite large (it equals the number of measurements). We return to a more practical way for computing gMSin Lesson 22. 0

BEST LINEAR UNBIASED

ESTIMATION,

REVfSITED

In Lesson 9 we derived the BLUE of 9 for the linear model (14-l), under the following assumptions about this model: 1. 8 is a deterministic but unknown vector of parameters, 2. X(k) is deterministic, and 3. V(k) is zero-mean noise with covariancematrix a(k). We assumedthat 6&k) = F(k)Z(k) and chose FBLU(k)so that iBLv(k) is an unbiasedestimator of 9, and the error variance for each one of the n elements of 9 is minimized. The reader should return to the derivation of 6&k) to see that the assumption “0 is deterministic” is never needed, either m the derivation of the unbiasednessconstraint [seethe proof of Theorem 6-1, in which Equation (6-9) becomes [I - F(k)X(k)]E(9) = 0, if 8 is random], or in the derivation of Ji[fj, hi) in Equation (9-14) (due to some remarkable cancellations); thus, 9,Lu(k), given in (g-22), is applicable to random us well us deterministic parameters in our linear model (14-I); anl’~

(15-4)

Defmition 15-3 (Meditch, 1969, pg. 118). A vector random Process (s(t), k.9) is a Murkuv process, if, for any m time points tl < f2 < . . . < t, in 9, where m is any integer, it is true that

Lessons 13 and 14 have demonstrated the importance of Gaussianrandom variables in estimation theory. In this lesson we extend some of the basic conceptsthat were introduced in Lesson 12, for Gaussianrandom variables, to indexed random variables?namely random processes.These extensions are needed in order to develop state estimators. DEFlNlTH3NS AND PROPERTIES UF DISCRETE-TIME GAUSS-MARKOV RANDOM PROCESSES

Recall that a random processis a coIIection of random variables in which the notion of time plays a role. Definition 15-l (Meditch, 1969,pg. 106). A vector random prucess is a family of randurn vecturs (s(t), ~4) indexed by a parameter t aU uf whuse values lie in some appropriate index set 9. When 9 = {k: k = 0, I, . . . ] we have a discrete4me randurn process. El Definition U-2 (Meditch, 1969, pg. 117). A vector randurn prucess {s(t), k%} Ls defined tu be multivariate Gaasian if, fur any t time points t[ in $ where t zk an integer, [he set uf e randum n vecturs s@J, I3 w7 * ’ * , s(Q is juintly Gaussian distributed. t1,

128

t2,

.

.

l

,

For continuous random variables, this means Ihat

Note that, in (U-5), s(tm) 5 S(t,,,) means si(tm) I Si(tm) for i = 1, 2,. . . , n. If we view time point t,,, as the present time and time points t . . . . tl as the past, then a Markov process is one whose probability law Git’, probability density function) dependsonly on the immediate past value, t,,,- 1. This is often referred to as the Markuv property for a vector random process. Because the probability law depends only on the immediate past value we often refer to such a process as a first-order Markov process (if it depended on the immediate two past values it would be a second-orderMarkov process). Theorem 15-l. Let (s(t), tE.%}be a first-order Markav process, and tl < t* < , . . < t, be any time points in 9, where m is an integer. Then

130

Elements of Discrete-Time

Gauss-Markov

Random Processes

A Basic State-Variable

Lesson 15

a?i

-

I),

l

l

Pcsw~>l = P w

7s(4)l

l

=

7s(4)lp[s(L - I), *9s(4)] = p bh?l>lS(f* - I)lP[s(L - I>, * 7s(h)1 PEGz>lS(fm

-

I>7

l

l

l

l

l

Pb(Ll),

l

l

l

?

(15-8)

Proof. For a Gaussianwhite process, we know, from (H-12), that

pb(t,-*),

l

l

l

9

P b(t)7

$I)1 = p Cs(tm - I)IG?l- *)Ip [s(tm- 2), 7s(t1)l s(h)1= p [s(G?l - ?)Is(fm - $1p b(Ll - 31,’ 7s(h)] (15-9) ... l

l

l

I

P b(t),

E{s(t,)&l

[e.g.,pmb(tm)ls(L - 111 andpm- I[& - I>I& - JII

I),

l

l

l

9

s(4>)

=

~~s(t,)Is(L

-

I>>

0

(15-10)

White noise is zero mean, or else it cannot have a flat spectrum. For white noise for all i # j

I),

l

l

l

?

P [S(h)1 P [s(tJl l

l

=

WL

1)

z(k + 1) = H(k + l)x(k + 1) + v(k + 1)

.

(15-16)

(15-12)

[because uncorrelatednessimplies statistical independence(see Lesson 12)] where p [s(ti)] is a multivariate Gaussianprobability density function.

(15-17)

(15-M)

where k = 0, 1, . . . . In this model w(k) and v(k) are p x 1 and m x 1 mutually uncorrelated (possibly nonstationary) jointly Gaussian white noise sequences;i.e., /Elrou.(j))=a(ns,

l

s(h)>

and

(15-11)

Additionally, for Gaussianwhite noise

= P [WI

-

x(k + 1) = @(k + l,k)x(k) + I-(k + l?k)w(k) + ‘If(k + l,k)u(k)

Definition 15-4. A vector random process {s(t), te.9) is said to be a Gaussian white process if, for any m time points tl, t2, . . . , t, in 9, where m is any integer, the m random vectors s(tr), s(t& . . . , s(t,,,)are uncorrelated Gaussian random vectors. 0

P tw1

(15-15)

In succeedinglessonswe shall develop a variety of state estimators for the following basic linear, (possibly) time-varying, discrete-time dynamical system (our basic state-variable modeo, which is characterized by YI x 1 state vector x(k) and m x 1 measurement vector z(k):

We leavethe proof of this useful result as an exercise. A vector random processthat is both Gaussianand a first-order Markov processwill be referred to in the sequel as a Gauss-Markovprocess.

E{s(ti)s’(t,)) = 0

P M7)l

A BASIC STATE-VARIABLE MODEL

For a first-order Markov process, -

SW1= P b(olw

Theorem 15-3 means that past and future values of s(t) in no way help determine present values of s(t). For Gaussianwhite processes,the transition probability density function equals the marginal density function, p [s(t)], which is multivariate Gaussian. Additionally,

Theorem 15-l demonstrates that a first-order Markov process is completely characterizedby two probability density functions, namely, the transition probability density function, p [s(ti)ls(ti _ JJ, and the initial (prior) probability density function p[s(t,)]. Note that generally the transition probability density functions can all be different, in which casethey should be subscripted

>Is(tm

(15-14)

Equating (15-14) and (15-15), we obtain (15-13). 0

Equation (15-7) is obtained by successivelysubstituting each one of the equations in (15-9) into (15-8). 0

Theorem 15-2.

WI = p[s(t)lp[s(+I

but, we also know that

l

l

(15-13)

for all t, ~e.9 and t =f T.

l

l

In a similar manner, we find that

Ew?l

131

Theorem 15-3. A vector Gaussian white process s(t) can be veiwed as a first-order Gauss-Markov process for which

PRX$ From probability theory (e.g., Papoulis, 1965) and the Markov property of s(t), we know that pb(G?l>,

Model

E{v(i)v’( i)} = R(i)&

132

Elements of Discrete-Tmx

Gauss-Markov

Random Process~

tzsson

75

Properties

of the Basic State-Variable

Model

PROPERTIES OF THE BASIC STATE-VARIABLE

and

I

1

E{w(+‘(j)j

= S= 0

for all i and j’

I I

(15-21)

Covariancematrix Q(i) is positive semidefinite and R(i) is positive definite [so that R-*(i) exists]. Additionally, u(k) is an 2 x 1 vector of known system inputs, and initial state vector x(O) is multivariate Gaussian,with mean m&l) and covariancePK(0),i.e.,

MODEL

In this section we state and prove a number of important statistical properties for our basic state-variablemodel Theorem 15-4. When x(O) and w(k) are jointly Gaussian then (x(k), k= 0, l,...) is a Gauss-Markov sequence. Note that if x(O) and w(k) are individually Gaussian and statistically independent (or uncorrelated), then they will be jointly Gaussian (Papoulis, 1965).

(15-22)

and9x(O) is not correlated with w(k) and v(k). The dimensionsof matrices @, I’, 9, H, Q and R are n X n, n x p, fl x 2, m X n, p x p, and m x m, respectively. Disturbance w(k) is often used to model the following types of uncertainty:

Proof a. Gaussian Property [assuming u(k) nonrandom]. Because u(k) is nonrandom, it has no effect on determining whether x(k) is Gaussian; hence, for this part of the proof we assumeu(k) = 0. The solution to (15-17) is - l)w(i - 1)

(15-23)

Q&i) = @(k,k - l)@(k - 1,k - 2). . . @,(i + 1,i)

(15-24)

x(k) = @(k,O)x(O) + i

@(k,i)r(i,i

i=l

1. disturbance forces acting on the system (e,g., wind that buffets an airplane); 2. errors in modeling the system (e.g., neglectedeffects); and 3. errors, due to actuators,in the translation of the known input, u(k), into physical signals. Vector v(k) is often usedto model the following types of uncertainty:

where Observe that x(k) is a Iinear transformation of jointly Gaussianrandom vectors x(O), w(O),w(l), . . . , w(k - 1); hence, x(k) is Gaussian, b. Markov Property. This property does not require x(k) or w(k) to be Gaussian. Because x satisfies state equation (1517), we see that x(k) depends only on its immediate past value; hence, x(k) is Markov. 0

1. errors in measurementsmade by sensinginstruments; 2. unavoidabledisturbancesthat act directly on the sensors;and 3. errors in the realization of feedback compensatorsusing physical components [this is valid only when the measurementequation contains a direct throughput of the input u(k), i.e., when z(k + 1) = I-I(k + 1) x(k + 1) + G(k + l)u(k + 1) + v(k + 1); we shall examine this situation in Lesson231.

We havebeen able to show that our dynamical systemis Markov because we specified a model for it. Without such a specification, it would be quite difficult (or impossible) to test for the Markov nature of a random process. By stacking up x(l), x(2), , . . into a supervectorit is easily seen that this supervector is just a linear transformation of jointly Gaussianquantities x(O), w(O), w(l), - - - ; hence, x(l), x(2), . . . are themselvesjointly Gaussian. A Gauss-Markov sequencecan be completeIy characterizedin two ways:

0f course, not all dynamical systemsare describedby this basic model. In general, w(k) and v(k) may be correlated, some measurements may be made so accurate that, for all practical purposes, they are “perfect” (Le., there is no measurementnoise associatedwith them), and either u(k) or v(k), or both, may be colored noiseprocesses.We shall considerthe modification of our basicstate-variablemodel for each of theseimportant situations in Lesson

1. specify the marginal density of the initial state vector, P Ex(O)l and the transition densityP[x(k + 1)1x(k)], or 2. specify the mean and covariance of the state vector sequence.The second characterization is a complete one becauseGaussian random vectors are completely characterized by their meansand covariances (Lesson 12). We shall find the second characterization more useful than the first.

23.

Elements

of Discrete-Time

Gauss-Markov

Random

Processes

Lesson 15

Properties

of the Basic State-Variable

mx(k)E(w’(k)l = 0, and E{w(k)m#)} = 0. State vector x(k) dependsat most on random input w(k - 1) [see(B-17)]; hence,

The Gaussiandensity function for state vector x(k) is = [(2n)“lP,(k)j]-“* exp

p[x(k)]

- m,(k>l’K’(k)[x(k)

-

mxWI}

m,(k)= EW)l

(15-26)

P,(k)= Eew - mxwmw - mxvw

(15-27)

We now demonstrate that m,(k) and P,(k) can be computed by means of recursive equations. For our basic state-variable model,

a. m,(k) can be computed from the vector recursive equation m,(k + 1) = @(k + l,k)m,(k) + 3!(k + l,k)u(k)

(15-32)

and E{w(k)x’(k)} = 0 as well. The last two terms in (15-31) are therefore equal to zero, and the equation reducesto (15-29). c. We leave the proof of (H-30) as an exercise.Observe that once we know covariance matrix P,(k) it is an easy matter to determine any crosscovariancematrix between statex(k) andx(i)@ # k). The Markov nature of our basic state-variable model is responsiblefor this. Cl Observe that mean vector m,(k) satisfies a deterministic vector state equation, (U-28), covariance matrix P,(k) satisfiesa deterministic matrix state equation, (15-29), and (15-28) and (15-29) are easily programmed for digital computation. Next we direct our attention to the statisticsof measurementvector z(k).

(15-28)

where k = 0, 1, . . . , and mx(0) initializes (O-28), b. P,(k) can be computed from the matrix recursive equation

= E{x(k)}E{w’(k)} = 0

E{x(k)w’(k)}

(15-25)

where

Theorem 15-5.

Model

Theorem 15-6.

For our basic state-variable model, when x(O), w(k) and

v(k) are jointly Gaussian, then {z(k), k = 1, 2, . . . } is Gaussian, and m,(k + 1) = H(k + l)m,(k

(15-33)

+ 1)

P,(k + 1) = iP(k + l,k)P,(k)@‘(k + 1,k) + r(k + l,k)Q(k)r’(k

+ 1,k)

(15-29) P,(k + 1) = H(k + l)P,(k

and P,(O)initializes (U-29), and wherek=O,l,..., c. E{ [x(i) - mJi)][x(j> - m,(j)]‘} 4 P,(i, j) can be computed from W, j)Px(j)

pX(iyi> = Px(i)Qr(j, i)

wheni > j wheni l + w41 PCW - mx(k)l+ rww’1

sian. Example 15-l Consider the simple single-input single-output first-order system x(k + 1) = $x(k) z(k + 1) =x(k

(15-31)

= @P,(k)@’ + I’Q(k)I” + QE{ [x(k) - mx(k)]wf(k)}rr + FE{w(k)[x(k) - mX(k)]‘}@’ Because m,(k) is not random and w(k) is zero mean, E{m,(k)w’(k)) \ -=

(15-28) and (1529),

We leave the proof as an exercisefor the reader. Note that if x(O), w(k) and v(k) are statistically independent and Gaussian they will be jointly Gaus-

+ w(k)

P,(k + 1) = E{ [x(k + 1) - mx(k + l)][x(k + 1) - m,(k + l)]‘} = E{ ww)

(15-34)

+ 1) + v(k + 1)

(15-35) (15-36)

where w(k) and v (k) are wide-sense stationary white noise processes,for which q = 20 and r = 5. Additionally, m, (0) = 4 and p,(O) = 10. The mean of x(k) is computed from the following homogeneous equation m,(k

1 A

+ 1) = ?mx(k)

??z,(O)= 4

(15-37)

136

Elements

of Discrete-Tme

Gauss-Markov

Random Processes

Lesson 15

and, the variance of X(/C) is computed from Equation (15-29), which in this case simplifies to

pJk -I 1) = ipx(k) + 20

px(O)= 10

(15-33)

Signal-To-Noise

Ratio

137

If our basic state-variable model is time-invariant and stationary, and, if @is associatedwith an asymptotically stable system (i.e., one whose poles all lie within the unit circle), then (Anderson and Moore. 1979) matrix P,(k) reachesa limiting (steady-state)solution P,, i.e.,

?&(k + 1) = mx(k + 1)

(15-41)

Fz P,(k) = p,

Additionally, the mean and variance of z(k) are computed from (1539)

Matrix f, is the solution of the following steady-stateversion of (1529), px = @,W + rQl?

and pr(k + 1) = px[k + 1) + 5

(15-40)

Figure 15-l depicts TTL(~)and pr2(k). Observe that tnX(k) decays tu zero very rapidly and that pln(JQ approachesa steady-statevalue. piQ = 5.163. This steady-state value can be computed from equation (15-38) by setting JJ~(Jc)= pX(k + 1) = j%. The existence of j& is guaranteed by our first-order system being stable. Although ~T=z~ (Jz)-+ 0 there is a lot of uncertainty about x(k), as evidencedby the large value of j&.There will be an even larger uncertainty about .7(k), because FZ-+ 31.66. These large values for j& and j& are due to the large values of q and r. In many practical applications, both q and r will be much less than unity in which case& and pX wil1 be quite small. D

876-

(15-42)

This equation is called a discrete-time Lyupunov equation. SeeLaub (1979)for an excellent numerical method that can be used to solve (15-42) for FX. SIGNAL-TO-NOISE

RATIO

In this section we simplify our basic state-variable model (15-17) and (15-B) to a time-invariant, stationary, singIe-input single-output model: + +u (k)

(15-43)

z (k + 1) = h’x(k + I) + v(k + 1)

(15-44)

x(k + 1) = @x(k) + yw(k)

Measurement z (k) is of the classicalform of signal plus noise, where “signal” s(k) = h’x(k). The signal-to-noise ratio is an often-used measure of quality of measurement z(k). Here we define that ratio, denoted by SNR(k), as SNR(k) = -d(k) t

(15-45)

From preceding analyses,we seethat SNR(k) = F BecauseP,(k) is in general a function of time, SNR(k) is also a function of time. If, however, @is associatedwith an asymptotically stable systemthen (15-41) is true. In this casewe can useFXin (15-46) to provide us with a single number, SNR, for the signal-to-noiseratio, i.e., SNR=-

h’F,h

(15-47)

r

Finally, we demonstrate that SNR(k) (or SNR) can be computedwithout knowing q and r explicitly; all that is needed is the ratio q /r . Multiplying and dividing the right-hand side of (15-46) by q, we find that Figure E-1 Mean (dashed)and standard deviation(bars)for first-ordersystem (15-35) and (B-36).

SNR(k) = [h’ y

h] (q /r)

(15-48)

Elements

of Discrete-Time

Gauss-Markov

Random Processes

Lesson 15

Scaled covariance matrix P,(k)/q is computed from the following version of

(15-29) (15-49) One of the most useful ways for using (15-48) is to compute q /I- for a given signal-to-noise ratio SNR, i.e., -4 =- SNR (15-50) r h’ E Th In Lesson 18 we show that q /r can be viewed as an estimator tuning parameter; hence, signal-to-noiseratio, SNR, can also be treated as such a parameter. Example 15-2 (Mendel, 1981) Consider the first-order system x(k + 1) = &(k) + v(k)

(15-51)

z(k + 1) = hx(k + 1) + v(k)

(15-52)

In this case, it is easy to solve (15-49), to show that

R/q = yzl(1 - 4’)

(15-53)

SNR=

(15-54)

hence, (g$)(;)

Observe that, if h2y = 1 - &2,then SNR = q /r. The condition h2y = 1 - 42 is satisfiedif,forexample,y= l,+= l/fiandh = l/G. 0

PROBLEMS 15-l. Prove Theorem 15-2, and then show that for Gaussian white noise Eb(t,)[s(t,

-

I),

l

.

l

,

s(h)}

=

E{s(t,)}.

15-2. Derive the formula for the cross-covarianceof x(k), P,(i, j), given in (15-30). 15-3. Derive the first- and second-orderstatistics of measurementvector z(k), that are summarized in Theorem 15-6. 15-4. Reconsider the basicstate-variable model when x(0) is correlated with w(O), and w(k) and v(k) are correlated [E{w(k)v’(k)} = S(k)]. (a) Show that the covarianceequation for z(k) remains unchanged. (b) Show that the covarianceequation for x(k) is changed,but only at k = 1. (c) Compute E{z(k + l)z’(k)}.

Lesson 15

139

Problems

15-5.

I v(k)

In this problem, assume that u(k) and v(k) are individually Gaussian and uncorrelated. Impulse responseh depends on parameter a, where a is a Gaussian random variable that is statistically independent of u (k) and v(k). (a) Evaluate E{z (k)}. (b) Explain whether or not ,7(k) is Gaussian.

Lesson

16

Single-Stage

141

with the linear expectation operator E{ =I?X(k- 1)). Doing this, we find that i(klk - 1) = @(k,k - l)ii(k - Ilk - 1) + V(k,k - l)u(k - 1)

State Esf,imatlon:

(16-4)

where k =l, 2,.... To obtain (16-4) we have used the facts that E{w(k - 1)) = 0 and u(k - 1) is deterministic. Observe, from (l&4), that the single-stage predicted estimate, i(klk - l), dependson the filtered estimate, i(k - Ilk - l), of the preceding state vector x(k - 1). At this point, (16-4) is an interesting theoretical result; but, there is nothing much we can do with it, becausewe do not as yet know how to compute filtered state estimates. In Lesson17 we shall begin our study into filtered state estimates, and shall learn that suchestimates of x(k) depend on predicted estimatesof x(k), just as predicted estimates of x(k) depend on filtered estimates of x(k - 1); thus, filtered and predicted state estimates are

Predktion

very tightly coupled together. Let P(k jk - 1) denote the error-covariance matrix that is associated with i(k(k - l), i.e.,

INTRODUCTION

We have mentioned, a number of times in this book, that in state estimation three situations are possibledepending upon the relative relationship of total number of available measurements, IV, and the time point, k, at which we estimate state vector x(k), namely: prediction (N -Ck), filtering (IV = k), and smoothing (ZV> k). In this lesson we develop algorithms for mean-squared of state x(k). In order to simplify our notation, predicted estimates,&(kij), we shall abbreviate&(k 1~)asi(k ij). (Just in caseyou haveforgotten what the notation S(k 1~)standsfor, seeLesson 2.) Note that, in prediction, k > j.

The most important predictor of x(k) for our future work on filtering and smoothing is the sin&e-stage predictor g(k/k - 1). From the Fundamental Theorem of Estimation Theory (Theorem 13-l), we know that i(kik - 1) = E{x(k)f%(k - 1)) (16-l) where %(k - 1) = co1(z(I), z(2), . . . , z(k - 1)) (16-2) It is very easyto derive a formula for i(k ik - I) by operating on both sidesof the state equation - 1) + r(k,k

P(klk - I) = E{ [%(klk - 1) - m;(kjk

- l)][S(klk

- 1) - m,(k [k - I)]‘)

- l)w(k - 1) + ‘I’(k?k - l)u(k

x(k Ik - 1) = x(k) - ri(kik - 1)

(16-3)

(16-6)

Additionally, let P(k - Ilk - 1) denote the error-covariance matrix that is associatedwith i(k - Ilk - I), i.e., (16-7)

For our basic state-variable model (see Property 1 of Lesson 13), m(k Ik - 1) = 0 and mi(k - Ilk - 1) = 0, so that P(klk - 1) = E{%(klk - l)i’(klk

- 1))

(16-8)

and P(k - Ilk - 1) = E(%(k - Ilk - l)Ji’(k - lbk - 1))

(16-9)

Combining (16-3) and (16-4), we seethat G(klk - 1) = @(k,k - l)x(k

- 1)

(16-5)

where

P(k - Ilk - 1) = E{[ji(k - ilk - 1) - m,(k - 11k - l)][g(k - Ilk - 1) - m,(k - Ilk - l)]‘}

SINGLE-STAGE PREDICTOR

x(k) = @(k,k - l)x(k

Predictor

- Ilk - 1) + T(k,k - l)w(k - 1)

(16-10)

142

State Estimation:

Prediction

Lesson 16

A straightforward calculation leads to the following formula for P(klk - l), P(kJk - .l) = a?(k,k - I)P(k - ilk - l)W(k,k - 1) + r(k,k - l)Q(k - l)T’(k,k - 1)

(16-H)

where k = 1,2, . . . . Observe, from (16-4) and (16-ll), that x(010)and P(O(0)initialize the single-stagepredictor and its error-covariance. Additionally, ri(O[O)= E{x(O)(no measurements}= m,(O)

A General State Predictor

Before proving this theorem, let us observethat the prediction formula (16-14) is intuitively what one would expect. Why is this so? Supposewe have processedall of the measurementsz(l), z(2), . . . , z(j) to obtain i(jlj) and are asked to predict the value of x(k), where k > j. No additional measurements can be used during prediction. All that we can therefore use is our dynamical state equation. When that equation is used for purposes of prediction we neglect the random disturbance term, becausethe disturbancesare not measurabie. We can only use measured quantities to assistour prediction efforts. The simpiified state equation is

(16-12)

and

143

x(k + 1) = @(k + l&)x(k) + !I!& + l&u(k)

(16-16)

a solution of which is

P(O10) = E(x(O~O)~‘(O~O)}

= W&(O)- mx(~)l[x(~> - m,(W>= P,(O)

x(k) = @(k,j)x( j) +

(16-13)

Finally, recall (Property 4 of Lesson 13) that both x(k jk - 1) and X(kjk - 1) are Gaussian.

c

- l)u(i - 1)

@(k,i)!I!(i,i

i=j+

(16-17)

1

Substituting i(jl j) for x(j), we obtain the predictor in (16-14).In our proof of Theorem 16-1we establish (16-14) in a more rigorous manner. Proof

a. The solution to state equation (16-3), for x(k), can be expressedin terms of x(j), where j < k, as

A GENERAL STATE PREDICTOR

In this section we generalizethe results of the precedingsectionso as to obtain predicted valuesof x(k) that look further into the future than just one step. We shall determine r2(klj) where k > j under the assumptionthat filtered state estimate x(jlj) and its error-covariance matrix E{ii( jlj)i’( jl j)} = P( jl j) are known for somej = 0, 1, . . . .

x(k) = @(k,j)x(j)

+

$ i=j+

@(k,i)[I”(i,i

- l)w(i - 1)

1

+ V(i,i - l)u(i - l)]

(16-M)

We apply the Fundamental Theorem of Estimation Theory to (16-18) by taking the conditional expectation with respectto Z(j) on both sides of it. Doing this, we find that

Theorem 164

a. If input u(k) is deterministic, or does not depend on any measurements, then the mean-squared predicted estimator of x(k), i(k( j), is given by the

i(kl j) = @(k,j)i(jlj)

+

i

Q(k,i)[I’(i,i

i=j+

+ *(i,i

expression

Note that on w(j -

Wi) = @OWi(jli), +

2

@(k,i)*(i,i

- l)u(i - 1)

k>j

(16-14)

i=j+1

+ r(k,k

- l)Q(k

- 1)

(16-19)

j) depends at most on x(j) which, in turn, depends at Consequently, = -

- 1) - l)l?(k,k

- l)E{u(i - l)lZ( j)}]

E{w(i (16-20) where i = j + 1, j + 2,. . . , k. Becauseof this range of values on argument i, w(i - 1) is never included in the conditioning set of values w(O), w(l), . . . , w(j - 1); hence,

b. The vector random process {x(kl j), k = j + 1, j + 2, . . . } is: i. zero mean, ii. Gaussian, and iii. first-order Markov, and iv. its covariance matrix is governed by P(klj) = @(k,k - l)P(k - llj)@‘(k,k

E{w(i - lW(i)l

- l)E{w(i - 1)1%(j)}

1

(16-15)

l)lw(O),

w(l),

.

.

l

$

w(j

-

01

E{w(i - 1)1%(j)} = E{w(i - 1)) = 0 for all i = j+l,j+2 ,..., k. Note, also, that E{u(i - 1)13(j)} = E{u(i - 1)) = u(i - 1)

(16-21)

(16-22)

State Estimation:

Prediction

Lesson 16

145

A General State Predictor

becausewe haveassumedthat U@- 1) does not depend on any of the measurements.Substituting (16-21) and (16-22) mto (l&19), we obtain the prediction formula (16-14). b-i. and b-ii. have alreadybeen proved in Properties 1 and 4 of Lesson 13. b-iii. Starting with i(k 1j) = x(k) - ?(k/j), and substituting (16-18) and (M-14) into this relation, we find that

ii(kij) = @(k,j)k(j/j) + $

@(k$)F(iJ - l)w(i - 1)

(X-23)

i=j+l

This equation looks quite similar to the solution of stateequation (16-3), when u(k) = 0 (for all k), e.g., see (16-H). In fact, i(k jj) also satisfies the state equation 5[kij)

= @(k,k - l)k(k - lij) + r(k,k

- I)w(k - 1)

(16-24)

E3ecause5(k ij) depends only upon its previous value, g(k - lij), it is first-order Markov. b-iv. We derived a recursive covariance equation for x(k) in Theorem H-5. That equation is (15-29) Because ii(k ij) satisfiesthe state equation for x(k), its covarianceP(k ij) is also given by (1529). We have rewritten this equation as in (16-H). q Observe that by setting j = k - 1 in (16-14) we obtain our previously derived single-stagepredictor k(k ik - 1). Theorem 16-l is quite limited becausepresently the only values of i( j ij) and P( jij) that we know are those at j = 0. For j = 0, (16-14)becomes i(kiU) = @(k,O)mX(0) + i @(k,i)‘P(z’,z’ - l)U@ - 1) i=l

(16-25)

The reader might feel that this predictor of x(k) becomespoorer and poorer as k gets farther and farther away from zero. The following example demonstrates that this is not necessarilytrue. Example 16-l

Let us examine prediction performance, as measured by p (k/0), for the first-order system x(k + 1) = Lx(k) + w(k) (14-26) VT where q = 25 and p (0) is variable. Quantity p (k IO),which in the caseof a scalar state vector is a variance, k easily computed from the recursive equation p(kiO) = $7 (k - l/O) + 25 for k = 1,2.*...

(16-27)

Two casesare summarized in Figure 16-l- JA%enp (0) = 6 we have

Figure 16-1

Predictionerror variance p (k/O).

relatively small uncertainty about ;(O/O),and as we expected, our predictions of x (k) for k 2 1 do become worse, becausep (k10) > 6 for all k 2 1. After a whilep(kj0) reaches a limiting value equal to 50. When this occurs we are estimating i(k/O) by a number 1 k 1 k that is very close to zero, becausei(k IO) = ;(OjO),and approacheszero i lh i ( y’z 1 for large values of k. When p(O) = 100 we have large uncertainty about ;(OlO), and, perhaps to our surprise, our predictions of x(k) for k L 1 improve in performance, becausep(k[O) < 100 for all k 2 1. In this case the predictor discounts the large initial uncertainty; however, as in the former case,p (k IO)again reachesthe limiting value of 50. For suitably large values of k, the predictor is completely insensitive to p (0). It reachesa steady-statelevel of performance equal to 50, which can be predetermined by setting p (k IO)and p (k - 110)equal top, in (N-27), and solving the resulting equation forjT. El

Prediction is possible only becausewe have a known dynamical model, namely, our state-variable model. Without a model, prediction is dubious at best (e.g., try predicting tomorrow’s price of a stick listed on any stock exchangeusing today’s closing price).

State Estimation:

THE INNOVATIONS

Prediction

Lesson 16

PROCESS

Suppose we have just computed the single-stage, predicted estimate of x(k + l), j;(k + l(k). Then the single-stagepredicted estimate of z(k + l), i(k + Ilk), is ?(k + Ilk) = E{z(k + l)/%(k)} = E{[H(k + l)x(k

+ 1) + v(k + 1)-$X(k)}

Lesson 16

Problems

random vectors. To prove that t(k + Ilk) is white noise we must show that E{z(i + lli)Z’(j + llj)} = P&i + l[i)S,

(16-34)

We shall consider the casesi > j and i = j, leaving the casei < j as an exercisefor the reader. When i > j, E{Z(i + lJi)Z’( j + l]j)} = E{ [H(i + l)X(i + lli) + v(i + l)]

i(k + Ilk) = H(k + l)zi(k + Ilk)

[H(j + VW + 1113 + v(i + N’I

(16-28)

= E{H(i + l)X(i + l[i)

The error between z(k + 1) and i(k + lik) is i(k + l/k), i.e., Z(k + l/k) = z(k + 1) - i(k + l/k)

D-W+ Mi + lI/?+ v(i + WI (16-29)

Signal Z(k + Ilk) is often referred to either as the inrtovationsprocess,prediction error process, or measurement residual process. We shall refer to it as the innovations process, becausethis is most commonly done in the estimation theory literature (e.g., Kailath, 1968).The innovations process plays a very important role in mean-squaredfiltering and smoothing. We summarize important facts about it in the following: Theorem 16-2 (Innovations) a. The following representations of the innovations process Z(k + l/k) are equivalent: Z(k + Ilk) = z(k + 1) - i(k + Ilk) E(k + Ilk) = z(k + 1) - H(k + l)i(k

(16-30) + Ilk)

(16-31)

because E{v(i + l)v’( j + 1)) = 0 and E{v(i + l)X’( j + llj)} = 0. The latter is true because, for i > j, X(j + llj) does not depend on measurement z(i + 1); hence, for i > j, v(i + 1) and X(j + 11~)are independent, so that E{v(i + l)?‘( j + llj)} = E{v(i + l)}E{S( j + llj)} = 0. We continue, as follows: E(z(i + lli)Z’(j

+ llj)} = H(i + l)E{x(i + lji)[z( j + 1) - H( j + l)ri( j + 11j)]‘} = 0

by repeated application of the orthogonality principle (Corollary 13-3). When i = j, P%(i + lli) = E{ [H(i + l),(i + lli) + v(i + l)][H(i + l),(i + lli) + v(i + 1)]‘} = H(i + l)P(i + lli)H’(i + 1) + R(i + 1) because,once again E{v(i + l)X’(i + Iii)} = 0, and P(i + 1) = E{x(i + l\i) Cl

X’(i + Iii)}. %(k + lik) = H(k + l)ji(k

+ Ilk) + v(k + 1)

(16-32)

b. The innovations is a zero-mean Gaussian white noise sequence, with E{%(k + llk)Z’(k + Ilk)} = PC(k + ljk) = H(k + l)P(k + llk)H’(k + 1) + R(k + 1)

rse of Pi;(k + l]k) is needed; hence, we shall llk)H’(k + 1) + R(k + 1) is nonsingular. This is usually true and will always be true if, as in our basic state-variable model, R(k + 1) is positive definite.

(16-33)

Proof (Mendel, 1983b)

a. Substitute (16-28) into (16-29) in order to obtain (16-31). Next, substitute the measurement equation z(k + 1) = H(k + l)x(k + 1) + v(k + 1) into (16.31), and use the fact that ji(k + Ilk) = x(k + 1) i(k + l(k), to obtain (16-32). b. Becauseg(k + Ilk) and v(k + 1) are both zero mean, E{Z(k + Ilk)} = 0. The innovations is Gaussianbecausez(k + 1) and S(k + l/k) are Gaussian, and, therefore, 2(k + l(k) is a linear transformation of Gaussian

PROBLEMS 16-l. Develop the counterpart to Theorem 16-1for the casewhen input u(k) is random

and independent of Z( 1). What happens if u(k) is random and dependent upon w=P 16-2. For the innovations process i(k + Ilk), prove that E{Z(i + lli)Z’(j + 11~))= 0 when i < j. 16-3. In the proof of part (b) of Theorem 16-2 we make repeated use of the orthogonality principle, stated in Corollary 13-3. In the latter corollary f’[S(k)]

148

State Estimation:

Prediction

Lesson 16

appears to be a function of 011of the measurements used in i&k). h-r the expression E{%(i + lii)z’(j + l)}, i > j, z’(j + I.) certainly is not a function of all the measurements used n-r x(i + lji). What is f [ -1, when we apply the orthogonality principle to E{i(i + lii)z’(j + l)j, i > j, to conclude that this expectation is zero? 16-4. Refer to Problem H-5. Assume that [l(k) can be measured [e.g., u (k) might be the output of a random number generator], and that LI = Q[Z(1), 2 (2), . . . , 4k - l)]. What is ?(kjk - l)? 16-5, Consider the following autoregressive model, z(k + FZ)= LI~Z(k + II - 1) ag(k +n - 2) - a.9 - anz(k) + w(k + ~2) in which w(k) is white noise. Measurements z(k), z (k + l), . . . , z (k + n - I) are available. (a) Compute i(k + n ik + n - 1). (b) Explain why the result in (a) is the overall mean-squared prediction of z (k + rt) even if IV(k + n) is non-Gaussian.

17

Lesson

State Estimation: Filtering

(the Kalmun

Filter)

INTRODUCTION

In this lesson we shall develop the Kalman filter, which is a recursive meansquared error filter for computing fi(k + Ilk + l), k = 0, 1, 2,. . . . As its name implies, this filter was deveioped by KaIman [circa 1959 (Kalman, 1960)]. From the Fundamental Theorem of Estimation Theory, Theorem 13-1, we know that r2(k + Ilk + 1) = E{x(k + l)lZ(k

+ 1))

(174)

Our approach to developing the Kalman filter is to partition E(k + 3) into two setsof measurements,3!(k) and z(k + 11,and to then expand the conditional expectation in terms of data setsS(k) and z(k + l), i.e., %(k + l(k + 1) = E(x(k + 1)1%(k),z(k + 1))

(17-2)

What complicates this expansion is the fact that Z(k) and z(k + 1) are statistically dependent. Measurement vector SE(k) depends on state vectors x(l), x(2), - * - 9 x(k), becausez(j) = H(j)x(j) + v(j)(j = 1, 2, . . . , k). Measurement vector z(k + 1) also depends on state vector x(k), because z(k + 1) = H(k + l)x(k + 1) + v(k + 1) and x(k + 1) = @(k + l,k)x(k) + r(k t l,k)w(k) + YT(k + l,k)u(k), Hence S(k) and z(k + 1) both dependon x(k) and are, therefore, dependent.

State Estimation:

150

Filtering

(the Kaiman Filter)

Lesson 17

Recall that x(k + l), Z(k) and z(k + 1) are jointly Gaussian random vectors; hence, we can useTheorem 12-4to express(17-2) as (17-3)

i(k + Ilk + 1) = E{x(k + l)la(k),i) where

(17-4) + mw)l We immediately recognize Z as the innovations process Z(k + Ilk) [see (X-29)]; thus, we rewrite (17-3) as i = z(k + 1) - E-w

i(k + Ilk + 1) = E{x(k + l)lfE(k),i(k

+ Ilk)}

(17-5)

Applying (12-37) to (17-5),we find that i(k

+ ilk + 1) = E{x(k + 1)1%(k)}

+ E{x(k + l)IZ(k + Ilk)} - m,(k + 1)

(17-6)

We recognizethe first term on the right-hand side of (17-6) as the single-stage predicted estimator of x(k + l), r2(k + Ilk); hence, %(k + l]k + 1) = ?(k + l/k)

+ E{x(k + l)(Z(k + l(k)} - m,(k + 1)

(17-7)

This equation is the starting point for our derivation of the Kalman filter. Before proceeding further, we observe, upon comparisonof (17-2) and (17-5), that our original conditioning on z(k + 1) has beenreplaced by conditioning on the innovationsprocessi(k + Ilk). One can showthat Z(k + Ilk) is computable from z(k + l), and that z(k + 1) is computablefrom ?(k + Ilk); hence, it issaid that z(k + 1) and Z(k + Ilk) are causally invertible (Anderson and Moore, 1979).We explain this statement more carefully at the end of this lesson.

The Kaiman

Filter

dieted value, i(k + Ilk). The following result provides us with the means for evaluating %(k + l[k + 1) in terms of its error-covariance matrix P(k + Ilk + 1). Filtering error-covariance matrix P(k + Ilk + 1) Preliminary Result. for the arbitrary linear recursive filter (17-8) is computed from the following equation: P(k + Ilk + 1) = [I - K(k + l)H(k

+ l)]P(k + llk)[I - K(k + l)H(k + K(k + l)R(k + l)K’(k + 1)

+ l)]’

(17-9)

Proof. Substitute (16-32) into (17-8) and then subtract the resulting .ation from x(k + 1) in order to obtain Z(k + lik + 1) = [I - K(k + l)H(k + l)]%(k + l(k) - K(k + l)v(k

+ 1)

(1740)

Substitute this equation into P(k + lfk + 1) = E{fs(k + Ilk + l)X’(k + 11 k + 1)) to obtain equation (17-9). As in the proof of Theorem 16-2, we have used the fact that %(k + Ilk) and v(k + 1) are independent to show that E{g(k + llk)v’(k + 1)) = 0. Cl

The state prediction-error covariance matrix P(k + Ilk) is given by equation (1641). Observe that (17-9) and (16-11) can be computed recursively, once gain matrix K(k + 1) is specified, as follows: P(OI0)+ P(l)O)+P(ljl)+ P(2ll)+P(212)-, . etc. It is important to reiterate the fact that (17-9) is true for any gain matrix, including the optimal gain matrix given next in Theorem 17-l. I+ l

l

THE KALMAN FILTER Theorem 17-l

a. The mean-squared filtered estimator of x(k + l), ii(k + Ilk + I), written

A PRELIMINARY RESULT

in predictor-corrector

In our derivation of the Kalman filter, we shall determine that ?(k + Ilk + 1) = i(k + Ilk) + K(k + l)i(k

+ Ilk)

i(k

(17-8)

where K(k + 1) is an n x m (Kalman) gain matrix. We will calculate the optimal gain matrix in the next section. Here let us view (17-8) as the structure of an abitrary recursive linear filter, which is written in so-calledpredictor-corrector format; i.e., the filtered estimate of x(k + 1) is obtained by a predictor step, x(k + Ilk), and a corrector step, K(k + l)i(k + Ilk). The predictor step usesinformation from the state equation, because r2(k + ilk) = @(k + l,k)i(klk) + q(k + l,k)u(k). The corrector step usesthe new measurementavailableat tk + 1.The correction is proportional to the difference between that measurementand its best pre-

format, is

+ l(k + 1) = i(k + Ilk) + K(k + l)z(k + Ilk)

(17-11)

fork = 0, 1,. . . , where %(0/O) = m,(O), and z(k + l/k) is the innovations process [E(k + Ilk) = z(k + 1) - H(k + l)j2(k + l(k)]. b. K(k + 1) is an n x m matrix (commonly referred to as the Kalman gain matrix or weighting matrix) which is specified by the set of relations

K(k + 1) = P(k + llk)H’(k

+ l)[H(k

+ l)P(k + llk)H’(k + 1) (1742) + R(k + l)]-’

P(k + ilk) = @(k + l,k)P(kIk)W(k

+ l,k) + I-(k + l,k)Q(k)I-“(k

+ 1,k)

(17-13)

State Estimation:

Filtering (the Katman Filter)

P(k + lik + 1) = [I - K(k + l)H(k + l)]P(k + Ilk)

Lesson 17

(17-14)

Observations

Statefiltering-error covariance matrix P(k + ljk + 1) is obtained by substituting (17-12) for K(k + 1) into (17-4) asfollows: P(k + Ilk + 1) = = = . = =

where I is the n X n iderzfity mufrix. fzmi P(Oj0) = Px(0). c. The stochastic process {2( k + 1ik + l), k = 0, 1, . . . }5which is defined by the filteriq error relation

jbrk=O,l,...:

C(k + l/k + 1) = x(k -!- 1) - ?(k + l/k + 1)

(17-H)

is u zero-mean Gauss-Murkov sequence whose covariance lc=O,l,*-., matrix is given by (U-H). Pruof (Mendel, 1983b,pp. 5657) a, We begin with the formula for S(k + l/k + 1) in (17-7). Recall that x(/c + 1) and z(k + 1) are jointly Gaussian. Because z(k + 1) and i(k + l[k) are causally invertible, x(k + 1) and @k + lik) are also jointly Gaussian.Additionally, E{i(k + Ilk)) = 0; hence,

E{x(k + l)ii(k + l\k)} = mX(k + 1) + Pti(k + 1,k + l/k)P;‘(k + l/k)i(k

+ l]k)

(1746)

We define gain matrix K(k + 1) as K(k + 1) = Pti(k + 1,k + l]k)Pg’(k t ilk)

(1747)

Substituting (17-16) and (17-17) into (17-7) we obtain the Kalman filter equation (17-l 1). Because i(k + l]k) = @(k + l,k)%(kik) + !Qk + 1,k)u(k), equation (17-l 1) must be initialized by i(O/O),which we have shown must equal mx(0) [see Equation (16-K?)]. b. In order to evaluate K(k + 1) we must evaluatePGand Pi’. Matrix PG has been computed in (X-33). By definition of cross-covariance, Pti = E{[x[k + 1) - mx(k + l)-&‘(k + l/k)} = E{x(k + 1)2(k + ljk)}

(1748)

because i(k + l]k) is zero-mean. Substituting (16-32) into this expression,we find that Pti = E{x(k + l)%‘(k + lik)}H’(k

+ 1)

(1749)

because E{x(k + l)v’(k + 1)) = 0. Finally, expressing x(k + 1) as %(k + l]k) + $(k + l]k) and applying the orthogonahty principle (I3-15), we find that Pti = P(k + #)H’(k

+ 1)

(17-20)

Combining equations (17-20) and (16-33) into (17-l’?), we obtain equation (1742) for the Kalman gain matrix. Stateprediction-error covariancematrix P(k + l/k) was derived in Lesson 16.

About the Kalman Filter

(I (I (I (I (I

-

KH)P(I KH)P KH)P KH)P KH)P

- KH)’ PH’K’ PH’K’ PH’K’

+ KRK’ + KHPH’K’ + KRK’ + K(HPH’ + R)K’ (17-21) + PH’K’

c. The proof that ii(k + l\k + 1) is zero-mean, Gaussianand Markov is so similar to the proof of part b of Theorem 16-l that we omit its details [see Meditch (1969, pp. lSl-182)]. Cl

OBSERVATIONS

ABOUT THE KALMAN

FILTER

1. Figure 17-l depicts the interconnection of our basic dynamical system [equations (15-17) and (U-18)] and Kalman filter system. The feedback nature of the Kalman filter is quite evident. Observe, also, that the Kalman filter contains within its structure a model of the plant. The feedback nature of the Kalman filter manifests itself in fw~ different ways, namely in the calculation of i(k + Ilk + 1) and also in the calculation of the matrix of gains, K(k + 1), both of which we shall explore below. 2. The predictor-corrector form of the Kalman filter is illuminating from an information-usage viewpoint. Observe that the predictor equations, which compute Ei(k + ljk) and P(k + ilk), use information only from the state equation, whereas the corrector equations, which compute K(k + l), ji(k -+ l(k + I) and P(k + Ilk + l), use information only from the measurement equation. 3. Once the gain matrix is computed, then (17-11)representsa time-varying recursive digital filler. This is seen more clearly when equations (16-4) and (16-31) are substituted into (17-11). The resulting equation can be rewritten as i(k

+ Ilk + 1) = [I - K(k + l)H(k + l)J@(k + l,k)li(k)k) + K(k + l)z(k + 1) +[I - K(k + l)H(k + l)]‘P(k + l,k)u(k)

(17-22)

for k =0, l,.... This is a state equation for state vector ri, whose time-varying plant matrix is [I - K(k + l)H(k + l)]Ql(k + 1,k). Equation (17-22) is time-varying even if our dynamical system in equations (1517) and (15-B) is time-invariant and stationary, because gain matrix K(k + 1) is still time-varying in that case. It is possible,

Observations

About the Kalman

155

Filter

however, for K(k + 1) to reach a limiting value (i.e., a steady-state value, K), in which case (17-22) reduces to a recursive constant coefficient filter. We will have more to say about this important steady-statecasein Lesson 19. Equation (17-22) is in a recursive filter form, in that it relates the filtered estimate of x(k + l), x(k + lik + l), to the filtered estimate of x(k), x(k[k). Using substitutions similar to those used in the derivation of (17-22),one can also obtain the following recursive predictor form of the Kalman filter (details left as an exercise), i(k + Ilk) = @(k + l,k)[I - K(k)H(k)]x(kIk - 1) +Q(k + l,k)K(k)z(Fi) + V(k + l&u(k)

c

(17-23)

Observethat in (17-23) the predicted estimateof x(k + 1), x(k + ilk), is related to the predicted estimate of x(k), i(k[k - 1). Interestingly enough, the recursive predictor (17-23), and not the recursive filter (17-22),plays an important role in mean-squaredsmoothing, as we shall seein Lesson 21. The structures of (17-22) and (17-23) are summarized in Figure 17-2. This figure supports the claim made in Lesson 1 that our recursive estimators are nothing more than time-varying digital filters that operate on random (and also deterministic) inputs. 4. Embedded within the recursive Kalman filter equations is another set of recursive equations-( 17-12), (17-13) and (17-14). Because P(Ol0) initializes these calculations, these equationsmust be ordered asfollows: P(klk)-,P(k + llk)-*K(k + l)+P(k + l]k + l)--,etc. By combining these three equations it is possible to get a matrix recursive equation for P(k + Ilk) as a function of P(k (k - l), or a similar equation for P(k + ilk + 1) as a function of P(kJk). These equations are nonlinear and a known as matrix Riccati equations. For 50 q for P(k + Ilk) is example, the matrix Riccati equa l

P(k + Ilk) =

atrix K(k + 1) with respectto parameter denotes the sensitivity of the ijth eleme \d similarly. 0. All other sensitivities, such as Sp@lk), are de Here we present some numerical sensitivity results for the simple first-order system x(k + 1) = ax(k) z(k) = hx(k)

+ b w(k)

(18-3)

+ n(k)

(18-4)

where ON = 0.7, bN = 1.0, hN = 0.5, qN = 0.2, and rN = 0.1. Figure 18-3 depicts p + 1) s,“C”+ 1), sfck + ‘I, St(’ + *) , and S,“(” + *) for parameter variations of 55%) flO%, ‘220%) and 50% about nominal values, &V, bN, hN, qNYand Observe that the sensitivity functions vary with time and that they all reach steady-state values. Table 18-l summarizes the steady-state sensitivity coefficients rN

SK(k 8

+ 1) , sr(kIk)

and

Sr’k

l

+ l(k) .

Some conclusions that can be drawn from these numerical results are: (1) K(k + l), P(klk) and P(k + l(k) are most sensitive to changesin parameter b, and

162

164

State Estimation:

Filtering Examples

Lesson 18

165

Examples

(2) p + 1) = pw for 8 = a, b , and q . This last observation could have been foreseen, because of our alternate equation for K(k + I) [Equation (17-Z)]. K(k + 1) = P(k + Ilk + l)H(k + I)R-‘(k

+ 1)

(N-5)

This expression shows that if !I and r are fixed then K(k + 1) varies exactly the same way as P(k + Ilk + 1).

0.2 -

-0.2

0.5

- 0.4 5% 5% 10%

O-4

I 1

0

1 4

1 3

2

I 5

I 6

I 7

I 8

J 10

9

k

0.8 t

0.3

- 20%

0.6

02 J 0

I

I

12

I

I

I

1

3

4

5

6

7

t

I

I

8

9

IO

- 10% -5% + WC

0.4

+ 10% + 20% + 50%

k 0.2

0.0

0

I

t

I

I

I

1

I

I

I

1

3

4

5

6

7

8

9

IO

k (d)

50% 20% 10% 5% 5% fO%

1.3 1.1 O-9

2 + 2vf

I

I

10% 5% 5% 10%

0.7 02

20%

0.3

- 0.7 0

I 1

I 2

t 3

I 4

I 5

I 6

t 7

k

I 8

9

1 IO

-0.8 0

1

1 2

I 3

I 4

I 5

I 6

(e) Figure 18-3 Sensitivhy plots.

Figure 18-3 Cont.

I 7

I 8

t

I

9

10

166

State Estimation: TABLE 18-l

ia) 37”+l) ~~~

6

+50

+20

Filtering

Examples

Lesson 18

and

Steady State Sensitivity Coefficients +lO

Examples

Percentage Change in 8 +5 -5 -10

-20

-50

:

.838 .506

,937 ,451

.430 .972

.989 .419

1.025 .396

1.042 .384

1.077 .361

1.158 .294

h 9 r

-.108 .417 - .393

- ,053 ,465 - .452

- .026 .483 - .476

- ,010 .493 - ,490

,026 ,515 -.518

.O47 .526 - ,534

.096 .551 - .570

.317 ,648 - .717

+20

Percentage Change in 6 +lO +5 -5 -10

P(k + 1Jk + 1) (18-10) = [I - K(k + l)h’] ‘@ ; @) r Observe that, given ratio q/r, we can compute P(k + llk)/r, K(k + l), and P(k + Ilk + 1)/r. We refer to Equations (l&8), (l&9), and (18-10) as the scaled Kalman filter equations. Ratio q /r can be viewed as a filter tuning parameter. Recall that, in Lesson 15, we showed q /r is related to signal-to-noise ratio; thus, using (15-50) for example, we can also view signal-to-noise ratio as a (single-channel) Kalman filter tuning parameter.

(b) s;k’kf 9

+50

-20

-50

;

.838 .506

.937 ,451

-972 .430

,989 .419

1.025 .396

1.042 .384

1.077 .361

1.158 .294

h 9 f

- ,739 .417 .410

- .877 .465 - .457

- .932 .483 .476

- ,962 ,493 .486

-1.025 .515 .507

-1.059 .526 ,519

-1.130 .551 ,544

-1.367 .648 .642

+50

+20

+lO

-20

-50

1.047 2.022 - .213 .832 ,118

.820 1.836 - ,253 .846 .131

.754 1.775 - .268 .851 .137

.585 1.591 - ,325 .871 .157

.453 1.402 -.393 .899 .185

a b h 9 r

Percentage Change in 8 +5 -5 -10 .723 1.745 - .277 .854 .140

.664 1.684 - .295 .860 .146

.637 1.653 - .305 ,864 ,149

Here is how to use the results in Table 18-1. From Equation (18-2), for example, we see that (18-6) or

Suppose, for example, that data quality is quite poor, so that signal-to-noise ratio, as measured by SNR, is very small. Then q/r will also be very small, because 4/r = SNR/h’(P,/q)h. In this case the Kalman filter rejects the low-quality data, i.e., the Kalman gain matrix approaches the zero matrix, because P(k + 1Jk) ~plk)(p+) (18-11) r

r

The Kalman filter is therefore quite sensitive to signal-to-noise ratio, as indeed are most digital filters. Its dependence on signal-to-noise ratio is complicated and nonlinear. Although signal-to-noise ratio (or q /r) enters quite simply into the equation for P(k + llk)/r, it is transformed in a nonlinear manner via (18-8) and (M-10). 0 Example 18-4 A recursive unbiased minimum-variance estimator (BLUE) of a random parameter vector 6 can be obtained from the Kalman filter equations in Theorem 17-l by setting x(k) = 9, @(k + 1,k) = I, IY(k + 1,k) = 0, Vl(k + 1,k) = 0, and Q(k) = 0. Under these conditions we see that w(k) = 0 for all k, and x(k + 1) = x(k)

which means, of course, that x(k) is a vector of constants, namely 8. The Kalman equations reduce to (18-12) i(k + Ilk + 1) = i(klk) + K(k + l)[z(k + 1) - H(k + l)ii(klk)], P(k + l(k) = P(klk)

% Change in Kij(k + 1) = (% Change in 6) x SF@ + ‘)

(18-7)

From Table 18-1and this formula we see, for example, that a 20% change in a produces a (20)(.451) = 9.02% change in K, whereas a 20% change in h only produces a (20)( - .053) = -1.06% change in K, etc. 0 Example 18-3 In the single-channel case, when w(k), z(k) and v(k) are scalars, then K(k + 1), P(k + Ilk) and P(k + Ilk + 1) do not depend on 4 and r separately. Instead, as we demonstrate next, they depend only on the ratio 4 /r . In this case, H = h’, r = y, Q = 4 and R = I, and Equations (17-12), (17-13) and (17-14) can be expressedas

1

K(k + 1) = ‘ck + ‘ik) h h, ‘tk + Ilk) h + 1 -’ r r E P(k + ilk) = @ w lk) -@’ + 4yyt r r r

(18-8) (18-9)

K(k + 1) = P(kjk)H’(k

+ l)[H(k + l)P(klk)H’(k

P(k + Ilk + 1) = [I - K(k + l)H(k

+ l)]P(klk)

+ 1) + R(k + l)]-’

(18-13) (18-14)

Note that it is no longer nF*wry-t&Minguish between filtered and predicted quantities, because 6(k + 1lkJ = 6(k(k) and P(k + Ilk) = P(kjk); hence, the notation h(k Ik) can be simplified to 6(k), for example. Equations (18-12), (M-13), and (18-14) were obtained earlier in Lesson 9 (see Theorem 9-7) for the case of scalar measurements. Cl Example 18-5 This example illustrates the divergence phenomenon, which often occurs when either process noise or measurement noise or both are small. We shall see that the Kalman filter locks onto wrong values for the state, but believesthem to be the true values, i.e., it “learns” the wrong state too well.

State Estimation:

Filtering Exampks

Lesson 18

0ur example is adapted from Jazwinski (1970, pp. 302-303). We begin with the following simple first-order system x(k + 1) =x(k) + b

(1845)

z(k + 1) = x(k + 1) + V(k + 1)

(18-16)

where b is a very small bias, so smail that , when we designour Kalman filter, we choose to neglect it. Our Kalman filter is based on the following model

z(k 3- 1) = xm(k + 1) + u(k + 1)

(18-18)

Using this model we estimate x(k) and c&,,(k /k), where it is straightfoward to show that

Lesson 18

Problems

tions for i,(k + Ilk + 1) or x(k Ik), becausestate equation (18-17) contains no process noise. Divergence is a large-sample property of the Kalman filter. Finite-memory and fading memory filtering control divergence by not letting the Kalman filter get into its “large sample” regime. Finite-memory filtering (Jazwinski, 1970)uses a finite window of measurements (of fixed length W) to estimate x(k). As we move from c = kl to I = kl + 1, we must account for two effects, namely, the new measurement at r = kl + 1 and a discarded measurement at f = kl - W. Fading-memory filtering, due to Sorenson and Sacks (1971) exponentially ages the measurements, weighting the recent measurement most heavily and past measurements much less heavily. It is analogous to weighted least squares, as described in Lesson 3. Fading-memory filtering seems to be the most successfuland popular way to control divergence effects. q

Iqk + 1) PROBLEMS The Kaobserve that, as k -+m, K(k + 1) -0 SO that im(k + lik + l)+&(kik). lman filter is rejecting the new measurementsbecauseit believes (1%17) to be the true model for x(k); but* of course, it is not the true model. The Kalman filter computes the error variance, pm(k ik), between im (k ik) and x,.,,(k). The true error variance is associatedwith Z(k ik), where i(k ik) = x(k) - im (k lk)

(18-20)

We leave it to the reader to show that

As k-+ m, i(kjk) + m becausethe third term on the right-hand side of (18-21) diverges to infinity. This term contains the bias II that was neglected in the model used by the Kalman filter- Note also that L, (k lk) = xm(k) - &(kik)+ 0 as k + m; thus, the Kalman filter has locked onto the wrong state and is unaware that the true error variance is diverging. A number of different remedies have been proposed for controlling divergence effects, including: 1. adding fictitious process noise, 2. finite-memory filtering, and 3. fading memory filtering. Fictitious process noise, which appears in the state equation, can be used to account for neglected modeling effects that enter into the state equation (e.g., truncation of second-and higher-order effects when a nonlinear state equation is hneariced, as described in Lesson 24). This processnoise introduces Q into the Kalman filter equations, observe, in our first-order example, that Q does not appear in the equa-

18-l. Derive the equations for &,,(k + Ilk + 1) and i(kjk), in (16-19) and (l&21), respectively. 18-2. In Lesson 5 we described cross-sectional processing for weighted least-squares estimates. Cross-sectional (also known as sequential) processing can be performed in Kalman filtering. Suppose z(k + 1) = co! (z,(k + l), z,(k + I), . . . , z,(k + l)), where zi(ll- + 1) = H,x(k + 1) + vj(k + l), vi(k + 1) are mutually uncorrelated for i = 1, 2,. , , , zi(k + 1) is m, x 1, and ml + rnt + .*- -I- mq = me Let k,(k + ilk + 1) be a “corrected” estimate of x(k + 1) that is associatedwith processing z,(k + I). (a) Using the Fundamental Theorem of Estimation Theory, prove that a crosssectional structure for the corrector equation of the Kalman filter is: i,(k -t Ilk + 1) = ri(k + ljk) + E{x(k + l)!&(k + ilk)} + Ilk + 1) = %(k + Ilk + 1) + E(x(k + l)Ii,(k + Ilk))

i,(k

f&(k + Ilk + 1) = k-

j(k + ljk + 1) + E{x(k + l)&(k

+ l/k))

= fi(k + l/k + 1) (b) Provide equations for computing E{x(k + l)lZ, (k t l/k)}. 18-3. (Project) Choose a second-order system and perform a thorough sensitivity study of its associatedKalman filter. Do this for various nominal values and for both small and large variations of the system’s parameters. You will need a computer for this project. Present the results both graphically and tabularly, as in Example 18-2. Draw as many conclusions as possible.

Steady-State

Lesson

79

Equation (19-l) is often referred to either as a steady-state or algebraic Riccati equation. b. The eigenvalues of the steady-state KaZman filter, @I? - KIWJ, all lie within the unit circle, so that the filter is asymptotically stable; i.e.,

State Estimation:

SteadyGtate Relationship Digital

171

Kalman Filter

Ih[@- KH@] / 4

(19-2)

If our dynamical modet in Equations (1.54 7) and (15-18) is timeinvariant and stationary, but is not necessarily asymptotically stable, then points (a) and (b) still hold as long as the system is completely stabilizable and detectable. Cl

Ku/man

A proof of this theorem is beyond the scope of this textbook. It can be found in Anderson and Moore, pp. 78-82 (1979). For definitions of the system-theoreticterms stabilizable and detectable, the reader should consult a textbook on linear systems, such as Kailath (1980) or Chen (1970). By completely detectable and completely stabihzable, we mean that (@,H) is completely detectable and (@,I’QI) is completely stabilizable, where Q = QIQ;. Additionally, any asymptotically stable model is always completely stabilizable and detectable. Probably the most interesting caseof a system that is not asymptotically stable, for which we want to designa steady-stateKalman filter, is one that has a pole on the unit circle.

to a

Wiener Filter

INTRODUCTION

In this lesson we study the steady-stateKalman filter from different points of view, and we then show how it is related to a digital Wiener filter.

Example 19-1

In this example (which is similar to Example 5.4 in Meditch, 1969, pp. 189-190) we consider the scalar system STEADY-STATE KALMAN FILTER

For time-invariant and stationary systems,if lim P(k + l/k) = P exists, then lim K(k)+?r; and the Kalman filter (17-11) bl

(21-38)

j-l

(21-39) A(i) =k and j = k + 1, k + 2,. . . . Additionally, one can show that the fixed-point smoothing error-covariance matrix, P(kl j), is computed from B(j)

=

II

j

Wlj) = Wlj - 1)+ BbW(A.0- Wlj - WW)

9

ii(klk - 1) = ii(kliV) - P(klk - l)r(kIN)

= r2(klj

where

because r(kIN) is simply a linear combination of all the observations z(l), z(N). From (21-24)we find that

z(2),

199

Smoothing

(21-40)

where j = k + 1, k + 2,. . . . Equation (21-38) is impractical from a computational viewpoint, becauseof the many multiplications of n x n matrices required first to form the A(i) matrices and then to form the B(j) matrices. Additionally, the inverse of matrix P(i + lli) is needed in order to compute matrix A(i). The following results present a “fast” algorithm for computing r2(kl j). It is fast in the sense that no multiplications of yt x y1matrices are needed to implement it. Theorem 21-3.

A most useful mean-squared fixed-point smoothed esti-

mator of x(k), %(klk + 4 where 1 = 1,L**, is given by the expression i(klk + I) = i(klk + Z - 1) + N,(klk + Z)[z(k + I) - H(k + Z)r2(k + Ilk + Z - l)] (21-41) where N,(klk

+ I) = S,(k,Z)H’(k

+ I)

[H(k + I)P(k + Ilk + Z - l)H’(k + 2) + R(k + Z)]-’

(21-42)

and 9,(k,Z) = 9,(k,Z - 1)

[I - K(k + Z - ljH(k + I - l)]W(k

+ Z,k + 2 - 1)

(21-43)

State Estimation:

Equutions

(21-#I)

Smoothing

and (21-43) are initialized

(Generd

Fksults)

Lesson 21

by %(kik) and !Z&(k,l) =

P(kik)@‘(k + 1,k), respe&ely. Additio&ly,

Fixed-Lag

Smoothing

variable model. Anderson and Moore (1979)give these equations for the recursive predictor; i.e., they find

P(klk + 2) = P(kjk + 1 - 1) - N#/k

+ I)[H(k + Z)P(k + Zik + 2 - 1) H’(k + Z) + R(k + Z)]N;(kik + Z) (21-44)

which is initialized by P(kik).

III

We leave’the proof of this useful theorem, which is similar to a result given by Fraser (1967), as an exercisefor the reader. Example 21-2

Here we consider the problem of fixed-point smoothing to obtain a refined estimate of the initial condition for the system described in Example 21-l. Recall that JJ(010)= 50 and that by fixed-interval smoothing we had obtained the result p (014) = 16.31, which is a significant reduction in the uncertainty associatedwith the initial condition. Using Equation (21-40) or (21-44) we compute ~1(0/1),~1(0/2),and ~(013) to be 16.69, l&32? and 16.31, respectively. Observe that a major reduction in the smoothing error variance occurs as soon as the first measurementis incorporated, and that the improvement in accuracy thereafter is relatively modest. This seems to be a general trait of fixed-point smoothing. IJ

Another way to derive fixed-point smoothing formulas is by the following state augmentatiun procedure (Anderson and Moore, 1979). We assume that for IL 2 j (21-45) W) A x(j)

where k ~j. The last equality in (21-49) makes use of (21-46) and (21-45). Observe that i(jlk), the fixed-point smoother of x(j), has been found as the second component of the recursive predictor for the augmented model. 2. The Kalman filter (or recursive predictor) equations are partitioned in order to obtain the explicit structure of the algorithm for ri(jlk). We leave the details of this two-step procedure as an exercise for the reader. FIXED-l&G SMOOTHING

The earliest attempts to obtain a fixed-lag smoother i(kjk + L) led to an algorithm (e.g., Meditch, 1969), which was later shown to be unstable (Kelly and Anderson, 1971).The following state augmentation procedure leads to a stable fixed-interval smoother for li(k - L Ik). We introduce t + 1 state vectors, as follows: xl(k + 1) = x(k), xz(k + 1) = x(k - l), xj(k + 1) = x(k - 2), . . . , xL + I(k + 1) = x(k - L) [i.e., xi(k + 1) = x(k + 1 - i), i = 1, 2,. . . , L + 13.The state equations for these L + 1 state vectors are

The state equation for state vector x0(k) is x,,(k + 1) = xG(k)

k +

xl(k + 1) = x(k) xz(k + 1) = x,(k) xj(k + 1) = x,(k)

(21-46)

It is initialized at k = j by (21-45). Augmenting (21-46) to our basic statevariable model in (15-17) and (S-18), we obtain the following augmented basic state-variable mod&

+ v(k + 1) (21-48) Q The foIlowing two-step procedure can be usedto obtain an algorithm for

xL+l(k

(X-50)

+ ;; = XL(k) I

Augmenting (21-50) to our basic state-variable model in (15-17) and (15-18), we obtain yet another augmented basic state-variable model:

z(k + 1) = (H(k + 1) 0) [;:$I

S( jik):

1. Write down the KaIman filter equationsfor the augmented basic state-

+

u(k) +

w(k)

(21-51)

202

State Estimation:

Smoothing

(General Results)

Lesson 21

Lesson 21

The following two-stepprocedure can be usedto obtain an algorithm for k(k - 1 IL\* 1. Write down the Kalman filter equations for the augmentedbasic statevariable model. Anderson and Moore (1979)give theseequations for the recursive predictor; i.e., they find E{col (x(k + 1),x1@+ l), . . . ,xL + I@ + l)f%(k)} = co1(r2(k + l(k),i@ + l[k), . . . ) IiL +I(k + 1Jk)) = co1(f(k + llk),i(klk),i(k - l/k), . . . f i(k - Llk))

(21-52)

The last equality in (21-52) makes use of the fact that xi(k + 1) = x(k + 1 - i), i = 1,2,. . . , L + 1. 2. The Kaiman filter (or recursive predictor) equationsare partitioned in order to obtain the explicit structure of the algorithm for ji(k - Llk). The detailed derivation of the algorithm for i(k - LIk) is left as an exercise for the reader (it can be found in Anderson and Moore, 1979, pp. 177-181). Some aspectsof this fixed-lag smoother are: 1. It is numerically stable,becauseits stability is determinedby the stability of the recursive predictor (i.e., no new feedback loops are introduced into the predictor as a result of the augmentationprocedure); 2. In order to computei(k - L (k), we must also computethe L - 1 fixedlag estimates, i(k - Ilk), i(k - 21k), . . . , g(k - L + Ilk); this may be costly to do from a computational point of view; and 3. Computation can be reducedby careful coding of the partitioned recursive predictor equations. PROBLEMS 21-1. Derive the formula for x(kIN) in (21-5) using mathematical induction. Then

derive i(klN) in (21-6). 21-2. Prove that {%(kIN), k = N, IV - 1, . . . , 0) is a zero-mean,second-order Gauss-

Markov process. 21-3. Derive the formula for the fixed-point smoothing error-covariance matrix,

P(kl j), given in (21-40). 21-4. Prove Theorem 21-3, which gives formulas for a most useful mean-squared fixed-point smoother of x(k), i(klk + l), I = 1,2,. . . . 21-5. Using the two-step procedure described at the end of the section entitled

Fixed-Point Smoothing, derive the resulting fixed-point smoother equations. 21-6. Using the two-step procedure described at the end of the section entitled

21-7.

Problems

203

Fixed-Lag Smoothing, derive the resulting fixed-lag smoother equations. Show, by means of a block diagram, that this smoother is stable. (Meditch, 1969, Exercise 6-13, pg. 245). Consider the scalar systemx(k + 1) = 2-kx(k)+w(k),z(k+1)=x(k+1),k=O,l,...,wherex(0)hasmeanzero and variance U& and w(k), k = 0, 1, . . . is a zero mean Gaussianwhite sequence which is independent of x(0) and has a variance equal to 4. (a) Assuming that optimal fixed-point smoothing is to be employed to determine x(Olj), j = 1, 2, . . . , what is the equation for the appropriate smoothing filter? (b) What is the limiting value of p (01j) asj + w? (c) How does this value compare with p (OIO)?

lesson

Minimum-Variance

22

h(I) = h’@ - ‘y,

205

1 = 1,2,...

(22-4)

Proof. Iterate (22-2) and substitute the results into (22-3). Compare the resulting equation with (22-l) to see that, under the conditions x(O) = 0, p(O) = 0 and h (0) = 0, they are the same. •l

Smoothing

The condition x(0) = 0 merely initializes our state-variable model. The condition ~(0) = 0 means there is no input at time zero. The coefficients in (22-4) represent sampled values of the impulse response. If we are given impulse response data {h (1) h (2): . . . , h (L)) then we can determine matrices @, y and h as well as system order n by applying an approximate realization procedure, such as Kung’s (1978), to {h (l), h (2), . . . , h (L)). Additionally, if h (0) + 0 it is simple to modify Theorem 22-l. In Example 14-l we obtained a rather unwieldy formula for l&(N). Note that, in terms of our conditioning notation, the elements of b&N) are jh.&lN, k = 1, 27-v N. We now obtain a very useful algorithm for j&&k IN. For notational convenience,we shorten bMSto b.

Appkations

INTRODUCTION

Theorem 22-2 (Mendel, 1983a,pp. 68-70)

In this lesson we present some applications that illustrate interesting numerical and theoretica aspectsof fixed-interval smoothing. Theseapplications are taken from the field of digital signa processing.

(22-l) k =1,2,...,N i=l Recall that deconvolution is the signal processingprocedure for removing the effects of !z(j) and v(j) from the measurements so that one is left with an estimate of p(j), Here we shahobtain a useful algorithm for a mean-squared fixed-interval estimator of p ( j). To begin, we must convert (22-l) into an equivalent state-variable model. z(k) = 2 p(i)h (k - i) + v(k),

Theorem 22-l (Mendel, 1983a, pp. 13-14).

a. A two-puss fixed-interval smoother for I

is

wherek=N-l,N - 2,. . . ,l. b. The smoothing error variance, Gp (klN), is

DECCH’WOLUTION [MVD)

Here, as in ExampIesZ-6and 14-1, we begin with the convolutional model

The single-channel Jtatea

varia b/e mudei

204

(MVD)

is equivalent to the convolutiorzaIsum model in (22-I) when x(0) = 0, ~(0) = 0, h(0) = 0. and

State Estimation:

MINIMUM-VARIANCE

Deconvolution

x(k + I) = @x(k) + Tp(k)

(22-Z)

z(k) = h’x(k) + r’(k)

(22-3)

~6 (klhr) = q (4 - q Wy’S(k + 1IN)yq (k)

(22-6)

where k = N-l,N-2,..., 1. In these formulas r(klN) and S(klN) are computed using (21-25) and (21-27), respectiveZy, and E{p’(k)) = q(k) [here q(k) d enotesthe variance of p(k), and should not be confusedwith

the event sequence;which appearsin the product model for p(k)]. Proof a. To begin, we apply the fundamental theorem of estimation

theory, Theorem 13-1,to (22-2). We operate on both sidesof that equation with E{ . I%‘(N)}, to show that yb(klN)

= li(k + 1IN) - @ri(kIN)

(22-7)

By performing appropriate manipulations on this equation we can derive (22-5) as follows. Substitute ir(klN) and li(k + 1jN) from Equation (21-24) into Equation (22-7), to see that

State Estimation:

yji(kllV)

= i(k

Smoothing

Applications

Lesson 22

+ Ilk) + P(k + llk)r(k + liIV>

(2243)

Applying (17-11) and (16-4) to the state-variable model in (22-2) and (22-3), it is straightforward to show that - 1)

(22-9)

hence, (22-8) reducesto y$(klN)

- 1) + P(k + llk)r(k

= QK(k)i(kIk

N)

(22-10)

Next, substitute (21-25) into (22-lo), to show that = @K(k)t(kIk

- 1) + P(k + #)r(k

+ l\NJ

- @P(k[k - l)@Jk + l,k)r(k + l!IV) - (PP(kIk - l)h’[h’P(klk - 1)h + r]-‘z(k[k

@i(k + l,k)r(k + l!IV)

(22-12)

Combine Equations (17-13) and (17-14)to seethat + 1,k) + yq(k)y’

(22-13)

Finally, substitute (22-13) into Equation (22-12)to observe that

YWIN) = Y4oY’r(k + 1IN) which has the unique solution given by

WIN) = 4ww(k + WI

(22-14)

which is Equation (22-5). b. To derive Equation (22-6) we use (22-S)and the definition of estimation error b(k IN), (22-15) iwYJ = P(k)- fiwv to form

P(k)= WIN) + 4Wr(k + l!N)

In this example we compute b(kIN), first for a broadband channel IR, h#), and then for a narrower-band channel IR, h*(k). The transfer functions of these channel models are -0.76286~~ + 1.58842’ - 0.823562 + 0.000222419 (22-19) HI(Z) = z4 - 2.2633~~ + 1.777342’ - 0.498032 + 0.045546

- 1)

= P(k + llk)r(k + l!N) - @P(k(k - 1)

P(k + Ilk) = 1, R(w)+ 1, and once again R(kj -+ 6(k). Broadband IRS often satisfy this condition. In general, however, &(klN) is a smeared-out version of I; however, the nature of the smearing is quite dependent on the bandwidth of h(k) and SNR. Example 22-2 This example is a continuationof Example 22-1. Figure 22-5 depicts R (k) for both the broadband and narrower-band IRS. hi(k) and hz(k), respectively. As predicted by (22-31),RI(~) is much spikier than R,(k), which explains why the MVD results for the broadband IR are quite sharp, whereas the MVD results for the narrower-band IR are smeared out. Note, also, the difference in peak amplitudes for R,(k) and R,(k). This explains why b(kfN) un derestimatesthe true values of p(k) by such large amounts in the narrower-band case (see Figs.22-4a and b). E3

RELATlONSHlP BETWEEN STEADY-STATE MVD FILTER AND AN INFlNtTE IMPULSE RESPONSE DIGITAL WIENER DECONVOLUTION FILTER

Theorem 22-3,

fis(k INI = I3 w *w where R(k) ti the auto-correlatbz function

Between

(22-21)

which can also be expressedas

Lwv = lwlxl

Reiationship

(22-29)

additionally, (22-30)

We have seen that an MVD filter is a cascadeof a causal Kalman innovations filter and an anticausal p-filter; hence, it is a noncausal filter. Its impulse response extends from k = --3c to k = +=, and the IR of the steady-state MVD filter is given in the time-domain by hMv(k) in (22-23, or in the frequency domain by HMv(w) in (22-26). There is a more direct way for designing an IIR minimum mean-squared error deconvolution filter, i.e., an IIR digital Wiener deconvolution filter, as we describe next. We return to the situation depicted in Figure 19-4, but now we assume that: Filter F(z) is an IIR filter, with coefficients (f(j), j = 0, 21, 52, . . . ); (22-32) d(k) = CL(k) where p(k) is a white noisesequence;p(k), v(k), and n (k) are stationary; and, p(k) and v(k) are uncorrelated. In this case, (19-39) becomes C f (i)&(i

iz -3

- 1) = 4+(j),

j = 0, -tL 56 . . .

(22-33)

214

State Estimation:

0.80

Smoothing

Applications

Lesson 22

Maximum-Likelihood

215

Deconvoiution

Using (22-l), the whiteness of p(k) and the assumptions that p(k) and v(k) are uncorrelated and stationary, it is straightforward to show that

r

(22-34)

42,(i) = qh(71 Substituting (22-34) into (22.33), we have

0.60

i

f(i)&(i

i= -es

0.40

-j)

= qh(-j),

j = 0, 21, +2,. . .

(22-35)

Taking the discrete-time Fourier transform of (22-35), we seethat 0.20

(22-36) but, from (22-l), we also know that (22-37)

-200

- loo -

200

300

Substituting (22-37) into (22-36),we determine F(o), as F(o) =

(a)

qH*(w) 41HW12 + r

(22-38)

This IIR digital Wiener deconvolution filter (i.e., two-sided leastsquares inverse filter) was, to the best of our knowledge, first derived by Berkhout (1977). Theorem 22-4 (Chi and Mendel, 1984). The steady-state MVD filter, whose IR is given by hMv(k), is exactly the same as Berkhout’s IIR digital Cl Wiener deconvolution filter. 0.32

0.24

The steady-stateMVD filter is a recursive implementation of Berkhout’s infinite-length filter. Of course, the MVD filter is also applicable to timevarying and nonstationary systems,whereashis filter is not.

0.16

MAXIMUM-IJKELIHOOD

In Example 14-2 we began with the deconvolution linear model Z(N) = X(N - 1)~ + v(N), used the product model for p (i.e., p = Q,r), and showed that a separation principle exists for the determination of iMAPand GmP.We showed that first one must determine GMAP, after which iMAPcan be computed using (14-57). We repeat (14-57) here for convenience,

0.08

0.00

-160

DECONVOLUTION

-80

0.00

80

160

240

h&N(4) = +Qix'(N - 1) [&E(N - l)Q$‘(N

(b) Figure 22-5 R(k) for (a) broadband channel (SNR = 100) and (b) narrower-band channel (SlVR = lOO),(Chi and Mendel, 1984, 0 1984, IEEE).

- 1) + PI]-‘e(N)

(22-39)

where 4 is short for hMAP.In terms of our conditioning notation used in state estimation, the elements of &(N)$ are ihlA#CIN; ii), k = l? 2?. . . , N. Equation (22-39) is terribly unwieldy because of the N X N matrix,

State Estknation:

Smoothing

Applications

lesson

22

CTptyN- I)Q$‘(N

- 1) + ~1, that must be inverted. The following theorem provides a more practical way to compute &&!IC; 4).

Theorem 22-5 (Mend& 1983b). unconditionul maximum-likelihuod (i.e., MAP) estimates uf r can be ubtained by applying hWD formuh tu the state-vuriable mudel

where qM&k)

x(k + 1) = @x(k) + y&&k)+)

(22-40)

z(k) = Wx(k) + v(k)

(22-41)

Recursive Waveshaping

begin, we must obtain the following state-variablemodels for h(k) and d(k) [i.e., we must use an approximate realization procedure (Kung, 1978) or any other viable technique to map (h(i), i = 0, 1,. . . , 1,) into {a,, yI, h,}, and {d(i), i = 0, 1,. . . ) &I into {@2,72, h)]:

but, by comparing (22-40) and (22-2), and (22-41) and (22s3), we see that &&#V) can be found from the MVD algorithm in Theorem 22-2 in which we replace p(k) by r(k) and set q (k) = u&&k). Cl RECURSIVE WAVESHAPING

(22-43) (22-44)

xz(k + 1) = @x2(k) + y?S(k)

(22-45) (22-46)

and

is a MAP estimate of q (k).

&~f. Example 14-2showed that a MAP estimate of q can be obtained prior tu finding a MAP estimate of r. By using the product model for p(k), and hMAp,our state-variable model in (22-2) and (22-3) can be expressedas in (22-40) and (22-41).Applving d (14-41) to this system,we see that (22-42) ki&j~ = ;MS@ IN)

x@ + 1) = @,lxO) + y&W h(k) = hfx,(k)

d(k) = h;x:(k)

State vectors x1 and x2 are nl x 1 and n2 x 1, respectively. Signal 6(k) is the unit spike. In the stochasticsituation depicted in Figure 22-6, where h(k) is excited by the white sequencew(k) and noise, v(k), corrupts s(k), the best we can possibly hope to achieveby waveshapingis to make z(k) = w (k)*h (k) + v(k) look like w(k)*d(k) (Figure 22-7). This is becauseboth h(k) and d(k) must be excited by the same random input, w fk), for the waveshapingproblem in this situation to be well posed. The state-variable model for this situation is Y,:

xr(k + 1) = @xl(k) + y+(k)

(22-47)

z(k) = hjx,(k) + v(k)

(22-48)

In Lesson 19 we described the design of a FIR waveshapingfilter (e.g., see Figure 19-4). In this section we shall develop a recursivewaveshapingfilter in the framework of state-variable modeis and mean-squaredestimation theory. other approaches to the design of recursive waveshapingfilters have been given by Shanks(1967) and Aguilara? et al. (1970). We direct our attention at the situation depicted in Figure 22-6. To

Figure 224 W?veshapingprobiern swdied in this section. Information about 2 (k), not necessaAy I! itself- is used to drive the (time-varying) recursive waveshaping fiIter, (Mendel. 1983a,G 1983, IEEE).

Figure 22-7 State-variable formulation of recursive waveshaping problem (Mendel, 1983a,Q 1983, IEEE).

State Estimation:

Smoothing

Applications

Lesson 22


surements. It is the problem of stochastic inversion and can be performed by means of minimum-variance deconvolution. 0 92:

xt(k + 1) = azxz(k) + y2w(k)

(22-49)

d,(k) =

(22-50)

%x2(k)

Observe that both Y, and sP2are excited by the same input, w(k). Additionally, w(k) and v(k) are zero-mean mutually uncorrelated white noise sequences,for which

E-b2W = 4

(22-51)

E{v2(k)} = r

(22-52)

In this book we have only discussedfixed-interval MVD, from which we obtain G(k[N). Fixed point algorithms are also available (e.g., Mendel, 1983). Theorem 22-7 (Recursivewaveshaping,Mendel, 1983a,pg. 600). Let w(kjN) denote the fixed-interval estimate of w(k), which can be obtainet from S1via minimum-variance deconvolution (MVD), (Theorem 22-2). Then dl(klN) is obtained from the waveshaping filter

(22-57)

ci,(klN) = hjji,(kIN)

We now proceed to formulate the recursive waveshapingfilter design problem in the context of mean-squaredestimation theory. Our filter design problem is: Given the measurementsz (l), z (2), . . . , z (j) determine an estimator &(klj) such that the mean-squarederror

J[&(k 1j)l = E{ C&(k1j)12)

(22-53)

is minimized. The solution to this problem is given next. Theorem 22-6 (Structure of Minimum-Variance Waveshaper,Dai and Mendel, 1986). The Minimum-Variance Waveshapingfilter consists of two components1. stochasticinversion, and 2. waveshaping.

Proof. According to the Fundamental Theorem of Estimation Theory, Theorem 13-1, the unbiased,minimum-variance estimator of d,(k) based on the measurements(2 (1) 2 (2), . . . , 2 (j)} is (22-54) h(k ii) = E(d~(k)IfKi~l where ?I!(j) = co1(2 (l), 2 (2), . . . ) t(j)). Observe, from Figure 22-7, that

d,(k) = w (k)*d(k)

(22-55)

&(k + 1IN) = @&(kIN) where k = 0, 1, . . . , N - 1, and &(O[N) = 0.

+ ygC(klN)

(22-58)

q

We leave the proof of this theorem as an exercisefor the reader. Some observations about Theorems 22-6 and 22-7 are in order. First, MVD is analogous to solving a stochastic inverse problem; hence, an MVD filter can be thought of as an optimal inverse filter. Second, if w(k) was deterministic and v(k) = 0, then our intuition tells us that the recursive waveshaping filter should consist of the following two distinct components: an inverse filter, to remove the effects of H(z), followed by a waveshapingfilter, whose transfer function is D (2). The transfer function of the recursive wave1 shaping filter would then be D(z), so that D(z) I = DW H(z) Finally, the results in these theorems support our intuition even in the stochastic case; for, &(klN), for example, is also obtained in two steps.As shown in Figure 22-8, first G(klN) is obtained via MVD. Then this signal is reshaped to give &(klN). Note, also, that in order to compute &(klN) it is not really necessaryto use the state-variable model for &(klN). Observe, from (22~56),that ci,(kIN) = d(k)*&(k(N)

hence,

(22-59)

Example 22-3 = E{w (k)IflE(j)}*d (k) = @(kjj)*d (k)

(22-56)

Equation (22-56)tells us that there are two stepsto obtain &(k /j): 1. first obtain G(k ii>,and 2. then convolve the desiredsignal with G(klJ. Step 1 removes the effects of the original wavelet and noise from the mea-

In this example we describe a simulation study for the Bernoulli-Gaussian input sequence [i.e., w(k)] depicted in Figure 22-9. When this sequenceis convolved with the fourth-order IR depicted in Figure 22.la and noise is added to the result we obtain measurements z(k), X-= 1,2,. . . , 1000, depicted in Figure 2240. In Figure 22-11 we see $(k IN). Observe that the large spikes in w(k) have been estimated quite well. When C&IN) is convolved with a first-order decaying exponential [&(f) = em3001, we obtain the shaped signal depicted in Figure 22-12. Some “smoothing” of the data occurs when 6(k IN’) is convolved with the exponential d*(k).

State Estimation:

220

Smoothing

Applications

Lesson 22

221


Recursive WaveshapingFilter

I

t

0.24 0.12

~~~~

0.00

0.20

0.60

0.40

0.80

1.0

Time (set) Figure 22-8 Details of recursivelinear waveshapingfilter. (Mendel, 1983a,@ 1983 IEEE.)

Figure 22-N Noise-corrupted signal t(k). Signal-to-noise ratio chosen equal to ten (Mendel, 1983a. 0 1983.IEEE).

I 0.00

0.20

0.40

I

I

0.60

0.80

Time (set) Figure 22-9

I3ernoullLGaussianinput sequence (Mendel, 1983a,C 1933, IEEE).

Figure 22-11 k(k(N), (Mendet, 1983a. 0 1983, IEEE).

1.o

State Estimation:

222

Smoothing

Applications

Lesson 22

Lesson

23

State Estimation for the /t/o+So-Basic State-Variable 0.00

0.20

0.60

0.40

0.80

Model

1.0

Time (set) Figure 22-12 &(kIN) when d,(t) = ea300r (Mendel 1983a,0 1983, IEEE).

More smoothing is achieved when C(k IN’) is convolved with a zero-phase waveform. 0

Finally, note that, becauseof Theorem 22-3,we know that perfect waveshaping is not possible. For example, the signal component of &(k/I?), &(I@), is given by the expression &(k Ih’) = d (k)*R (k)*w (k)

INTRODUCTION

(22-60)

How much the auto-correlation function R(k) will distort &(kIN) from d (k)*w (k), depends,of course, on bandwidth and signal-to-noiseratio considerations. PROBLEMS 22-l. Rederive the MVD algorithm for fi(kIIV), which is given in (22-5), from the Fundamental Theorem of Estimation Theory, i.e., fi(klN) = E{&k)l%(IV)}. 22-2. Prove Theorem 22-3. Explain why part (b) of the theorem means that fiS(k IN) is a zero-phasewaveshapedversion of p(k). 22-3. This problem is a memory refresher. You probably have either seen or carried out the calculations asked for in a course on random processes. (a) Derive Equation (22-34); (b) Derive Equation (22-37). 22-4. Prove the recursive waveshapingTheorem 22-7.

In deriving all of our state estimators we assumedthat our dynamical system could be modeled as in Lesson 15, i.e., as our basic state-variable model. The results so obtained are applicable only for systemsthat satisfy all the conditions of that model: the noise processesw(k) and v(k) are both zero mean, white, and mutually uncorrelated, no known bias functions appear in the state or measurement equations, and no measurements are noise-free (i.e., perfect). The following casesfrequently occur in practice: 1. either nonzero-mean noise processesor known bias functions or both in the state or measurementequations 2. correlated noise processes, 3. colored noise processes,and 4. some perfect measurements. In this lesson we show how to modify some of our earlier results in order to treat these important special cases.In order to see the forest from the trees, we consider each of these four casesseparately. In practice, some or all of them may occur together. 223

224

State Estimation

for the Not-So-Bask

State-Variable

Model

Lesson 23

Correlated

Noises

BIASES

CORRELATED NOISES

Here we asswrx that our basic state-variable model, given in (15-17) and (H-18), has been modified IO

Here we assumethat our basic state-variable model is given by (15-17) and (15-M), except that now w(k) and v(k) are correlated, i.e.,

x(k + 1) = @(k + l.k)x(k) z(k + 1) = H(k + l)x(k

+ r(k + Lk)w](k)

+ ‘P(k + l,k)u(k)

+- 1) + G(k + l)u(k + I) + vI(k + I)

(23-l)

E{w(k)v’(k)} = S(k) # 0

(23-2)

There are many approaches for treating correlated process and measurement noises, some leading to a recursive predictor, some to a recursive filter, and others to a filter in predictor-corrector form, as in the following:

where WI(k) and y(k) are nonzero mean, individually and mutually uncorrelated Gaussiannoise sequences,i.e.,

J3w(k)l = ml(k) + 0

m,Jk) known

(23-3)

%(k))

rq(k)

known

(23-4)

= ml(k) + 0

WW - ~&~lbd19 - ~wI~~~l’I = QWO, Wb&J - mI~NW~ ml(j)l’l = R(i)&,,andE-h(i) - ~wl(t’)l[~~(~~ - mJ~91’1 = 0.

This caseis handled by reducing (23-l) and (23-2) to our previous basic state-variable model, using the following simple transformations. Let w(k) A w#c) - mwI(k)

(23-5)

v(k) A vdk) - ml(k)

(23-6)

and Observe that both w(k) and v(k) are zero-mean white noise processes,with covariances Q(k) and R(k), respectively. Adding and subtracting W + 1&)a1 (k) in state equation (23-l) and mvI(k + 1) in measurement equation (23-2), these equations can be expressedas x(k + 1) = Q(k + l,k)x(k)

+ r(k + 1 ,k)w(kJ + uI(k)

(23-7)

Theorem 23-2. When w(k) and v(k) are correlated, then u predictorcorrectur form uf the Kalman filter is i(k + Ilk) = @(k + l,k)i(klk) + r(k + l,k)S(k)[H(k)P(klk

uI(k) = q(k + l,k)u(k) + lY(k + l,k)m,,Jk)

xdk + 1) = z(k + 1) - G(k + l)u(k + I) - m,Jk + 1)

(23-8) (23-9)

(23-10)

Clearly, (23-7) and (23-8) is once again a basic state-variable model, one in which uI(k) plays the role of g(k + l,k)u(k) and zI(k + 1) plays the role of z(k + 1). Theorem 23-I. When biases are present in a state-variable model, then that mude! can always be reduced tu u basic state-variable mode/ [e.g., (23-7) to (23-N)]. All uf our previuus slate estimators can be applied to this basic statevariable mode2 by replacing z(k) by q(k) and V(k + l,k)u(k) by u*(k). Cl

+ ‘F(k + l,k)u(k) - l)H’(k)

+ R(k)]-‘i(k

jk - 1)

(2342)

+ Ilk)

(2343)

and ji(k + l(k + I> = g(k + Ilk) + K(k + l)i(k

where Kalman gain matrix, K(k + I), is given by (17-12). filtering-error iance matrix, P(k + Ilk + I>, is given by (17-l@, and, prediction-error iance matrix, P(k + l(k), is given by

P(k + Ilk) = @(k + l,k)P(klk)@;(k

covarcovar-

+ I,k) + Q,(k)

(23-14)

@,(k + l,k) = @(k + 1,k) - r(k + l,k)S(k)R-‘(k)H(k)

(23-15)

in which and Ql(k) = r(k

q(k + 1) = H(k + I)x(k + I) + v(k + 1)

(23-12)

+ l,k)Q(k)I”(k

+ l,k)

- T(k + l,k)S(k)R-‘(k)S’(k)l-“(k

+ l,k)

(23-16)

Observe that, if S(k) = 0, then (23-12) reduces to the more familiar predictor equation (16-4), and (23-14) reduces to the more familiar (17-13). P~oc$, The derivation of correction equation (23-13) is exactly the same, when w(k) and v(k) are correlated, as it was when w(k) and v(k) were assumed uncorrelated. See the proof of part (a) of Theorem 17-1 for the details. In order to derive predictor equation (23-12), we begin with the Funda- mental Theorem of Estimation Theory, i.e , r2(k + ilk) = E(x(k + 1)/55(k))

(23-17)

Substitute state equation (15-17) into (23-171,to show that ii(k + l(k) = @(k + l,k)i(k!k)

+ Y(k + I .k)u(k) + T(k + l.k)E{w(k)l%(k))

(23-N)

Colored State Estimation

226

for the Not-So-Basic

State-Variable

Noises

227

Lesson 23

Model

and

Next, we develop an expressionfor E{w(k)lZ(k)}. Let Z(k) = co1(%(k - l), z(k)); then,

P(k + lik) = [@(k + 1,k) - L(k)H(k)]P(klk

[QP(k+ 1,k) - L(k)H(k)]’

E{w(k)l~(k)}= E(w(k)l~(k- l),z(k)) = E{w(k)l%(k - l),i(kJk - 1)) = E{w(k)l%(k - 1)) + E{w(k)fi(kIk - E(w(k)l = E{w(k)[i(kIk - 1))

- 1)

+ r(k - 1))

+ l,k)Q(k)I-‘(k

+ l,k)

- r(k + l,k)S(k)L’(k) - L(k)S(k)Y(k + 1,k)

(23-19)

(23-25)

+ L(k)R(k)L’(k)

In deriving (23-19) we used the facts that w(k) is zero mean, and w(k) and %(k - 1) are statistically independent. Because w(k) and i(klk - 1) are

Proof. These results follow directly from Theorem 23-2; or, they can be derived in an independent manner, as explained in Problem 23-2. Cl

jointly Gaussian, E{w(k)li(klk

- 1)) = P,;(k,klk

- l)P;i’(klk - l)g(klk - 1)

(23-20)

where Pi; is given by (16-33), and P,(k,kIk

(23-21)

- 1) + v(k)]‘}

= S(k) In deriving (23-21) we used the facts that i(klk - 1) and w(k) are statistically independent, and w(k) is zero mean. Substituting (23-21) and (16-33) into (23-20),we find that E{w(k)lZ(kIk

- 1)) = S(k)[H(k)P(klk

(23-22)

- l)H’(k) + R(k)]-’

Substituting (23-22) into (23-19),and the resulting equation into (23-18)completes our derivation of the recursivepredictor equation (23-12). We leave the derivation of (23-14) as an exercise. It is straightforward but algebraically tedious. 0 Recall that the recursivepredictor playsthe predominant role in smoothing; hence, we present When w(k) and v(k) are correlated, then a recursive Corollary 23-l. predictor for x(k + l), is i(k + ilk)

= 4P(k + l,k)i(klk

- 1) + V(k + l,k)u(k) + L(k)s(k[k

When w(k) and v(k) are correlated, then a recursive

i(k + Ilk + 1) = (I?l(k + l,k)i(k\k) + *(k + l,k)u(k) + D(k)z(k) + K(k + l)Z(k + Ilk)

- 1) = E{w(k)Z’(klk - 1)) = E{w(k)[H(k)%(klk

Corollary 23-2. filter for x(k + 1) is

- 1)

(23-23)

(23-26)

where

D(k) = I’(k + l,k)S(k)R-‘(k)

(23-27)

and all other quantities have been defined above. Proof. Again, these results follow directly from Theorem 23-2; however, they can also be derived, in a much more elegant and independent manner, as described in Problem 23-3. q

COLORED NOISES

Quite often, some or all of the elements of either v(k) or w(k) or both are colored (i.e., have finite bandwidth). The following three-step procedure is used in these cases: 1. model each colored noise by a low-order difference equation that is excited by white Gaussian noise; 2. augment the states associatedwith the step 1 colored noise models to the original state-variable model; 3. apply the recursive filter or predictor to the augmented system.

where

L(k) = [@(k + l,k)P(klk + r(k

- l)H’(k)

+ l,k)S(k)l[H(k)P(kjk

- l)H’(k) + R(k)]-

-1

(23-24)

We try to model colored noiseprocessesby low-order Markov processes, i,e,) low-order difference equations. Usually, first- or second-order models

228

State Estimation


State-Variable

Model

Lesson 23

are quite adequate.Consider the following first-order model for colored noise process w(k), u(k + 1) = cm(k) + n(k) (23-28) In this model n(k) is white noise with variance & ; thus, this model contains two parameters: ctand d, which must be determined from a priori knowledge abut u(k). We may know the amplitude spectrum of m(k), correlation information about w(k), steady-statevarknce of u(k), etc. Two independent pieces of information are needed in order to uniquely identify a and dn. Example 23-l We are given the facts that scalar noise w(k) is stationary with the properties E{w(k)} = 0 and E{w(i)w(~]] = e -‘b- 4. A first-order Markov model for w(k) can easily be obtained as

t(k + I) = e-2((k) +-Vl - cd n(k)

(23-29) (23-30)

4kJ = w where E{t(O)] = 0, E{[‘(O)] = I, E{n (k)] = tl and E{n (~]n(I]} = &.

0

Example 23-2 Here we illustrate the state augmentation procedure for the first-order system

Colored

Noises

229

Equations (23-35) and (23-36) constitute the augmented state-variable model. We observe that. when the original process noise is colored and the measurement noise is white, the state augmentation procedure leads us once again to a basic (augmented) state-variable model. one that is of higher dimension than the original model because of the modeled colored processnoise. Hence, in this casewe can apply all of our state estimation algorithms to the augmented state-variable model. E Example 23-3 Here we consider the situation where the process noise is white but the measurement noise is colored, again for a first-order system, X(k + 1) = a,X(k) t w(k)

(23-37)

z(k + 1) = hx(k + 1) + ~(k + 1)

(23-38)

As in the preceding example, we model v(k) by the following first-order Markov process v(k + 1) = ap(k)

+ n(k)

(23-39)

where n(k) is white noise. Augmenting (23-39) to (23-37) and reexpressing (23-38) in terms of the augmented state vector x(k), where x(k) = co1 (X(k), v(k))

(23-40)

we obtain the following augmented state-variable model,

Xck + 1) = a,X(k) + w(k)

(23-31)

z (k + 1) = hX(k + 1) + v(k + 1)

(23-32)

(23-4 1)

where w(k) is a first-order Markov process, i.e., u(k + 1) = ulw(k) + n(k)

(23-33)

and

and v(k) and n(k) are white noise processes.We “augment” (23-33) to (23-31), as follows. Let (23-34)

x(k) = co1(X(k), u(k)) then (23-31) and (23-33) can be combined, to give

(23-35) \

’

-u

x(k + 1)

-

(D

w

Y

Equation (23-25) is our augmentedstute equutioa Observe that it is once again excited by a white noise process,just as our basic state equation (15-17) is. In order to complete the description of the augmented state-variable model, we must expressmeasurementz (Ic + 1) in terms of the augmented state vector, x(k + l), i.e., z(k + 1) = (h -w H

0) ($

; ;;) + v(k + 1)

x(k + 1)

(23-36)

(23-42)

H

x(k + 1)

Observe that a vector process noise now excites the augmented state equation and that there is no measurement noise in the measurement equation. This second observation can lead to serious numerical problems in our state estimators, becauseit means that we must set R = 0 in those estimators, and, when we do this, covariance matrices become and remain singular. 0

Let us examinewhat happens to P(k + Ilk + 1) when covariance matrix R is set equal to zero. From (17-14) and (17-12) (in which we set R = O>,we find that P(k + Ilk + 1) = P(k + Ilk) - P(k + llk)H’(k [H(k + l)P(k + llk)H’(k + l)]-‘H(k

+ 1) + l)P(k + Ilk)

(23-43)

Multiplying both sidesof (23-43) on the right by H’(k + l), we find that P(k + l[k + l)H’(k + 1) = 0

(23-44)

230

State Estimation for the Not-So-Basic

State-Variable

Model

Lesson 23

BecauseH’(k + 1) is a nonzero matrix, (23-44) implies that P(k + Ifk + 1) must be a singular matrix. We leave it to the reader to show that once P(k + Ilk + 1) becomessingular it remains singular for all other values of k.

Perfect Measurements:

Reduced-Order

Estimators

231

where L1 is n x I and L2 is n x (n - 1); thus, (23-50)

x(k) = LlYW) + LP(k)

In order to obtain a filtered estimate of x(k), we operate on both sidesof (23-50)with E{ I%(k)}, where l

PERFECT MEASUREMENTS:

REDUCED-ORDER ESTIMATORS

9(k)

We have just seen that when R = 0 (or, in fact, even if some, but not all, measurementsare perfect) numerical problems can occur in the Kalman filter. One way to circumvent theseproblems is ad hoc, and that is to usesmall values for the elements of covariancematrix R, even through measurements are thought to be perfect. Doing this has a stabilizing effect on the numerics of the Kalman filter. A secondway to circumvent these problems is to recognizethat a set of “perfect” measurementsreducesthe number of states that have to be estimated. Suppose,for example,that there are 2perfect measurementsand that state vector x(k) is n X 1. Then, we conjecture that we ought to be able to estimatex(k) by a Kalmanfilter whosedimension is no greaterthan n - e. Such an estimator will be referred to as a reduced-order estimator.The payoff for using a reduced-order estimatoris fewer computations and lessstorage. In order to illustrate an approach to designing a reduced-order estimator, we limit our discussionsin this section to the following time-invariant and stationary basic state-variable model in which u(k) A 0 and all measurementsare perfect, x(k + 1) = @x(k) + Tw(k) (23-45) y(k + 1) = Hx(k + 1)

(23-46)

In this model y is I x 1. What makes the design of a reduced-orderestimator challenging is the fact that the I perfect measurementsare linearly related to the n states, i.e., H is rectangular. To begin we introduce a reduced-orderstate vector, p(k), whose dimension is (n - 2) x 1; p(k) is assumedto be a linear transformation of x(k), i.e., p(k) 4 Cx(k)

co1

(Y(l),

Y(2),

l

l

l

9

Y(k))

(23-51)

Doing this, we find that wqk) =

L,Y(k)

+ L2i-q lk)

(23-52)

which is a reduced-order estimator for x(k). Of course, in order to evaluate ii(k lk) we must develop a reduced-order Kalman filter to estimate p(k). Knowing fi(k ]k) and y(k), it is then a simple matter to compute jZ(k ]k), using (23-52). In order to obtain fi(klk), using our previously-derived Kalman filter algorithm, we first must establish a state-variable model for p(k). A state equation for p is easily obtained, as follows: p(k + 1) = Cx(k + 1) = C[@x(k) + rw(k)] = C@[L,y(k) + Lzp(k)] + CTw(k) = CQLzp(k) + C@Lly(k) + Ww(k)

(23-53)

Observe that this state equation is driven by white noise w(k) and the known forcing function, y(k).

A measurement equation is obtained from (23-46), as y(k + 1) = Hx(k + 1) = H[@x(k) + Tw(k)]

= H@[Lly(k) + L*p(k)] + HTw(k) = H@Lzp(k) + H@Lly(k) + HTw(k)

(23-54)

At time k + 1 we know y(k); hence, we can reexpress (23-54) as yl(k + 1) = H@L*p(k) + HI’w(k)

(23-55)

yl(k + 1) 4 y(k + 1) - HQLy(k)

(23-56)

(23-47)

Augmenting (23-47) to (23-46),we obtain (23-48) H is invertible. Of course, many differDesign matrix C is chosenso that c ( > ent choices of C are possible;thus, this first step of our reduced-order estimator design procedure is nonunique. Let = &IL,)

=

(23-49)

Before proceeding any farther, we make some important observations about our state-variable model in (23-53) and (23-55). First, the new measurement yl(k + 1) representsa weighted difference between measurements y(k + 1) and y(k). The technique for obtaining our reduced-order statevariable model is, therefore, sometimes referred to as a measurementdiflerencing technique (e.g., Bryson and Johansen, 1965). Becausewe have already used y(k) to reduce the dimension of x(k) from n to n - 1,we cannot again use y(k) alone as the measurementsin our reduced-order state-variable model. As we have just seen,we must use both y(k) and y(k + 1).

232

State Estimation


State-Vkable

Model

Lesson 23

second, measurementequation (23-55) appears to be a combination of signal and noise. Unless HI? = U, the term Hrw(k) will act asthe measurement noise in our reduced-order state-variable model. Its covariance matrix is HrQr’E#‘. Unfortunately, it is possiblefor Hr to equal the zero matrix. From linear system theory, we know that IV is the matrix of first IvIarkov parameters for our originai systemin (23-45) and (23-46), and Hr may equal zero. If this occurs, then we must repeat all of the above until we obtain a reducedorder state vector whosemeasurementequation is excited by white noise. We see, therefore, that dependingupon systemdynamics, it is possibleto obtain a reduced-order estimator of x(k) that uses a reduced-order Kalman filter of dimension less than n - 1, Third, the noises, which appear in state equation (23-53) and measurement equation (23-S) are the same, namely w(k); hence, the reducedorder state-variable model involves the correlated noise casethat we described before in this chapter in the section entitled Correlated Noises. Finally, and most important, measurement equation (23-55) is nonstandard, in that it expressesyl at k + I in terms of p at k rather than p at k + 1. Recall that the measurementequation in our basicstate-variablemodel is z(k + 1) = Hx(k + 1) -t v(k + 1). We cannot immediately apply our Kalman filter equations to (23-53) and (23-S) until we express (23-55) in the standard way. To proceed, we let

Lesson 23

Problems

we see that

&,(k + Ilk + 1) = &(k + ilk)

(23-64)

Equation (23-64) tells us to obtain a recursive filter for our reduced-order model, that is in terms of data set 9,:(k + l), we must first obtain a recursive predictor for that model, which is in terms of data set B{(k). Then, wherever t(k) appears in the recursive predictor, it can be replaced by yl(k + 1). Using Corollary 23-1, applied to the reduced-order model in (23-53) and (23-58): we find that @,(k + l/k) = C@L&(@

- 1) + C@L,y(k)

thus, fi,,(k + l(k + 1) = C~L&,(k~k)

+ CQLly(k) + L(~)[YI(~

+ 1) - H@bfi,,(kh)l

(23-66)

Equation (23-66) is our reduced-order Kalman filter. It provides filtered estimates of p(k + 1) and is only of dimension (n - I) x 1. Of course, when L(k) and P,,(k + Ilk + 1) are computed using (23-13) and (23-14), respectively, we must make the folIowing substitutions: @(k + l,k)* C@L?, H(k)-+ H@Lz, f(k + 1 ,k)-* CT, Q(k) --f Q, S(k) + QT’H’, and R(k)-+ HT’QT’H’.

FINAL REMARK

so that c(k) = H@L?p(k) + Hrw(k)

(23-58)

Measurement equation (23-58)is now in the standard form; however, because g(k) equals a future value of yI, namely yl(k + l), we must be very careful in applying our estimator formulas to our reduced-order model (23-53) and (23-58). In order to see this more clearly, we define the foIlowing two data sets, By@ + 1) = -h(l), y,(2), . . . , ydk + I), . . .I

(23-59)

and (2340) Obviously, (23-61) Letting (23-62) and (23-63)

In order to see the forest from the trees, we have considered each of our special cases separately. In actual practice, some or all of them may occur simultaneously. The exercises at the end of this lesson will permit the reader to gain experience with such cases.

PROBLEMS 23-l. Derive the prediction-error covariance equation (23-14). 23-2. Derive the recursive predictor, given in (23-23), by expressing ri(k f l(k) as E(x(k t- l)~~(k)) = E(x(k + l)jZE(k - l),i(kjk - 1)). 23-3. Here we derive the recursive filter, given in (23-26), by first adding a convenient form of zero to state equation (1517): in order to decorrelate the processnoise in this modified basic state-variable model from the measurement noise v(k). Add D(k)[z(k) - H(k)x(k) - v(k)] to-(15-17). The process noise, w,(k), in the modified basic state-variable model, is equal to r(k f l,k)w(k) - D(k)v(k). Choose “decorrelation” matrix D(k) so that E{w,(k)v’(k)] = 0. Then complete the derivation of (23-26). Observe that (23-14) can be obtained by inspection, via this derivation.

State Estimation


State-Variable

Lesson 23

Model

23-4. In solving Problem 23-3, one arrives at the following predictor equation, i(k + l(k) = QI(k + l,k)i(k(k)

23-5. 23-6.

23-7.

23-8.

23-9.

+ ‘P(k + l,k)u(k)

+ D(k)@)

Beginning with this predictor equation, and corrector equation (Z-13) derive the recursive predictor given in (23-23). Show that once P(k + l/k + 1) becomes singular it remains singular for all other values of k. Assume that R = 0, HF = 0, and HOPI’f 0. Obtain the reduced-order estimator and its associated reduced-order Kalman filter for this situation. Contrast this situation with the case given in the text, for which NT f 0. Develop a reduced-order estimator and its associatedreduced-order Kalman filter for the case when 2 measurements are perfect and m - I measurements are noisy. Consider the first-order system x(k + 1) = ix(k) + w*(k) and z (k + 1) = x(k + 1) + v(k + 1), where E{wi(k)} = 3, E{v(k)} = 0, w,(k) and v(k) are both white and Gaussian,E{w:(k)} = 10, E{v2(k)} = 2, and, w,(k) and v(k) are correlated, i.e., E{wl(k)v (k)} = 1. (a) Obtain the steady-staterecursive Kalman filter for this system. (b) What is the steady-statefilter error variance, and how does it compare with the steady-statepredictor error variance? Consider the first-order system x(k + 1) = $x(k) + w(k) and z(k + 1) = x (k + 1) + v(k + l), where w(k) is a first-order Markov process and v(k) is Gaussian white noise with E{v (k)} = 4 and r = 1. (a) Let the model for w(k) be w(k + 1) = w(k) + u(k), where u(k) is a zero-mean white Gaussian noise sequence for which E{u ‘(k)} = d. Additionally, E{w(k)} = 0. What value must cyhave if E(w’(k)} = W for all k?

(b) Suppose W2 = 2 and 02, = 1. What are the Kalman filter equations for estimation of x (k) and w(k)? 2340. Consider the first-order system x(k + 1) = - $x(k) + w(k) and t (k + 1) = x(k + 1) + v(k + l), where w(k) is white and Gaussian[w(k) - N(w(k); 0, 1)] and v (k) is also a noise process. The model for v(k) is summarized in Figure P23-10. (a) Verify that a correct state-variable model for v(k) is, xl@ + 1) = -&xl(k)

+ $n(k)

v(k) = xl(k) + n(k)

(b) Show that v(k) is also a white process. (c) Noise n(k) is white and Gaussian [n(k) - N(n (k); 0, l/4)]. What are the Kalman fiIter equations for finding i(k + l(k + l)?

Figure P23-10

Lesson 23

Problems

235

23-11, Obtain the equations from which we can find &(k + Ilk + 1) i,(k + ilk + 1) and c(k + l]k + 1) for the following system: Xl(k + 1) = --xl(k) + x*(k) X2(k + 1) = x2(k) + w(k) z(k + 1) = xI(k + 1) + v(k + 1) where v(k) is a colored noise process, i.e., v(k + 1) = - $ v(k) + n(k)

Assume that w(k) and n (k) are white processesand are mutually uncorrelated, and, c&(k) = 4 and $,(k) = 2. Include a block diagram of the interconnected system and reduced-order KF. 23-12. Consider the system x(k + 1) = @x(k) + yp(k) and z(k + 1) = h’x(k + 1) + v (k + 1), where p(k) is a colored noise sequenceand v(k) is zero-mean white noise. What are the formulas for computing fi(kIk + l)? Filter &i(k Ik + 1) is a deconvolution filter. 23-13. Consider the scalar moving average (MA) time-series model z(k) = r(k) + r(k - 1)

where r(k) is a unit variance, white Gaussian sequence.Show that the optimal one-step predictor for this model is [assume P(OI0) = I] i(k + l/k) = &

[z(k) - i(klk

- I)]

(Hint: Express the MA model in state-spaceform.) 23-14. Consider the basic state variable model for the stationary time-invariant case. assume also that w(k) and v(k) are correlated, i.e., E{w(k)v’(k)} = S. (a) Show, from first principles, that the single-stage smoother of x(k), i.e., %(k jk + 1) is given by i(klk

+ 1) = i(klk)

+ M(klk

+ l)i(k

+ Ilk)

where M(kIk + 1) is an appropriate smoother gain matrix. (b) Derive a closed form solution for M(kIk + 1) as a function of the correlation matrix S and other quantities of the basic state-variable model.

fesson

A Dynamical

24

Model

237

f inecwizafion und Discretization of Nonlinear

Systems

Figure 24-l

Coordinate system for an angular measurement between two objects

A and 3.

The purpose of this lesson is to explain how to linearize and discretize a nonlinear differential equation model. We do this so that we will be able to apply our digital estimators to the resulting discrete-time system.

A DYNAMICAL INTRODUCTION

Many real-world systemsare continuous-time in nature and quite a few are also nonlinear. For example, the state equationsassociatedwith the motion of a satellite of massnz about a spherical planet of massA!, in a planet-centered coordinate system, are nonlinear, becausethe planet’s force field obeys an inverse square law. Figure 24-l depicts a situation where the measurement equation is nonlinear. The measurement is angle 4, and is expressed in a rectangular coordinate system, i.e., +i = tan-’ [y /(x - ii)], Sometimes the state equation may be nonlinear and the measurement equation linear, or vice-versa, or they may both be nonlinear. Occasionally, the coordinate systern in which one chooses to work causesthe two former situations. For example, equations of motion in a polar coordinate system are nonlinear, whereasthe measurementequations are linear. In a polar coordinate system, where # is a state-variable, the measurementequation for the situation depicted in Figure 24-1 is zj = &, which is linear. In a rectangular coordinate system, on the other hand, equations of motion are linear, but the measurement equations are nonlinear. Finally, we may begin with a linear systemthat contains some unknown parameters. When these parameters are modeled as first-order Markov processes?and these models are augmentedto the original system, the augmented model is nonlinear, because the parameters that appeared in the original “linear” model are treated as states.We shall describethis situation in much more detail in Lesson25. 236

MODEL

The starting point for this lesson is the nonlinear state-variable model

g(f) = f[x(&u(t),t] + G(t)w(l)

(24-l)

z(f) = hb(+.+),d

(24-Z)

+ ~0)

We shall assumethat measurements are only available at specific values of time, namely at t = ti, i = 1, 2, . . . ; thus, our measurement equation will be treated as a discrete-time equation, whereasour state equation will be treated asa continuous-time equation. State vector x(i) is n X 1; u(t) is an 1 X I vector of known inputs; measurement vector z(t) is m x 1; ii(f) is short for dx(t)/dt; nonlinear functions f and h may depend both implicitly and explicitly on l, and we assumethat both f and h are continuous and continuously differentiable with respect to all the elements of x and u; w(t) is a continuous-time white noise process,i.e., E{w(t)) =. 0 and E(w(~)w’(T)) = Q(r)?+ - 7);

(24-3)

v(ti) is a discrete-time white noise process,i.e., E(v(t,)) = 0 for t = E,, i = 1, and E(v(~~)v’(~j))= R(ti)6, ;

(24-4)

and, w(f) and V(Q)are mutually uncorrelated at all E = t,, i.e.: E(w(r)v’(~J} = 0

fort = t,

i = 1,2,. . .

(24-5)

Linearization

238

and Discretization

of Nonlinear

Systems

Lesson 24

Linear Perturbation

Equations

Comparing (24-8) and (24-l), and (24-9) and (24-2), we conclude that

Example 24 1

Here we expand upon the previously mentioned satellite-planet example. Our example is taken from Meditch (1969, pp. 60-61), who states. . . “Assuming that the planet’s force field obeys an inverse square law, and that the only other forces present are the satellite’s two thrust forces u,(t) and u@(t)(see Figure 24-2), and that the satellite’s initial position and velocity vectors lie in the plane, we know from elementary particle mechanicsthat the satellite’s motion is confined to the plane and is governed by the two equations .. r =

r(j2

-

- y+

r2

1 ;

2.w4 1 fb(t),u(t>Jl= co1x2,x1x:- Xl$+ mU1=X4’ -- Xl +lmu2 and

h[x(t),u(t),t]= x1 - ro

(24-11)

Observe that, in this example, only the state equation is nonlinear.

LINEAR PERTURBATION

li =

(24-10)

0

(24-6)

u40

and _

1

2G + 1 ; ue(0 r

(24-7)

where y = GM and G is the universal gravitational constant. “Definingxl=r,xz= i,x~=9,xq=8,ul=ur,anduz=u,,wehave

EQUATIONS

In this section we shall linearize our nonlinear dynamical model in (24-l) and (24-2) about nominal values of x(t) and u(t), x*(t) and u*(t), respectively. If we are given a nominal input, u*(t), then x*(t) satisfies the following nonlinear differential equation, g*(t)

(24-8)

z(t) = r(t) - r() + v(t) = x1(t) - ro + v(t)

(24-9)

where r. is the planet’s radius.”

_ Reference Axis Planet, M Figure 24-2 Schematic for satellite-planet system (Copyright 1969,McGraw-Hill).

(24-12)

[x*(t),u*(t),t]

and associatedwith x*(t) and u*(t) is the following nominal measurement, z*(t), where Z*(t)

which is of the form in (24-l). . . . Assuming . . . that the measurement made on the satellite during its motion is simply its distance from the surface of the planet, we have the scalar measurement equation

= f

= h

[x*(t),u*(t)]

t

= ti

i = 1, 2, . . .

(24-13)

Throughout this lesson, we shall assumethat x*(t) exists. We discusstwo methodsfor choosingx*(t) in Lesson25. Obviously, one is just to solve (24-12) for x*(t). Note that x*(t) must provide a good approximation to the actual behavior of the system. The approximation is considered good if the difference between the nominal and actual solutions can be described by a system of linear differential equations, called linear perturbation equations. We derive these equations next. Let 6x(t) = x(t) - x*(t)

(24-14)

h(t) = u(t) - u*(t)

(24-15)

then, -$8x(t) = M(t) = k(t) - g*(t) = f[x(t)g(t),t]

+ wb@) - f Lx*(t),u*(WI

(24-16)

Linearization

240

and Discretization

of Nonlinear

Systems

Linear Perturbation

Lesson 24

= f[x*(t),u*($f] + ~[x*(~),u*(~),~]~x(t) + FU[x*(t),u*(&@u(t) + higher-order terms

where FXaud FUare n x n and n

x

241

Observe that, even if our original nonlinear differential equation is not an explicit function of time (i.e., f [x(r>.u(t),f] = f [x(f),u(t)]), our perturbation state equation is alwaystime-varying becauseJacobian matrices F, and F, vary with time, becausex” and u* vary with time. Next, let

Fact 1. When f [x(&u(j),f] is expandedin a Taylor seriesabout x*(t) and u*(t), we obtain f[x(t)u(t),t]

Equations

(24-17)

Sz(r) = z(t) - z*(t)

f Jacobian matrices, i.e.,

(24-24)

Fact 2. When h[x(t).u(t).t] is expanded in a Tuylor series about and u*(t), we obtain

(24-18)

h[x(t),u($f]

= h[x*(t),u*(r),t3

+ H, [x*(t>,u*(t)~]~x(~)

+ H, [x*(t),u*(t),fjSu(~) + higher-order terms (24-19)

c

where H, and H, are m

x

n and m

x

4- -

;

*-.

The Taylor seriesexpansion of the ith componentof f [x(t),u(t),t]

dh,/dx,*

;

( ah,/&,*

.*a

ah&x,*

8h,l&i;r

..-

dhIf&

(24-26) 1

and

H,[x*(f),u*(t),f] =

(24-21) Proof.

(24-25)

P Jacobian matrices, i.e.,

ah, f ax,*

H,[x*(t),u*(r).t] = (24-20)

x*(t)

; ( dh, /au,*

-nI

;

.,-

dh, / du/*1

(24-27)

In these expressions dhi/axT and ahi/au; are short for

iS

dhi &j*

+$ tl [x&)- x;(t)]+3 1 [u#)- u?(t)] + -*. +$LI*i u,t 0 - ul*(r)] + higher-order terms

(24-22)

wherei = 1,2,..., n. Collecting these II equations together in vector-matrix format, we obtain (24-17), in which FXand FUare defined in (24-18) and (24-19), respectively. El Substituting (24-17) into (24-X) and neglecting the “higher-order terms,” we obtain the following perturbation sttiie equatim

-

dhi Cx(~)*u(t),tI1 ix(t) = x*(f),u(t) % (0

(24-28)

= u*(t)

and dhi[x(l).u(t)J] dhi aLdj*aui (0

x(f) = x*(t),u(t)

= u*(t)

0

(24-29)

We leave the derivation of this fact to the reader, becauseit is analogous to the derivation of the Taylor seriesexpansion off [x(f),u(t),/]. Substituting (24-25) into (24-24) and neglecting the “higher-order terms ,” we obtain the following perturbation measurement equation 6z(r) = H, [x*(t>,u*(r),r]Sx(r) + H~[X*(t),U*(t)yt]SU(t)

+ V(f)

t = Ii,

i = 2,2,...

(24-30)

G(t) = FX[x*(f),u*(t),#x(t) + FU[x*(r},u*(t)~]&~(t) + G(+v(f)

1’

(24-23) ’

Equations (24-23) and (24-30) constitute our linear perturbation equations, or our linear perturbation stare-variable model.

Linearization

242

and Discretization

of Nonlinear

Lesson 24

Systems

Example 24-2

Discretization

of a Linear Time-Varying

State-Variable

Model

and

Returning to our satellite-planet Example 24-1, we find that /

0

1

E{W(t)V’(ti)}

0

0

=

fort=&

0

i=l,2,...

(24-35)

\

Our approach to discretizing state equation (24-30) begins with the solution of that equation. FX[x*(t>,u*(t),t] = FX[x*(t)] = Theorem 24-l.

F,[x*(t),u*(t),t] = FU=

0

0

;

0

0

0

The solution to state equation (24-31) can be expressed

x(t)= @(t,to)x(to) + ~‘@(w)[W)u(r)+ G(++)ld~

(24-36)

t0

where state transition matrix @(t,T) is the solution to the following homogeneous differential equation,

0 -l m

&(t,r) = F(t) is a discrete-time white noise process, and, w(t) and v(ti) are mutually uncorrelated at all t = ti, i = 1, 2, . . . , i.e., E{w(?)} = 0 for all t, E{v(ti)} = 0 for all ti, and E{w(t)w’(T)} = Q(t)s(t - T)

(24-33)

E{v(ti)v’(tj)} = R(ti)ajj

(24-34)

x(k + 1) = @(k + 1, k)x(k) + ‘P(k + l,k)u(k) + Wdck) @(k + 1,k) ‘P(k + 1,k)

=

@(tk

=

I

(24-40)

+ dk)

tk + 1 @(tk

(24-39)

(24-41)

+

tk

and wd(k) is a discrete-time white Gaussiansequencethat is statistically alent to

I

tk + 1 @(tk

+ 1J

)G(T)w(7)dT

tk

The mean and covariance matrices of wd(k) are @(t, + l,T)G(+V(T)dT

= 0

(24-42)

Linearization

and Discretization

of Nonlinear

Systems

Lesson 24

Discretized

Perturbation

State-Variable

Model

Substituting (24-46) into (24-41), we find that Vr(k + 1,k) = I’* -’ @(tk + t,7)Ckd7

= 1” + ’ eFncrk+ l -r’ Ckd7

lk lk

tk + 1

z

[I

+

F&k+

I fk

t

-

‘;>Jc,d~

(24-50) tk + 1

= CkT + FkCktk+ IT - FkCk

7dr I Ik

respectively. Ubserve, from the right-hand side of Equations (24-40), (24-41), and (24-43), that these quantities can be computed from knowIedge about F(l), C(f), G(t), and Q(l). In general,we must compute@@ + l,kj, q(k + l,k), and Qd(k) using numerical integration, and, these matrices change from one time interval to the next becauseF(l), C(t), G(l), and Q(r) usually changefrom one time interval to the next. Becauseour measurementshave been assumedto be available only at sampled values of t, namely at t = fi, i = 1, 2, . . . , we can express (24-32) as z(k -+ I) = H(k + l)x(k + 1) + v(k + 1)

where we have truncated *(k + 1,k) to its first-order term in T, Proceeding in a similar manner for Qd(k), it is straightforward to show that Q&j

= GkQ&iET

(24-51)

Note that (24-47), (24~49), (24-50), and (24-51), while much simpIer than their original expressions, can change in values from one time-interval to another, because of their dependenceupon k. Cl

(24-44) DISCRET12ED PERTURBATION

STATE-VARIABLE

MODEL

Equations (24-39) and (24-44) constitute our discretized state-variablemodel. Example 24-3 Great simplifications of the calculations in (24-40), (24-41) and (24-43) occur if F(I), C(l), G(r)? and Q(t) are approximateIy constant during the time interval [rk, tk+ I], i.e., if (24-45)

Applying the results of the preceding sectionto the perturbation state-variable model in (24-23) and (24-30), we obtain the following d&refired perturbation state-variable model Sx(k + I) = @(k + l,k;*)ax(k) Sz(k + 1) = H,(k + l;*)Sx(k

+ ‘P(k -t l.k;*)h(k)

@(k + l,k;*) (24-47)

(24-43) and, for sufficiently small values of T

(24-54)

= @(t, + I&;*)

&(r,r;*)

= Fx[x*(r~~u*(t),r]~(t,T;*)

(24-55)

@(t,t;*) = I Additionally, ‘I’(k + l,k;*) = 1” +I @(tk+ I,f;*)FU[x*(r),u*(r),~]dT

I -I-FkT

(2443)

where

is givenby the where we have assumed that lk + l - fk = T. The matrix exponentia1 infinite series

e FkT=

+ 1) + v(k + 1)

The notation @(k + l,k;*), for example, denotes the fact that this matrix dependson x* (t) and u*(t). More specifically,

hence, @(k + l,k) = eFAT

(24-52)

+ 1)

+ H,(k + l;*)Su(k To begin, (24-37) is easily integrated to yield

+ m(k)

(24-49)

We use this approximation for eFkTin deriving simpler expressionsfor q(k + 1,k) and Qd(k). Comparable results can be obtained for higher-order truncations of eFkT .

II,

Qd(k;*) = i” + ’ @(t, + I,T;*)G(~)Q(~)G’(r)W(t~ ‘A

+ l,q*)dr

(24-56)

(24-57)

Linearization

and Discretization

of Nonlinear

Systems

Lesson 24

24-l. Derive the Taylor seriesexpansion of h[x(t), u(t), t] given in (24-25). 24-2. Derive the formula for Q&C) given in (24-51). 24-3. Derive formulas for !P(k + 1,k) and Q&k) that include first- and second-order effects of T, using the first three terms in the expansionof eFkT. 24-4. Let a zero-mean stationary Gaussian random process v(f) have the autocorrelation function +“(7) given by 4” (7) = e-l” + ee2! (a) Showthat this colored-noise processcan be generatedby passing white noise p(f) through the linear system whose transfer function is s+

lf5

(s + l)(s + 2)

(b) Obtain a discrete-time state-variable model for this colored noise process (assumeT = 1 msec). 24-5. This problem presents a model for estimation of the altitude, velocity, and constant ballistic coefficient of a vertically falling body (Athans, et al., 1968). The measurementsare made at discrete instantsof time by a radar that measures range in the presenceof discrete-time white Gaussiannoise. The state equations for the falling body are il = -x2 i2 = -e -yxlxG&3 A!3= 0 where y = 5 x lo-‘, x1@)is altitude, x2(f) is downward velocity, and x3 is a constant ballistic parameter. Measured range is given by z(k) = is a known periodic function.

Lesson fterated

25

Extended

Filter

and (25-5)

Least Squcwes

and Extended

Kalman

[or~fLP)lusing our Lesson

2. Concatenate (252) and compute

3 formulas. 3. Solve the equation

Ku/man

&l~Ls(Aq= i&&N) - 8”

(25-6)

i&&v)

(25-7)

for km(N), i.e.,

Filtering

= tl* + &T&v)

4. Replace 8* with L(N)

and return to Step 1. Iterate through these steps until convergence occurs. Let h(N) and m(N) denote estimates of 8 obtained at the ith and (i + l)st iterations, respectively. Convergence of the ILS method occurs when [@d(N) - i&&v)I

< E

(25-8)

where Eis a prespecified small positive number.

This Iessonis primarily devoted to the extended Kalman fiber (EKF), which is a form of the Kalman filter “extended” to nonlinear dynamical systemsof the type described in Lesson 24. We shall show that the EKF is related to the method of iterated least squares (ILS), the major difference being that the EKF is for dynamical systemswhereas ILS is not.

We observe, from this four-step procedure, that ILS uses the estimate obtained from the linearized model to generate the nominal value of 6 about which the nonlinear model is relinearized. Additionally, in each complete cycle of this procedure, we use both the nonlinear and linearized models. The nonlinear model is used to compute z *(AC),and subsequently 6~(k), using (25-3).

The notions of relinearizing about a filter output and using both the nonlinear and linearized models are also at the very heart of the EKF.

ITERATED LEAST SQUARES

We shall illustrate the method of ILS for the nonlinear model described in Example 2-5 of Lesson2, i.e., for the model

z(k) =f (W + +)

(25-l)

wherek = 1,2 ,..., N. Iterated Ieastsquaresis basically a four step procedure.

k = 1,2,. . . , N

(25-2)

where Sz(k) = z(k) - z*(k) = z(k) -j@*.k) M= e - e* 24a

KALMAN

FILTER

The nonlinear dynamical system of interest to us is the one described in Lesson 24. For convenience to the reader, we summarize aspects of that system next. The nonlinear state-variable model is i(t) = f [x(t),u(t),t] + G(t)w(t)

1. Linearizef(&k) about a nominal value of O,O*. Doing this, we obtain the perturbutim meusuremerzltyuutkm Sz(k) = Fo(k;O*)cW + v(k)

EXTENDED

~(0 = h ~xO),u(O,d + v(t)

(25-9)

t = tj

i = 1,2 ,--a

(25-10)

Given a nominal input, u*(l), and assuming that a nominal trajectory, x*(t), exists, x*(f) and its associated nominal measurement satisfy the following nominal system model,

(25-3)

i*(t} = f [x*(r),u*(t).t]

(25-4)

z*(t) = h [x*(r).u*(t).t]

(25-l 1) t = t,

i = I,2 ) . . *

(25-12)

Iterated Least Squares and Extended

Kalman Filtering

Lesson 25

Letting &x(t) = x(l) - x*(t), h(t) = u(t) - u*(t), and 6z(t) = z(t) - z*(t), we also have the following discretized perturbation state-variable model that is associatedwith a linearized version of the original nonlinear state-variable model, Sx(k + 1) = @(k + l,k;*)Sx(k) + V(k + l,k;*)Gu(k) + W&) Sz(k + 1) = H,(k + 1;*)6x(k + 1) + H,(k + l;*)Su(k + 1) + v(k + 1)

(25-13)

251

Extended Kalman Filter

how to choose x*(l) for the entire interval of time t E [fk ,tk + J. Thus far, we have only mentioned how x*(t) is chosenat tk, i.e., as r2(kIk).

1

?

l

’

l

As a consequenceof relinearizing about i(klk) (k = 0,

Theorem 251. >Y si(tlt,)

(25-14)

In deriving (25-13) and (2514), we made the important assumptionthat higher-order terms in the Taylor series expansions of f[x(t),u(t)?t] and h [x(t),u(t),t] could be neglected.Of course,this is only correct aslong asx(t) is “close” to x*(t) and u(t) is “close” to u*(t). If u(t) is an input derived from a feedback control law, so that u(t) = u[x(t),t], then u(t) can differ from u*(t), becausex(t) will differ from x*(t). On the other hand, if u(t) doesnot dependon x(t) then usually u(t) is the same as u*(t), in which case Su(t) = 0. We see, therefore, that x*(t) is the critical quantity in the calculation of our discretized perturbation statevariable model. Supposex*(t) is given a priori; then we can compute predicted, filtered, or smoothed estimates of 6x(k) by applying all of our previously derived estimators to the discretized perturbation state-variablemodel in (25-13) and (25-14).We can precompute x*(t) by solving the nominal differential equation (25-11).The Kalman filter associatedwith using a precomputed x*(t) is known as a relinearited KF. A relinearized KF usually gives poor results, becauseit relies on an open-loop strategy for choosingx*(t). When x*(t) is precomputed there is no way of forcing x*(t) to remain close to x(t), and this must be done or else the perturbation state-variable model is invalid. Divergence of the relinearized KF often occurs; hence, we do not recommendthe relinearized KF. The relinearized KF is basedonly on the discretized perturbation statevariable model. It does not use the nonlinear nature of the original systemin an active manner. The extended Kalman filter relinearizes the nonlinear system about eachnew estimate asit becomesavailable; i.e., at k = 0, the system is linearized about x(010).Once z(1) is processedby the EKF, so that %(111)is obtained, the systemis linearized about x( 111).By “linearize about x( ill),” we mean x(111)is used to calculate all the quantities needed to make the transition from x( 111)to i(2/1), and subsequentlyx(212).This phrase will become clear below. The purpose of relinearizing about the filter’s output is to use a better reference trajectory for x*(t). Doing this, 6x = x - i will be held as small as possible, so that our linearization assumptionsare less likely to be violated than in the caseof the relinearized KF. The EKF is developed below in predictor-corrector format (Jazwinski, 1970). Its prediction equation is obtained bv- integratingc the nominal differential equation for x*(t), from tk to tk +1. In order to do this, we need to know

=

0

for all t E [tk ,tk+ J

(25-15)

This meansthat x*(t) = %(tlfk)for ai/ t E [tk ?fk+ 11

(2516)

Before proving this important result, we observe that it provides us with a choice of x*(t) over the entire interval of time t E [tk ,tk + J, and, it statesthat at the left-hand side of this time interval x*(&) = i(kIk), whereas at the right-hand side of this time interval x*(tk +1) = f(k + Ilk). The transition from jZ(k + Ilk) to i(k + ilk + 1) will be made using the EKF’s correction equation.

Proof. Let tl be an arbitrary value oft lying in the interval betweenfkand tk+ 1(seeFigure 25-l). For the purposesof this derivation, we can assumethat h(k) = 0 [i.e., perturbation input &u(k) takeson no new values in the interval from tkto tk+ 1; recall the piecewise-constantassumption made about u(t) in the derivation of (24-37)], i.e.,

6x(k + 1) = = P(k + llk;*)H:(k

+ l;*)[H,(k + l;*) P(k + ljk;*)H;(k + l;*) + R(k + l)]-’

which is the EKFpredicbz equa&im. Observethat the nonlinear nature of the system’sstate equation is usedto determine S(k -+ Ilk). The integral in (25.22) is evaluatedby meansof numerical integration formulas that are initialized by

f[~(t~l~~~,u~(~~),~~l.

The corrector equation for g(k + ljk + 1) is obtained from the Kalman filter associated with the discretized perturbation state-variable model in (25-13) and (25-14), and is 8%(k + lik + 1) = G(k

- &(k

+ l\k) + K(k + l;*)[az(k

+ l;*)Z(k

+ l/k) - &(k

+ 1) + l;*)&(k

As a consequenceof relinearizing about #k), G(k G(k

+ I)]

(25-23)

we know that

+ l/k) = 0

(25-24)

+ l[k + 1) = g(k + Ilk + 1) - x*(k + 1) = G(k + l\k + 1) - %(k + l/k)

(2525)

and ih(k

-+ 1) = z(k + 1) - z*(k + 1)

= z(k + 1) - h[x*(k + l), u*(k + I), k + 1-j = z(k + 1) - h [i(k -t Ilk). u*(k + l), k + I]

(X-26)

P(k + Ilk;*) = G’(k + l,k;*)P(klk;*)@‘(k P(k + Ilk + I;*) = [I - K(k + l;*)H,(k

+ l,k;*) + Q&;*)

+ l;*)]P(k + ilk;*)

(25-28) (25-29) (25-30)

Remember that in these three equations * denotes the use of i(k + l(k). The EKF is very widely used, especially in the aerospace industry; however, it doesnot provide an optima1estimateof x(k). The optimal estimate of x(k) is still E{x(k)f%(k)}, regardlessof the linear or nonlinear nature of the system’smodel. The EKF is a first-order approximation of E{x(k)l%(k)) that sometimesworks quite well, but cannot be guaranteed always to work well. No convergence results are known for the EKF; hence, the EKF must be viewed as an ad hoc filter. Alternatives to the EKF, which are based on nonlinear filtering, are quite complicated and are rarely used. The EKF is designed to work well as long as 6x(k) is “small.” The iterated EKF (Jazwinski, 1970), depicted in Figure 25-2, is designed to keep 6x(k) as small as possible. The iterated EKF differs from the EKF in that it iterates the correction equation L times until llkL(k + Ilk + 1) jiLsl(k + Ilk + 1)11 5 E. Corrector Rl computes K(k + l;*), P(k -I- Ilk;“), and P(k + l/k + l;*) using x* = li(k + Ilk); corrector #2 computes these quantities using x* = gl(k + Ilk + 1); corrector #3 computes thesequantities using x* = iz(k + l!k + 1): etc. Often, just adding one additiona corrector (i.e., L = 2) leads to substantially better results for i(k + Ilk + 1) than are obtained using the EKF.

Application

to Parameter

Estimation

One of the earliest applications of the extended Kalman filter was to parameter estimation (Kopp and Orford, 1963).Consider the continuous-time linear system

fi(k+ Ilk+ I)

corrector

#L b-

I(k + Ilk + 1) 4.-

+ooo

+ Ilk + 1)

g,(k

4

‘k

W+)

Corrector

#I +

+ Ilk) W

= W Predictor

Wlk)

Figure 252 Iterated EKF. All of the calculations provide us with a refined estimate of x(k + l), ii(k + Ilk + l), starting with r2(kIk).

2(t) = Ax(t) + w(t) a tktl

S,jk-t Ilk+ I)?i(k+IIk+I)

*

APPLICATION TO PARAMETER ESTIMATION

(2531a) t = ti

Z(t) = Hx(t) + V(t)

i = 1, 2,. . .

(25-m)

Matrices A and H contain some unknown parameters, and our objective is to estimate these parameters from the measurementsZ(ti> as they become available. To begin, we assume differential equation models for the unknown parameters, i.e., either t&(t) = 0

I = 1,2,. . . $z*

(2532a)

t;j(t)=O

i=l,2,...,j*

(2532b)

4(t) = w(t) + n&)

I = 1,2,. . .) I”

(25-33a)

hj(t) = djhj(t) + qi(t)

j = 1, 2, . ,i*

(25-33b)

l

l

In the latter models n/(t) and qj(t) are white noise processes,and one often choosescl = 0 and dj = 0. The noises n,(t) and qj(t) introduce uncertainty about the “constancy” of the al and hj parameters. Next, we augment the parameter differential equations to (25-31a) and (25031b).The resulting system is nonlinear, because it contains products of states [e.g., al(t)xi(t)]. The augmented system can be expressedas in (25-9) and (25-lo), which means we have reduced the problem of parameter estimation in a linear system to state estimation in a nonlinear system. Finally, we apply the EKF to the augmentedstate-variable model to obtain cil(klk) and hj(klk). Ljung (1979) has studied the convergenceproperties of the EKF applied to parameter estimation, and has shown that parameter estimates do not converge to their true values. He shows that another term must be added to the EKF corrector equation in order to guarantee convergence. For details, seehis paper. Example 25-l

Considerthe satellite and planet Example24-1, in which the satellite’smotion is governedby the two equati.ons .. r=$--

1 *’y+ m uro

(25-34)

Iterated Least Squares and Extended Kalman Filtering

Lesson 25

Lesson 25

Pfoblems

Noise u (t, > is white and Gaussian, and &(f,) is given. The control signal u(t) is the sum of a desired control signal u”(r) and additive noise, i.e.,

and (2535) We shall assumethat m and y are unknown constants, and shall model them as h+(r) = 0

(25-36)

j(r) = 0

(2537)

The additive noise SU(f) is a normally distributed random variable modulated bY a function of the desired control signal. i.e.,

su(f) = S[u*(r)Jwo(r) where we(t) is zero-mean white noise with intensity and as(t) may be unknown and are modeled as

h(r) = ar(r)[a,(t)- i%(t)]+ w,(t) (2533)

NOW,

We note, finally, that the modeling and augmentation approach to parameter estimation?described above, is not restricted to continuous-time linear systems.Additiona situations are describedin the exercises. PROBLEMS 25-L In the first-order system, x(k + 1) = ax(k) + w(k), and z (k + 1) = x(k + 1) + v(k + 11,k = 1, 2, . . . , N, a is an unknown parameter that is to be estimated. Sequencesw(k) and v(k) are, as usual, mutually uncorrelated and white, and, w(k) - jV(w(k);O,l) and v(k) - N(v(k);O,h). Explain, using equations and a flowchart, how parameter a can be estimated using an EKF. 25-2. Repeat the preceding problem where all conditions are the same except that now w(k) andv(k) arecorrelated,andE{w(k)v (k)] = ~4. 25-3. The system of differential equations describing the motion of an aerospace vehicle about its pitch axis can be written as (Kopp and Orford ‘91963)

ii 1(t)= x2(t) i2(@

=

Q)k(q

-I- h(t)%(r)

+

03O)W)

where xl = i(f), which is the actual pitch rate. Sampledmeasurements are made of the pitch rate, i.e., z(r) = x,(r) + v(r)

1 = f,

i = 1,2,...,N

&. Parameters i = 1, 2,3

In this model the parameters Q,(t) are assumedgiven, as are the a priori values of a,(t) and e,(l), and, r+pi(t)are zero-mean white noises with intensities cr$,. (a) What are the EKF formulas for estimation of x1, x2, al, and ~2,assumingthat us is known? (b) Repeat (a) but now assumethat u3is unknown. 25-4. Suppose we begin with the nonlinear discrete-time system, x(k + 1) = f[x(k),k]

+ w(k)

z(k) = h[x(k),k]

+ v(k)

k = 1,2,...

Develop the EKF for this system [Hint: expand f [x(k),k] and h [x(k),k] in Taylor series about ri(k jk) and i(klk - l), respectively]. 25-S. Refer to Problem 24-7. Obtain the EKF for (a) Equation for the unsteady operation of a synchronousmotor, in which C and p are unknown; (b) Duffing’s equation, in which C, cy, and p are unknown; (c) Van der Pal’s equation, in which Eis unknown; and (d) Hill’s equation, in which a and b are unknown.

Lesson

26

A Log-Likelihood

Function for the Basic State-Variable

Model

259

estimates of a collection of parameters, also denoted 8, that appear in our basic state-variable model, (26-5)

x(k + 1) = @x(k) + rw(k) + *u(k)

Maximum-Likelihood

z(k -t 1) = Hx(k + 1) + v(k + 1)

k=O,l,...,N-1

(26-6)

Now, however,

State and Parameter

8 = co1(elements of @, r, q? H, Q, and R)

(26-7)

and, we assumethat 6 is d x 1. As in Lesson 11, we shall assume that 6 is identifiable.

Estimation

Before we can determine 4MLwe must establishthe log-likelihood function for our basic state-variable model. A LOG-LIKELIHOOD FUNCTION FOR THE BASIC STATE-VARIABLE MODEL

As always, we must compute p (3!@) = p (z(l), z(2), . . . , z(N#). This is difficult to do for the basic state-variable model, because

INTRODUCTION In Lesson 11 we studied the problem of obtaining maximum-likelihood estimates of a collection of parameters, 0 = co1(elementsof Qp,T, H, and R), that appear in the state-variablemodel (26-l)

x(k + 1) = @x(k) + Vu(k) z(k + 1) = Hx(k + 1) + v(k + 1)

k =O,l,...,N-

1

(26-2)

The log-likelihood Theorem 264. model in (26-5) and (26-6) is

We determined the log-likelihood function to be L(8pfJ = - i 5 [z(i) - HBxg(i)]‘Ril [z(i) - Hex0(i)] - $ In lRel i =1

(26-3)

xe(0) known

function for our basic state-variable

L(813) = - i 5 [Zi (jlj - l)JV,l (jlj - l)&(j(j

- 1)

j=l

where quantities that are subscripted 8 denote a dependenceon 8. Finally, we pointed out that the state equation (26-l), written as xe(k + 1) = Qexe(k) + ‘I!,u(k)

The measurementsare all correlated due to the presenceof either the process noise, w(k), or random initial conditions or both. This represents the major difference between our basic state-variable model, (26-5) and (26-6), and the state-variable model studied earlier, in (26-l) and (26-2). Fortunately, the measurementsand innovations are causally invertible, and the innovations are all uncorrelated, so that it is still relatively easyto determine the log-likelihood function for the basic state-variable model.

(26-4)

acts as a constraint that is associated with the computation of the loglikelihood function. Parameter vector 0 must be determined by maximizing L(f@!l) subject to the constraint (26-4). This can only be done using mathematical programming techniques (i.e., an optimization algorithm such as steepest descentor Marquardt-Levenberg). In this lessonwe study the problem of obtaining maximum-likelihood

+ WWli - 1)ll

(26-9)

where &,( j(j - 1) is the innovations process, and & ( jl j - 1) is the covariance of that process [in Lesson 16 we used the symbol Pti (jlj - 1) for this covar-

iance], &(jlj

- 1) = HftPe( jlj - l)Hi + RB

(26-10)

This theorem is also applicable to either time-varying or nonstationary systemsor both. Within the structure of these more complicated systemsthere must still be a collection of unknown but constant parameters. It is these parameters that are estimated by maximizing L (61%).

Maximum-Likelihood

State and Parameter

Estimation

Lesson 26

Pro@ (Mended, 1983b, pp. lOl-1031. We must first obtain the joint

density function p (S!@)= p (z(l), . . . , z(N)~9).In Lesson 17 we saw that the innovations process i(+ - 1) and measurement z(i) are causaIiy invertible; thus, the density function jqi(liO), qq),

. . . , qqN

- l#)

contains the same data informatiop as-p (z(l), . . . , z(N@) does. Consequently, L (91%j can be replacedby I,@/?E),where 2 = co1(i( liO), . * . , i(Npv - 1))

(26-l 1)

and L@@)

=

inJqql~o),

.

l

l

, i(AqN - 1)/e)

(2642)

Now, however, we use the fact that the innovations processis Gaussian white noise to expressA$!$$) as

On Computing

&,

In the present situation, where the true values of 9, 8~~are not known but are being estimated, the estimate of x(i) obtained from a Kalman filter will be suboptimal due to wrong values of 8 being used by that filter. In fact, we must use iML in the implementation of the Kalman filter, becausehMLwill be the best information available about OTat tj. If bML-+OTas Iv-+ x, then &,,Mi - I>+ G,(jIj - 1) as iV--+ x, and the suboptimal KaIman filter will approach the optimal Ratman filter. This result is about the best that one can hope for in maximum-likelihood estimation of parameters in our basic statevariable model. Note also, that although we beganwith a parameter estimation problem, we wound up with a simultaneous state and parameter estimation problem. This is due to the uncertainties present in our state equation, which necessitated state estimation using a Kalman filter.

ON COMPUTING

For our basic state-variablemodel, the innovations are Gaussiandistributed, which means that pj (i(jlj - 1)19)= ~?(i(jij - 1)10)forj = 1, . . . , W, hence, L(@Q = ln fi p(i(jij

- 1)/e)

(26-14)

j=l

From part (b) of Theorem 16-2in Lesson 16 we know that

=

-;

ii,,

How do we determine 6MLfor L (@X>given in (26-9) (subject to the constraint of the Kalman filter)? No simple closed-form solution is possible, because8 entersinto L (elfq in a comphcatednonlinear manner. The only way presently known to obtain 8MLis by meansof mathematical programming. The most effective optimization methods to determine &$Lrequire the computation of the gradient of L(C@E)as well as the Hessian matrix, or a pseudo-Hessianmatrix of ,5(01%).The Marquardt-Levenberg algorithm (also known as the Levenberg-Marquardt algorithm [Bard, 1970; Marquardt, 1963]),for example, has the form *‘i+

Substitute (26-H) into (26-14)to show that L(0pt)

261

8 ML

1 =

A.

&vi,

-

(6

+

Q)-lg,

i

=

0,

1,

, . .

(26-17)

where &-denotes the gradient

$ [Z’(jij - 1)N-1 (j[j - l)Z(jlj - 1) j-1

+ wwl~ - w

where by convention we haveneglectedthe constant term -ln(2r)& because it does not depend on 0. Becausejj( # 19)and JJ( # 10)contain the same information about the data, L(@‘) and L(@Z) must also contain the same information about the data; hence, we can use ~(f$E) to denote the right-hand side of (26~16),as in (26-9). To indicate which quantities on the right-hand side of (26-9) may depend on 8, we have subscriptedall such quantities with 9. IJ The innovations process&,(j /j - 1) can be generatedby a Kalman filter; hence, the Kalman filter acts as a cunstraint that is asuciated with the cornput&on

uf the log-likelihuud

function

fur the basic stute-variable

(2648)

(2646)

mudel.

H, denotes a pseudo-Hessian

Hi --

*1 8 = eML

(2649)

and Di is a diagonal matrix chosento force Hi + D, to be positive definite, so that (H, + D‘)-’ will always be computable. We do not propose to discussdetails of the Marquardt-Levenberg algorithm. The interested reader should consult the preceding referencesfor general discussions, and Mendel (1983) or Gupta and Mehra (1974) for discussions directly related to the application of this algorithm to the present problem, maximization of L (9[%).

Maximum-Likelihood

State and Parameter

Estimation

Lesson 26

We direct our attention to the calculationsof gi and Hi. The gradient of L(O~%) will require the calculationsof

The innovations depend upon i, (jlj - 1); hence, in order to compute Wjlj - l)/a9, we must compute a&( jlj - 1)/J& A Kalmanfilter must be usedto compute f, (jlj - 1); but this filter requires the following sequenceof calculations: Pg(klk)+PB(k + llk)+K,(k + l)+&(k + llk)+&(k + 11 k + 1). Taking the partial derivative of the prediction equation with respectto 6iwe find that

mi +xu(k)

i = 1,2,...,d

(26-20)

i

We seethat to compute &(k + Ilk) / a6i, we must also compute &(k Ik) / aei. Taking the partial derivative of the correction equation with respect to 6i, we find that d&(k + Ilk + 1) = &,(k

89i

-&(k

+ ilk) + dK*(k + 1)

30i

86i + 1) [$+(k

[z(k + 1) - H&k

263

On Computing 6,,

are used by the sensitivity equations; hence, the Kalman filter must be run together with the d setsof sensitivity equations. This procedure for recursively calculating the gradient dL (Ofa)l d0 therefore requires about as much computation asd + 1 Kalman filters. The sensitivity systemsare totally uncoupled and lend themselves quite naturally to parallel processing (see Figure 26-l). The Hessian matrix of L (01%)is quite complicated, involving not only first derivatives of Ze(j 1j - 1) and N,( j 1j - l), but also their second derivatives. The pseudo-Hessianmatrix of L (Ol%) i gnores all the second derivative terms; hence, it is relatively easy to compute because all the first derivative terms have already been computed in order to calculate the gradient of L(Ol%). Justification for neglecting the second derivative terms is given by Gupta and Mehra (1974), who show that as iML approaches &-, the expected value of the dropped terms goesto zero. The estimation literature is filled with many applications of maximumlikelihood state and parameter estimation. For example, Mendel (1983b) applies it to seismic data processing,Mehra and Tyler (1973) apply it to aircraft parameter identification and McLaughlin (1980) applies it to groundwater flow.

+ Ilk)]

+ Ilk)

i

+H h(k + ilk) 8 tMi

113111v rry

1

her a

i = 1,2, . . . , d

a&(k

+ l)/aOi

dLld6,

(26-21)

Observe that to compute &(k + Ilk + l)/dei, we must also compute &(k + l)/ aei. We leave it to the reader to show that the calculation of &(k + l)/ a0i requires the calculation of dP@(k+ Ilk)/ 80i, which in turn requires the calculation of dPO(k+ 1jk + l)/ Joi. The systemof equations a&(k + Ilk)/aOi

1-q

L z(k)

(

Sensitwty Filter 0,

,

-

R

Filter

0 0 0

&(k + 1Jk + l)/Hi dPo(k + Ilk + 1)/J& dPe(k + l)k)/dOi

is called a Kalman filter sensitivity system. It is a linear system of equations, just as the Kalman filter, which is not only driven by measurementsz(k + 1) [e.g., see (2621)] but is also driven by the Kalman filter [e.g., see(26-20)and (2621)]. We need a total of d such sensitivity systems, one for each of the d unknown parameters in 0. Each system of sensitivity equations requires about as much computation as a Kalman filter, Observe, however, that Kalman filter quantities

G

i =- aLldeZ

:nsitiviry llter 0,

A D I E N T

gi

1

Figure 26-l Calculations needed to compute gradient vector gi. Note that 0j denotes the jth component of 8 (Mendel, 1983b, 0 1983, Academic Press, Inc.).

Maximum-Likelihood

State and Parameter Estimation

Lesson 26

A STEADY-STATE APPROX!MATlON

x=E@H’+R

(26-22)

and

Log-likelihood function ~(9~5!!)is a steady-stateapproximation of I. (C$iC).The steady-stateKalman filter used to compute i(j[j - 1) and x is i(k + l/k) = cD%(k\k)-I-ml(k)

(X-24)

+ Ilk + I) = i(k + lik) + K[z(k + 1) - Hi(k + l\k)]

(26-25)

in which E is the steady-stateKalman gain matrix. Recall that 0 = co1(elements of Q, IY,q, H, Q, and R)

(26-26)

We ignore r (initially) for reasonsthat are explainedbelow Equation (26-32). Let + = co1(elements of @, V, H, p, and z)

j[j - I) - ; IV In ix+1

(26-29)

Instead of finding 6MLthat maximizes ~(O~~], subject to the constraints of a fulI-blown Kalman filter, we now propose to find &L that maximizes ~(c#X), subject ?othe cunstraints of the foIlowing filter: (26-30)

+ ‘P&k)

&(k + lik + 1) = &(k + Ilk) + ii&&k

+ lik)

(26-31)

(26-32)

Once we have computed +ML we can compute &IL by inverting the transformations in (26-27). Of course, when we do this, we are also using the invariance property of maximum-likelihood estimates. Observe that z(+j%) in (26-29) and the filter in (2630)-(26-32) do not depend on r; hence, we haye not included r in any definition of $. We explain how to reconstruct r from $ MLfollowing Equation (26-44). Because maximum-likelihood estimates are asymptoticahy efficient (Lesson ll), once we have determined ?*iML,the filter in (26-30) and (26-31) will be the steady-stateKalman filter. The major advantageof this steady-stateapproximation is that the filter sensitivity equations are greatly simplified. When K and x are treated as matrices of unknown parameters we do not need the predicted and corrected error-covariance matrices to “compute” K and x. The sensitivity equations for (26-32), (26-30), and (26-31) are

Wk + w = _ dG(k + Ilk) d#i IG w

1

&(k

- %

’ a@, * + 3 Gw I

d%+(k + Ilk + 1) = d%+(k + Ilk) + $‘[z(k Wi Wi a&(k + l/i) -G&J dd+ -&,2

j-l

&,(k + lik) = @&,(k/k)

i+(k f l/k) = z(k + 1) - H,&,(k + ilk)

(26-28)

where + isp x I, and view E as a function of 4, i-e,, = - ; g i;( j/j - 1)x&(

265

d%(k + Ilk) = Qr &&Ik) +d#i Wi

We now make the following transformations of variables:

m@l

Approximation

and

Supposeour basicstate variabIe mode1is time-invariant and stationary so that F = ht$ P(j\j- - 1) exists. Let

i(k

A Steady-State

1

%+(k + ilk)

(26-33)

J% + x 44 1

(26-34)

+ 1) - H&k

tk + l)]

+ Ilk)

(26-35)

where i = 1, 2, . . . , p. Note that d&,/d#i is zero for all & not in q and is a matrix filled with zeros and a single unity value for & in 5. There are more elements in + than in 9, because.K and K have more unknown elementsin them than do Q and R, i.e., p > d. Additionally, x does not appear in the filter equations; it only appears 1 in ‘t(t$l~). It is, therefore, possible to obtain a closed-form solution for xML, Theorem 26-Z.

A closed-form solution

for matrix gMLis (26-36)

266

Maximum-Likelihood

State and Parameter

Estimation

Lesson 26

Proof. To determine %hlLwe must set dz:(+[%)/%& = 0 and solve the resulting equation for I&,. This is most easily accomplished by applying gradient matrix formulas to (26-29) that are given in Schweppe(1974).Doing this, we obtain (26-37)

A Steady-State

These equations must be solved simultaneously for p and @Q&L using iterative numerical techniques. For details, seeMehra (1970a). Note finally, that the best for which we can hope by this approachis not I?& and QML,but only= ML.This is due to the fact that, when I’ and Q are both unknown, there will be an ambieuitv in their determination, i.e., the term rw(k) which appears in our basic-state-variable model [for which EbWw’(~)~ = Ql cannot be distinguishedfrom the term w@), for which

whosesolution is gM, in (26-36). 0 Observe that &

Approximation

E{w,(k)w;(k)} = Q1 = I’QI”

is the samplesteady-state-covariancematrix of &; i.e.,

A

ZML+ j-32 lim cov [Z(j lj - l)] A A Supposewe are also interested in determining QMLand RML.How do we a obtain these quantities from $ML? As in Lesson 19, we let K, i’, and PI denote the steady-statevalues of K(k + l), P(k + Ilk), and P(k lk), respectively, where K = PH’(H~@ + R)-1 = jQfp-1 (26-38) cb ML'&

p = @i!@ + I’QI”

(26-39)

i!, = (I - KH)i!

(26-40)

and

(26-47)

This observation is also applicable to the original problem formulation wherein we obtained & directly; i.e.? when both I’ and Q are unknown, we should really choose 8 = co1(elementsof @, V, H, rQr’, and R) (26-48) In summary, when our basic state-variable model is time-invariant and stationary, we can first obtain 4 MLby maximizing Z($la) given in (26-29), subject to the constraints of the simple filter in (26-30), (26-31), and (26-32). A mathematical programming method must be used to obtain those elements of +MLassociated with &ML, $i&, tiMLYand &. The closed-form solution, given in (26-36), is used for x ML.Finally, if we want to reconstruct RMLand (I’QI?)ML, we use (26-44) for the former and must solve (26-45) and (26-46)for the Iatter. Example 26-1 (Mehra, 1971)

Additionally, we know that x = HFH’ + R

((26-41)

The following fourth-order system, which representsthe short period dynamics and the first bending mode of a missile, was simulated:

0 0 0 0

By the invariance property of maximum-likelihood estimates,we know that

2ML A

RML = (1 - fiM,KML)~ML

(26-44)

No closed-form solution existsfor QML.Substituting (aML,&, &, and &r+ into (26.38)-(26-40) and combining (26-39) and (26-40),we obtain the followi ng two equations and (26-46)

1 0

0 1

-a!1

-a2

-cY3

0 0 -a4

z(k + 1) = xl(k + I) + v(k + 1)

(26-43)

Solving (26-43) for -irk’MLand substitu$ngthe resulting expressioninto (26-42), we obtain the following solution for RML,

0 0

(26-50)

For this model, it was assumed that x(O) = 0, 4 = 1.0, r = 0.25, cyl = -0.656, cy2= 0.784, a3 = -0.18, and cy4= 1.0. Using measurements generated from the simulation, maximum-likelihood estimates were obtained for $, where

+ = co1 (al,

cY2, a!3,

-

cY4,

x, L L

k.77 7;4>

(26-51)

In (2651), SITis a scalar because, in this example, t (k) is a scalar. Additionally, it was assumedthat x(0) was known exactly. According to Mehra (1971, pg. 30) “The starting valuesfor the maximum likelihood schemewere obtained using a correlation technique given in Mehra (1970b). The results of successiveiterations are shown in Table 26-l. The variances of the estimates obtained from the matrix of second partial derivatives (i.e., the Hessi.an matrix of E) are also given. For comparison purposes, results obtained by using 1000 data points and 100 data points are given. ” Cl

Lesson 26

Problems

PROBLEMS

TABLE 26-I

0 ults from correlation technique, Mehra (197Ob). 1 - I.0706 2.3800 -0.5965 estimates using 1000points

2 -1.06fkl 2.3811 -O-S938 3 - 1.0085 2.4026 -0.6054 4 -0.9798 2.4409 -0*6u36 5 -0.9785 2.4412 -0.5999 6 -0.9771 2.4637 -0.6014 7 -0.9769 2.4603 -Oh023 8 - 0.9744 2.5240 -0.6313 9 -0.9743 2.5241 -0.6306 10 -0.9734 2 S270 -0.6374 11 -0.9728 2.5313 -0.6482 12 -0.9720 2.5444 -0.66U2 13 -0.9714 2.5600 -0.6634 14 -0.9711 2.5657 -0.6624 estimates using 100 points 30 -0.9659 2.620 -0Ao94 al values -0.94 2.557 -o.twlo mates of standard deviation using 1WOpoints 0.0317 0.0277 mates of standard deviation using 100 points 0.149 0.104

rce: Mehra (1971, pg. 301, Q 1971, AlAA.

Parameter Estimates for Missile Example

(I.8029

-0.1360

0.8696

0.6830

0.2837

0.4191

0.8207

0.8029 0.7452 0.8161 0.8196 0.8086 0.8 130 0.8105 0.8 108 0.7961 0.7987 0.7995 0.7919 0.7808

-0.1338 -0.1494 -0,140s -OS1370 -0.1503 -0.1470 -0.1631 -Om1622 -0.1630 -Owl620 -0.1783 -0.2036 -0.2148

0.8759 0.9380 0.8540 0.8%0 0.8841 0.8773 0.9279 O-9296 0.9so5 0.9577 0.9866 1.0280 1.0491

0.6803 0.6304 0.6801 0.6803 0.7068 0.7045 0.7990 0.7989 0*7974 0.8103 0.8487 0.8924 0.9073

0.2840 0.2888 0.3210 0.3214 0.3479 0.3429 0.37% 0.3749 0.3568 0.3443 0.3303 0.3143 0.325i

0.42oU 0+4392 OA 108 U.6107 0.6059 0.6lCI6 0.6484 0.6480 0.6378 0.6403 O.m33 II.6014 0.6122

0.8312 I.U3lk 1.1831 1.1835 1.2200 1.2104 1.2589 1.2588 1.2577 1,235i 1.2053 1.2OS4 I .2200

O-7663

-0.1987

1,0156

1.24

0.136

0.454

1.103

0.7840

-0,180O

1.00

0.8937

0.2957

0.6239

1.2510

OS0247

0.0275

0.0261

0.0302

0.0323

o.K+02

0.029

0.131

0.084

0.184

0,303

0.092

0.082

0.09

26-l. Obtain the sensitivity equations for dKe(k f l)/ Jfl,, aP,(k + llk)/&$ and JPe(k + Ilk i- l)/ &I,. Explain why the sensitivity system for &(k + l/k)/ at?, and d&+(k + I/k + Q/&3, is linear. 26-2. Compute formulas for g, and H,. Then simplify H, to a pseudo-Hessian. 26-3, In the first-order systemx (k + 1) = ax(k) + w(k). and z (k + 1) = x(k + 1) + v(k + l), k = 1, 2, . . a) N, a is an unknown parameter that is to be estimated. Sequencesw(k) and v(k) are, as usual, mutually uncorreiated and white, and, w(k) - N(rr*(k); 0,l) and v(k) - N(v(k); 0, l/z). Explain, using equations and a flowchart, how parameter a can be estimated using a MLE. except that now Repeat the preceding problem where all conditions are w(k) and v (k) are correlated, and E{w(k)v(k)} = V4. 26-S. We are interested in estimating the parameters a and r in the following first-order system: x(k + 1) + ax(k) = w(k) z(k) = x(k) -t v(k)

k = 1,2,. . . , N

Signals w(k) and v (k) are mutualIy uncorrelated, white, and Gaussian, and, = 1 and E{n’(k)) = r. (a) Let 9 = co1(a,$. What is the equation for the log-likelihood function? (b) Prepare a macro flow chart that depicts the sequenceof calculations required to maximize ,5(9/%), Assume an optimization algorithm is used which requires gradient information about L (9/Z.). (c) Write out the Kalman filter sensitivity equations for parameters a and r. 26-6. Develop the sensitivity equations for the case considered in Lesson 11, i.e., for the case where the only uncertainty present in the state-variable model is measurement noise. Begin with t(1319) in (11-42). 26-7. Refer to Problem 24-7. Explain. using equations and a flowchart, how to obtain MLE’s of the unknown parameters for: (a) equation for the unsteady operation of a synchronous motor, in which C and p are unknown; (c) Duffing’s equation, in which C, ct, and /3 are unknown; (c) Van der Pal’s equation, in which e is unknown; and (d) Hill’s equation, in which a and b are unknown. E{w’(k))

Notation

Lesson

27

Kalmam-Bucy

and Problem

Statement

271

SYSTEM DESCRIPTION

Filtering

Our continuous-time system is described by the following state-variable model, 2(t) = F(t)x(t) + G(t)w(t)

(27-l)

z(t) = H(t)x(t) + v(t)

(27-2)

where x(t) is y1X 1, w(t) is p X 1, z(t) is m X 1, and v(t) is m X 1. For simplicity, we have omitted a known forcing function term in state equation (27-l). Matrices F(t), G(t), and H(t) have dimensions which conform to the dimensions of the vector quantities in this state-variable model. Disturbance w(t) and measurement noise v(t) are zero-mean white noise processes,which are assumedto be uncorrelated, i.e., E{w(t)} = 0, E{v(t)} = 0, E{w(t)w’(r)} = Q(t)s(t - T)

(27-3)

E{v(t)v’(T)} = R(t)S(t - 7)

(27-4)

E{w(t)v’(r)} = 0

(27-5)

INTRODUCTION

The Kalman-Bucy filter is the continuous-time counterpart to the Kalman filter. It is a continuous-time minimum-variance filter that provides state estimatesfor continuous-timedynamical systemsthat are describedby linear, (possibly) time-varying, and (possibly) nonstationary ordinary differential equations. The Kalman-Bucy filter (KBF) can be derived in a number of different ways, including the following three: 1. Use a formal limiting procedure to obtain the KBF from the KF (e.g., Meditch, 1969). 2. Begin by assumingthe optimal estimator is a linear transformation of all measurements. Use a calculus of variations argument or the orthogonality principle to obtain the Wiener-Hopf integral equation. Embedded within this equationis the filter kernal. Take the derivative of the Wiener-Hopf equation to obtain a differential equation which is the KBF (Meditch, 1969). 3. Begin by assuminga linear differential equation structure for the KBF, one that containsan unknown time-varying gain matrix that weights the difference betweenthe measurementmade at time t and the estimate of that measurement. Choose the gain matrix that minimizes the. meansquared error (Athans and Tse, 1967). We shall briefly describethe first and third approaches,but first we must define our continuous-time model and formally state the problem we wish to solve. 270

Equations (27.3), (27,4), and (27-5) apply for t 2 to. Additionally, R(t) is continuous and positive definite, whereas Q(t) is continuous and positive semidefinite. Finally, we asume that the initial state vector x(to) may be random, and if it is, it is uncorrelated with both w(t) and v(t). The statistics of a random x(to) are (27-6) Wt,)~ = m&o) and

covb(to)l = R&o)

(27-7)

Measurements z(t) are assumedto be made for tos t s 7. If x(to), w(t), and v(t) are jointly Gaussianfor all t t [to,?], then the KBF will be the optimal estimator of state vector x(t). We will not make any distributional assumptions about x(to), w(t), and v(t) in this lesson, being content to establish the linear optimal estimator of x(t). NOTATION AND PROBLEM STATEMENT

Our notation for a continuous-time estimate of x(t) and its associatedestimation error parallels our notation for the comparable discrete-time quantities, i.e., ?(t It) denotes the optimal estimate of x(t) which uses all the measurements z(t), where t 2 to, and qt It) = x(t) - k(tlt)

(27-8)

272

KalmwI-Bucy

Filtering

Lesson 27

The mean-squaredstate estimation error is J [i(t ll)] = E{%‘(tit)i(t It)]

Derivation

of KBF Using a Formal Limiting

Procedure

273

If Ei(@,) = 0, then (27-9)

We shall determine Qit) that minimizes J [%(tit)], subject to the constraints of our state-variablemodel and data set.

g(tlf) = j-’ @(t,~)K(~)z(~)d7 = jr A(~J)z(T)& ID

IO

where the filter kernel A(t,r) is A(~,T) = @(!,r)K(r)

THE KALMAWWCY

FILTER

The solution to the problem stated in the preceding section is the KalmanBucy Filter, the structure of which is summarizedin the following: Theorem 27-l.

The KBF is describedby the vector di#erentia/ equation

i(tit) = F(t)g(t[t) + K(t)[z(t) - H(t)i(t It)]

(27-N)

wheret 2 to,k(t&) = rn&), K(t) = P(t ir)H’(t)R-l(t)

(27-U)

and

(27-17)

(27-18)

The second approach to deriving the KBF, mentioned in the introduction to this chapter. begins by assumingthat i(rir) can be expressed as in (27-17) where A(f,7) is unknown. The mean-squaredestimation error is minimized to obtain the following Wiener-ISopf integral equation: E{x(t)z’(~)} - 1’ A(t,~)E{z(r)z’(a)}d7 = 0 (27-19) *o where lo I (7 5 t. When this equation is convertedinto a differential equation, one obtains the KBF described in Theorem 27-l. For the details of this derivation of Theorem 27-1 see Meditch, 1969,Chapter 8. DERIVATION OF KBF USING A FORMAL LIMITING PROCEDURE

I$]t) = F(t)P(t/t) + P(tlt)F’(t) - P(tlt)H’(t)R-‘(t)H(t)P(tlt) (27-12) + wNw’~~~ Epatiun (27-L& which is a matrix Riccati differential equatiun, is initialized by qt&J = R(k?). El Matrix K(t) is the Kaiman-Bucy gain matrix, and P(t/t) is the stateestimation-error covariancematrix, i.e+, P(tlt) = E{%(tlt)S(tlt)}

(27-13)

i(k + l]k + 1) = @(k + l,k)ir(kjk) + K(k + l)[z(k + 1) - H(k + l)@(k + l,k)ri(klk)]

(27-20)

can also be written as G(t + Atlt + At) = @(t + At,t)i(tIr> + K(r + At)[z(t + At) - H(f + At)@(t + At,r)k(rir>]

(27-21)

where we have let fk = t and tk + r = t + AL In Example 24-3 we showed that

Equation (27-10)can be rewritten as i(tlt) = [F(t) - K(t)H(t)]i(t[t) + K(t)z(t)

Kalman filter Equation (17-ll), expressedas

(27-14)

which makes it very clear that the KBF is a time-varying filter that processes the measurementslinearly to produce k(tlt). The solution to (27-14)is

@(t + At,t) = I + F(f)At + O(Ar’)

(27-22)

Q&)

+ O(Ar’)

(27-23)

] [G(t)Aij’ + O(Ar’)

(27-24)

and = G(r)Q(t)G’(t)At

Observe that Qd(f) can also be written as Q&) = [G(t)Af] [y where the state transition matrix @(t,r) is the solution to the matrix differential equation &‘(t,r) = [F(t) - K(t)H(t)]@(t,T) (27-16) @(t,t) = I

and, if we expresswd(k) as r(k + l,k)w(k), r’(k + Z,k), then r(f

+

so that Qd(k) = T(k + l?k)Q(k)

At,?) = G(f)Ar + O(At’)

(27-25)

Kalman-Bucy

274

Filtering

Lesson 27

Derivation

of KBF When Structure

275

of the Filter is Prespecified

Then, we substitute Q(t)/dt for Q(k = t>in (17-13),to obtain

Q(t>

Q(k =t>+ At

(27-26)

Equation (27-26) means that we replace Q(k = t) in the KF by Q(t)/At. Note that we have encountered a bit of a notational problem here?becausewe have used w(k) [and its associatedcovariance, Q(k)] to denote the disturbance in our discrete-time model, and w(t) [and its associatedintensity, Q(t)] to denote the disturbance in our continuous-time model. Without going into technical details, we shall also replace R(t + At) in the KF by R(t + At)/At, i.e., R(k + 1 = t + At)+R(t

+ At)lAt

(27-27)

SeeMeditch (1969, pp. 139-142) for an explanation. Substituting (27-22) into (27-21), and omitting all higher-order terms in At, we find that ii(t + At It + At) = [I + F(t)At]%(t(t)+ K(t + At){z(t + At) - H(t + At)[I + F(t)At]i(t It)}

P(t + At It) = @(t + dt,t)P(t It)@‘@ + At,t)

Q(t) ryt + dt,?) + r(t + At,t) --&We leave it to the reader to show that

Ar-*O

At It

hence, lim K(t + dt) = P(tlt)H’(t)R-‘(t)

At-0

At

P(t + At It + At) = P(t + 4tlt) - K(t + At)H(t + 4?)P(t + AtIt)

= P(tlt) + [F(t)P(tIt) + P(tlt)F’(t)

At-*0

+ W~Q(OG’(t)lAt - K(t + dt)H(t + At)P(t + 4tlt) lim P(t + At It + 49 - P@(r)

i(tlt) = F(t)i(tIt) + lim K(t + At){z(t + At) Al-r0 - H(t + At)[I + F(t)At]Ei(tlt)}/At

Jr+0

(27-29)

Under suitable regularity conditions, which we shall assumeare satisfiedhere, we can replace the limit of a product of functions by the product of limits, i.e.,

Al-+0

(27-30)

The second limit on the right-hand side of (27-30) is easy to evaluate, i.e., lim {z(t + At) - H(t + At)[I + F(t)At]zZ(tIt)}= z(t) - H(t)jZ(tIt)

At+0

(27-31)

In order to evaluate the first limit on the right-hand side of (27-30),we first substitute R(t + At)/At for R(k + 1 = t + A ) in (17~12),to obtain K(t +

At)

= P(t + Atlt)H’(t + At) [H(t + At)P(t + Atlt)H’(t + At)At + R(t + At)]-‘At

(27-32)

At

= ti(tlt) = F(t)P(tIt) + P(tjt)F’(t)

+ G(t)Q(t)G’(t) - lim

K(t + At)H(t + dt)P(t + 4tlt)

At-0

At

(27-37)

or finally, using (27-35) *(tit)

At-0

(27-36)

Consequently,

or

Ar-0

(27-35)

g K(t)

Combining (27-29), (27.30), (27-31), and (27-35),we obtain the KBF in (27-10) and the KB gain matrix in (27-11). In order to derive the matrix differential equation for P(t It), we begin with (17-14), substitute (27-33) along with the expansionsof @(t + dt,t) and r(t + dt,t) into that equation, and use the fact that K(t + 4t) has no zeroorder terms in dt, to show that

(27-28)

+ At) - i(tIt) = F(t)ri(t It) At + lim K(t + At){z(t + At) - H(t + At)[I + F(t)At]i(tlt)}/At

lim K(t + At){z(t + At) - H(t + At)[I + F(t)At]rZ(t]t)}/At K(t + At) = lim At lim {z(t + At) - H(t + At)[I + F(t)At]ii(t[t)}

(27-34)

lim P(t + AtIt) = P(@)

At+ 0

from which it follows that lim i;(t +

(27-33)

= F(t)P(tIt) + P(t]t)F’(t)

+ G(t)Q(t)G’(t) - P(t It)H’(t)R-‘(t)H(t)P(t

It)

(27-38)

This completes the derivation of the KBF using a formal limiting procedure. It is also possible to obtain continuous-time smoothers by means of this procedure (e.g., see Meditch, 1969,Chapter 7). DERIVATION OF KBF WHEN STRUCTURE OF THE FILTER IS PRESPECIFIED

in this derivation of the KBF we begin by assuming that the filter has the following structure, i(t It) = F(t)i(t It) + K(t)[z(t) - Hk( t It)I

(27-39)

Kaiman-Bucy

Filtering

Lesson 27

Our objective is to find the m&x function K(T), to5 f 5 7:that minimizes the following mean-squarederror, J[K(r)] = E{e’(r)e(r)} e(7) = x(7) - ?(T~T)

Derivation

277

of the Filter is Prespecified

Substituting (27-45) into (27-46),we seethat ‘X(K.P.C) = tr [F(t)P(t!@‘(t)] - tr [K(t)H(t)P(t I@‘(r)] + tr [P(tjt)F’(t)C’(r)] - tr [P(tlt)H’(QK’(t)X’(t)l + tr ~W)QW’(O~‘(Ol + tr [~(l)~(t)~‘(t)X’(t)]

(27-40) (27-41)

This optimization problem is a fixed-time, free-end-point (i.e., 7 is fixed but e(T) is not fixed) problem in the calculus of variations (e.g., Kwakernaak and Sivan, 1972;Athans and Falb, 1965;and Bryson and Ho, 1969). It is straightforward to show that @e(r)] = 0, so that E{ [e(T) - E{e(r)}] [e(r) - E{e(T)l]‘] = E{e(T)e’(r)J. Letting P(f It) = E{e(l)e’(f)}

of KBF When Structure

(27-47)

The Euler-Lagrange equations for our optimization problem are: (27-48)

(27-42)

Z”;(f)= .- m(Kap3,X:) *

we know that J[K@)] can be reexpressedas

(27-49)

(27-43) We leave it to the reader to derive the following state equation for e(f), and i ts associat.edcovarianceequation, 4(f) = [F(t) - K(l)H(t)]e(f) + G(l)w(t) - K(f)v(t) I$!r) = [F(f) - K(t)H(r)]P(rit) + P(+)[F(t) - K(l)H(r)]’ + G(t)Q(f)G’(l) + K@)R(l)K’(f)

(27-50) and

(27-44)

(27-45)

where e(lO)= 0 and P(loitO)= PX(tO). Our optimization problem for determining K(f) is: given & matrix dvferential equation (27-45), sutisfied by the error-cuvariunce matrix P(tit), u terminal lime T, and the cost functiunal ~[K(T)] in (27-43),determine the rnutrix K(t), to 5 t 5 T that minimizes J[K(t)]. The elementsJ+(#) of P(rit) may be viewed as the state variables of a dynamical system,and the elementsk;j(t) of K(l) may be viewed as the control variables in an optimal control problem. The cost functional is then a terminal time penalty function on the state variables plj(tlt)+ Euler-Lagrange equations associatedwith a free end-point problem can be usedto determine the optimal gain matrix K(t). To do this we define a set of costate variables u&f) that correspond to the p&it), i,j = 1, 2, . . . , n. Let X(f) be an PI x n costatematrix that is associated with P(tlf), i.e., Z(t) = (uo(Q)q. Next, we introduce the Hamiltonian function X(K,P?Z) where for notational conveniencewe have omitted the dependence of K? P, and Z on 1,and, (27-46)

x*(7) = -& tr P(+)

(27-51)

In these equations starred quantities denote optimal quantities, and I* denotes the replacement of K, P, and X by K*, P*, and Z* afier the appropriate derivative has been calculated. Note, also, that the derivatives of X(K,P,Z) are derivatives of a scalar quantity with respect to a matrix (e.g., K, P, or X). The calculus of gradient matrices (e.g., Schweppe, 1974; or Athans and Schweppe, 1965) can be usedto evaluate these derivatives.The results are: -x*p*qjf - s*‘p*H’ + BIKER + ~*‘K*R = 0 (27-52) %* = -S*(F

(27-53)

- K*H) - (F - K*H)‘s*

Ij* = (F - K*H)P*

+ P*(F - K*H)’

+ GQG’ + K*RK*’

(27-54)

and P(r)

(27-55)

= I

Our immediate objective is to obtain an expression for K*(t). Fact.

Matrix X*(t) is symmetric and positive definite.

a

We leave the proof of this fact as an exercise for the reader, Using this fact, and the fact that covariance matrix P*(t) is symmetric, we are able to express (27-52) as 2X*(K*R - P’H’) = 0

(27-56)

Kalman-Bucy

Filtering

Lesson 27

279

KBF

which leads to the following three algebraic equations:

BecauseC* > 0, (C*)-’ existsso that (27-56) has for its only solution K*(t) = P*(tlt)H’(t)R-‘(t)

Steady-State

(27-57)

(27-65a)

which is the Kalman-Bucy gain matrix stated in Theorem 27-l. In order to obtain the covariance equation associatedwith K*(t), substitute (27-57) into (27-54).The result is (27-12). This completesthe derivation of the KBF when the structure of the filter is prespecified. STEADY-STATE

KBF

1 --

-

F22

;p11p12

l-2 -712

+

q

=

0

(27-65b)

=

0

(27-65~)

It is straightforward to show that the unique solution of these nonlinear algebraic

If our continuous-time systemis time-invariant and stationary, then, when certain system-theoretic conditions are satisfied (see, e.g., Kwakernaak and Sivan, 1972),P(tlt)+ 0 in which caseP(tlt) hasa steady-statevalue, denotedi? In this case,K(t) + K, where jf = jQl~--* (27-58)

equations, for which P > 0, is 7l2= 711

=

F22

=

(4r )

l/2

(27-66a)

fi

q l/4r3/4

(27-66b)

fi

q 3/4r 114

(27-66~)

The steady-state KB gain matrix is computed from (27~58),as

i? is the solution of the algebraicRiccati equation F-i?+ i?F’ - FH’R-‘HF

+ GQG’ = 0

and the steady-state KBF is asymptotically stable, i.e., the eigenvalues of F - %3 all lie in the left-half of the complex s-plane. Example 27-1 Here we examine the steady-stateKBF for the simplest second-order system, the double integrator, (27-60) X(t) = w(t) and (27-61) z(t) = x(t) + v(t) in which w(t) and v (t) are mutually uncorrelated white noise processes,with intensities 4 and r, respectively.With xl(t) = x (t) and x2(t)= i(t), this system is expressed in state-variable format as (27-62) Z = (1

0) (;:)

+ v

(27-67)

(27-59)

(27-63)

Observe that, just as in the discrete-time case,the single-channel KBF dependsonly on the ratio q /r. Although we only neededF1, and F12to compute K, F22is an important quantity, because

722= I--+= lim E{ [i(t) - i(tlt)32)

(27-68)

Fll = lim E{ [X (t) - Z(tIt)]‘} t+=J

(27-69)

Additionally,

Using (27-66b) and (27-66c), we find that p-22

= (q wjh

If q /r > 1 (i.e., SNR possibly greater than unity), we will always have larger errors in estimation of i(t) than in estimation of x(t). This is not too surprising because our measurement depends only on x(t), and both w(t) and v(t) affect the calculation of

i(t It). The steady-state KBF is characterized by the eigenvalues where

The algebraic Riccati equation for this system is

(27-70)

matrix F-ii%,

(27-71)

(27-64) These eigenvalues are solutions of the equation s2 + V5 (q /r)"4 s + (q /r)1'2 = 0

(27-72)

280

Kalman-Bucy Filtering

l

Lesson27

When this equation is expressedin the normaiized form

= (q /r)1’4

Problems

The controlled variabk can be expressedas

we find that (43

Lesson 27

(27-73)

u(t) = f[Z(T) 6 = 0.707

(27-74)

thus, the steady-state KBF for the simpk duuble integrator system is damped at 0.707. The filter’s pules lie on the 45’ line depicted in Figure 27-I. They can be moved along this line by adjusting the ratio q /r; hence, once again. we may view q /r as a filter tuning parameter. •l

problem

The stochastic linear optimal output feedback regula tar problem of finding the functional

(27-78)

to 5 I 5 t]

for to I t 5 f1such that the objective function q-u] = E (; x’(tl)Wlx(tl) + ; 1; HW&(4

+ ~‘Wbu(~)ld~

)

(27-79)

is minimized. Here W1, WZ, and W3 are symmetric weighting matrices, and, W1 I 0, W, > 0, and W3 > 0 for toI t I tl. In the control theory literature, this problem is also known asthe linearquadratic-Gaussianregulator problem (i.e., the LQG problem; see, Athans, 1971, for example), We state the structure of the solution to this problem, without proof, next. The optimal control, u*(t), which minimizes J[u] in (27-79) is u*(t) = -F’(t)i(tlt)

(27-80)

where p(t> is an optimal gain matrix, computed as (27-81)

P(t) = W,‘B(r)P,(t)

where PC(t)is the solution of the control Riccati equation

Figure 27-l Eigenvalues of steady-state Kl3F Lie along 245 degree lines+ Increasing q /r moves them farther away from the origin. whereas decreasing q /r moves them doser to the origin. AN MPORTANT

APPLtCATtON

FOR THE K8F

(27-83)

We see that the KBF plays an essentialrole in the solution of the LQG problem. (27-75)

for l 2 ffl where x0 is a random initial condition vector with mean m&J and covariancematrix P&J. Measurementsare given by z(f) = H(f)x(f) + v(r)

(2742)

and i(t/t) is the output of a KBF, properly modified to account for the control term in the state equation, i.e., i(t It) = F(t)?(+) + B(t)u*(r) + K(f)[z(t) - H(t)?(+)]

Consider the system k(r) = F(t)x(r) + B(I)u(~) + G(f)w(~) QJ = x0

--Ii=(t) = F’(r)P,(t) + P,(t)F(f) - P,(t)B(t)W;‘B’(r)P,(t) + D’(t)W,D(t) PC( t ,) given I

(27-76)

for t 2 lo. The joint random processco1[w(~),v(I)] is a white noise processwith intensity

PROBLEMS 27-l. Explain the replacement R(t + At)/Ar in (27-27).

of covariance matrix

R(k + 1 = t + df)

by

27-2. Show that GIO P(t + AtIt) = P(tIt). 27-3. Derive the state equation for error e(r). given in (27-44), and its associated covariance equation (27-45). 27-4. Prove that matrix Z*(t) is symmetric and positive definite.

Lesson Sufficient

A Statistics

and Statistical Estimation

of Parameters

Concept

of Sufficient

Statistics

283

z (IV)], where z(i) = 0 if the ith car is not defective and z(i) = 1 if the ith car is defective. The total number of observed defective cars is T(Z) = 5 z(i) i=l This is a statistic that maps many different values of z (l), . . . , z (IV) into the samevalue of T(%). It is intuitively clear that, if one is interested in estimating the proportion 0 of defective cars, nothing is lost by simply recording and using T(3) in place of z (l), . . . , z (IV). The particular sequenceof ones and zeros is irrelevant. Thus, as far as estimating the proportion of defective cars, T(Z) contains all the information contained in %. 0

An advantage associated with the concept of a sufficient statistic is dimensionality reduction. In Example A-l, the dimensionality reduction is fromNt0 1. Definition A-l.

A statistic T(Z) is sufficient for vector parameter 8, if of %, conditioned on T(S) = t, does not involve

and only if the distribution 8. cl

INTRODUCTION

In this lesson,* we discussthe usefulnessof the notion of sufficient statistics in statistical estimation of parameters. Specifically,we discussthe role played by sufficient statistics and exponential families in maximum-likelihood and uniformly minimum-variance unbiased (UMVU) parameter estimation. CONCEPT OF SUFFICIENT STATISTICS

The notion of a sufficient statistic can be explained intuitively (Ferguson, 1967), as follows. We observe3(N) (3 for short), where 55= co1[z(l), z(2), z(N)], in which z(l), . . . , z(N) are independentand identically distribute; random vectors, each having a density function p(z(i)l8), where 8 is unknown. Often the information in % can be representedequivalently in a statistic, T(Z), whosedimension is independentof N, such that T(%) contains all of the information about 0 that is originally in %. Sucha statistic is known as a sufficient statistic.

Example A-2 This example illustrates the application of Definition A-l to identify a sufficient statistic for the model in Example A-l. Let 8 be the probability that a car is defective. Then z(l), 42),.*., z(N) is a record of IV Bernoulli trials with probability 8; thus, Wt) = ey1 - e)“- f, 0 < 8 c 1, where t = xr’ 1 z(z), and t(i) = 1 or 0. The conditional distribution of z (l), . . . , z (N), given xy= 1z (i) = t, is P[%/T = t] =

P[S,T = t]

eyi - e)N-r

=- 1

which is independent of 8; hence, T(S) = Cy= 1 z(i) is sufficient. Any one-to-one function of T (3) is also sufficient. 0

This example illustrates that deriving a sufficient statistic using Definition A-l can be quite difficult. An equivalent definition of sufficiency, which is easy to apply, is given in the following: Theorem A-l (Factorization Theorem). A necessary and sufficient condition for T(Z) to be sufficient for 8 is that there exists a factorization

Example A-l Consider a sampled sequenceof N manufactured cars. For each car we record whether it is defective or not. The observed sample can be represented as % = co1 [z(l), . . . ,

where the first factor in (A-l) may depend on 8, but depends on 9 only through T(Z), whereas the second factor is independent of 8. Cl

* This lesson was written by Dr. Rama Chellappa, Department of Electrical EngineeringSystems,Unversity of Southern California. Los Angeles: CA 9089.

The proof of this theorem is given in Ferguson(1967) for the continuous caseand Duda and Hart (1973) for the discrete case.

282

Sutiicient

Statistics

and Statistical

Estimation

of Parameters

Lesson A Exponental

Families of Distributions

Example A-3 (Continuation of Example A-2) In Example A-2, the probability distribution of samplesz [ l), . . . , z (NJ is where the total number of defective cars is t = Ecluation (A-2) can be written quivalently as

either 0 or 1.

pie, the family of normal distributions N(~,c?), with 2 known and p unknown, is an exponential family which, as we have seen in Example A-4, has a one-dimensional sufficient statistic for CL,that is equal to xr’ 1z(i). As Bickel and Doksum (1977) state, Definition A-2 (Bickel and Doksum, 1977). If there exist real-valued functions a@), and b(9) on parameter space 0, and real-valued functions T(z) [z is short for z(i)] and h(z) on R’, such that the density function ~(~10)can be written as

Comparing (A-3) with (A-I), we conclude that h(S) = 1 +e

+ N ln (1 - 0)

p(zl0) = exp [a’@)T(z) + b(0) + h(z)]

1

Using the Factorization Theorem, it was easy to determine T(9).

then p(zjO), 8 E 9, is said to be a one-parameter exponential family dktributions. 0

0

Exztmple A-4 LetS=col[z(l), . . ..z(N)]be a random sample drawn from a univariate G.aussian distribution, with unknown mean p, and, known variance 42 > 0. Then

r=l

i=l

T(S) = 5 T(z(i)) i-l

z(i) as a sufficient

Becausethe concept of sufficient statistics involves reduction of data, it is \vorthwhile to know how far such a reduction can be done for a given problem The dimension of the smallest set of statistics that is still sufficient for the parameters is caIled a minimu/ sufficienf aaktic. SeeBarankin (1959) and Datz (1959) for techniques useful in identifying a minimal sufficient statistic. FAMILIES OF DISTRIBUTIONS

Example A-5 Let z(l), . , . , z(N) be a randomsamplefrom a multivariateGaussiandistribution with unknown d X 1 mean vector p and known covariance matrix P&.Then [z is short for

z(i)3 ~(414 = =P WbJW

+ bbL) + h WI

where a’(p) = p’P,’ b(p) = -$.dP;’

)L

T(z) = z

It is of interest to study families of distributions, ~(z(#) for which, irrespective of the sample size N, there exists a sufficient statistic of fixed dimension. The exponential families of distributions have this property. For exam-

- i ln det P& I

.

1 (A-5)

The sufficient statistic T(Z) for this situation is

h (3) = exp

EXP0NENTiAL

of

The Gaussian, Binomial, Beta, Rayleigh, and Gamma distributions are examples of such one-parameter exponential families. In a one parameter exponential family, T(z) is sufficient for 8. The family of distributions obtained by sampling from one-parameterexponential families is also a one-parameter exponential family. For example, suppose that z(l), . . . , z(N) are independent and identically distributed with common density p (~10);then, p(fEj6) = exp a’(9) 5 T(z(i)) + Nb(9) + E h(z(i))

Based on the Factorization Theorem we identify statistic for p. 0

(A-4)

Sufficient

286

Statistics

and Statistical

Estimation

of Parameters

Lesson A

Exponental

Additionally,

Families and Maximum-Likelihood

Estimation

287

EXPONENTIAL FAMILIES AND MAXIMUM-LIKELIHOOD ESTIMATION

where T(g) = 5 z(i)

l

and h(S) = exp - + j z’(i) Pi’ z(i) I-1 Nd

-2ln

277

Let us consider a vector of unknown parameters6 that describe a collection of N independent and identically distributed observations 2%= co1 [z(l), . . . , z(N)]. The maximum-likelihood estimate (MLE) of 0 is obtained by maximizing the likelihood of 8 given the observations%. Likelihood is defined in Lesson 11 to be proportional to the value of the probability density of the observations, given the parameters, i.e., As discussedin Lesson 11, a sufficient condition for I (01%)to be maximized is

The notion of a one-parameter exponential family of distributions as stated in Bickel and Doksum (1977) can easily be extended to m parameters and vector observationsin a straightforward manner. Definition A-3. If there exist real matrices Al, . . . , A,, a real function b of 8, where 8 E 8, real matrices I@) and a real function h(z), such that the density function ~(~19)can be written as

II

exp UJ(0) + h (z))

where

and L(@Z) = In 2(0/Z). M aximum-likelihood estimates of 8 are obtained by solving the systemof n equations ww3 ae,

(A-6)

then p(z]O), 0 E 8 is said to be an m-parameter exponential family distributions. Cl

of

Example A-6 The family of d-variate normal distributions N(p,P,), where both p and Pp are unknown, is an example of a 2-parameter exponential family in which 8 contains k and the elements of Pk. In this case Al(e) = al(e) = Pi1 p

T*(z) = z’

= 0

i = 1,2,. . . , n

for eMLand checking whether the solution to (A-8) satisfies (A-7). When this technique is applied to membersof exponential families, iML can be obtained by solving a set of algebraicequations. The following theorem paraphrased from Bickel, and Doksum (1977) formalizes this technique for vector observations. Theorem A-2 (Bickel and Doksum, 1977). Let p(zle) =exp[a’(O)T(z) + b(8) + h(z)] and let ~4 denote the interior of the range of a(e). If the equation

WT(z)l = T(z)

A*(e) = -;P;l

(A-8)

(A-9)

has a solution e(z) for which a[b(z)] t d, then e(z) is the unique MLE of

8. q

T2(z) = zz’

The proof of this theorem can be found in Bickel and Doksum (1977). and

Example A-7 (Continuationof Example A-5) h(z) = 0

113

As is true for a one-parameter exponential family of distributions, if z(N) are drawn randomly from an m-parameter exponential family, then p[z(l), . . . , z(N)le] form an m-parameter exponential family with sufficient statisticsT@Ej, . . . 9T&E), where Il,(!Z) = cr= 1T [z(j)].

z(l),

l

l

l

?

In this case T(Z) = 5 I’ 1

and

z(i)

zaa

Sufficient

Statistics and Statistical

Eskmtion

of Parameters

Lesson A Exponental

hence (A-9) becomes

Families and Maximum-Likelihood

Estimation

and E&b (%)I = A’ (P, + 14) Applying (A-9) to both TI (3e) and T2(9): we obtain

whose solution, &, is

Np = 2 z(i) i=l which is the well-known MLE of p. 0

Theorem A-2 can be extended to the m-parameter exponential family caseby using Definition A-3. We illustrate the applicability of this extension using the example given below.

whose solutions, I; and P&,are

Exam@ A-8 (see, also, Example A-6) Let z = co1[z(l), . +. , z(N)] be randomly drawn from p (~10)= IV (P,P~), where both p and PFare unknown, so that 9 contains p and the elements of Pp. Vector p is cI x 1 and matrix P@is d x cI, symmetric and positive d&mite. We expressp (2/O) as p (3@) = (Z+‘fd’*

(det Ph)-N’2 which are the MLE’s of J.Land P,- Cl Exampie A-9 (Linear Model) Consider the linear model zqk) = %e(k)fl+ T(k) in which 13is an n X 1 vector of deterministic parameters, X(k) is deterministic, and V(k) is a zero-mean white noise sequence,with known covariance matrix C&(k).From (1X-25)in Lesson 11, we can expressp (S(k)l9), as

a’(O) = 0’ T(zE(k)) = W(k)W(k)zE(k) N b(e) = --ln27r - i In det a(k) - i ew(k)W(k)3e(kp 2

Using Theorem A-1 or Definition A-3 it can be seen that zy= Iz(Q and X7= I z(qz’(i) are sufficient for (p, Pw).Letting and

Observe that we find that

Ee(X(k)W’(k)5T(k)}

= W(k)9V’(k)%e(k)B

hence, applying (A-9), we obtain

X(k)W(k)X(k)il= ‘x’(k)w’(k)zE(k)

Sufficient


Estimation

of Parameters

Lesson A

whose solution, 6(k), is ii(k) = [X’(k)9i-1(k)%e(k)]-1W(k)9i-‘(k)%(k)

which is the well-known expression for the MLE of 8 (see Theorem 11-3). The case when R(k) = $1, where 2 is unknown can be handled in a manner very similar to that in Example A-8. ci

SUFFICIENT STATlSTICS AND UNIFORMLY MINIMUM-VARIANCE UNBIASED ESTIMATION

In this section we discusshow sufficient statistics can be used to obtain uniformly minimum-variance unbiased(UMVU) estimates. Recall, from Lesson 6, that an estimate 6 of parameter 8 is said to be unbiased if E(8) = 9

(A-10)

Among such unbiased estimates,we can often find one estimate?denoted 8”) which improves all other estimatesin the sensethat var (0*) 5 var (i)

(A-11)

When (A-l 1) is true for all (admissible) values of 6, 0” is known as the UMVU estimate of 8. The UMVU estimator is obtained by choosing the estimator which has the minimum variance among the classof unbiasedestimators. If the estimator is constrained further to be a linear function of the observations, then it becomesthe BLUE which was discussedin Lesson9. Suppose we have an estimate, i(Z), of parameter 6 that is based on observations Z = co1 [z(l), . . . , z(N)]. Assume further that p (910) has a finite-dimensional sufficient statistic, T(Z), for 8. Using T(s), we can construct an estimate e*(Z) which is at least as good as, or even better, than i by the celebrated Rao-Blackwell Theorem (Bickel and Doksum, 1977).We do this by computing the conditional expectation of @Z), i.e.,

e*(z) = E{@l!I)~T(%)}

Sufficient

Statistics

Definition A-4 [Lehmann (1959; 1980);Bickel and Doksum (1977)]. A sufficient statistic T(S) is said to be complete, if the only real-valued function, g, defined on the range of T(3), which satisfies Ee(g(T)} = 0 for all 0, is the function g(T) = 0. 0

Minimum-Variance

Unbiased

Estimation

291

Completeness is a property of the family of distributions of T(Z) generated as ovaries over its range. The concept of a complete sufficient statistic, as stated by Lehmann (1983), can be viewed as an extension of the notion of sufficient statistics in reducing the amount of useful information required for the estimation of 6. Although a sufficient statistic achieves data reduction, it may contain some additional information not required for the estimation of 8. For instance, it may be that E&( T(%))] is a constant independent of 6 for some nonconstant function g. If so, we would like to have E&( T(S))] = c, (constant independent of 6) imply that g( T(Z)) = c. By subtracting c from EBGmwl~ one arrives at Definition A-4. Proving completeness using Definition A-4 can be cumbersome. In the special case when p (z(k)16) is a oneparameter exponential family, i.e., when (A-13) the completeness of T(z(k)) can be verified by checking if the range of a (0) has an open interval (Lehmann, 1959). Example A-10 Let 3 = co1[r(l), . . . , z(N)] be a random sample drawn from a univariate Gaussian distribution whose mean p is unknown, and whose variance 2 > 0 is known. From Example A-5, we know that the distribution of % forms a one-parameter exponential family, with T(Z) = xy= 1 z(i) and a(p) = p lo? Because a(p) ranges over an open interval as p varies from -- to +m, T(%) = CrX 1r(i) is complete and sufficient. The same conclusion can be obtained using Definition A-4 as follows. We must show that the Gaussian family of probability distributions (with p unknown and 2 fixed) is complete. Note that the sufficient statistic T(2) = cy= l z (i) (see Example A-5) is Gaussian with mean Np and variance N’c?. Supposeg is a function such that E,(g( T)} = 0 for all -- < p < 0~;then,

= \:- &g(vaN

(A-12)

Estimate 0*(Z) is “better than 6” in the sense that E{ [O*(Z) - 01’) < E{ [t!(%) - el’}. B ecause T(Z) is sufficient, the conditional expectation E{@%)[T(E)} wi‘11not depend on 8; hence, e*(Z) is a function of % only. Application of this conditioning technique can only improve an estimate such as e(Z); it does not guarantee that 0*(%) will be the UMVU estimate. To obtain the UMVU estimate using this conditioning technique, we need the additional concept of completeness.

and Uniformly

+ Np) exp (-g)

dv = 0

(A-14)

implies g ( ) = 0 for all values of the argument of g. 0 l

Other interesting examples that prove completeness for tributions are found in Lehmann (1959). Once a complete and sufficient statistic T(Z) is known rameter estimation problem, the Lehmann-Scheffe Theorem, be used to obtain a unique UMVU estimate. This theorem from Bickel and Doksum (1977).

families of disfor a given pagiven next, can is paraphrased

Theorem A-3 [Lehmann-Scheffe Theorem (e.g., Bickel and Doksum, 1977)]. If a complete and sufficient statistic, T(Z), exists for 8, and 6 is an unbiased estimator of 8, then O*(S) = E{ iIT( is an UMVU estimator of 8. If

292

Sufficient

Stdstks

and Statistical

Estima?ion of Parameters

Lesson A

Sufficient

Statistics

and Uniformty

Minimum-Variance

Unbiased

Estimation

293

Theorem A-4 [Bickel and Doksum (1977) and Lehmann (1959)]. p(zle) be an m-parameter exponentialfamily given by

A proof of this theorem can be found in Bickel and Doksum (1977). This theorem can be applied in two ways to determine an UMVU estimator [Kckel and Doksum (1977),and Lehmann (1959)]. Method I.

Find a staGsticof the form h (T(T)) such that where T(S) is a complete and sufficient statistic for 8. Then, h (T(g)) is an UMVU estimator of 0. This follows from the fact that

Method 2. Find an unbiasedestimator, i, of 0; then, E{@(Z)j is an UMVU estimator of 8 for a complete and sufficient statistic T(S). Example A-11 (Continuation of Example A-10) We know that T(S) = xy=I z (i) is a completeand sufficient statistic for g, Furtherrnore, l/N I!/= I z (i ) is an unbiased estimator of p; hence, we obtain the well-known result from Method 1, that the samplemean,l/N zySI z(i), isanUMVU estimateof p. Because this estimator is linear, it is alsothe BLUE of p+ 0 ExampIe A-12 (Linear Model ) As in ExampleA-9, considerthe linear ?x(k) = x(k)e + T(k)

(A-16)

where 8 is a deterministic but unknown n x 1 vector of parameters, X(k) is deterministic, and Ew(k)} = 0. Additionally, assume that V(k) is Gaussian with known covariance matrix 3(&z). Then, the statistic T@(k)) = X’(k)%-%(k) is sufficient (see Example A-9). That it is alsocomplete can be seen by using Theorem A-4. To obtain UMVU estimate 6, we need to identify a function Iz[T(Z(k))] such that F$z [T@!(k))]) = 9. The structure of /I [T@(k))] is obtained by observing that E{T(%(k))} = E{X’(k)%-l(k)%(k)}

5 Qj [ i= 1

(e)1;:(2)+ b(e) + h(z)

1

where a,, . . . , a, and b are real-valued functions of 0, and T,, . . . , T, and h are real-valued functims of z. Suppose that the range of a = co1 [al(O), . . . , a,(9)] has an open m-rectangle [if (x,, yJ, . . . , (xm,ym>are m open intervals, the set S,): Xi < Si< yj, 1 I i zs m) is called the open m-rectangle], rhen -@I,..., T(z) = col [TI(z), . . . , Tm(z)]is complete as well as sufficient. •1

Example A-13 (This example is taken from Bickel and Doksum, 1977, pp. 123-124) As in Example A-4, let 9 = co1 [z(l), . . . , z (Iv)] be a sample from a IV@ ,2) population where both p and d are unknown. As a special caseof Example A-6, we observe that the distribution of 2 forms a two-parameter exponential family where 9 = co1 (CL,d). Because co1[al(9), a2(e)] = co1(p /c?,-I/X?) ranges over the lower halfplane, as 9 ranges over co1[(- m,m), (O,m)], the conditions of Theorem A-4 are satisfied. As a result, T(S) = co1[Cf”= 1z (i), ‘$= 1z ‘(i)] is complete and sufficient. Cl

Theorem A-3 also generalizesin a straightforward manner to: Theorem A-5. If a complete- and sufficient statistic (-W-Q, + . . 7 Tm(W exists for 8, and 8 is an unbiased estimator e*(Z) = E(@T(Z)) is an UMVU estimator of 8. If the elements ofthe matrix of 8* (2.) are < mfor all 8, then B*(Z) is the unique UMVU 8. 0

T(Z) = co1 of 0, then covariance estimate of

The proof of this theorem is a straightforward extension of the proof of Theorem A-3, which can be found in Bickel and Doksum (1977). Example A-14 (Continuation of Example A-13)

In Example A-13 we saw that co1 [T,(Z), G(Z)] = co1[zyz 1 z(i), xi”,. , z’(i)] is sufficient and complete for both p and 2. Furthermore, since

= X’(k)%-‘(k)?@-)9

hence,

Consequently? the UMW

p(zlf3) = exp

Let

and estimator of 9 is

which agrees with Equation (9-26). 0

are unbiased estimators of ~1and d, respectively, we use the extension of Method 1 to the vector parameter case to conclude that 5 and @’ are UMW estimators of p and d. El

We now generalize the discussions given above to the case of an m-parameter exponential family, and scalar observations. This theorem is paraphrased from Bickel and Doksum (1977)

It is not always possible to identify a function h (T(2)) that is an’unbiased estimator of 8. Examples that use the conditioning Method 2 to obtain

Sufficient

UMVU


Estimation

of Parameters

Lesson A

estimators are found, for example, in Bickel and Doksum (1977) and

Appendix

A

Lehman (1980).

Glossary

PROBLEMS A-l. Suppose z(l), . . . , z(N) are independent random variables, each uniform on [0,6], where 8 > 0 is unknown. Find a sufficient statistic for 8. A-2. Supposewe have two independent observations from the Cauchy distribution, 1 1 -7r1 + (2 - 0)”

P(Z) = -

of Maior

Results

-xcO

A-6. Show that the family of Bernoulli distributions, with unknown probability of successp (0 2sp ~5l), is complete. A-7. Show that the family of uniform distributions on (0,Q where 8 > 0 is unknown, is complete. A-8. Let z(l), . . . , z(N) be independent and identically distributed samples, where p (z (i)iS) is a Bernoulli distribution with unknown probability of success p (0 5 p 5 1). Find a complete sufficient statistic, T; the UMVU estimate 4(T) of p; and, the variance of 4(T). A-9. [Taken from Bickel and Doksum (1977)]. Let z(l), z(2), . . . , z(N) be an independent and identically distributed sample from N&,1). Find the UMVU estimator of pg [z (1) 101. A-10. [Taken from Bickel and Doksum (1977)]. Supposethat & and z are two UMVU estimates of 8 with finite variances. Show that z = T2. A-11. In Example A-12 prove that T@!(k)) is complete.

Equations (3-10) and (3-11) Theorem 4-1 Lemma 4-l Theorem 4-2 Theorem 5-l Theorem 6-l Theorem 6-2 Theorem 6-3 Corollary 6-l Theorem 6-4 Corollary 6-2 Theorem 7-l Theorem 7-2 Theorem 8-l Theorem 8-2

Batch formulas for &+&k) and &#). Information form of recursive LSE. Matrix inversion lemma. Covariance form of recursive LSE. Multistage LSE. Necessary and sufficient conditions for a linear batch estimator to be unbiased. Sufficient condition for a linear recursive estimator to be unbiased. Cramer-Rao inequality for a scalarparameter. Achieving the Cramer-Rao lower bound. Cramer-Rao inequality for a vector of parameters. Inequality for error-variance of ith parameter. Mean-squared convergence implies convergence in probability. Conditions under which i(k) is a consistent estimator of 8. Sufficient conditions for &&) to be an unbiased estimator of 8. A formula for cov [&&)].

Glossary of Major Results

Corollary 8-l Theorem 8-3 Theorem 8-4 Theorem 8-5 Equation (9-22) Theorem 9-l Corollary 9-1 Theorem 9-2 peorem 9-3 Corollary 9-2 Theorem 9-4 Corollary 9-3 Theorem 9-5 Theorem 9-6 Theorem 9-7 Definition 10-l Theorem lo- 1 Theorem 11-l Theorem 1l-2 Theorem I l-3 Corollary 11-l Theorem 12-l Theorem 12-2 Theorem 12-3 Theorem 12-4

A formula for cov [&(k)~ under special conditions on the measurement noise. An unbiasedestimator of 2. Sufficient conditions for &(k) to be a consistent estimator of 0. Sufficienl conditions for & (k) to be a consistent estimator of 2”. Batch formula for &&IQ The relationship between &(ti) and &&). When all the results obtained iyLessons 3,4 and 5 for b-(k) can be applied to Or&k). When 6&k) equals 6&) (Gauss-Markov,Theorem). A formula for cov [6&Q]. The equivalencebetween P(k) and cov [&&)]. Most efficient estimator property of 6&k). When 6&) is a most efficient estimator of 6. Invariance of hBLu(k) to scalechanges. Information form of recursive BLUE. Covarianceform of recursive J3LUE. Likelihood defined. Likelihood ratio of combined data from statistically independent sets of data. Large-sample properties of maximum-likelihood estimates. Invariance property of MLE’s. Condition under which 6&) = 6&k). Conditions under which &(k) = &&c) = iLS(k), and, resulting estimator properties. A formula for ~(xiy) when x and y are jointly Gaussian. Propertiesof E{xiyl when x and y are jointly Gaussian. Expansion formula for E{x\y ,zl when x, y, and z are jointly Gaussian,and y and z are statistically independent. Expansion formula for lZ{x~y,z~when x, y and z are jointly Gaussian and y and z are not necessarily statistically independent.

Appendix

A

Theorem 13-l Corollary 13-1 Corollary 13-2 Corollary 13-3 Theorem 13-2 Theorem 14-1 Theorem 14-2

Theorem 14-3 Theorem 15-l Theorem 15-2 Theorem 15-3 Equations (U-17) & (15-18) Theorem 15-4 Theorem 15-5 Theorem 15-6 Equations (16-4) & (16-11) Theorem 16-l Theorem 16-2 Theorem 17-l Theorem 19-l Theorem 19-2 Theorem 20-1 Corollary 20-l Corollary 20-2

A formula for 8,,(k) (The Fundamental Theorem of Estimation Theory). A formula for i&k) when 8 and Z(k) are jointly Gaussian. A linear mean-squaredestimator of 8 in the nonGaussiancase. Orthogonality principle. When 6MAP(k)= k&). Conditions under which i,,(k) = &&k)Condition under which 8&k) = &Ltj(k). Condition under which &*~(k) = &LU(~). Expansion of a joint probabihty density function for a first-order Markov process. Calculation of conditional expectation for a firstorder Markov process. Interpretation of Gaussianwhite noise as a special first-order Markov process. The basicstate-variable model. Conditions under which x(k) is a Gauss-Markov sequence. Recursive equations for computing mX(k) and R(k). Formulasfor computing m, (k) and P,(k). Single-stage predictor formulas for i(k jk - 1) and P(k jk - 1). Formula for and properties of general state predictor, i(klj), k > j. Representationsand properties of the innovations process. KaIman filter formulas and properties of resulting estimatesand estimation error. Steady-stateKalman filter. Equivalenceof steady-state Kalman filter and infinite length digita Wiener filter. Single-statesmoother formula for i(kjk + 1). ReIationshipbetween single-stagesmoothing gain matrix and Kalman gain matrix, Another way to expressd(klk + 1).

Glossary of Major Results

298

Theorem 20-2 Corollary 20-3 Corollary 20-4 Theorem 21-1

Theorem 21-2

Theorem 21-3

Theorem 22-l Theorem 22-2 Theorem 22-3

Double-stage

smoother formula for ri(klk + 2).

Relationship betweendouble-stagesmoothing gain matrix and Kalman gain matrix. Two other waysto expressi(k jk + 2). Formulas for a useful fixed-interval smoother of x(k), %k IN), and its error-covariance matrix, P(k IN): Formulas for a most useful two-passfixed-interval smoother of x(k) and its associated errorcovariancematrix. Formulas for a most useful fixed-point smoothed estimator of x(k), jZ(k Ik + Z) where Z = 1, and its associatederror-covariance ma2 t&i ;?‘(klk + I). Conditions under which a single-channel statevariable model is equivalent to a convolutional sum model. Recursive minimum-variance deconvolution formulas. Steady-stateMVD filter, and zero phase nature of

Appendix

A

Equations (24-39) & (24-44) Theorem 25-1 Equations (25-22) & (25-27) Theorem 26-l Theorem 26-2

Theorem 27-l Definition A-l Theorem A-l Theorem A-2 Theorem A-3 Theorem A-4

6swo Theorem 22-4 Theorem 22-5 Theorem 22-6 Theorem 22-7 Theorem 23-1 Theorem 23-2 Corollary 23-1 Corollary 23-2 Equations (24-l) & (24-2) Equations (24-23)& (24-30) Theorem 24-1

Equivalence betweensteady-stateMVD filter and Berkhout’s infinite impulse response digital Wiener deconvolution filter. Maximum-likelihood deconvolution results. Structure of minimum-variancewaveshaper. Recursive fixed-interval waveshapingresults. How to handle biases that may be present in a state-variablemodel. Predictor-corrector Kalman filter for the correlated noise case. Recursive predictor formulas for the correlated noise case. Recursive filter formulas for the correlated noise case. Nonlinear state-variablemodel. Perturbation state-variablemodel. Solution to a time-varying continuous-time state equation.

Theorem A-5

299

Discretized state-variable model. A consequenceof relinearizing about i(kIk). Extended Kalman filter prediction and correction equations. Formula for the log-likelihood function of the basic state-variablemodel. Closed-form formula for the maximum-likelihood estimate of the steady-statevalue of the innovation’s covariancematrix. Kalman-Bucy filter equations. Sufficient statistic defined. Factorization theorem. A method for computing the unique maximumlikelihood estimator of 0 that is associatedwith exponential families of distributions. Lehmann-ScheffeTheorem. Provides a uniformly minimum-variance unbiasedestimator of 8. Method for determining whether or not T(z) is complete as well as sufficient when p (~10)is an m-parameter exponential family. Provides a uniformly minimum-variance unbiased estimator of vector 0.

References

AGLXLERA, IL J. A. DEBREMAECKER, and S. HERNANDEZ. 1970. “Design of recursive

filters.” Geophysics,Vol. 35, pp. 247-253. ANIIERSON, B. D. 0.. and J. B. MOORE. 1979, QptimaZ Fikering. Eqlewood Cliffs, NJ:

Prentice-Hall. ACM, M. 1967. Optimization of Stochastic Systems- Tupics in Discrete- Time Systems.

NY: Academic Press. ASTR~M, K. J. 196K “Lectures on the identification probiem-the

least squares method.” Rept. No, 6806%Lund Institute of Technology, Division of Automatic CcmtroL ATHA?& M. 1971. “The role and use of the stochastic linear-quadratic-Gaussian problem in control systemdesign.” IEEE Trans. on Automatic ContruI, Vol. AC-16 pp. 529-552. ATHANS. M., and P. L. FALB. 1965. Optimal Contr~k Ati l~troductbn to the Theory and Its Applications. NY: McGraw-Hill. AI-HANS, M., and F. SCFIWEPPE.1965. “Gradient matrices and matrix calculations.”

MIT Lincoln Labs., Lexington, MA, Tech, Note 196553. ATHANS, M., and E. TSE.1967.“A direct derivation of the optimai linear filter using the maximum principle,” IEEE Trans. OJIAutomatic Contrd, Vol. AC-12, pp. 690-698. ATHANS. M., R. P. WISHNER, and A. BERTOLIM 1968. ‘Suboptimal state estimation

for continuous-time noniinear systems from discrete noisy measurements.*’ IIZI!X Trans. WI Automatic Control, Vol. AC-13, pp. 504-514. BARAMUN, E. W., and M, KATZ, JR. 1959. “Sufficient Statistics of Minimal Dimension,” Sunkhya, Vol. 21, pp. 217-246. BARANKIN, E. W. 1961. “Application to Exponential Families of the Sohnion to the Mn-nmal Dimensionality Problem for Sufficient Statistics.” Buk hs~. hterrmt. Stat., Vol. 38, pp. 141-150.

References

BARD, Y. 1970. “Comparison of gradient methods for the solution of nonlinear parameter estimation problems.” SIAM J. Numerical Analysis, Vol. 7, pp. 157-186. % BERKHOUT.A. G. 1977. “Least-squares inverse filtering and wavelet deconvolution.” Geophysics, Vol. 42, pp. 1369-1383. BICKEL, P. J.. and K. A. DOKSUM. 1977. Marhematical Starisfics: Basic Ideas and Selected Topics. San Francisco: HoIden-Day, Inc. BERMAN, G. J. 1973a. “A comparison of discrete hnear filtering algorithms.” KEE Trans. on Aerospace and Electronic Sysrems, Vol. AES-9, pp. 28-37, BIERMAN, G. J. 1973b. “Fixed-interval smoothing with discrete measurements.” Inr. J. Conrrui, Vol. 18, pp. 65-75. BIERMAN, G. J. 1977. Factorization Methods for Discrete Sequential Esrimation. NY: Academic Press. BRYSOK,A. E., JR., and M. FRAZIER. 1963. “Smoothing for linear and nonlinear dynamic systems.” TDR 63-119, pp. 353-364, Aero. Sys. Div., Wright-Patterson Air Force Base, Ohio. BRYSON,A. E., JR., and D. E. JOHANSEN.1965. “Linear filtering for time-varying systems using measurementscontaining colored noise.” IEEE Trans. on Automatic Control, Vol. AC-10, pp. 4-10. BRYSON,A. E., JR., and Y. C. Ho. 1969. Applied Optimal Control. Waltham, MA: Blaisdell. CHEN, C, T. X970. Introduction to Linear Sysrem T/zeory. NY: I-Iolt. CHI, C. Y. 1983. “Single-channel and multichannel deconvolution.” Ph.D. dissertation, Univ. of Southern Cahfiornia, Los Angeles, CA. CHI, C. Y., and J. M. MENDEL. 1984. “Performance of minimum-variance deconvolution fiiter.” IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-32, pp. 1145-1153. CRAMER,H. 1946. Mathematical Methods of Statisrics, Princeton, NJ: Princeton Univ. Press. DAI, G-Z.. and 3. M. MENDEL. 1986. “General Problems of Minimum-Variance Recursive Waveshaping.” IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-34. DONGARRA,J.J.,J.R. BUNCH.~. B. MOLER, and G.W. STEWART.1979. LINPACK User’s Guide, Philadelphia: SIAM. DUDA, R. D., and P. E. HART. 1973. Pattern Clussification and Scene Analysis. NY: Wiley Interscience. EDWARDS,A. W. F. 1972. Likelihood. London: Cambridge Univ. Press. FAURRE. P. L. 1976. “Stochastic Realization Algorithms.” in System Identificarion: Advances and Case Studies (eds., R. K. Mehra and D. G. Lainiotis), pp. l-25. NY: Academic Press. FERGWSON, T. S. 1967. MarhematicaZ Starisrics: A Decision Theoretic Approach. NY: Academic Press. FRASER,D. 1967. “Discussion of optimal fixed-point continuous linear smoothing (by J. S. Meditch).” Proc. 1967 Joint Automatic Conrrol Conf., p. 249, Univ. of PA, Philadelphia.

References

GOLDBERGER, A. S. 1964. Econometric Theory. NY: John Wiley. GRAYBILL, F. A. 1961. An Introduction

to Linear Statistical Models. Vol. 1, NY:

McGraw-Hill. GUPTA, N. K., and R. K. MEHRA. 1974. “Computational aspects of maximum likelihood estimation and reduction of sensitivity function calculations.” IEEE Trans. on Automatic Control, Vol. AC-19, pp. 774-783. GURA, I. A., and A. B. BIERMAN. 1971.“On computational efficiency of linear filtering algorithms. ” Automatica, Vol. 7, pp. 299-314. HAMMING, R. W. 1983. Digital Filters, 2nd Edition. Englewood Cliffs, NJ: Prentice-

Hall. HO, Y. C. 1963. “On the stochasticapproximation Math. Ana!. and Appl., Vol. 6, pp. 152-154. JAZWINSKI,

A. H. 1970. Stochastic

and optimal filtering.” J. of

Processes and Filtering

Theory. NY: Academic

Press. KAILATH, T. 1968. “An innovations approach to least-squares estimation-Part 1: Linear filtering in additive white noise.” IEEE Trans. on Automatic Control, Vol.

AC-13, pp. 646-655. KAILATH, T. K. 1980. Linear Systems. Englewood Cliffs, NJ: Prentice-Hall. ULMAN. R. E. 1960. “A new approach to linear filtering and prediction problems.” Trans. ASME J. Basic Eng. Series D, Vol. 82, pp. 35-46. KALMAN, R. E., and R. BUCY. 1961. “New results in linear filtering and prediction theory.” Trans. ASME, J. Basic Eng., Series D, Vol. 83, pp. 95-108. KASHYAP, R. L., and A. R. RAO. 1976. Dynamic Stochastic Models from Empirical Data. NY: Academic Press. KELLY, C. N., and B. D. 0. ANDERSON. 1971. “On the stability of fixed-lag smoothing algorithms.” J. Franklin Inst., Vol. 291, pp. 271-281. KMENTA, J. 1971. Elements of Econometrics. NY: MacMillan. KOPP, R. E., and R. J. ORFORD. 1963. “Linear regression applied to system identification for adaptive control systems.” AZAA J., Vol. 1, pp. 2300. KUNG, S. Y. 1978. “A new identification and model reduction algorithm via singular

value decomposition.” Paper presentedat the 12th Annual Asilomar Conference on Circuits, Systems,and Computers, Pacific Grove, CA. KWAKEFWAAK, H., and R. SIVAN. 1972. Linear Optimal Control Systems. NY: WileyInterscience. LAUB, A. J. 1979. “A Schur method for solving algebraic Riccati equations.” IEEE Trans. on Automatic Control, Vol. AC-24, pp. 913-921. LEHMANN, E. L. 1959. Testing Statistical Hypotheses. NY: John Wiley. LEHMANN, E. L. 1980. Theory of Point Estimation. NY.: John Wiley. LJUNG, L. 1976. “Consistency of the Least-Squares Identification Method.” IEEE Trans. on Automatic Control. Vol. AC-21, pp. 779-781. LJUNG, L. 1979. “Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems.” IEEE Trans. on Automatic Control, Vol. AC-24, pp. 36-50.

References

MARQUARDT,D. W. 1963. “An algorithm for least-squares estimation of nonlinear parameters. ” J. Sot. Indust. Appl. Math., Vol. 11, pp. 431-441. MCLOUGHLIN,D. B. 1980. “Distributed systems-notes.” Proc. 1980 Pre-JACC Tutorial Workshop on Maximum-Likelihood Identification, San Francisco, CA. MEDITCH,J. S. 1969. Stochastic Optimal Linear Estimation and Control. NY: McGrawHill. MEHRA, R. K. 1970a. “An algorithm to solve matrix equations PHT = G and P = @PIatrix Ricca!i equation+ 15s Maximum a posteriori estimation: comparison with best linear unbiased estimator, 12-%I24 comparison with mean-squared cstimatnr. i l-%116 Gaussianlinear model, 123-124 general case. 114-l 16 ~~aximurn-~~keli~~~~~d deconvolution (set DeconvoIution) Maximum-likelihood estimation, w97 Maximum-likelihood estimators: compatison with best linear unbiased estimator, 93-94 comparison with least-squaresestimator. 94 for exponential families, 287-290 the linear model, 92-94 obtaining ?hem+89-91 properties. 91-92 Maximum-likelihood method, $9 h$aximum-IikeIihood state and parameier estimation: computing &. 261 log-likelihood function for the basic state-variable model, 259-261 steady-state approximation, 264268 Mean-squared convergence (see Convergence in mean-square) blean-squared estimation, 109-113 Mean-squared e5timators: comparison with best linear unbiased estimator, 122-123 comparison with maximum a postcriori estimator. ll-$116 derivation. 110 Gaussiancase. 11l-1 13 for linear and Gaussian model, 1l&l20 properties of, 112-l 13 Measurement differencing technique, 231 Minimum-variance deconvoiution (seeDeconvolution) MLD (PC Deconvolution) Mod& ng: estimation problem, 1 measurement problem. 1 re resent&on roblem, 1 vaYidation prob rcm, 2 Multistage least-squares(XC LeastSquaresestimator) Multivariate Gaussianrandom sariables (-TeeGaussian random variables) MVD (seeDeconvolution)

Nonlinear dvnarnical systems: discrctized petturbalion statevariable model, 24s linear perturbation equaCor)s, 23?-242 model, 237-239 Nonlinear measurement model. 12 Not-so-basic state-variable model: biases, 224 colored noises, 227-230 correlated noises, 22-5227 perfect measurements, 23&233 Notation, 14-15 Orthogonality principle, 11l-l 12 Parameter estimation (Jee Extended Kaiman filter) Perfect measuremen& (see Not-sobasic state-variable model) Perturbation equations (JC~ Hanlinear dynamical systems) Philosoph!: estimation theory, 6 modeling, 5-6 Prediction: general, 142-145 recursive predictor, 155 single-stage, 140-142 steady-state predictor. 173 Predictor-corrector form of Kalman filter, 151-W Properties of best linear unbiased estimators [see Best linear unbiased estimator] Propertie of estimators (see Small sample properties of estimators: Large sample properties of estimators) Properties of least-squaresestimatar (see Least-squares estimatar) Propetiies of maximum-likciihood estimators (see Maximumlikelihood estimators] Properties of mean-squaredestimatars (seeMean-squaredestima-

tors)

Random processes (,qce GaussMatkov random processes) Random variables (XC Gaussian random variabIes) Recursive calculation of state covariance matrix, 134 Recursive calculation of state mean vector, 134 Recursive processing (see Leastsquares estimation recessing; Best linear unbiasecrestimator) Recursive wavesha ing, 21&222 Reduced-order Ka rman filter, 231 Reduced-order state estimator. 231 Riccati equation (seeMatrix Riccati equation: Algebraic Riccati equation) Sarn;te;yn

as a recursive digital

Scaiecha&es: best linear unbiased estimator, 77-78

icasl squares. 2-&24 Sensitivity of Kalman filter, 16-L 166 Signal-to-noise ralio, 137-138. 167 Smglc-channelsteady-state Kalrnan tiller* 17-s 176 Small sample Dropertics of estimatars (5ee&o-Least-squares estimator\: efficiency:&-S2 unbiasedness,

44-46

Smnothine: applications. X%222 double-stage. 187-189 fixed-internal, 190, 19-3-199 fixed-lag, 191.201-202 fixed-point. 190, 199-201 single-stage, 18-Gl87 three types. 183-184 Stabilized fotrn for computing I’(k + l/A + I], IS6 Standard form for computing P(k + lik +- l),l% State estimation (see Prediction; Filte+ng:.Smoothing) Statel;;tlmatlon example, 10-12, State-variable model (see Basic state-variable model; Not-sobasic state-variable model) Steady-state a roximation (5ee Maximum- r!i elihood state and parameter estimation) Steady-statefilter system, 173 Steadv-stateKalman filter, I?&172 Steadi-state MVD filter: defined, 207 properties of, 212-213 relationship m HR Wiener deconvolution filter, 213-215 Steady-statepredictor system, 173 Stochastic linear optimal output feedback regulator problem, 281 Sufficient statistics: defined, 282-284 for exponential families of distributtons. 285286,287-290 and uniformlv minimum-variance unbiased e&nation, 290-294 Wnbiasedness (set Small sample properties of estimators) Uniformly minimum-variance unbiased estimation (see Sufficient statistics) Variance estimator, 6748 Waveshapin (see Recursive waveshapingf Weighted least-squares estimator (see Least-square5 estimator; Best linear unbiased estimator) White noise discrete, 130-131 Wiener fiIter: derivation. 178-179 relation to Kalman filter, 18&181 Wiener-Hopf equations, 179 &m-in

algorithm. 3!H2

Lessons in Digital Estimation Theory

Lessons in Digital Estimation Theory

Suggest Documents

A Unitary Approach to Information and Estimation Theory in Digital ...

Estimation theory

Lessons In Electric Circuits, Volume IV -- Digital

Doing Digital Development Differently: Lessons in

Estimation Theory - LA - EPFL

Some statistical estimation problems in ruin theory

Network Reliability Estimation in Theory and Practice

Strain estimation in digital holographic ... - Infoscience - EPFL

Strain estimation in digital holographic ... - Infoscience - EPFL

Power Estimation in Digital Circuits

Quantum Estimation Theory - Google Sites

Flight Mechanics1 Estimation Theory Symposium

Quantum Estimation Theory - Google Sites

Accurate Curvature Estimation Along Digital

Telemorphosis: Theory in - MLibrary Digital Collections - University ...

DIGITAL 1SIGNAL PROCESSING IN THEORY AND PRACTICE

Globalization Theory: Lessons from the Exportation of ...

Lessons from digital leaders Building your digital DNA - Deloitte

Lessons Learned from Coordination Theory ... - Alexandria (UniSG)

Lessons from digital leaders Building your digital DNA - Deloitte

1 âTowards Digital Transformation: Lessons learned

18 Lessons from 7 Incredible Digital Experiences

Lessons in theory of change: monitoring, learning ... - CGSpace - cgiar

Lessons in theory of change: CCAFS Southeast ... - CGSpace - cgiar

Lessons in Digital Estimation Theory