Estimation Theory - LA - EPFL

476 downloads 100 Views 2MB Size Report
Parameter Estimation Problem : Given a set of measured data. {x[0],x[1],... ... Lessons in Estimation Theory for Signal Processing, Communications and Control.
Estimation Theory Alireza Karimi Laboratoire d’Automatique, MEC2 397, email: alireza.karimi@epfl.ch

Spring 2013

(Introduction)

Estimation Theory

Spring 2013

1 / 152

Course Objective Extract information from noisy signals Parameter Estimation Problem : Given a set of measured data {x[0], x[1], . . . , x[N − 1]} which depends on an unknown parameter vector θ, determine an estimator θˆ = g (x[0], x[1], . . . , x[N − 1]) where g is some function.

Applications : Image processing, communications, biomedicine, system identification, state estimation in control, etc. (Introduction)

Estimation Theory

Spring 2013

2 / 152

Some Examples Range estimation : We transmit a pulse that is reflected by the aircraft. An echo is received after τ second. Range θ is estimated from the equation θ = τ c/2 where c is the light’s speed. System identification : The input of a plant is excited with u and the output signal y is measured. If we have : y [k] = G (q −1 , θ)u[k] + n[k] where n[k] is the measurement noise. The model parameters are estimated using u[k] and y [k] for k = 1, . . . , N. DC level in noise : Consider a set of data {x[0], x[1], . . . , x[N − 1]} that can be modeled as : x[n] = A + w [n] where w [n] is some zero mean noise process. A can be estimated using the measured data set. (Introduction)

Estimation Theory

Spring 2013

3 / 152

Outline Classical estimation (θ deterministic) Minimum Variance Unbiased Estimator (MVU) Cramer-Rao Lower Bound (CRLB) Best Linear Unbiased Estimator (BLUE) Maximum Likelihood Estimator (MLE) Least Squares Estimator (LSE) Bayesian estimation (θ stochastic) Minimum Mean Square Error Estimator (MMSE) Maximum A Posteriori Estimator (MAP) Linear MMSE Estimator Kalman Filter

(Introduction)

Estimation Theory

Spring 2013

4 / 152

References Main reference :

Fundamentals of Statistical Signal Processing Estimation Theory by Steven M. KAY, Prentice-Hall, 1993 (available in Library de La Fontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 and Chapter 9.

Other references : Lessons in Estimation Theory for Signal Processing, Communications and Control. By Jerry M. Mendel, Prentice-Hall, 1995. Probability, Random Processes and Estimation Theory for Engineers. By Henry Stark and John W. Woods, Prentice-Hall, 1986. (Introduction)

Estimation Theory

Spring 2013

5 / 152

Review of Probability and Random Variables

(Probability and Random Variables)

Estimation Theory

Spring 2013

6 / 152

Random Variables Random Variable : A rule X (·) that assigns to every element of a sample space Ω a real value is called a RV. So X is not really a variable that varies randomly but a function whose domain is Ω and whose range is some subset of the real line. Example : Consider the experiment of flipping a coin twice. The sample space (the possible outcomes) is : Ω = {HH, HT , TH, TT } We can define a random variable X such that X (HH) = 1, X (HT ) = 1.1, X (TH) = 1.6, X (TT ) = 1.8 Random variable X assigns to each event (e.g. E = {HT , TH} ⊂ Ω) a subset of the real line (in this case B = {1.1, 1.6}). (Probability and Random Variables)

Estimation Theory

Spring 2013

7 / 152

Probability Distribution Function For any element ζ in Ω, the event {ζ|X (ζ) ≤ x} is an important event. The probability of this event Pr [{ζ|X (ζ) ≤ x}] = PX (x) is called the probability distribution function of X . Example : For the random variable defined earlier, we have : PX (1.5) = Pr [{ζ|X (ζ) ≤ 1.5}] = Pr [{HH, HT }] = 0.5 PX (x) can be computed for all x ∈ R. It is clear that 0 ≤ PX (x) ≤ 1. Remark : For the same experiment (throwing a coin twice) we could define another random variable that would lead to a different PX (x). In most of engineering problems the sample space is a subset of the real line so X (ζ) = ζ and PX (x) is a continuous function of x. (Probability and Random Variables)

Estimation Theory

Spring 2013

8 / 152

Probability Density Function (PDF) The Probability Density Function, if it exists, is given by : dPX (x) dx When we deal with a single random variable the subscripts are removed : pX (x) =

p(x) = Properties :





(i ) −∞

(ii ) (iii )

dP(x) dx

p(x)dx = P(∞) − P(−∞) = 1 

Pr [{ζ|X (ζ) ≤ x}] = Pr [X ≤ x] = P(x) =  x2 p(x)dx Pr [x1 < X ≤ x2 ] =

x

−∞

p(α)d α

x1 (Probability and Random Variables)

Estimation Theory

Spring 2013

9 / 152

Gaussian Probability Density Function A random variable is distributed according to a Gaussian or normal distribution if the PDF is given by :

p(x) = √

1 2πσ 2

e−

(x−µ)2 2σ2

The PDF has two parameters : µ, the mean and σ 2 the variance. We note X ∼ N (µ, σ 2 ) when the random variable X has a normal (Gaussian) distribution with the mean µ and the standard deviation σ. Small σ means small variability (uncertainty) and large σ means large variability. Remark : Gaussian distribution is important because according to the Central Limit Theorem the sum of N independent RVs has a PDF that converges to a Gaussian distribution when N goes to infinity. (Probability and Random Variables)

Estimation Theory

Spring 2013

10 / 152

Some other common PDF  Uniform (b > a) :

p(x) =

Exponential (µ > 0) :

p(x) =

Rayleigh (σ > 0) :

p(x) =

Chi-square χ2 :

p(x) =





where Γ(z) =

1 b−a

0

a var (A var(A (Minimum Variance Unbiased Estimation)

Estimation Theory

Spring 2013

28 / 152

Unbiased Estimators Remark : When several unbiased estimators of the same parameters from independent set of data are available, i.e., θˆ1 , θˆ2 , . . . , θˆn , a better estimator can be obtained by averaging : 1ˆ θi θˆ = n n



ˆ =θ E (θ)

i =1

Assuming that the estimators have the same variance, we have : n 1 var(θˆi ) 1  ˆ var(θˆi ) = 2 n var(θˆi ) = var(θ) = 2 n n n i =1

By increasing n, the variance will decrease (if n → ∞, θˆ → θ). It is not the case for biased estimators, no matter how many estimators are averaged. (Minimum Variance Unbiased Estimation)

Estimation Theory

Spring 2013

29 / 152

Minimum Variance Criterion The most logical criterion for estimation is the Mean Square Error (MSE) : ˆ = E [(θˆ − θ)2 ] mse(θ) Unfortunately this type of estimators leads to unrealizable estimators (the estimator will depend on θ). ˆ + b(θ)]2 } ˆ = E {[θˆ − E (θ) ˆ + E (θ) ˆ − θ]2 } = E {[θˆ − E (θ) mse(θ) ˆ − θ is defined as the bias of the estimator. Therefore : where b(θ) = E (θ) ˆ + b 2 (θ) = var(θ) ˆ + b 2 (θ) ˆ = E {[θˆ − E (θ)] ˆ 2 } + 2b(θ)E [θˆ − E (θ)] mse(θ)

Instead of minimizing MSE we can minimize the variance of the unbiased estimators : Minimum Variance Unbiased Estimator (Minimum Variance Unbiased Estimation)

Estimation Theory

Spring 2013

30 / 152

Minimum Variance Unbiased Estimator

Existence of MVU Estimator : In general MVU estimator does not always exist. There may be no unbiased estimator or non of unbiased estimators has a uniformly minimum variance. Finding the MVU Estimator : There is no known procedure which always leads to the MVU estimator. Three existing approaches are : 1

Determine the Cramer-Rao lower bound (CRLB) and check to see if some estimator satisfies it.

2

Apply the Rao-Blackwell-Lehmann-Scheffe theorem (we will skip it).

3

Restrict to linear unbiased estimators.

(Minimum Variance Unbiased Estimation)

Estimation Theory

Spring 2013

31 / 152

Cramer-Rao Lower Bound

(Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

32 / 152

Cramer-Rao Lower Bound

CRLB is a lower bound on the variance of any unbiased estimator. ˆ ≥ CRLB(θ) var(θ) Note that the CRLB is a function of θ. It tells us what is the best performance that can be achieved (useful in feasibility study and comparison with other estimators). It may lead us to compute the MVU estimator.

(Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

33 / 152

Cramer-Rao Lower Bound Theorem (scalar case) Assume that the PDF p(x; θ) satisfies the regularity condition   ∂ ln p(x; θ) =0 for all θ E ∂θ Then the variance of any unbiased estimator θˆ satisfies −1  2 ∂ ln p(x; θ) ˆ var(θ) ≥ −E ∂θ 2 An unbiased estimator that attains the CRLB can be found iff : ∂ ln p(x; θ) = I (θ)(g (x) − θ) ∂θ for some functions g (x) and I (θ). The estimator is θˆ = g (x) and the minimum variance is 1/I (θ). (Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

34 / 152

Cramer-Rao Lower Bound Example : Consider x[0] = A + w [0] with w [0] ∼ N (0, σ 2 ).   1 1 2 exp − 2 (x[0] − A) p(x[0]; A) = √ 2σ 2πσ 2 √ 1 ln p(x[0], A) = − ln 2πσ 2 − 2 (x[0] − A)2 2σ Then 1 ∂ ln p(x[0]; A) = 2 (x[0] − A) ∂A σ





∂ 2 ln p(x[0]; A) 1 = 2 2 ∂A σ

According to Theorem : ˆ ≥ σ2 var (A)

(Cramer-Rao Lower Bound)

and

I (A) =

1 σ2

ˆ = g (x[0]) = x[0] and A

Estimation Theory

Spring 2013

35 / 152

Cramer-Rao Lower Bound Example : Consider multiple observations for DC level in WGN : x[n] = A + w [n]

n = 0, 1, . . . , N − 1

with w [n] ∼ N (0, σ 2 )



N−1 1  exp − 2 (x[n] − A)2 p(x; A) = √ 2σ ( 2πσ 2 )N n=0

1

Then ∂ ln p(x; A) ∂A

=

=





N−1  ∂ 1 − ln[(2πσ 2 )N/2 ] − 2 (x[n] − A)2 ∂A 2σ n=0 

N−1 N−1 N 1  1  (x[n] − A) = 2 x[n] − A σ2 σ N n=0

n=0

According to Theorem : 2 ˆ ≥σ var (A) N (Cramer-Rao Lower Bound)

N and I (A) = 2 σ

N−1 1  ˆ and A = g (x[0]) = x[n] N

Estimation Theory

n=0

Spring 2013

36 / 152

Transformation of Parameters If it is desired to estimate α = g (θ), then the CRLB is : −1 2  2 ∂g ∂ ln p(x; θ) ˆ var(θ) ≥ −E 2 ∂θ ∂θ Example : Compute the CRLB for estimation of the power (A2 ) of a DC level in noise : σ2 4A2 σ 2 (2A)2 = var(Aˆ2 ) ≥ N N

Definition Efficient estimator : An unbiased estimator that attains the CRLB is said to be efficient. Example : Knowing that x¯ =

N−1 1  x[n] is an efficient estimator for A, is N n=0

x¯2 an efficient estimator for A2 ? (Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

37 / 152

Transformation of Parameters Solution : Knowing that x¯ ∼ N (A, σ 2 /N), we have : x ) + var(¯ x ) = A2 + E (¯ x 2 ) = E 2 (¯

σ2 = A2 N

2 = x¯2 is not even unbiased. So the estimator A Let’s look at the variance of this estimator : x 4 ) − E 2 (¯ x 2) var(¯ x 2 ) = E (¯ but we have from the moments of Gaussian RVs (slide 15) : E (¯ x 4 ) = A4 + 6A2

σ2 σ2 + 3( )2 N N

Therefore :

2 3σ 4 σ2 4A2 σ 2 2σ 4 2 + 2 − A + + 2 = var(¯ x ) = A + 6A N N N N N 2

4

(Cramer-Rao Lower Bound)



2

Estimation Theory

Spring 2013

38 / 152

Transformation of Parameters Remarks : 2 = x¯2 is biased and not efficient. The estimator A As N → ∞ the bias goes to zero and the variance of the estimator approaches the CRLB. This type of estimators are called asymptotically efficient. General Remarks : ˆ is an If g (θ) = aθ + b is an affine function of θ, then g (θ) = g (θ) ˆ efficient estimator. First, it is unbiased : E (aθ + b) = aθ + b = g (θ), moreover : 2 ∂g  ˆ = a2 var(θ) ˆ var(θ) var(g (θ)) ≥ ∂θ ˆ so that the CRLB is but var(g (θ)) = var(aθˆ + b) = a2 var(θ), achieved. If g (θ) is a nonlinear function of θ and θˆ is an efficient estimator, ˆ is an asymptotically efficient estimator. then g (θ) (Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

39 / 152

Cramer-Rao Lower Bound Theorem (Vector Parameter) Assume that the PDF p(x; θ) satisfies the regularity condition   ∂ ln p(x; θ) =0 for all θ E ∂θ Then the variance of any unbiased estimator θˆ satisfies Cθˆ − I−1 (θ) ≥ 0 where ≥ 0 means that the matrix is positive semidefinite. I(θ) is called the Fisher information matrix and is given by : 2 ∂ ln p(x; θ) Iij (θ) = −E ∂θi ∂θj An unbiased estimator that attains the CRLB can be found iff : ∂ ln p(x; θ) = I(θ)(g(x) − θ) ∂θ (Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

40 / 152

CRLB Extension to Vector Parameter Example : Consider a DC level in WGN with A and σ 2 unknown. Compute the CRLB for estimation of θ = [A σ 2 ]T . N−1 N 1  N 2 (x[n] − A)2 ln p(x; θ) = − ln 2π − ln σ − 2 2 2 2σ n=0

The Fisher information matrix is : 2 I(θ) = −E

∂ ln p(x;θ) ∂A2 ∂ 2 ln p(x;θ) ∂A∂σ2

∂ 2 ln p(x;θ) ∂A∂σ2 ∂ 2 ln p(x;θ) ∂(σ2 )2



 =

N σ2

0



0 N 2σ4

The matrix is diagonal (just for this example) and can be easily inverted to yield : ˆ ≥ var(θ)

σ2 N

2 ) ≥ var(σ

2σ 4 N

Is there any unbiased estimator that achieves these bounds ? (Cramer-Rao Lower Bound)

Estimation Theory

Spring 2013

41 / 152

Transformation of Parameters If it is desired to estimate α = g(θ), and the CRLB for the covariance of θˆ is I−1 (θ), then : 2  −1 T ∂ ln p(x; θ) ∂g ∂g −E Cαˆ ≥ ∂θ ∂θ 2 ∂θ Example : Consider a DC level in WGN with A and σ 2 unknown. Compute the CRLB for estimation of signal to noise ratio α = A2 /σ 2 . We have θ = [A σ 2 ]T and α = g (θ) = θ12 /θ2 , then the Jacobian is :     ∂g (θ) ∂g (θ) A2 2A ∂g (θ) = − 4 = ∂θ ∂θ1 ∂θ2 σ2 σ So the CRLB is :  2A var(ˆ α) ≥ σ2 (Cramer-Rao Lower Bound)

A2 − 4 σ



N σ2

0

0

−1 

N 2σ4

Estimation Theory

2A σ2 −A2 σ4

 =

4α + 2α2 N Spring 2013

42 / 152

Linear Models with WGN If N point samples of data observed can be modeled as x = Hθ + w where x H θ w

=N ×1 =N ×p =p×1 =N ×1

observation vector observation matrix (known, rank p) vector of parameters to be estimated noise vector with PDF N (0, σ 2 I)

Compute the CRLB and the MVU estimator that achieves this bound. Step 1 : Compute ln p(x; θ).   ∂ 2 ln p(x; θ) and the covariance Step 2 : Compute I(θ) = −E ∂θ 2 matrix of θˆ : Cθˆ = I−1 (θ). Step 3 : Find the MVU estimator g(x) by factoring ∂ ln p(x; θ) = I(θ)[g(x) − θ] ∂θ (Linear Models)

Estimation Theory

Spring 2013

43 / 152

Linear Models with WGN √ Step 1 : ln p(x; θ) = − ln( 2πσ 2 )N − 2σ1 2 (x − Hθ)T (x − Hθ). Step 2 : 1 ∂ T ∂ ln p(x; θ) = − 2 [x x − 2xT Hθ + θ T HT Hθ] ∂θ 2σ ∂θ 1 T [H x − HT Hθ] = σ2   2 1 ∂ ln p(x; θ) = 2 HT H Then I(θ) = −E 2 ∂θ σ Step 3 : Find the MVU estimator g(x) by factoring HT H ∂ ln p(x; θ) = I(θ)[g(x) − θ] = [(HT H)−1 HT x − θ] ∂θ σ2 Therefore : C ˆ = I−1 (θ) = σ 2 (HT H)−1 θˆ = g(x) = (HT H)−1 HT x θ

(Linear Models)

Estimation Theory

Spring 2013

44 / 152

Linear Models with WGN For a linear model with WGN represented by x = Hθ + w the MVU estimator is : θˆ = (HT H)−1 HT x This estimator is efficient and attains the CRLB. That the estimator is unbiased can be seen easily by : ˆ = (HT H)−1 HT E (Hθ + w) = θ E (θ) The statistical performance of θˆ is completely specified because θˆ is a linear transformation of a Gaussian vector x and hence has a Gaussian distribution : θˆ ∼ N (θ, σ 2 (HT H)−1 )

(Linear Models)

Estimation Theory

Spring 2013

45 / 152

Example (Curve Fitting) Consider fitting the data x[n] by a p-th order polynomial function of n : x[n] = θ0 + θ1 n + θ2 n2 + · · · + θp np + w [n] We have N data samples, then : x = [x[0], x[1], . . . , x[N − 1]]T w = [w [0], w [1], . . . , w [N − 1]]T θ = [θ0 , θ1 , . . . , θp ]T so x = Hθ + w, where H is :  1 0 0  1 1 1   1 2 4   .. .. ..  . . . 1 N − 1 (N − 1)2

··· ··· ··· .. .

0 1 2p .. .

···

(N − 1)p

       N×(p+1)

Hence the MVU estimator is : θˆ = (HT H)−1 HT x (Linear Models)

Estimation Theory

Spring 2013

46 / 152

Example (Fourier Analysis) Consider the Fourier analysis of the data x[n] : x[n] =

M 

 2πkn 2πkn )+ ) + w [n] bk sin( N N M

ak cos(

k=1

k=1

so we have θ = [a1 , a2 , . . . , aM , b1 , b2 , . . . , bM ]T and x = Hθ + w where : H = [ha1 , ha2 , . . . , haM , hb1 , hb2 , . . . , hbM ] with

 hak

   =  

1 cos( 2πk N ) 2πk2 cos( N ) .. .

      

 ,

hbk

) cos( 2πk(N−1) N

   =  



1 sin( 2πk N ) 2πk2 sin( N ) .. .

     

) sin( 2πk(N−1) N

Hence the MVU estimate of the Fourier coefficients is : θˆ = (HT H)−1 HT x (Linear Models)

Estimation Theory

Spring 2013

47 / 152

Example (Fourier Analysis) After simplification (noting that (HT H)−1 =

2 N I),

we have :

2 θˆ = [(ha1 )T x, . . . , (haM )T x , (hb1 )T x, . . . , (hbM )T x]T N which is the same as the standard solution : N−1 N−1 2  2πkn 2  2πkn ˆ , bk = x[n] cos x[n] sin ˆak = N N N N n=0

n=0

From the properties of linear models the estimates are unbiased. The covariance matrix is : Cθˆ = σ 2 (HT H)−1 =

2σ 2 I N

Note that θˆ is Gaussian and Cθˆ is diagonal (the amplitude estimates are independent). (Linear Models)

Estimation Theory

Spring 2013

48 / 152

Example (System Identification) Consider identification of a Finite Impulse Response (FIR) model, h[k] for k = 0, 1, . . . , p − 1, with input u[n] and output x[n] provided for n = 0, 1, . . . , N − 1 : x[n] =

p−1 

h[k]u[n − k] + w [n]

n = 0, 1, . . . , N − 1

k=0

FIR model can be represented by the linear model x = Hθ + w where     u[0] 0 ··· 0 h[0]  u[1]  u[0] ··· 0  h[1]     H = θ=   .. .. .. ..   ...   . . . . h[p − 1] p×1 u[N − 1] u[N − 2] · · · u[N − p] N×p The MVU estimate is θˆ = (HT H)−1 HT x with Cθˆ = σ 2 (HT H)−1 . (Linear Models)

Estimation Theory

Spring 2013

49 / 152

Linear Models with Colored Gaussian Noise Determine the MVU estimator for the linear model x = Hθ + w with w a colored Gaussian noise with N (0, C). Whitening approach : Since C is positive definite, its inverse can be factored as C−1 = DT D where D is an invertible matrix. This matrix acts as a whitening transformation for w : E [(Dw)(Dw)T ] = E (DwwT D) = DCDT = DD−1 D−T D = I Now if we transform the linear model x = Hθ + w to : x = Dx = DHθ + Dw = H θ + w where w = Dw ∼ N (0, I) is white and we can compute the MVU estimator as : θˆ = (HT H )−1 HT x = (HT DT DH)−1 HT DT Dx so, we have : θˆ = (HT C−1 H)−1 HT C−1 x with Cθˆ = (HT H )−1 = (HT C−1 H)−1 (Linear Models)

Estimation Theory

Spring 2013

50 / 152

Linear Models with known components Consider a linear model x = Hθ + s + w, where s is a known signal. To determine the MVU estimator let x = x − s, so that x = Hθ + w is a standard linear model. The MVU estimator is : θˆ = (HT H)−1 HT (x − s) with C ˆ = σ 2 (HT H)−1 θ

Example : Consider a DC level and exponential in WGN : x[n] = A + r n + w [n] where r is known. Then we have :        w [0] x[0] 1 1  x[1]   1   r   w [1]          =  ..  A +  ..  +  .. ..      . . . .   x[N − 1]

r N−1

1

    

w [N − 1]

The MVU estimator is : N−1  ˆ = (HT H)−1 HT (x − s) = 1 (x[n] − r n ) A N

ˆ = with var(A)

n=0

(Linear Models)

Estimation Theory

Spring 2013

σ2 N 51 / 152

Best Linear Unbiased Estimators (BLUE) Problems of finding the MVU estimators : The MVU estimator does not always exist or impossible to find. The PDF of data may be unknown. BLUE is a suboptimal estimator that : restricts estimates to be linear in data ; restricts estimates to be unbiased ;

θˆ = Ax

ˆ = AE (x) = θ E (θ)

minimizes the variance of the estimates ; needs only the mean and the variance of the data (not the PDF). As a result, in general, the PDF of the estimates cannot be computed. Remark : The unbiasedness restriction implies a linear model for the data. However, it may still be used if the data are transformed suitably or the model is linearized. (Best Linear Unbiased Estimators)

Estimation Theory

Spring 2013

52 / 152

Finding the BLUE (Scalar Case) 1

Choose a linear estimator for the observed data x[n] θˆ =

N−1 

,

an x[n] = aT x

n = 0, 1, . . . , N − 1 where a = [a0 , a1 , . . . , aN−1 ]T

n=0 2

Restrict estimate to be unbiased : ˆ = E (θ)

N−1 

an E (x[n]) = θ

n=0 3

Minimize the variance ˆ = E {[θˆ − E (θ)] ˆ 2 } = E {[aT x − aT E (x)]2 } var(θ) = E {aT [x − E (x)][x − E (x)]T a} = aT Ca

(Best Linear Unbiased Estimators)

Estimation Theory

Spring 2013

53 / 152

Finding the BLUE (Scalar Case) Consider the problem of amplitude estimation of known signals in noise : x[n] = θs[n] + w [n]

1

Choose a linear estimator : θˆ =

N−1 

an x[n] = aT x

n=0 2

3

ˆ = aT E (x) = aT sθ = θ Restrict estimate to be unbiased : E (θ) then aT s = 1 where s = [s[0], s[1], . . . , s[N − 1]]T Minimize aT Ca subject to aT s = 1.

The constrained optimization can be solved using Lagrangian Multipliers : Minimize

J = aT Ca + λ(aT s − 1)

The optimal solution is : sT C−1 x θˆ = T −1 s C s (Best Linear Unbiased Estimators)

and

ˆ = var(θ)

Estimation Theory

1 sT C−1 s Spring 2013

54 / 152

Finding the BLUE (Vector Case) Theorem (Gauss–Markov) If the data are of the general linear model form x = Hθ + w with w is a noise vector with zero mean and covariance C (the PDF of w is arbitrary), then the BLUE of θ is : θˆ = (HT C−1 H)−1 HT C−1 x and the covariance matrix of θˆ is Cθˆ = (HT C−1 H)−1 Remark : If noise is Gaussian then BLUE is MVU estimator. (Best Linear Unbiased Estimators)

Estimation Theory

Spring 2013

55 / 152

Finding the BLUE Example : Consider the problem of DC level in noise : x[n] = A + w [n], where w [n] is of unspecified PDF with var(w [n]) = σn2 . We have θ = A and H = 1 = [1, 1, . . . , 1]T . The covariance matrix is :  1    2 0 ··· 0 σ02 0 σ0 0 · · ·  0 1 ···  0 σ2 · · · 0  0    σ12 1   −1   C =  .. .. . . ..  ⇒ C =  .. . . .  . . .  . . . .  . . . .   2 0 0 · · · σN−1 0 0 · · · σ21 N−1

and hence the BLUE is : θˆ = (H C T

−1

−1

H)

−1

T

H C

N−1 −1 N−1  1  x[n] x= 2 σn σn2 n=0

and the minimum covariance is : T

Cθˆ = (H C

−1

−1

H)

n=0

N−1 −1  1 = σn2 n=0

(Best Linear Unbiased Estimators)

Estimation Theory

Spring 2013

56 / 152

Maximum Likelihood Estimation

(Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

57 / 152

Maximum Likelihood Estimation Problems : MVU estimator does not often exist or cannot be found. BLUE is restricted to linear models. Maximum Likelihood Estimator (MLE) : can always be applied if the PDF is known ; is optimal for large data size ; is computationally complex and requires numerical methods. Basic Idea : Choose the parameter value that makes the observed data, the most likely data to have been observed. Likelihood Function : is the PDF p(x; θ) when θ is regarded as a variable (not a parameter). ML Estimate : is the value of θ that maximizes the likelihood function. Procedure : Find log-likelihood function ln p(x; θ) ; differentiate w.r.t θ and set to zero and solve for θ. (Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

58 / 152

Maximum Likelihood Estimation Example : Consider DC level in WGN with unknown variance x[n] = A + w [n]. Suppose that A > 0 and σ 2 = A. The PDF is :

N−1 1  1 (x[n] − A)2 p(x; A) = N exp − 2A (2πA) 2 n=0

Taking the derivative of the log-likelihood function, we have : N−1 N−1 N 1  1  ∂ ln p(x; A) =− + (x[n] − A) + 2 (x[n] − A)2 ∂A 2A A 2A n=0

n=0

What is the CRLB ? Does an MVU estimator exist ? MLE can be found by setting the above equation to zero :   N−1   1 1 ˆ =− +1 x 2 [n] + A 2 N 4 n=0

(Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

59 / 152

Stochastic Convergence Convergence in distribution : Let {p(xN )} be a sequence of PDF. If there exists a PDF p(x) such that lim p(xN ) = p(x)

N→∞

at every point x at which p(x) is continuous, we say that p(xN ) converges d

in distribution to p(x). We write also xN → x.

Example Consider N independent RV x1 , x2 , . . . , xN with mean µ and finite variance N 1  xn , then according to the Central Limit Theorem σ 2 . Let x¯N = N n=1 √ x¯N − µ converges in distribution to z ∼ N (0, 1). (CLT), zN = N σ

(Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

60 / 152

Stochastic Convergence Convergence in probability : Sequence of random variables {xN } converges in probability to the random variable x if for every > 0 lim Pr {|xN − x| > } = 0

N→∞

Convergence with probability 1 (almost sure convergence) : Sequence of random variables {xN } converges with probability 1 to the random variable x if and only if for all possible events Pr { lim xN = x} = 1 N→∞

Example Consider N independent RV x1 , x2 , . . . , xN with mean µ and finite variance N 1  xn . Then according to the Law of Large Numbers σ 2 . Let x¯N = N n=1 (LLN), z¯N converges in probability to µ. (Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

61 / 152

Asymptotic Properties of Estimators Asymptotic Unbiasedness : Estimator θˆN is an asymptotically unbiased estimator of θ if : lim E (θˆN − θ) = 0 N→∞

Asymptotic Distribution : It refers to p(xn ) as it evolves from n = 1, 2, . . ., especially for large value of n (it is not the ultimate form of distribution, which may be degenerate). Asymptotic Variance : is not equal to lim var(θˆN ) (which is the N→∞

limiting variance). It is defined as : 1 lim E {N[θˆN − lim E (θˆN )]2 } asymptotic var(θˆN ) = N→∞ N N→∞ Mean-Square Convergence : Estimator θˆN converges to θ in a mean-squared sense, if : lim E [(θˆN − θ)2 ] = 0 N→∞

(Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

62 / 152

Consistency Estimator θˆN is a consistent estimator of θ if for every > 0 plim(θˆN ) = θ



lim Pr [|θˆN − θ| > ] = 0

N→∞

Remarks : If θˆN is asymptotically unbiased and its limiting variance is zero then it converges to θ in mean-square. If θˆN converges to θ in mean-square, then the estimator is consistent. Asymptotic unbiasedness does not imply consistency and vice versa. plim can be treated as an operator, e.g. : plim(xy ) = plim(x)plim(y ) ;

plim

plim(x) x = y plim(y )

The importance of consistency is that any continuous function of a consistent estimator is itself a consistent estimator. (Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

63 / 152

Maximum Likelihood Estimation Properties : MLE may be biased and is not necessarily an efficient estimator. However : MLE is a consistent estimator meaning that lim Pr [|θˆ − θ| > ] = 0

N→∞

MLE asymptotically attains the CRLB (asymptotic variance is equal to CRLB). Under some “regularity” conditions, the MLE is asymptotically normally distributed a θˆ ∼ N (θ, I−1 (θ)) even if the PDF of x is not Gaussian. If an MVU estimator exists then ML procedure will find it.

(Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

64 / 152

Maximum Likelihood Estimation Example Consider DC level in WGN with known variance σ 2 . Sol. Then

N−1 1  exp − (x[n] − A)2 p(x; A) = N 2 2 2σ (2πσ ) 2 n=0

1



N−1 1  ∂ ln p(x; A) = 2 (x[n] − A) = 0 ∂A σ n=0



N−1 

ˆ =0 x[n] − N A

n=0 N−1  ˆ = x¯ = 1 x[n] which leads to A N n=0

(Maximum Likelihood Estimation)

Estimation Theory

Spring 2013

65 / 152

MLE for Transformed Parameters (Invariance Property) Theorem (Invariance Property of the MLE) The MLE of the parameter α = g (θ), where the PDF p(x; θ) is parameterized by θ, is given by ˆ α ˆ = g (θ) where θˆ is the MLE of θ. It can be proved using the property of the consistent estimators. If α = g (θ) is a one-to-one function, then ˆ α ˆ = arg max p(x; g −1 (α)) = g (θ) α

If α = g (θ) is not a one-to-one function, then p¯T (x; α) =

max

{θ:α=g (θ)}

(Maximum Likelihood Estimation)

p(x; θ)

and

Estimation Theory

ˆ α ˆ = arg max p¯T (x; θ) = g (θ) α

Spring 2013

66 / 152

MLE for Transformed Parameters (Invariance Property) Example Consider DC level in WGN and find MLE of α = exp(A). Since g (θ) is a one-to-one function then : x) α ˆ = arg max pT (x, α) = arg max p(x; ln α) = exp(¯ α

α

Example Consider DC level in WGN and find MLE of α = A2 . Since g (θ) is not a one-to-one function then : √ √ α ˆ = arg max{p(x; α), p(x; − α)} α≥0 2  √ √ = arg √max {p(x; α), p(x; − α)} α≥0

 = (Maximum Likelihood Estimation)

arg

max

−∞ · · · < h2 , h1 > < h2 , h2 > · · · .. .. .. . . . < hp , h1 > < hp , h2 > · · ·

 < h1 , hp > < h2 , hp >    ..  . < hp , hp >

If the columns of H are orthonormal then HT H = I and θˆ = HT x. In this case we have θˆi = hT x, thus i

ˆs = Hθˆ =

p 

θˆi hi =

i =1

p 

(hT i x)hi

i =1

If we increase the number of parameters (the order of the linear model), we can easily compute the new estimate. (Least Squares)

Estimation Theory

Spring 2013

80 / 152

Choosing the Model Order Suppose that you have a set of data and the objective is to fit a polynomial to data. What is the best polynomial order ? Remarks : It is clear that by increasing the order Jmin is monotonically non-increasing. By choosing p = N we can perfectly fit the data to model. However, we fit the noise as well. We should choose the simplest model that adequately describes the data. We increase the order only if the cost reduction is significant. If we have an idea about the expected level of Jmin , we increase p to approximately attain this level. There is an order-recursive LS algorithm to efficiently compute a (p + 1)-th order model based on a p-th order one (See 8.6). (Least Squares)

Estimation Theory

Spring 2013

81 / 152

Sequential Least Squares   ˆ − 1] based on x[N − 1] = x[0], . . . , x[N − 1] is Suppose that θ[N ˆ available. If we get a new data x[N], we want to compute θ[N] as a ˆ function of θ[N − 1] and x[N].

Example Consider LS estimate of DC level :

N−1 1  ˆ A[N − 1] = x[n]. We have : N n=0

1 1  x[n] = N +1 N +1 N

ˆ A[N] =



n=0

 N−1 1  N x[n] + x[N] N n=0

1 N ˆ A[N − 1] + x[N] = N +1 N +1 ˆ − 1]) ˆ − 1] + 1 (x[N] − A[N = A[N N +1 new estimate = old estimate + gain × prediction error (Least Squares)

Estimation Theory

Spring 2013

82 / 152

Sequential Least Squares Example (DC level in uncorrelated noise) x[n] = A + w [n] and var(w [n]) = σn2 . The WLS estimate (or BLUE) is : N−1

x[n] σn2 1 n=0 σn2

ˆ − 1] = n=0 A[N N−1

Similar to the previous example we can obtain : 2  ˆ − 1] ˆ ˆ − 1] + 1/σN x[N] − A[N A[N] = A[N N−1 1 n=0 σn2

 ˆ − 1] + K [N] x[N] − A[N ˆ − 1] = A[N The gain factor K [N] can be reformulated as : K [N] =

ˆ − 1]) var(A[N ˆ − 1]) + σ 2 var(A[N

N

(Least Squares)

Estimation Theory

Spring 2013

83 / 152

Sequential Least Squares Consider the general linear model x = Hθ + w, where w is an uncorrelated noise with the covariance matrix C. The BLUE (or WLS) is : θˆ = (HT C−1 H)−1 HT C−1 x and Cθˆ = (HT C−1 H)−1 Let’s define : C[n] = diag(σ02 , σ12 , . . . , σn2 )     H[n − 1] n×p = H[n] = 1×p hT [n]   x[n] = x[0], x[1], . . . , x[n] ˆ The objective is to find θ[n], based on n + 1 data samples, as a function of ˆ − 1] and new data x[n]. The batch estimator is : θ[n ˆ − 1] = Σ[n − 1]HT [n − 1]C−1 [n − 1]x[n − 1] θ[n with Σ[n − 1] = (HT [n − 1]C−1 [n − 1]H[n − 1])−1 (Least Squares)

Estimation Theory

Spring 2013

84 / 152

Sequential Least Squares Estimator Update :  ˆ − 1] ˆ = θ[n ˆ − 1] + K[n] x[n] − hT [n]θ[n θ[n] where K[n] =

σn2

Σ[n − 1]h[n] + hT [n]Σ[n − 1]h[n]

Covariance Update : Σ[n] = (I − K[n]hT [n])Σ[n − 1] The following Lemma is used to compute the updates :

Matrix Inversion Lemma (A + BCD)−1 = A−1 − A−1 B[C−1 + DA−1 B]−1 DA−1 with A = Σ−1 [n − 1], B = h[n], C = 1/σn2 , D = hT [n]. Initialization ? (Least Squares)

Estimation Theory

Spring 2013

85 / 152

Constrained Least Squares Linear constraints in the form Aθ = b can be considered in LS solution. The LS criterion becomes : Jc (θ) = (x − Hθ)T (x − Hθ) + λT (Aθ − b) ∂Jc = −2HT x + 2HT Hθ + AT λ ∂θ Setting the gradient equal to zero produces : θˆc

1 = (HT H)−1 HT x − (HT H)−1 AT λ 2 λ = θˆ − (HT H)−1 AT 2

where θˆ = (HT H)−1 HT x is the unconstrained LSE. Now Aθˆc = b can be solved to find λ : λ = 2[A(HT H)−1 AT ]−1 (Aθˆ − b) (Least Squares)

Estimation Theory

Spring 2013

86 / 152

Nonlinear Least Squares Many applications have nonlinear observation model : s(θ) = Hθ. This leads to a nonlinear optimization problem that can be solved numerically. Newton-Raphson Method : Find a zero of the gradient of the criterion by linearizing the gradient around the current estimate.  −1  ∂g(θ)  g(θ) θˆk+1 = θˆk −  ˆ ∂θ θ=θk

where g(θ) =

∂sT (θ) ∂θ

[x − s(θ)] and

 N−1  ∂ 2 s[n] ∂s[n] ∂s[n] ∂[g(θ)]i = − (x[n] − s[n]) ∂θj ∂θi θj ∂θj ∂θi n=0

Around the solution [x − s(θ)] is small so the first term in the Jacobian can be neglected. This makes this method equivalent to the Gauss-Newton algorithm which is numerically more robust. (Least Squares)

Estimation Theory

Spring 2013

87 / 152

Nonlinear Least Squares Gauss-Newton Method : Linearize signal model around the current estimate and solve the resulting linear problem.  ∂s(θ)  (θ − θ0 ) = s(θ0 ) + H(θ0 )(θ − θ0 ) s(θ) ≈ s(θ0 ) + ∂θ θ=θ0 The solution to the linearized problem is : θˆ = [HT (θ0 )H(θ0 )]−1 HT (θ0 )[x − s(θ0 ) + H(θ0 )θ0 ] = θ0 + [HT (θ0 )H(θ0 )]−1 HT (θ0 )[x − s(θ0 )] If we now iterate the solution, it becomes : θˆk+1 = θˆk + [HT (θˆk )H(θˆk )]−1 HT (θˆk )[x − s(θˆk )] Remark : Both the Newton-Raphson and the Gauss-Newton methods can have convergence problems. (Least Squares)

Estimation Theory

Spring 2013

88 / 152

Nonlinear Least Squares Transformation of parameters : Transform into a linear problem by seeking for an invertible function α = g(θ) such that : s(θ) = s(g−1 (α)) = Hα ˆ where α ˆ = (HT H)−1 HT x. So the nonlinear LSE is θˆ = g−1 (α)

Example (Estimate the amplitude and phase of a sinusoidal signal) s[n] = A cos(2πf0 n + φ)

n = 0, 1, . . . , N − 1

The LS problem is nonlinear, however, we have : A cos(2πf0 n + φ) = A cos φ cos 2πf0 n − A sin φ sin 2πf0 n if we let α1 = A cos φ and α2 = −A sin φ then the signal model becomes linear s = Hα. The LSE is α ˆ = (HT H)−1 HT x and g−1 (α) is ! −α2 2 2 and φ = arctan A = α1 + α2 α1 (Least Squares)

Estimation Theory

Spring 2013

89 / 152

Nonlinear Least Squares Separability of parameters : If the signal model is linear in some of the parameters, try to write it as s = H(α)β. The LS error can be minimized wrt β. ˆ = [HT (α)H(α)]−1 HT (α)x β The cost function is a function of α that can be minimized by a numerical method (e.g. brute force).   ˆ = xT I − H(α)[HT (α)H(α)]−1 HT (α) x J(α, β) Then α ˆ = arg max α

  xT H(α)[HT (α)H(α)]−1 HT (α) x

Remark : This method is interesting if the dimension of α is much less than the dimension of β. (Least Squares)

Estimation Theory

Spring 2013

90 / 152

Nonlinear Least Squares Example (Damped Exponentials) Consider the following signal model : s[n] = A1 r n + A2 r 2n + A3 r 3n

n = 0, 1, . . . , N − 1

with θ = [A1 , A2 , A3 , r ]T . It is known that 0 < r < 1. β = [A1 , A2 , A3 ]T so we get s = H(r )β with :  1 1 1 2  r r3 r  H(r ) =  .. .. ..  . . .

Let’s take     

r N−1 r 2(N−1) r 3(N−1)   Step 1 : Maximize xT H(r )[HT (r )H(r )]−1 HT (r ) x to obtain ˆr . Step 2 : Compute βˆ = [HT (ˆr )H(ˆr )]−1 HT (ˆr )x. (Least Squares)

Estimation Theory

Spring 2013

91 / 152

The Bayesian Philosophy

(The Bayesian Philosophy)

Estimation Theory

Spring 2013

92 / 152

Introduction Classical Approach : Assumes θ is unknown but deterministic. Some prior knowledge on θ cannot be used. Variance of the estimate may depend on θ. MVUE may not exist. In Monte-Carlo Simulations, we do M runs for each fixed θ and then compute sample mean and variance for each θ (no averaging over θ). Bayesian Approach : Assumes θ is random with a known prior PDF, p(θ). We estimate a realization of θ based on the available data. Variance of the estimate does not depend on θ. A Bayesian estimate always exists In Monte-Carlo simulations, we do M runs for randomly chosen θ and then we compute sample mean and variance over all θ values. (The Bayesian Philosophy)

Estimation Theory

Spring 2013

93 / 152

Mean Square Error (MSE) Classical MSE Classical MSE is a function of the unknown parameter θ and cannot be used for constructing the estimators.  2 ˆ 2 p(x; θ)dx ˆ ˆ mse(θ) = E {[θ − θ(x)] } = [θ − θ(x)] Note that the E {·} is wrt the PDF of x.

Bayesian MSE : Bayesian MSE is not a function of θ and can be minimized to find an estimator.   2 ˆ 2 p(x, θ)d xd θ ˆ ˆ [θ − θ(x)] Bmse(θ) = E {[θ − θ(x)] } = Note that the E {·} is wrt the joint PDF of x and θ. (The Bayesian Philosophy)

Estimation Theory

Spring 2013

94 / 152

Minimum Mean Square Error Estimator Consider the estimation of A that minimizes the Bayesian MSE, where A is a random variable with uniform PDF p(A) = U [−A0 , A0 ] and independent of w [n].      2 2 ˆ ˆ ˆ [A − A] p(A|x)dA p(x)d x Bmse(θ) = [A − A] p(x, θ)dxdA = where we used Bayes’ theorem : p(x, θ) = p(A|x)p(x). Since p(x) ≥ 0 for all x, we need only minimize the integral in brackets. We set the derivative equal to zero :    ∂ 2 ˆ p(A|x)dA = −2 Ap(A|x)dA + 2A ˆ p(A|x)dA [A − A] ˆ ∂A 

which results in ˆ= A

Ap(A|x)dA = E (A|x)

Bayesian MMSE Estimate : is the conditional mean of A given data x or the mean of posterior PDF p(A|x). (The Bayesian Philosophy)

Estimation Theory

Spring 2013

95 / 152

Minimum Mean Square Error Estimator How to compute the Bayesian MMSE estimate :  ˆ = E (A|x) = Ap(A|x)dA A The posteriori PDF can be computed using Bayes’ Rule : p(A|x) =

p(x|A)p(A) p(x|A)p(A) =" p(x) p(x|A)p(A)dA

" p(x) is the marginal PDF defined as p(x) = p(x, A)dA. The integral in the denominator acts as a ”normalization” of p(x|A)p(A) such the integral of p(A|x) be equal to 1. p(x|A) has exactly the same form as p(x; A) which is used in the classical estimation. The MMSE estimator always exists and can be computed by : " ˆ = " Ap(x|A)p(A)dA A p(x|A)p(A)dA (The Bayesian Philosophy)

Estimation Theory

Spring 2013

96 / 152

Minimum Mean Square Error Estimator Example : For p(A) = U [−A0 , A0 ] we have :

 A0 N−1 1 −1  1 A exp (x[n] − A)2 dA 2A0 −A0 (2πσ 2 )N/2 2σ 2 n=0 ˆ= A

 A0 N−1 −1  1 1 2 exp (x[n] − A) dA 2A0 −A0 (2πσ 2 )N/2 2σ 2 n=0

Before collecting the data the mean of the prior PDF p(A) is the best estimate while after collecting the data the best estimate is the mean of posterior PDF p(A|x). Choice of p(A) is crucial for the quality of the estimation. Only Gaussian prior PDF leads to a closed-form estimator. Conclusion : For an accurate estimator choose a prior PDF that can be physically justified. For a closed-form estimator choose a Gaussian prior PDF. (The Bayesian Philosophy)

Estimation Theory

Spring 2013

97 / 152

Properties of the Gaussian PDF Theorem (Conditional PDF of Bivariate Gaussian) If x and y are distributed according to a bivariate Gaussian PDF :  T  

−1 x − E (x) 1 x − E (x) −1 exp C p(x, y ) =  y − E (y ) y − E (y ) 2 2π det(C)  with covariance matrix :

C=

var(x) cov(x, y ) cov(x, y ) var(y )



then the conditional PDF p(y |x) is also Gaussian and : cov(x, y ) [x − E (x)] var(x) cov2 (x, y ) var(y |x) = var(y ) − var(x) E (y |x) = E (y ) +

(The Bayesian Philosophy)

Estimation Theory

Spring 2013

98 / 152

Properties of the Gaussian PDF Example (DC level in WGN with Gaussian prior PDF) Consider the Bayesian model x[0] = A + w [0] (just one observation) with a prior PDF of A ∼ N (µA , σA2 ) independent of noise. First we compute the covariance cov(A, x[0]) : cov(A, x[0]) = E {(A − µA )(x[0] − E (x[0])} = E {(A − µA )(A + w [0] − µA − 0)} = E {(A − µA )2 + (A − µA )w [0]} = var(A) + 0 = σA2 The Bayesian MMSE estimate is also Gaussian with : 2 ˆ = µA|x = µA + cov(A, x[0]) [x[0] − µA ] = µA + σA (x[0] − µA ) A var(x) σ 2 + σA2

ˆ = σ 2 = σA2 − var(A) A|x (The Bayesian Philosophy)

σ2 cov2 (A, x[0]) = σA2 (1 − 2 A 2 ) var(x) σ + σA Estimation Theory

Spring 2013

99 / 152

Properties of the Gaussian PDF Theorem (Conditional PDF of Multivariate Gaussian) If x (with dimension k × 1) and y (with dimension l × 1) are jointly Gaussian with PDF :  T  

−1 x − E (x) 1 x − E (x) −1 exp C p(x, y ) = k+l  y − E (y) y − E (y) 2 (2π) 2 det(C)  with covariance matrix :

C=

Cxx Cyx

Cxy Cyy



then the conditional PDF p(y|x) is also Gaussian and : E (y|x) = E (y) + Cyx C−1 xx [x − E (x)] Cy |x

(The Bayesian Philosophy)

= Cyy − Cyx C−1 xx Cxy

Estimation Theory

Spring 2013

100 / 152

Bayesian Linear Model Let the data be modeled as x = Hθ + w where θ is a p × 1 random vector with prior PDF N (µθ , Cθ ) and w is a noise vector with PDF N (0, Cw ). Since θ and w are independent and Gaussian, they are jointly Gaussian then the posterior PDF is also Gaussian. In order to find the MMSE estimator we should compute the covariance matrices : Cxx

= E {[x − E (x)][x − E (x)]T } = E {(Hθ + w − Hµθ )(Hθ + w − Hµθ )T } = E {(H(θ − µθ ) + w)(H(θ − µθ ) + w)T } = HE {(θ − µθ )(θ − µθ )T }HT + E (wwT ) = HCθ HT + Cw

Cθx

= E {(θ − µθ )[H(θ − µθ ) + w]T } = E {(θ − µθ )(θ − µθ )T }HT = Cθ HT

(The Bayesian Philosophy)

Estimation Theory

Spring 2013

101 / 152

Bayesian Linear Model Theorem (Posterior PDF for the Bayesian General Linear Model) If the observed data can be modeled as x = Hθ + w where θ is a p × 1 random vector with prior PDF N (µθ , Cθ ) and w is a noise vector with PDF N (0, Cw ), then the posterior PDF p(x|θ) is Gaussian with mean : E (θ|x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ ) and covariance Cθ|x = Cθ − Cθ HT (HCθ HT + Cw )−1 HCθ Remark : In contrast to the classical general linear model, H need not be full rank. (The Bayesian Philosophy)

Estimation Theory

Spring 2013

102 / 152

Bayesian Linear Model Example (DC Level in WGN with Gaussian Prior PDF) x[n] = A + w [n],

n = 0, . . . , N − 1,

A ∼ N (µA , σA2 ),

w [n] ∼ N (0, σ 2 )

We have the general Bayesian linear model x = HA + w with H = 1. Then E (A|x) = µA + σA2 1T (1σA2 1T + σ 2 I)−1 (x − 1µA ) var(A|x) = σA2 − σA2 1T (1σA2 1T + σ 2 I)−1 1σA2 We use the Matrix Inversion Lemma to get : (I + 1

σA2 T −1 1 ) =I− σ2

E (A|x) = µA + (The Bayesian Philosophy)

σA2 σA2 +

σ2 N

11T σ2 2 σA

+ 1T 1

=I−

11T σ2 2 σA

+N

(¯ x − µA ) and var(A|x) = Estimation Theory

σ2 2 N σA 2 2 σA + σN Spring 2013

103 / 152

Nuisance Parameters Definition (Nuisance Parameter) Suppose that θ and α are unknown parameters but we are only interested in estimating θ. In this case α is called a nuisance parameter. In the classical approach we have to estimate both but in the Bayesian approach we can integrate it out. Note that in the Bayesian approach we can find p(θ|x) from p(θ, α|x) as a marginal PDF :  p(θ|x) = p(θ, α|x)dα We can also express it as p(x|θ)p(θ) p(θ|x) = " p(x|θ)p(θ)d θ

 where

p(x|θ) =

p(x|θ, α)p(α|θ)dα

Furthermore if α is independent of θ we have :  p(x|θ) = p(x|θ, α)p(α)dα (The Bayesian Philosophy)

Estimation Theory

Spring 2013

104 / 152

General Bayesian Estimators

(General Bayesian Estimators)

Estimation Theory

Spring 2013

105 / 152

General Bayesian Estimators Risk Function A general Bayesian estimator is obtained by minimizing the Bayes Risk ˆ θˆ = arg min R(θ) θˆ

ˆ = E {C (ε)} is the Bayes Risk, C (ε) is a cost function and where R(θ) ε = θ − θˆ is the estimation error.

Three common risk functions 1. Quadratic :

ˆ = E {ε2 } = E {(θ − θ) ˆ 2} R(θ)

2. Absolute :

ˆ = E {|ε|} = E {|θ − θ|} ˆ R(θ)

3. Hit-or-Miss :

ˆ = E {C (ε)} R(θ)



(General Bayesian Estimators)

where C (ε) =

Estimation Theory

0 |ε| < δ 1 |ε| ≥ δ Spring 2013

106 / 152

General Bayesian Estimators The Bayes risk is   ˆ = E {C (ε)} = R(θ)

 ˆ C (θ − θ)p(x, θ)d xd θ =

ˆ g (θ)p(x)dx

 where

ˆ = g (θ)

ˆ C (θ − θ)p(θ|x)d θ

should be minimized.

Quadratic ˆ = E {(θ − θ) ˆ 2 } =Bmse(θ) ˆ which leads to the MMSE For this case R(θ) estimator with θˆ = E (θ|x) so θˆ is the mean of the posterior PDF p(θ|x).

(General Bayesian Estimators)

Estimation Theory

Spring 2013

107 / 152

General Bayesian Estimators Absolute In this case we have :   ˆ = |θ − θ|p(θ|x)dθ ˆ g (θ) =

θˆ −∞

 (θˆ − θ)p(θ|x)dθ +

∞ θˆ

ˆ (θ − θ)p(θ|x)d θ

ˆ equal to zero and the use of Leibnitz’s By setting the derivative of g (θ) rule, we get :  ∞  θˆ p(θ|x)d θ = p(θ|x)d θ −∞

θˆ

Then θˆ is the median (area to the left=area to the right) of p(θ|x)

(General Bayesian Estimators)

Estimation Theory

Spring 2013

108 / 152

General Bayesian Estimators Hit-or-Miss In this case we have  ˆ = g (θ)

ˆ θ−δ

−∞



= 1−

 1 · p(θ|x)d θ + ˆ θ+δ

ˆ θ−δ



ˆ θ+δ

1 · p(θ|x)dθ

p(θ|x)d θ

For δ arbitrarily small the optimal estimate is the location of the maximum of p(θ|x) or the mode of the posterior PDF. This estimator is called the maximum a posteriori (MAP) estimator. Remark : For unimodal and symmetric posterior PDF (e.g. Gaussian PDF), the mean and the mode and the median are the same. (General Bayesian Estimators)

Estimation Theory

Spring 2013

109 / 152

Minimum Mean Square Error Estimators Extension to the vector parameter case : In general, like the scalar case, we can write :  i = 1, 2, . . . , p θˆi = E (θi |x) = θi p(θi |x)d θi Then p(θi |x) can be computed as a marginal conditional PDF. For example :      ˆ · · · p(θ|x)d θ2 · · · d θp d θ1 = θ1 p(θ|x)dθ θ1 = θ1 In vector form we have :   " p(θ|x)d θ θ 1 "  θ2 p(θ|x)d θ     θˆ =   = θp(θ|x)dθ = E (θ|x) ..   . " θp p(θ|x)d θ  ˆ Similarly Bmse(θi ) = [Cθ|x ]ii p(x)d x (General Bayesian Estimators)

Estimation Theory

Spring 2013

110 / 152

Properties of MMSE Estimators For Bayesian Linear Model, poor prior knowledge leads to MVU estimator. T −1 −1 T −1 θˆ = E (θ|x) = µθ + (C−1 θ + H Cw H) H Cw (x − Hµθ )

For no prior knowledge µθ → 0 and C−1 θ → 0, and therefore : −1 T −1 θˆ → [HT C−1 w H] H Cw x

Commutes over affine transformations. Suppose that α = Aθ + b then the MMSE estimator for α is α ˆ = E (α|x) = E (Aθ + b|x) = AE (θ|x) + b = Aθˆ + b Enjoys the additive property for independent data sets. Assume that θ, x1 , x2 are jointly Gaussian with x1 and x2 independent : −1 θˆ = E (θ) + Cθx1 C−1 x1 x1 [x1 − E (x1 )] + Cθx2 Cx2 x2 [x2 − E (x2 )]

(General Bayesian Estimators)

Estimation Theory

Spring 2013

111 / 152

Maximum A Posteriori Estimators In the MAP estimation approach we have : θˆ = arg max p(θ|x) θ

where p(θ|x) =

p(x|θ)p(θ) p(x)

An equivalent maximization is : θˆ = arg max p(x|θ)p(θ) θ

or θˆ = arg max[ln p(x|θ) + ln p(θ)] θ

If p(θ) is uniform or is approximately constant around the maximum of p(x|θ) (for large data length), we can remove p(θ) to obtain the Bayesian Maximum Likelihood : θˆ = arg max p(x|θ) θ

(General Bayesian Estimators)

Estimation Theory

Spring 2013

112 / 152

Maximum A Posteriori Estimators Example (DC Level in WGN with Uniform Prior PDF) The MMSE estimator cannot be obtained in explicit form due to the need to evaluate the following integrals :

 A0 N−1 1 −1  1 2 A exp (x[n] − A) dA 2A0 −A0 (2πσ 2 )N/2 2σ 2 n=0 ˆ= A

 A0 N−1  −1 1 1 exp (x[n] − A)2 dA 2A0 −A0 (2πσ 2 )N/2 2σ 2 n=0

 x¯ < −A0  −A0 ˆ= x¯ −A0 ≤ x¯ ≤ A0 The MAP estimator is given as : A  A0 x¯ > A0 Remark : The main advantage of MAP estimator is that for non jointly Gaussian PDFs it may lead to explicit solutions or less computational effort. (General Bayesian Estimators)

Estimation Theory

Spring 2013

113 / 152

Maximum A Posteriori Estimators Extension to the vector parameter case 

θˆ1 = arg max p(θ1 |x) = arg max θ1

θ1

 ···

p(θ|x)dθ2 . . . dθp

This needs integral evaluation of the marginal conditional PDFs. An alternative is the following vector MAP estimator that maximizes the joint conditional PDF : θˆ = arg max p(θ|x) = arg max p(x|θ)p(θ) θ

θ

This corresponds to a circular Hit-or-Miss cost function. In general, as N → ∞, the MAP estimator → the Bayesian MLE. If the posterior PDF is Gaussian, the mode is identical to the mean, therefore the MAP estimator is identical to the MMSE estimator. The invariance property in ML theory does not hold for the MAP estimator. (General Bayesian Estimators)

Estimation Theory

Spring 2013

114 / 152

Performance Description In the classical estimation the mean and the variance of the estimate (or its PDF) indicates the performance of the estimator. In the Bayesian approach the PDF of the estimate is different for each realization of θ. So a good estimator should perform well for every possible value of θ. The estimation error ε = θ − θˆ is a function of two random variables (θ and x). The mean and the variance of the estimation error, wrt two random variables, indicates the performance of the estimator. The mean value of the estimation error is zero. So the estimates are unbiased (in the Bayesian sense). ˆ = Ex,θ (θ − E (θ|x)) = Ex [Eθ|x (θ) − Eθ|x (θ|x)] = Ex (0) = 0 Ex,θ (θ − θ) The variance of the estimate is Bmse : ˆ 2 ] = Bmse(θ) ˆ var(ε) = Ex,θ (ε2 ) = Ex,θ [(θ − θ) (General Bayesian Estimators)

Estimation Theory

Spring 2013

115 / 152

Performance Description (Vector Parameter Case) The vector of the estimation error is ε = θ − θˆ and has a zero mean. Its covariance matrix is : Mθˆ = Ex,θ (εεT ) = Ex,θ {[θ − E (θ|x)][θ − E (θ|x)]T } & ' () = Ex Eθ|x [θ − E (θ|x)][θ − E (θ|x)]T = Ex (Cθ|x ) if x and θ are jointly Gaussian, we have : Mθˆ = Cθ|x = Cθθ − Cθx C−1 xx Cxθ since Cθ|x does not depend on x. For a Bayesian linear model we have : Mθˆ = Cθ|x = Cθ − Cθ HT (HCθ HT + Cw )−1 HCθ In this case ε is a linear transformation of x and θ and thus is Gaussian. Therefore : ε ∼ N (0, Mθˆ) (General Bayesian Estimators)

Estimation Theory

Spring 2013

116 / 152

Signal Processing Example Deconvolution Problem : Estimate a signal transmitted through a channel with known impulse response : x[n] =

n s −1

h[n − m]s[m] + w [n]

n = 0, 1, . . . , N − 1

m=0

where s[n] is a WSS Gaussian process with known ACF, and w [n] is WGN with variance σ 2 .     

x[0] x[1] . .. x[N − 1]





  = 

   

h[0] h[1] . .. h[N − 1]

x = Hθ + w

0 h[0] . .. h[N − 2]



··· ··· . .. ···

0 0 . .. h[N − ns ]

    

s[0] s[1] . .. s[ns − 1]





    +  

w [0] w [1] . .. w [N − 1]

    

ˆs = Cs HT (HCs HT + σ 2 I)−1 x

where Cs is a symmetric Toeplitz matrix with [Cs ]ij = rss [i − j] (General Bayesian Estimators)

Estimation Theory

Spring 2013

117 / 152

Signal Processing Example Consider that H = I (no filtering), so the Bayesian model becomes x=s+w Classical Approach The MVU estimator is ˆs = (HT H)−1 HT x = x. Bayesian Approach The MMSE estimator is ˆs = Cs (Cs + σ 2 I)−1 x. Scalar case : We estimate s[0] based on x[0] : ˆs [0] =

η rss [0] x[0] x[0] = 2 rss [0] + σ η+1

where η = rss [0]/σ 2 is the SNR. Signal is a Realization of an Auto Regressive Process : Consider a first order AR process : s[n] = −a s[n − 1] + u[n], where u[n] is WGN with variance σu2 . The ACF of s is : rss [k] = (General Bayesian Estimators)

σu2 (−a)|k| 1 − a2



Estimation Theory

[Cs ]ij = rss [i − j] Spring 2013

118 / 152

Linear Bayesian Estimators

(Linear Bayesian Estimators)

Estimation Theory

Spring 2013

119 / 152

Introduction Problems with general Baysian estimators : difficult to determine in closed form, need intensive computations, involve multidimensional integration (MMSE) or multidimensional maximization (MAP), can be determined only under the jointly Gaussian assumption. What can we do if the joint PDF is not Gaussian or unknown ? Keep the MMSE criterion ; Restrict the estimator to be linear. This leads to the Linear MMSE Estimator : Which is a suboptimal estimator that can be easily implemented. It needs only the first and second moments of the joint PDF. It is analogous to BLUE in classical estimation. In practice, this estimator is termed Wiener filter. (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

120 / 152

Linear MMSE Estimators (scalar case) Goal : Estimate θ, given the data vector x. Assume that only the first two moments of the joint PDF of x and θ are available :     E (θ) Cθθ Cθx , Cxθ Cxx E (x) LMMSE Estimator : Take the class of all affine estimators θˆ =

N−1 

an x[n] + aN = aT x + aN

n=0

where a = [a0 , a1 , . . . , aN−1 ]T . Then minimize the Bayesian MSE : ˆ = E [(θ − θ) ˆ 2] Bmse(θ) Computing aN : Let’s differentiate the Bmse with respect to aN :     ∂ E (θ − aT x − aN )2 = −2E (θ − aT x − aN ) ∂aN Setting this equal to zero gives : aN = E (θ) − aT E (x). (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

121 / 152

Linear MMSE Estimators (scalar case) Minimize the Bayesian MSE : ˆ = E [(θ − θ) ˆ 2 ] = E [(θ − aT x − E (θ) + aT E (x))2 ] Bmse(θ) ( ' T = E [a (x − E (x)) − (θ − E (θ))]2 = E [aT (x − E (x))(x − E (x))T a] − E [aT (x − E (x))(θ − E (θ))] −E [(θ − E (θ))(x − E (x))T a] + E [(θ − E (θ))2 ] = aT Cxx a − aT Cxθ − Cθx a + Cθθ This can be minimized by setting the gradient to zero : ˆ ∂Bmse(θ) = 2Cxx a − 2Cxθ = 0 ∂a T which results in a = C−1 xx Cxθ and leads to (Note that : Cθx = Cxθ ) : −1 T −1 θˆ = aT x + aN = CT xθ Cxx x + E (θ) − Cxθ Cxx E (x)

= E (θ) + Cθx C−1 xx (x − E (x)) (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

122 / 152

Linear MMSE Estimators (scalar case) The minimum Bayesian MSE is obtained : ˆ = CT C−1 Cxx C−1 Cxθ − CT C−1 Cxθ Bmse(θ) xθ xx xx xθ xx −Cθx C−1 xx Cxθ + Cθθ

= Cθθ − Cθx C−1 xx Cxθ Remarks :

The results are identical to those of MMSE estimators for jointly Gaussian PDF. The LMMSE estimator relies on the correlation between random variables. If a parameter is uncorrelated with the data (but nonlinearly dependent), it cannot be estimated with an LMMSE estimator.

(Linear Bayesian Estimators)

Estimation Theory

Spring 2013

123 / 152

Linear MMSE Estimators (scalar case) Example (DC Level in WGN with Uniform Prior PDF) The data model is : x[n] = A + w [n] n = 0, 1, . . . , N − 1 where A ∼ U [−A0 , A0 ] is independent from w [n] (WGN with variance σ 2 ). We have E (A) = 0 and E (x) = 0. The covariances are : Cxx

= E (xxT ) = E [(A1 + w)(A1 + w)T ] = E (A2 )11T + σ 2 I

Cθx

= E (AxT ) = E [A(A1 + w)T ] = E (A2 )1T

Therefore the LMMSE estimator is : 2 T 2 T 2 −1 ˆ = Cθx C−1 A xx x = σA 1 (σA 11 + σ I) x =

where

A A3  0 A20 1 = E (A ) = A dA = = 2A0 6A0 −A0 3 −A0 

σA2

A20 /3 x¯ A20 /3 + σ 2 /N

(Linear Bayesian Estimators)

2

A0

2

Estimation Theory

Spring 2013

124 / 152

Linear MMSE Estimators (scalar case) Example (DC Level in WGN with Uniform Prior PDF) Comparison of different Bayesian estimators :

 A0 N−1 −1  2 A exp (x[n] − A) dA 2σ 2 −A0 n=0 ˆ MMSE : A = 

N−1 A0 −1  exp (x[n] − A)2 dA 2σ 2 −A0 n=0

 x¯ < −A0  −A0 ˆ x¯ −A0 ≤ x¯ ≤ A0 MAP : A =  A0 x¯ > A0 ˆ= LMMSE : A

A20 /3 x¯ A20 /3 + σ 2 /N

(Linear Bayesian Estimators)

Estimation Theory

Spring 2013

125 / 152

Geometrical Interpretation Vector space of random variables : The set of scalar zero mean random variables is a vector space. The zero length vector of the set is a RV with zero variance. For any real scalar a, ax is another zero mean RV in the set. For two RVs x and y , x + y = y + x is a RV in the set. For two RVs x and y , the inner product is : < x, y >= E (xy )  √ The length of x is defined as : x = < x, x > = E (x 2 ) Two RVs x and y are orthogonal if :

< x, y >= E (xy ) = 0.

For RVs, x1 , x2 , y and real numbers a1 and a2 we have : < a1 x1 + a2 x2 , y >= a1 < x1 , y > +a2 < x2 , y > E [(a1 x1 + a2 x2 )y ] = a1 E (x1 y ) + a2 E (x2 y ) The projection of y on x is : (Linear Bayesian Estimators)

< y, x > E (yx) x= x 2 x σx2

Estimation Theory

Spring 2013

126 / 152

Geometrical Interpretation The LMMSE estimator can be determined using the vector space viewpoint. We have : N−1  ˆ an x[n] θ= n=0

where aN is zero, because of zero mean assumption. θˆ belongs to the subspace spanned by x[0], x[1], . . . , x[N − 1]. θ is not in this subspace. A good estimator will minimize the MSE : ˆ 2 ] = 2 E [(θ − θ) where = θ − θˆ is the error vector. Clearly the length of the vector error is minimized when is orthogonal to the subspace spanned by x[0], x[1], . . . , x[N − 1] or to each data sample : & ) ˆ E (θ − θ)x[n] = 0 for n = 0, 1, . . . , N − 1 (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

127 / 152

Geometrical Interpretation The LMMSE estimator can be determined by solving the following equations : 

N−1  am x[m] x[n] = 0 n = 0, 1, . . . , N − 1 E θ− m=0

or

    

N−1 

n = 0, 1, . . . , N − 1

am E (x[m]x[n]) = E (θx[n])

m=0 E (x 2 [0]) E (x[1]x[0]) .. . E (x[N − 1]x[0])

E (x[0]x[1]) E (x 2 [1]) .. . E (x[N − 1]x[1])

··· ··· .. . ···

E (x[0]x[N − 1]) E (x[1]x[N − 1]) .. . 2 E (x [N − 1])

Therefore Cxx a = CT θx





a0 a1 .. .

   

aN−1





  = 

   

E (θx[0]) E (θx[1]) .. . E (θx[N − 1])

    

T a = C−1 xx Cθx

The LMMSE estimator is : θˆ = aT x = Cθx C−1 xx x (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

128 / 152

The vector LMMSE Estimator We want to estimate θ = [θ1 , θ2 , . . . , θp ]T with a linear estimator that minimize the Bayesian MSE for each element. θˆi =

N−1 

ain x[n] + aiN

  Bmse(θˆi ) = E (θi − θˆi )2

Minimize

n=0

Therefore :  θˆi = E (θi ) + Cθi x C−1 xx x − E (x) *+,- *+,- * +, 1×N N×N

i = 1, 2, . . . , p

N×1

The scalar LMMSE estimators can be combined into a vector estimator :  x − E (x) θˆ = E (θ) + Cθx C−1 xx * +, - *+,- *+,- * +, p×1

and similarly Bmse(θˆi ) = [M ˆ]ii θ

p×N N×N

where

Mθˆ = Cθθ − Cθx C−1 Cxθ *+,- *+,- *+,- xx *+,p×p

(Linear Bayesian Estimators)

N×1

Estimation Theory

p×p

p×N

N×p

Spring 2013

129 / 152

The vector LMMSE Estimator Theorem (Bayesian Gauss–Markov Theorem) If the data are described by the Bayesian linear model form x = Hθ + w where θ is a p × 1 random vector with mean E (θ) and covariance Cθθ , w is a noise vector with zero mean and covariance Cw and is uncorrelated with θ (the joint PDF of p(θ, w) is arbitrary), then the LMMSE estimator of θ is :  θˆ = E (θ) + Cθθ HT (HCθθ HT + Cw )−1 x − HE (θ) The performance of the estimator is measured by  = θ − θˆ whose mean is zero and whose covariance matrix is  −1 T −1 C = Mθˆ = Cθθ − Cθθ HT (HCθθ HT + Cw )−1 HCθθ = C−1 θθ + H Cw H (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

130 / 152

Sequential LMMSE Estimation ˆ − 1] based on x[n − 1], update the new estimate Objective : Given θ[n ˆ θ[n] based on the new sample x[n].

Example (DC Level in White Noise) Assume that both A and w [n] have zero mean : ⇒

x[n] = A + w [n] Estimator Update :

ˆ − 1] = A[N

σA2 x¯ σA2 + σ 2 /N

ˆ ˆ − 1] + K [N](x[N] − A[N ˆ − 1]) A[N] = A[N

where

K [N] =

ˆ − 1]) Bmse(A[N ˆ − 1]) + σ 2 Bmse(A[N

Minimum MSE Update : ˆ ˆ − 1]) Bmse(A[N]) = (1 − K [N])Bmse(A[N (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

131 / 152

Sequential LMMSE Estimation Vector space view : If two observations are orthogonal, the LMMSE estimate of θ is the sum of the projection of θ on each observation. ˆ : 1 Find the LMMSE estimator of A based on x[0], yielding A[0] σA2 ˆ = E (Ax[0]) x[0] = x[0] A[0] E (x 2 [0]) σA2 + σ 2 2

Find the LMMSE estimator of x[1] based on x[0], yielding xˆ [1|0] xˆ[1|0] =

3

σ2 E (x[0]x[1]) x[0] = 2 A 2 x[0] 2 E (x [0]) σA + σ

Determine the innovation of the new data :˜ x [1] = x[1] − xˆ[1|0]. This error vector is orthogonal to x[0]

4

ˆ Add to A[0] the LMMSE estimator of A based on the innovation : ˆ = A[0] ˆ + A[1]

E (A˜ x [1]) x˜[1] E (˜ x 2 [1]) +, * the projection of A on

(Linear Bayesian Estimators)

ˆ + K [1](x[1] − xˆ[1|0]) = A[0] x˜[1]

Estimation Theory

Spring 2013

132 / 152

Sequential LMMSE Estimation Basic Idea : Generate a sequence of orthogonal RVs, namely, the innovations : / . x˜[2] , . . . , x˜[n] x˜[0] , x˜[1] , *+,*+,*+,- *+,x[0] x[1]−ˆ x [1|0] x[2]−ˆ x [2|0,1]

x[n]−ˆ x [n|0,1,...,n−1]

Then, add the individual estimators to yield : ˆ A[N] =

N 

ˆ K [n]˜ x [n] = A[N−1]+K [N]˜ x [N]

where

K [n] =

n=0

E (A˜ x [n]) E (˜ x 2 [n])

It can be shown that : ˆ − 1] x˜[N] = x[N] − A[N

and

K [N] =

ˆ − 1]) Bmse(A[N ˆ − 1]) σ 2 + Bmse(A[N

the minimum MSE be updated as : ˆ ˆ − 1]) Bmse(A[N]) = (1 − K [N])Bmse(A[N (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

133 / 152

General Sequential LMMSE Estimation Consider the general Bayesian linear model x = Hθ + w, where w is an uncorrelated noise with the diagonal covariance matrix Cw . Let’s define : Cw [n] = diag(σ02 , σ12 , . . . , σn2 )     H[n − 1] n×p = H[n] = 1×p hT [n]   x[n] = x[0], x[1], . . . , x[n] Estimator Update :  ˆ − 1] ˆ = θ[n ˆ − 1] + K[n] x[n] − hT [n]θ[n θ[n] where

M[n − 1]h[n] + hT [n]M[n − 1]h[n] Minimum MSE Matrix Update : K[n] =

σn2

M[n] = (I − K[n]hT [n])M[n − 1] (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

134 / 152

General Sequential LMMSE Estimation Remarks : For the initialization of the sequential LMMSE estimator, we can use the prior information : ˆ θ[−1] = E (θ)

M[−1] = Cθθ

For no prior knowledge about θ we can let Cθθ → ∞. Then we have the same form as the sequential LSE, although the approaches are fundamentally different. No matrix inversion is required. The gain factor K[n] weights confidence in the new data (measured by σn2 ) against all previous data (summarized by M[n − 1]).

(Linear Bayesian Estimators)

Estimation Theory

Spring 2013

135 / 152

Wiener Filtering Consider the signal model : x[n] = s[n] + w [n] n = 0, 1, . . . , N − 1 where the data and noise are zero mean with known covariance matrix. There are three main problems concerning the Wiener Filters : Filtering : Estimate θ = s[n] (scalar) based on the data set x = [x[0], x[1], . . . , x[n]]. The signal sample is estimated based on the the present and past data only. Smoothing : Estimate θ = s = [s[0], s[1], . . . , s[N − 1]] (vector) based on the data set x = x[0], x[1], . . . , x[N − 1] . The signal sample is estimated based on the the present, past and future data. Prediction : Estimate θ = x[N − 1 + ] for  positive integer based on the data set x = [x[0], x[1], . . . , x[N − 1]]. Remarks : All these problems are solved using the LMMSE estimator : with the minimum Bmse : Mθˆ = Cθθ − Cθx C−1 θˆ = Cθx C−1 xx x xx Cxθ (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

136 / 152

Wiener Filtering (Smoothing) Estimate θ = s = [s[0], s[1], . . . , s[N − 1]] (vector) based on the data set  x = x[0], x[1], . . . , x[N − 1] . Cxx = Rxx = Rss + Rww Therefore :

Cθx = E (sxT ) = E (s(s + w)T ) = Rss

and

−1 ˆs = Csx C−1 xx x = Rss (Rss + Rww ) x = Wx

Filter Interpretation : The Wiener Smoothing Matrix can be interpreted as an FIR filter. Let’s define : W = [a0 , a1 , . . . , aN−1 ] where aT n is the (n + 1)th raw of W. Let’s define also : hn = [h(n) [0], h(1) [1], . . . , h(n) [N − 1]] which is just the vector an when flipped upside down. Then ˆs [n] =

aT nx

=

N−1 

h(n) [N − 1 − k]x[k]

k=0

This represents a time-varying, non causal FIR filter. (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

137 / 152

Wiener Filtering (Filtering)   Estimate θ = s[n] based on the data set x = x[0], x[1], . . . , x[n] . We have again Cxx = Rxx = Rss + Rww and     Cθx = E (s[n] x[0], x[1], . . . , x[n] ) = rss [n], rss [n − 1], . . . , rss [0] = (rss )T Therefore :

ˆs [n] = (rss )T (Rss + Rww )−1 x = aT x

Filter Interpretation : If we define hn = [h(n) [0], h(1) [1], . . . , h(n) [N − 1]] which is just the vector a when flipped upside down. Then : T

ˆs [n] = a x =

n 

h(n) [n − k]x[k]

k=0

This represents a time-varying, causal FIR filter. The impulse response can be computed as : (Rss + Rww )a = rss (Linear Bayesian Estimators)

⇒ Estimation Theory

(Rss + Rww )hn = rss Spring 2013

138 / 152

Wiener Filtering (Prediction) Estimate θ = x[N − 1 + ] based on the data set x = [x[0], x[1], . . . , x[N − 1]]. We have again Cxx = Rxx and Cθx

= E (x[N − 1 + ][x[0], x[1], . . . , x[N − 1]])   = rxx [N − 1 + ], rxx [N − 2 + ], . . . , rxx [] = (rxx )T

Therefore :

T xˆ[N − 1 + ] = (rxx )T R−1 xx x = a x

Filter Interpretation : If we let again h[N − k] = ak , we have xˆ[N − 1 + ] = a x = T

N−1 

h[N − k]x[k]

k=0

This represents a time-invariant, causal FIR filter. The impulse response can be computed as : Rxx h = rxx For  = 1, this is equivalent to an Auto-Regressive process. (Linear Bayesian Estimators)

Estimation Theory

Spring 2013

139 / 152

Kalman Filters

(Kalman Filters)

Estimation Theory

Spring 2013

140 / 152

Introduction Kalman filter is a generalization of the Wiener filter. In Wiener filter we estimate s[n] based on the noisy observation vector x[n]. We assume that s[n] is a stationary random process with known mean and covariance matrix. In Kalman filter we assume that s[n] is a non stationary random process whose mean and covariance matrix vary according to a known dynamical model. Kalman filter is a sequential MMSE estimator of s[n]. If the signal and noise are jointly Gaussian, then the Kalman filter is an optimal MMSE estimator, if not, it is the optimal LMMSE estimator. Kalman filter has many applications in Control theory for state estimation. It can be generalized to vector signals and noise (in contrast to Wiener filter). (Kalman Filters)

Estimation Theory

Spring 2013

141 / 152

Dynamical Signal Models Consider a first order dynamical model : s[n] = as[n − 1] + u[n]

n≥0

where u[n] is WGN with variance σu2 , s[−1] ∼ N (µs , σs2 ) and s[−1] is independent of u[n] for all n ≥ 0. It can be shown that : s[n] = a

n+1

s[−1] +

n 

ak u[n − k]



E (s[n]) = an+1 µs

k=0

and Cs [m, n] = E

& )  s[m] − E (s[m]) s[n] − E (s[n])

= am+n+2 σs2 + σu2 am−n

n 

a2k

for m ≥ 0

k=0

and Cs [m, n] = Cs [n, m] for m < n. (Kalman Filters)

Estimation Theory

Spring 2013

142 / 152

Dynamical Signal Models Theorem (Vector Gauss-Markov Model) The Gauss-Markov model for a p × 1 vector signal s[n] is : s[n] = As[n − 1] + Bu[n]

n≥0

A is p × p and B p × r and all eigenvalues of A are less than 1 in magnitude. The r × 1 vector u[n] ∼ N (0, Q) and the initial condition is a p × 1 random vector with s[n] ∼ N (µs , Cs ) and is independent of u[n]’s. Then, the signal process is Gaussian with mean E (s[n]) = An+1 µs and covariance : m+1

Cs [m, n] = A

m   n+1 T  T Cs A + Ak BQBT An−m+k k=m−n

for m ≥ n and Cs [m, n] = Cs [n, m] for m < n. The covariance matrix C[n] = Cs [n, n] and the mean and covariance propagation equations are : E (s[n]) = AE (s[n − 1]) , C[n] = AC[n − 1]AT + BQBT (Kalman Filters)

Estimation Theory

Spring 2013

143 / 152

Scalar Kalman Filter Data model : State equation Observation equation

s[n] = as[n − 1] + u[n] x[n] = s[n] + w [n]

Assumptions : u[n] is zero mean Gaussian noise with independent samples and E (u 2 [n]) = σu2 . w [n] is zero mean Gaussian noise with independent samples and E (w 2 [n]) = σn2 (time-varying variance). s[−1] ∼ N (µs , σs2 ) (for simplicity we suppose that µs = 0). s[−1], u[n] and w [n] are independent. Objective : Develop a sequential MMSE  estimator to estimate s[n] based on the data X[n] = x[0], x[1], . . . , x[n] . This estimator is the mean of posterior PDF :  ˆs [n|n] = E (s[n]x[0], x[1], . . . , x[n]) (Kalman Filters)

Estimation Theory

Spring 2013

144 / 152

Scalar Kalman Filter To develop the equations of a Kalman filter we need the following properties : Property 1 : For two uncorrelated jointly Gaussian data vector, the MMSE estimator θ (if it is zero mean) is given by : θˆ = E (θ|x1 , x2 ) = E (θ|x1 ) + E (θ|x2 ) Property 2 : If θ = θ1 + θ2 , then the MMSE estimator of θ is : θˆ = E (θ|x) = E (θ1 + θ2 |x) = E (θ1 |x) + E (θ2 |x) Basic Idea : Generate the innovation x˜[n] = x[n] − xˆ[n|n − 1] which is uncorrelated with previous samples X[n  − 1]. Then use x˜[n] instead of x[n] for estimation (X[n] is equivalent to X[n − 1], x˜[n] ).

(Kalman Filters)

Estimation Theory

Spring 2013

145 / 152

Scalar Kalman Filter From Property 1, we have :    ˆs [n|n] = E (s[n]X[n − 1], x˜[n]) = E (s[n]X[n − 1]) + E (s[n]x˜[n])  = ˆs [n|n − 1] + E (s[n]x˜[n]) From Property 2, we have :

 ˆs [n|n − 1] = E (as[n − 1] + u[n]X[n − 1])   = aE (s[n − 1]X[n − 1]) + E (u[n]X[n − 1]) = aˆs [n − 1|n − 1]

The MMSE estimator of s[n] based on x˜[n] is :  E (s[n]˜ x (n)) x˜[n] = K [n](x[n] − xˆ[n|n − 1]) E (s[n]x˜[n]) = 2 E (˜ x [n]) where xˆ[n|n − 1] = ˆs [n|n − 1] + w[n|n ˆ − 1] = ˆs [n|n − 1]. Finally we have : ˆs [n|n] = aˆs [n − 1|n − 1] + K [n](x[n] − ˆs [n|n − 1]) (Kalman Filters)

Estimation Theory

Spring 2013

146 / 152

Scalar State – Scalar Observation Kalman Filter State Model : Observation Model :

s[n] = as[n − 1] + u[n] x[n] = s[n] + w [n]

Prediction : ˆs [n|n − 1] = aˆs [n − 1|n − 1] Minimum Prediction MSE : M[n|n − 1] = a2 M[n − 1|n − 1] + σu2 Kalman Gain : K [n] =

σn2

M[n|n − 1] + M[n|n − 1]

Correction : ˆs [n|n] = ˆs [n|n − 1] + K [n](x[n] − ˆs [n|n − 1]) Minimum MSE : M[n|n] = (1 − K [n])M[n|n − 1] (Kalman Filters)

Estimation Theory

Spring 2013

147 / 152

Properties of Kalman Filter The Kalman filter is an extension of the sequential MMSE estimator but applied to time-varying parameters that are represented by a dynamic model. No matrix inversion is required. The Kalman filter is a time-varying linear filter. The Kalman filter computes (and uses) its own performance measure, the Bayesian MSE M[n|n]. The prediction stage increases the error, while the correction stage decreases it. The Kalman filter generates an uncorrelated sequence from the observations so can be viewed as a whitening filter. The Kalman filter is optimal in the Gaussian case and the optimal LMMSE estimator if the Gaussian assumption is not valid. In KF M[n|n], M[n|n − 1] and K [n] do not depend on the measurement and can be computed off line. At steady state (n → ∞) the Kalman filter becomes an LTI filter (M[n|n], M[n|n − 1] and K [n] become constant). (Kalman Filters)

Estimation Theory

Spring 2013

148 / 152

Vector State – Scalar Observation Kalman Filter State Model : Observation Model : Prediction :

s[n] = As[n − 1] + Bu[n] with u[n] ∼ N (0, Q) x[n] = hT [n]s[n] + w [n] with s[−1] ∼ N (µs , Cs ) ˆs[n|n − 1] = Aˆs[n − 1|n − 1]

Minimum Prediction MSE (p × p) : M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT Kalman Gain (p × 1) : K[n] =

M[n|n − 1]h[n] σn2 + hT [n]M[n|n − 1]h[n]

Correction : ˆs[n|n] = ˆs[n|n − 1] + K[n](x[n] − hT [n]ˆs [n|n − 1]) Minimum MSE Matrix (p × p) : M[n|n] = (I − K[n]hT [n])M[n|n − 1] (Kalman Filters)

Estimation Theory

Spring 2013

149 / 152

Vector State – Vector Observation Kalman Filter State Model : Observation Model : Prediction :

s[n] = As[n − 1] + Bu[n] with u[n] ∼ N (0, Q) x[n] = H[n]s[n] + w[n] with w[n] ∼ N (0, C[n]) ˆs[n|n − 1] = Aˆs[n − 1|n − 1]

Minimum Prediction MSE (p × p) : M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT Kalman Gain Matrix (p × M) : (Need matrix inversion !)  K[n] = M[n|n − 1]HT [n] C[n] + H[n]M[n|n − 1]HT [n]

−1

Correction : ˆs[n|n] = ˆs[n|n − 1] + K[n](x[n] − H[n]ˆs [n|n − 1]) Minimum MSE Matrix (p × p) : M[n|n] = (I − K[n]H[n])M[n|n − 1] (Kalman Filters)

Estimation Theory

Spring 2013

150 / 152

Extended Kalman Filter The extended Kalman filter is a sub-optimal approach when the state and the observation equations are nonlinear. Nonlinear data model : Consider the following nonlinear data model s[n] = f(s[n − 1]) + Bu[n] x[n] = g(s[n]) + w[n] where f(·) and g(·) are nonlinear function mapping (vector to vector). Model linearization : We linearize the model around the estimated value of s using a first order Taylor series.   ∂f s[n − 1] − ˆs[n − 1|n − 1] f(s[n − 1]) ≈ f(ˆs[n − 1|n − 1]) + ∂s[n − 1]  ∂g  s[n] − ˆs[n|n − 1] g(s[n]) ≈ g(ˆs[n|n − 1]) + ∂s[n] We denote the Jacobians by :     ∂g ∂f   H[n] = A[n − 1] =  ∂s[n − 1] ∂s[n]  ˆ s [n−1|n−1]

(Kalman Filters)

Estimation Theory

ˆ s [n|n−1]

Spring 2013

151 / 152

Extended Kalman Filter Now we use the linearized models : s[n] = A[n − 1]s[n − 1] + Bu[n] +f(ˆs[n − 1|n − 1]) − A[n − 1]ˆs[n − 1|n − 1] x[n] = H[n]s[n] + w[n] +g(ˆs[n|n − 1]) − H[n]ˆs[n|n − 1] There are two differences with the standard Kalman filter : 1 2

A is now, time varying. Both equations have known terms added to them. This will not change the derivation of the Kalman filter.

The extended Kalman filter has exactly the same equations where A is replaced with A[n − 1]. A[n − 1] and H[n] should be computed at each sampling time. MSE matrices and Kalman gain can no longer be computed off line. (Kalman Filters)

Estimation Theory

Spring 2013

152 / 152