A tutorial on Minimum Mean Square Error Estimation
September 21, 2015
Bingpeng Zhou and Qingchun Chen
[email protected],
[email protected] Key lab of information coding & transmission, Southwest Jiaotong University, Chengdu 610031, Sichuan Province, China. Abstract: In this tutorial, the parameter estimation problem and its various estimators in particular the minimum mean squared errors estimator are introduced and derived to provide an intuitive insight into their mechanisms.
1 1.1
Why we do Parameter Estimation? Introduction
Parameter estimation comes from the requirement of finding out the desired control parameters of a ’black’ system from its outputs (measurements), which is the fundamental for many applications in wireless communication and signal processing, such as channel tracking, signal detection, decoding, image reconstruction, wireless localization, frequency offset estimation, etc.
1.2
The Goal
A typical parameter estimation is to estimate the unknown system parameter x ∈ RD from numerous measurements {zi : ∀i = 1 : M }, given the measurement function below, z = f (x) + n,
(1)
where n is the additive measurement noise. The measurement function f (x) is maybe linear or not, and the additive noise n is maybe Gaussian distributed or not.
1.3
Discussion
There are a multitude of methods to estimate x from {zi }∀i , which can be roughly classified into statistic-based methods, e.g., maximum likelihood estimation (MLE), maximum a posterior (MAP) and minimum mean square errors (MMSE) etc, and statistic-free methods, e.g., (linear or not) least squares, best linear unbiased estimation (BLUE) and minimum variance unbiased (MVU) estimation. The statistic-based algorithm commonly provides an optimal parameter estimation in terms of minimum estimation errors, while the statistic-free algorithm provides a promising and simple way for estimation when the statistical knowledge of system is not available. Of course, no matter which algorithm (statistic-based or statistic-free one) we use, the unbiasedness and covariance are two important metrics for an estimator. In 1
Bingpeng Zhou: A tutorial on MMSE
2
addition, in some specific cases with regular properties (such as linearity, Gaussian and unbiasedness, etc), some of statistics-based methods are equivalent to the statistics-free ones, just as the maximum likelihood estimation corresponds to least square estimation for linear and Gaussian system. More details are not included here. According to how much statistical knowledge and which regular characteristic of the system we have known, we have various different types of statistic-based estimators. For example, if we have known the system measurement is linear and the measurement noise is a zero-mean Gaussian variable, i.e., z = Ax + n, where the linear coefficient matrix A ∈ RS×D is given, and n ∼ N (n|0, W), where W ∈ RS×S is supposed as the precision matrix, then we can estimate x by using the MLE algorithm. Moreover, if the priori distribution p(x) of x is also given, then the linear and Gaussian MMSE algorithm can be used to estimate x. On the contrary, if no regular property and statistical information is available, then the above estimators do not work any longer, and they will be degraded into the MVU estimator (in other words, only MVU can be used in such a situation). The linear and Gaussian MMSE is one of these estimation algorithms, where the linear and Gaussian properties of the system are given.
2 2.1
What is MMSE? Definition
MMSE method is an estimator with minimum mean squared errors (which means it is optimal in a statistics sense), given the statistical information such as the priori p(x), where the mean squared errors (MSE) is defined as (in a statistics sense) ∫ MSE = p(x|z)(b x − x)⊤ (b x − x) dx, (2) X
where p(x|z) denotes the posteriori distribution of x. Hence, the optimal MMSE estimator can be found by minimizing MSE as follows ∫ b⋆MMSE = arg min x p(x|z)(b x − x)⊤ (b x − x) dx. (3) b x
X
By making the associated derivative be zero, i.e., ∫ d p(x|z)(b x − x)⊤ (b x − x) dx = 0, db x the optimal MMSE estimator is derived as ∫ ⋆ bMMSE = x p(x|z)x dx.
(4)
(5)
X
Namely, the optimal MMSE estimator is actually its posteriori expectation. Generally, the posteriori p(x|z) is cast, based on the Bayesian chain rules, as p(x|z) =
p(z|x)p(x) , p(z)
(6)
where p(z|x) is known as the likelihood function, p(x) is the priori and p(z) is the normalizing constant, which is given by ∫ p(z) = p(z|x)p(x) dx. (7) X
For a specific system or estimation problem, the remaining task of MMSE estimation is to specify those statistic densities, e.g., p(z|x) and p(x) through incorporating the priori knowledge and the regular properties we have known.
Bingpeng Zhou: A tutorial on MMSE
2.2
3
Derivation
In the following, we derive the optimal linear and Gaussian MMSE estimator, where the system is assumed to be linear and Gaussian, i.e., z = Ax + n,
(8)
where n ∼ N (n|0, W). In addition, the priori of the desired variable x is assumed to be Gaussian, i.e., p(x) = N (x|χ, Λ),
(9)
where χ, Λ are the associated expectation and precision matrix, respectively. Based on above formulation, the likelihood can be cast as p(z|x) = N (z|Ax, W).
(10)
Hence, based on the Gaussian distribution properties, the posteriori is cast as p(x|z) ∝ p(z|x)p(x)
(11)
= N (z|Ax, W)N (x|χ, Λ) ′
= N (x|A z, W )N (x|χ, Λ), +
(12) (13)
where •+ denotes the general inverse, and W′ is the equivalent precision which is cast as (see Lemma 1) W′ = A⊤ W A.
(14)
Furthermore, we can see from Eq. (13) that, the posteriori p(x|z) is cast as the product of two Gaussian distributions, hence based on the Gaussian distribution properties we know its posteriori must be a Gaussian one, namely, (see Lemma 2) p(x|z) = N (x|A+ z, W′ )N (x|χ, Λ) = N (x|χ♯ , Λ♯ ),
(15) (16)
where χ♯ is the posteriori expectation, and Λ♯ is the posteriori precision which are given by ( ) χ♯ = (Λ♯ )−1 W′ A+ z + Λχ , (17) Λ♯ = W′ + Λ. By substituting Eq. (14) into (17) and (18), they can be further derived as ( ) χ♯ = (A⊤ W A + Λ)−1 A⊤ W z + Λχ , ♯
⊤
Λ = A W A + Λ.
(18)
(19) (20)
As aforementioned, the optimal (linear and Gaussian) MMSE is its posteriori expectation (see Eq. (5)), hence we have ( ) b⋆MMSE = χ♯ = (A⊤ W A + Λ)−1 A⊤ W z + Λχ . x (21) Eq.(21) presents the generalized linear & Gaussian MMSE, and Λ♯ is the corresponding estimation precision. For the nonlinear or non-Gaussian cases, there are numerous approximation methods to find the final MMSE, e.g., variational Bayesian inference, importance sampling-based approximation, Sigma-point approximation (i.e., unscented transformation), Laplace approximation and linearization, etc.
Bingpeng Zhou: A tutorial on MMSE
2.3
4
Specific case in Wireless Communications
In the context of wireless communication (WC), the priori mean of x is commonly zero (e.g., the mean of channel, pilots). Suppose the priori expectation of x is zero, i.e., χ = 0, then, the optimal (linear and Gaussian) MMSE can be further specified as b⋆MMSE = (A⊤ W A + Λ)−1 A⊤ W z. x
(22)
An alterative expression is given by its convariance matrix (instead of precision). We know the covariance matrix is defined as the inverse of the associated precision matrix. Hence we define the covariance Σn with respect to measurement noise n, the priori covariance Σx of the desired variable x as follows, Σn = W−1 , −1
Σx = Λ
.
As a result, the above optimal MMSE can also be cast as ( ) −1 −1 ⊤ −1 b⋆MMSE = A⊤ Σ−1 x A Σn z, n A + Σx ( ) ⊤ −1 −1 −1 = A + Σn (A ) Σx z, ) ( ⊤ −1 ⊤ ≈ A A + Σn Σ−1 A z, x ( ⊤ ) −1 = A A + γ −1 I A⊤ z,
(23) (24)
(25) (26) (27) (28)
where γ denotes the SNR at transmitter.1 If the signal power is normalized, then we −2 . Hence, the linear MMSE in WC is finally specified as have that γ ∝ σn ( ) 2 −1 ⊤ b⋆MMSE = A⊤ A + σn x I A z,
(29)
which is the commonly employed expression in wireless communication applications, such as channel estimation or signal detection.
3
Conclusions
MMSE estimator is optimal in terms of minimum estimation errors, which is actually the posteriori expectation of a desired variable. Its final estimator and the associated estimation precision are given by Eq. (19) and (20), respectively.
4
Useful Knowledge
Some useful conclusions with respect to Gaussian distribution are summarized as follows. Lemma 1. Equivalent density to the likelihood function Given the likelihood function p(z|x) = N (z|Ax, W) of a linear and Gaussian system z = Ax+n associated with the objective variable x, the equivalent distribution (owing to measurement z) of x is cast as pz|x (x) = N (x|A+ z, W′ ), where the equivalent precision is cast as W′ = A⊤ W A.
(30)
The proof is not included here. 1
Since the expectations of background noise and signals are both zero-mean, hence their covariances −1 Σn and Σx are equivalent to the corresponding powers. In consequential, we have that, Σn Σ−1 I. x = γ
Bingpeng Zhou: A tutorial on MMSE
5
Remark 1. This equivalent distribution pz|x (x) reflects the distribution information of x obtained from the measurements, which retains all necessary statistical information of x from its likelihood density. Lemma 2. Products of two Gaussian densities Given two independent Gaussian densities of x, i.e., N (x|χ1 , Λ1 ) and N (x|χ2 , Λ2 ), then the joint density of x is also a Gaussian distribution (supposed as N (x|χ3 , Λ3 )), namely, p(x) = N (x|χ1 , Λ1 )N (x|χ2 , Λ2 ) = N (x|χ3 , Λ3 ),
(31)
where the associated expectation and precision are given by χ3 = Λ−1 3 (Λ1 χ1 + Λ2 χ2 ),
(32)
Λ3 = Λ1 + Λ2 .
(33)
The proof is very easy if you do a patient and careful derivation. It is not included here.