Oct 19, 2016 - that mitigate the effect of model mismatch on the estimates. ... w.r.t. a dominating Ï-finite measure Ï on SX , such that the Radon-Nykonym.
1
Measure-Transformed Quasi Maximum Likelihood Estimation arXiv:1511.00237v1 [stat.ME] 1 Nov 2015
Koby Todros and Alfred O. Hero
Abstract In this paper the Gaussian quasi maximum likelihood estimator (GQMLE) is generalized by applying a transform to the probability distribution of the data. The proposed estimator, called measure-transformed GQMLE (MT-GQMLE), minimizes the empirical Kullback-Leibler divergence between a transformed probability distribution of the data and a hypothesized Gaussian probability measure. By judicious choice of the transform we show that, unlike the GQMLE, the proposed estimator can gain sensitivity to higherorder statistical moments and resilience to outliers leading to significant mitigation of the model mismatch effect on the estimates. Under some mild regularity conditions we show that the MT-GQMLE is consistent, asymptotically normal and unbiased. Furthermore, we derive a necessary and sufficient condition for asymptotic efficiency. A data driven procedure for optimal selection of the measure transformation parameters is developed that minimizes an empirical estimate of the asymptotic mean-squared-error. The MT-GQMLE is applied to a source localization problem in a simulation example that illustrates its sensitivity to higher-order statistical moments and resilience to outliers. Index Terms Higher-order statistics, probability measure transform, quasi maximum likelihood estimation, robust statistics, source localization.
I. I NTRODUCTION Classical multivariate estimation [1], [2] deals with the problem of estimating a deterministic vector parameter using a sequence of multivariate samples from an underlying probability distribution. When the probability distribution is known to lie in a specified parametric family of probability measures, This work was partially supported by United States-Israel Binational Science Foundation grant 2014334 and United States MURI grant W911NF-15-1-0479.
November 3, 2015
DRAFT
2
the maximum likelihood estimator (MLE) [3], [4] can be implemented. In many practical scenarios an accurate parametric model is not available and alternatives to the MLE become attractive. A popular alternative to the MLE is the quasi maximum likelihood estimator (QMLE) [5], [6]. The QMLE minimizes the empirical Kulback-Leibler divergence [7] to the underlying probability distribution of the data over a misspecified parametric family of hypothesized probability measures. When the hypothesized parametric family is the class of Gaussian probability distributions the well established Gaussian QMLE (GQMLE) [5], [8]-[13] is obtained that reduces to fitting the mean vector and covariance matrix of the underlying parametric distribution to the sample mean vector (SMV) and sample covariance matrix (SCM) of the data. The GQMLE has gained popularity due to its low implementation complexity, simplicity of performance analysis, and the computational advantages that are analogous to the first and second-order methods of moments [3], [4], [14]. Despite the possible model mismatch, relative to the Gaussian distribution, the GQMLE is a consistent estimator of the true parameters of interest when the mean vector and covariance matrix are correctly specified [5], [10], [11]. However, in some circumstances, such as for certain types of non-Gaussian data, deviation from normality inflicts poor estimation performance of the GQMLE even though it may be consistent. This can occur when the mean vector and covariance matrix are weakly dependent on the parameter to be estimated and stronger dependency is present in higher-order cumulants, or in case of heavy-tailed data when the non-robust SMV and SCM provide poor estimates in the presence of outliers. In order to overcome this limitation several non-Gaussian QMLEs have been proposed in the econometric and signal processing literature that assume more complex distributional models [15]-[21]. Although non-Gaussian QMLEs can be successful in reducing the effect of model mismatch, they may suffer from the several drawbacks. First, unlike the GQMLE, they may have cumbersome implementations with increased computational and sample complexities and their performance analysis may not be tractable. Second, to be consistent and asymptotically normal they may require considerably more restrictive regularity conditions as compared to the GQMLE [11], [18], [22], [23]. Furthermore, when the hypothesized parametric family largely deviates from normality they may have low asymptotic relative efficiency [24] w.r.t. the MLE under nominal Gaussian distribution of the data. In this paper, a new generalization of the GQMLE is proposed that applies Gaussian quasi maximum likelihood estimation under a transformed probability distribution of the data. Under the proposed generalization we show that new estimators can be obtained that can gain sensitivity to higher-order statistical moments, resilience to outliers, and yet have the computational advantages of the first and second-order methods of moments. This generalization, called measure-transformed GQMLE (MT-GQMLE), is inspired November 3, 2015
DRAFT
3
by the measure transformation approach that was recently applied to canonical correlation analysis [25], [26], and multiple signal classification [27], [28]. The proposed transform is structured by a non-negative function, called the MT-function, and maps the true probability distribution into a set of new probability measures on the observation space. By modifying the MT-function, classes of measure transformations can be obtained that have different useful properties that mitigate the effect of model mismatch on the estimates. Under the proposed transform we define the measure-transformed (MT) mean vector and covariance matrix, derive their strongly consistent estimates, and establish their sensitivity to higher-order statistical moments and resilience to outliers. The MT-GQMLE minimizes the empirical Kullback-Leibler divergence to the transformed probability distribution of the data over a family of hypothesized Gaussian probability measures. This hypothesized class of Gaussian distributions is characterized by the MT-mean vectors and MT-covariance matrices corresponding to the true underlying parametric family of transformed probability measures. Under some mild regularity conditions we show that the MT-GQMLE is consistent, asymptotically normal and unbiased. Additionally, a necessary and sufficient condition for asymptotic efficiency is derived. Furthermore, a data-driven procedure for optimal selection of the MT-function within some parametric class of functions is developed that minimizes an empirical estimate of the asymptotic mean-squared-error (MSE). We illustrate the MT-GQMLE for the problem of source localization in the presence of heavy-tailed spherically contoured noise. By specifying the MT-function within the family of zero-centered Gaussianshaped functions parameterized by a scale parameter, we show that the MT-GQMLE is sensitive to higher-order statistical moments and resilient to outliers. More specifically, we show that the proposed MT-GQMLE outperforms the non-robust GQMLE [29] and attains MSE performance that are significantly closer to those obtained by the MLE that, unlike the proposed estimator, requires complete knowledge of the likelihood function. The paper is organized as follows. In Section II, the GQMLE is reviewed. In Section III, a transformation on the probability distribution of the data is developed. In Section IV we use this transformation to construct the MT-GQMLE. The proposed estimator is applied to a source localization problem in Section V. In Section VI, the main points of this contribution are summarized. The proofs of the propositions and theorems stated throughout the paper are given in the Appendix.
November 3, 2015
DRAFT
4
II. T HE G AUSSIAN
QUASI MAXIMUM LIKELIHOOD ESTIMATOR :
R EVIEW
A. Preliminaries We define the measure space (X , SX , PX;θ0 ), where X ⊆ Cp is the observation space of a complexvalued random vector X, SX is a σ -algebra over X , and PX;θ0 is a probability measure that belongs to some unknown parametric family of probability measures P , {PX;θ : θ ∈ Θ ⊆ Rm }
(1)
on SX . The vector θ 0 denotes a fixed unknown parameter value to be estimated and the vector θ is a free parameter that indexes the parameter space Θ. The vector θ 0 will be called the true vector parameter and PX;θ0 will be called the true probability distribution of X. It is assumed that the family {PX;θ : θ ∈ Θ}
is absolutely continuous w.r.t. a dominating σ -finite measure ρ on SX , such that the Radon-Nykonym derivative [30] f (x; θ) ,
dPX;θ (x) , dρ (x)
(2)
exists for all θ ∈ Θ. The function f (x; θ) is called the likelihood function of θ observed by the vector x ∈ X . Let g : X → C denote an integrable scalar function on X . The expectation of g (X) under PX;θ
is defined as E [g (X) ; PX;θ ] ,
Z
g (x) dPX;θ (x) .
(3)
X
The empirical probability measure PˆX given a sequence of samples Xn , n = 1, . . . , N from PX;θ is specified by N 1 X PˆX (A) = δXn (A) , N
(4)
n=1
where A ∈ SX , and δXn (·) is the Dirac probability measure at Xn [30]. B. The Gaussian QMLE Given a sequence of samples from PX;θ0 , the GQMLE of θ 0 minimizes the empirical Kulback-Leibler divergence (KLD) between the probability measure PX;θ0 and a hypothesized complex circular Gaussian probability distribution [31] ΦX;θ that is characterized by the mean vector µX (θ) , E [X; PX;θ ] and the covariance matrix ΣX (θ) , cov [X; PX;θ ]. The KLD between PX;θ0 and ΦX;θ is defined as [7]: f (X; θ 0 ) ; PX;θ0 , DKL [PX;θ0 ||ΦX;θ ] , E log φ (X; θ)
November 3, 2015
(5)
DRAFT
5
where f (x; θ 0 ) and φ (x; θ) , det−1 [πΣX (θ)] exp − (x − µX (θ))H Σ−1 (θ) (x − µ (θ)) X X
(6)
are the density functions associated with PX;θ0 and ΦX;θ , respectively, w.r.t. the dominating σ -finite measure ρ on SX . An empirical estimate of (5) given a sequence of samples Xn , n = 1, . . . , N from PX;θ0 is defined as:
N 1 X f (Xn ; θ 0 ) ˆ DKL [PX;θ0 ||ΦX;θ ] , log . N φ (Xn ; θ)
(7)
n=1
The GQMLE of θ 0 is obtained by minimization of (7) w.r.t. θ . Since the density f (·; θ 0 ) is θ -independent P this minimization is equivalent to maximization of the term N1 N n=1 log φ (Xn ; θ), which by (6) amounts to maximization of the following objective function: h i ˆ X ||ΣX (θ) − kµ ˆ X − µX (θ)k2 J (θ) , −DLD Σ
(Σx (θ ))−1
,
(8)
where DLD [A||B] , tr AB−1 − log det AB−1 − p is the log-determinant divergence [32] between √ positive definite matrices A, B ∈ Cp×p , kakC , aH Ca denotes the weighted Euclidian norm of a P ˆ ˆ X , N1 N vector a ∈ Cp with positive-definite weighting matrix C ∈ Cp×p , and µ n=1 Xn and ΣX , 1 PN ˆ X ) (Xn − µ ˆ X )H denote the standard SMV and SCM. Hence, the GQMLE of θ 0 is n=1 (Xn − µ N given by:
ˆ = arg max J (θ) . θ θ∈Θ
III. P ROBABILITY
(9)
MEASURE TRANSFORMATION
In this section, we develop a transform on the probability measure PX;θ . Under the proposed transform, we define the measure-transformed mean vector and covariance matrix, derive their strongly consistent estimates, and establish their sensitivity to higher-order statistical moments and resilience to outliers. These quantities will be used in the following section to construct the proposed measure-transformed extension of the GQMLE (9). A. Probability measure transform Definition 1. Given a non-negative function u : Cp → R+ satisfying 0 < E [u (X) ; PX;θ ] < ∞,
(10)
a transform on PX;θ is defined via the relation: (u)
QX;θ (A) , Tu [PX;θ ] (A) =
Z
ϕu (x; θ) dPX;θ (x) ,
(11)
A November 3, 2015
DRAFT
6
where A ∈ SX and ϕu (x; θ) ,
u (x) . E [u (X) ; PX;θ ]
(12)
The function u (·) is called the MT-function. (u)
Proposition 1 (Properties of the transform). Let QX;θ be defined by relation (11). Then (u)
1) QX;θ is a probability measure on SX . (u)
2) QX;θ is absolutely continuous w.r.t. PX;θ , with Radon-Nikodym derivative [30]: (u)
dQX;θ (x) = ϕu (x; θ) . dPX;θ (x)
(13)
[A proof is given in Appendix A] (u)
The probability measure QX;θ is said to be generated by the MT-function u (·). By modifying u (·), such that the condition (10) is satisfied, virtually any probability measure on SX can be obtained. B. The measure-transformed mean and covariance (u)
According to (13) the mean vector and covariance matrix of X under QX;θ , called the MT-mean and MT-covariance, are given by: (u)
µX (θ) , E [Xϕu (X; θ) ; PX;θ ]
(14)
and (u)
ΣX (θ) , E
(u)
X − µX
H (u) (θ) X − µX (θ) ϕu (X; θ) ; PX;θ , (u)
(15)
(u)
respectively. Equations (14) and (15) imply that µX (θ) and ΣX (θ) are weighted mean and covariance of X under PX;θ , with the weighting function ϕu (·; ·) defined in (12). By modifying the MT-function u (·), such that the condition (10) is satisfied, the MT-mean and MT-covariance are modified. In particular, (u)
by choosing u (·) to be any non-zero constant valued function we have QX;θ = PX;θ , for which the standard mean vector µX (θ) and covariance matrix ΣX (θ) are obtained. Alternatively, when u (·) is non-constant analytic function, which has a convergent Taylor series expansion, the resulting MT-mean and MT-covariance involve higher-order statistical moments of PX;θ . C. The empirical measure-transformed mean and covariance According to (14) and (15) the MT-mean and MT-covariance can be estimated using only samples (u)
from the distribution PX;θ . In the following Proposition, strongly consistent estimates of µX (θ) and (u)
ΣX (θ) are constructed based on N i.i.d. samples of X. November 3, 2015
DRAFT
7
Proposition 2 (Strongly consistent estimates of the MT-mean and MT-covariance). Let Xn , n = 1, . . . , N denote a sequence of i.i.d. samples from PX;θ . Define the empirical mean and covariance estimates: (u)
ˆX , µ
N X
Xn ϕˆu (Xn )
(16)
n=1
and ˆ (u) , Σ X
N X (u) H (u) ˆX ˆX ϕˆu (Xn ) , Xn − µ Xn − µ
(17)
n=1
respectively, where
u (Xn ) ϕˆu (Xn ) , PN . n=1 u (Xn )
If
i h E kXk2 u (X) ; PX;θ < ∞
(18)
(19)
(u) w.p. 1 w.p. 1 (u) (u) (u) w.p. 1 ˆX ˆ X −−−→ µX (θ) and Σ −−−→ ΣX (θ) as N → ∞, where “−−−→” denotes convergence with then µ
probability (w.p.) 1 [33]. [A proof is given in Appendix B] (u)
ˆ X and Note that for u(X) ≡ 1 the estimators µ
N ˆ (u) N −1 ΣX
reduce to the standard unbiased SMV and
(u) (u) ˆ (u) can be written as statistical functionals η X ˆ X and Σ [·] and SCM, respectively. Also notice that µ X (u) ΨX [·] of the empirical probability measure PˆX defined in (4), i.e., (u)
ˆX = µ
and
(u)
E[Xu (X) ; PˆX ] (u) , η X [PˆX ] ˆ E[u (X) ; PX ]
(20)
(u)
E[(X − η X [PˆX ])(X − η X [PˆX ])H u (X) ; PˆX ] (u) ˆ (u) , ΨX [PˆX ]. Σ = X ˆ E[u (X) ; PX ]
(21)
(u) (u) By (12), (14), (15), (20) and (21), when PˆX is replaced by PX;θ we have η X [PX;θ ] = µX (θ) and (u) (u) (u) (u) ˆX ˆ X and Σ ΨX [PX;θ ] = ΣX (θ), which implies that µ are Fisher consistent [34].
D. Robustness to outliers Here, we study the robustness of the empirical MT-mean and MT-covariance to outliers using their vector and matrix valued influence functions [35]. Define the probability measure Pǫ , (1 − ǫ)PX;θ + ǫδy , where 0 ≤ ǫ ≤ 1, y ∈ Cp , and δy is the Dirac probability measure at y. The influence function of a
Fisher consistent estimator with statistical functional H[·] at probability distribution PX;θ is defined as [35]: H [Pǫ ] − H [PX;θ ] ∂ H [Pǫ ] = . IFH (y; PX;θ ) , lim ǫ→0 ǫ ∂ǫ ǫ=0 November 3, 2015
(22)
DRAFT
8
The influence function describes the effect on the estimator of an infinitesimal contamination at the point y. An estimator is said to be B-robust if for any fixed value of θ ∈ Θ its influence function is bounded
(u) ˆ (u) are given by ˆ X and Σ over Cp [35]. Using (20)-(22) one can verify that the influence functions of µ X (u)
IFη(u) (y; PX;θ ) = x
and
u (y) (y − µX (θ)) E [u (X) ; PX;θ ] (u)
(u)
(u)
IFΨx(u) (y; PX;θ ) =
(23)
u (y)[(y − µX (θ))(y − µX (θ))H − ΣX (θ)] , E[u(X); PX;θ ]
(24)
respectively. The following proposition states a sufficient condition for boundedness of (23) and (24). This condition is satisfied for the Gaussian MT-function proposed in Section V. Proposition 3. The influence functions (23) and (24) are bounded for any fixed value of θ ∈ Θ if the MT-function u(y) and the product u(y)kyk22 are bounded over Cp . [A proof is given in Appendix D] IV. T HE MEASURE - TRANSFORMED G AUSSIAN
QUASI MAXIMUM LIKELIHOOD ESTIMATOR
In this section we extend the GQMLE (9) by applying the transformation (11) to the probability measure PX;θ . Similarly to the GQMLE, given a sequence of samples from PX;θ0 , the proposed MTGQMLE of θ 0 minimizes the empirical Kulback-Leibler divergence between the transformed probability (u)
measure QX;θ0 and a hypothesized complex circular Gaussian probability distribution. This Gaussian (u)
(u)
(u)
measure denoted here as ΦX;θ is characterized by the MT-mean µX (θ) and MT-covariance ΣX (θ) (u)
corresponding to the transformed probability measure QX;θ . The minimization is carried out w.r.t. the free vector parameter θ . Regularity conditions for consistency, asymptotic normality and unbiasedness are derived. Additionally, we provide a closed-form expression for the asymptotic MSE and obtain a necessary and sufficient condition for asymptotic efficiency. Optimal selection of the MT-function u (·) out of some parametric class of functions is also discussed.
A. The MT-GQMLE (u)
(u)
The KLD between QX;θ0 and ΦX;θ is defined as [7]: i h q (u) (X; θ 0 ) (u) (u) (u) ; QX;θ0 , DKL QX;θ0 ||ΦX;θ , E log (u) φ (X; θ)
(25)
where q (u) (x; θ 0 ) and (u)
φ
(x; θ) , det
November 3, 2015
−1
h i H −1 (u) (u) (u) (u) πΣX (θ) exp − x − µX (θ) ΣX (θ) x − µX (θ)
(26)
DRAFT
9
i h (u) (u) (u) (u) are the density functions associated with QX;θ0 and ΦX;θ , respectively. According to (13), DKL QX;θ0 ||ΦX;θ can be estimated using only samples from PX;θ0 . Hence, similarly to (16) and (17), an empirical estimate
of (25) given a sequence of samples Xn , n = 1, . . . , N from PX;θ0 is defined as: N i X h q (u) (Xn ; θ 0 ) (u) (u) ˆ KL QX ϕˆu (Xn ) log (u) , ||Φ D , X;θ ;θ 0 φ (X ; θ) n n=1
(27)
where ϕˆu (·) is defined in (18). The proposed estimator of θ 0 is obtained by minimization of (27) w.r.t. θ . Since the density q (u) (·; θ 0 ) is θ -independent this minimization is equivalent to maximization of the term PN ˆu (Xn ) log φ(u) (Xn ; θ), which by (16), (17) and (26) amounts to maximization of the following n=1 ϕ objective function:
2 h (u) i
(u)
(u) ˆ X ||Σ(u) ˆ Ju (θ) , −DLD Σ (θ) − µ − µ (θ)
X
(u) −1 , X X (Σx (θ))
(28)
where the operators DLD [·||·] and k·k(·) are defined below (8). Thus, the proposed MT-GQMLE is given by: ˆ u = arg max Ju (θ) . θ
(29)
θ ∈Θ
By modifying the MT-function u (·), such that the condition (10) is satisfied the MT-GQMLE is modified, resulting in a family of estimators generalizing the GQMLE described in Subsection II-B. In particular, (u)
by choosing u (x) to be any non-zero constant function over X , then QX;θ = PX;θ and the standard GQMLE (9) is obtained. B. Asymptotic performance analysis Here, we study the asymptotic performance of the proposed estimator (29). For simplicity, we assume a sequence of i.i.d. samples Xn , n = 1, . . . , N from PX;θ0 . ˆ u ). Assume that the following conditions are satisfied: Theorem 1 (Strong consistency of θ
(A-1) The parameter space Θ is compact. (u)
(u)
(u)
(u)
(A-2) µX (θ 0 ) 6= µX (θ) or ΣX (θ 0 ) 6= ΣX (θ) ∀θ 6= θ 0 . (u)
(A-3) ΣX (θ) is non-singular ∀θ ∈ Θ. (u)
(u)
(A-4) µX (θ) and ΣX (θ) are continuous over Θ. i h (A-5) E kXk2 u (X) ; PX;θ0 < ∞.
Then,
w.p. 1 ˆu − θ −−→ θ 0 as N → ∞.
(30)
[A proof is given in Appendix E] November 3, 2015
DRAFT
10
ˆ u ). Assume that the following conditions are Theorem 2 (Asymptotic normality and unbiasedness of θ
satisfied: P P ˆu − (B-1) θ → θ 0 as N → ∞, where “− →” denotes convergence in probability [33].
(B-2) θ 0 lies in the interior of Θ which is assumed to be compact. (u)
(u)
(B-3) µX (θ) and ΣX (θ) are twice continuously differentiable in Θ. i h (B-4) E u2 (X) ; PX;θ0 < ∞ and E kXk42 u2 (X) ; PX;θ0 < ∞.
Then,
D ˆ u − θ0 − θ → N (0, Cu (θ 0 )) as N → ∞,
(31)
D
where “− →” denotes convergence in distribution [33]. The asymptotic MSE at θ 0 takes the form: −1 Cu (θ 0 ) = N −1 F−1 u (θ 0 ) Gu (θ 0 ) Fu (θ 0 ) ,
(32)
Gu (θ) , E u2 (X) ψ u (X; θ) ψ Tu (X; θ) ; PX;θ0 ,
(33)
ψ u (X; θ) , ∇θ log φ(u) (X; θ) ,
(34)
Fu (θ) , −E [u (X) Γu (X; θ) ; PX;θ0 ] ,
(35)
Γu (X; θ) , ∇2θ log φ(u) (X; θ)
(36)
where
and it is assumed that Fu (θ) is non-singular at θ = θ 0 . [A proof is given in Appendix F] ˆ u and the Cram´er-Rao lower bound The following proposition relates the asymptotic MSE (32) of θ
(CRLB) [36], [37]. Proposition 4 (Relation to the CRLB). Let η(X; θ) , ∇θ log f (X; θ)
(37)
denote the gradient of the logarithm over the likelihood function (2) w.r.t. θ . Assume that the following conditions are satisfied: (C-1) For any θ ∈ Θ, the partial derivatives
∂f (x;θ) ∂θk
and
∂ 2 φ(u) (x,θ ) ∂θk ∂θj
k, j = 1, . . . , m exist ρ − a.e., where
θk , k = 1, . . . , m denote the entries of the vector parameter θ .
(C-2) For any θ ∈ Θ, vk (x; θ) ,
∂ log φ(u) (x;θ) f ∂θk
(x; θ) u (x) ∈ L1 (X , ρ), k = 1, . . . , m, where L1 (X , ρ)
denotes the space of absolutely integrable functions, defined on X , w.r.t. the measure ρ.
November 3, 2015
DRAFT
11
(C-3) There exist dominating functions rk,j (x) ∈ L1 (X , ρ), k, j = 1, . . . , m, such that for any θ ∈ Θ ∂vk (x;θ) ∂θj ≤ rk,j (x) ρ − a.e.
(C-4) The matrix Gu (θ) (33) and the Fisher information matrix
are non-singular at θ = θ 0 .
IFIM (θ) , E η (X; θ) η T (X; θ) ; PX;θ
Then, Cu (θ 0 ) N −1 I−1 FIM (θ 0 ) ,
where equality holds if and only if η (X; θ 0 ) = IFIM (θ 0 ) F−1 u (θ 0 ) ψ u (X; θ 0 ) u (X) w.p. 1.
(38)
[A proof is given in Appendix G] One can verify that when PX;θ is a Gaussian measure, the condition (38) is satisfied only for non-zero constant valued MT-functions resulting in the standard GQMLE (9) that only involves first and secondorder moments. This implies that in the Gaussian case, non-constant MT-functions will always lead to asymptotic performance degradation. In the non-Gaussian case, however, as will be illustrated in the sequel, there are many practical scenarios where a non-constant MT-function can significantly decrease the asymptotic MSE as compared to the GQMLE. This results in estimators with weighted mean and covariance that involve higher-order moments, as discussed in Subsection III-B, whose empirical estimates can gain robustness against outliers, as discussed in Subsection III-D. By solving the differential equation (38) one can obtain the MT-function for which the resulting MT-GQMLE is asymptotically efficient. Unfortunately, in the non-Gaussian case, the solution is highly cumbersome and requires the knowledge of the likelihood function f (X; θ). Therefore, as will be described in Subsection IV-C, we propose an alternative technique for optimizing the choice of the MT-function that is based on the following empirical estimate of the asymptotic MSE. Theorem 3 (Empirical asymptotic MSE). Define the empirical asymptotic MSE: ˆ u (θ ˆ u ) , N −1 F ˆ −1 (θ ˆ u )G ˆ u (θ ˆ u )F ˆ −1 (θ ˆ u ), C u u
where ˆ u (θ) , N −1 G
N X
u2 (Xn ) ψ u (Xn ; θ) ψ Tu (Xn ; θ)
(39)
(40)
n=1
November 3, 2015
DRAFT
12
and ˆ u (θ) , −N −1 F
N X
u (Xn ) Γu (Xn ; θ) .
(41)
n=1
Furthermore, assume that the following conditions are satisfied: P ˆu − (D-1) θ → θ 0 as N → ∞. (u)
(u)
(D-2) µX (θ) and ΣX (θ) are twice continuously differentiable in Θ which is assumed to be compact. i h (D-3) E u2 (X) ; PX;θ0 < ∞ and E kXk4 u2 (X) ; PX;θ0 < ∞.
Then,
P ˆu) − ˆ u (θ C → Cu (θ 0 ) as N → ∞.
(42)
[A proof is given in Appendix H] C. Optimization of the choice of the MT-function Here we restrict the class of MT-functions to some parametric family {u (X; ω) , ω ∈ Ω ⊆ Cr } that satisfies the conditions stated in Definition 1 and Theorem 3. For example, the Gaussian family of functions that satisfy the conditions in Proposition 3 is a natural choice for inducing outlier resistance. An optimal choice of the MT-function parameter ω would minimize the empirical asymptotic MSE (39) that is constructed by the same sequence of samples used for obtaining the MT-GQMLE (29). V. A PPLICATION :
SOURCE LOCALIZATION
We illustrate the use of the proposed MT-GQMLE (29) for source localization in spherically contoured noise arising from the compound Gaussian (CG) family [38]. This family encompasses common heavytailed distributions, such as the t-distribution and the K -distribution that have been widely adopted for modelling radar clutter [39]-[41]. We consider a uniform linear array (ULA) of p sensors with half wavelength spacing that receive a signal generated by a narrowband far-field point source with azimuthal angle of arrival (AOA) θ0 . Under this model the array output satisfies [13]: Xn = Sn a (θ0 ) + Wn , n = 1, . . . , N,
(43)
where n is a discrete time index, Xn ∈ Cp is the vector of received signals, Sn ∈ C is the emitted signal,
a (θ0 ) , [1, exp (−iπ sin (θ0 )) , . . . , exp (−iπ(p − 1) sin (θ0 ))]T is the steering vector of the array toward
direction θ0 and Wn ∈ Cp is an additive noise. We assume that the following conditions are satisfied: 1) θ0 ∈ Θ = [−π/2, π/2 − δ] where δ is a small positive constant, 2) the emitted signal is symmetrically
November 3, 2015
DRAFT
13
distributed about the origin, 3) Sn and Wn are statistically independent and first-order stationary, and 4) the noise component is spherically contoured with stochastic representation [38]: Wn = νn Zn ,
(44)
where νn ∈ R++ is a first-order stationary process and Zn ∈ Cp is a proper-complex wide-sense
stationary Gaussian process with zero-mean and scaled unit covariance σZ2 I. The processes νn and Zn are also assumed to be statistically independent.
In order to gain robustness against outliers, as well as sensitivity to higher-order moments, we specify the MT-function in the zero-centred Gaussian family of functions parametrized by a width parameter ω , i.e., uG (x; ω) , exp −kxk2 /ω 2 , ω ∈ R++ .
(45)
One can verify that the MT-function (45) satisfies the B-robustness conditions stated in Proposition 3. Using (14), (15) and (43)-(45) it can be shown that the MT-mean and MT-covariance under the transformed (u )
probability measure QX;θG are given by: (u )
µX G (θ; ω) = 0
(46)
ΣX G (θ; ω) = rS (ω) a (θ) aH (θ) + rW (ω) I,
(47)
and (u )
respectively, where rS (ω) and rW (ω) are some strictly positive functions of ω . Hence, by substituting (45)-(47) into (28) the resulting MT-GQMLE (29) is obtained by solving the following maximization problem: (uG ) ˆX (ω) a (θ) , θˆuG (ω) = arg max aH (θ) C θ∈Θ
(uG ) (u ) (u )H ˆX ˆ (uG ) (ω) + µ ˆ X G (ω) µ ˆ X G (ω). where C (ω) , Σ X
Under the considered settings, it can be shown that the conditions stated in Theorems 1-3 are satisfied. The resulting asymptotic MSE (32) is given by: i h √ √ ν 2 σ2 ω2 p E ν 4 σZ4 + 2ν 2 σZ2 +ω2 |S|2 h 2pS, 2νσZ , ω ; PS,ν 6 h Z i CuG (θ; ω) = , (48) 2 2 √ π cos (θ) (p2 − 1) N E2 p |S|2 h pS, νσZ , ω ; PS,ν −p−2 exp −|S|2 /(ν 2 + ω 2 ) . Furthermore, its empirical estimate where h (S, ν, ω) , (ν 2 + ω 2 )/ω 2
(39) takes the form:
November 3, 2015
PN α2 (Xn ; θˆuG (ω))u2G (Xn ; ω) ˆ ˆ CuG (θuG (ω) ; ω) = Pn=1 , 2 ˆ ( N n=1 β(Xn ; θuG (ω))uG (Xn ; ω))
(49)
DRAFT
14
H ¨ (θ) XXH a (θ) + |a˙ H (θ) X|2 , a˙ (θ) , where α (X; θ) , 2Re a˙ H (θ) XXH a (θ) , β (X; θ) , 2Re a
¨ (θ) , d2 a (θ)/dθ 2 . da (θ)/dθ and a
In the following example, we consider a BPSK signal with power σS2 impinging on a 4-element ULA with half wavelength spacing from AOA θ0 = 30◦ . We consider the following p-variate complex compound Gaussian noise distributions with zero location parameter and isotropic dispersion σZ2 I: Gaussian, Cauchy, t-distribution with λ = 2 degrees of freedom, and K -distribution with shape parameter λ = 0.75. Notice that unlike the Gaussian distribution, the other noise distributions are heavy tailed and produce outliers. The generalized signal-to-noise-ratio (GSNR) is defined as GSNR , 10 log10 σS2 /σZ2 and is used to index the estimation performance. For each noise type we performed two simulations. In the first simulation example, we compared the asymptotic MSE (48) to its empirical estimate (49) as a function of ω obtained from a single realization of N = 3000 i.i.d. snapshots and fixed value of GSNR. For the Gaussian, Cauchy, and t-distributed noise the GSNR value was set to −5 [dB]. For the K -distributed noise the GSNR was set to −15 [dB]. Observing Figs. 1(a), 2(a), 3(a) and 4(a) one sees that due to the consistency of (49), which follows from Theorem 3, the compared quantities are very close. This implies that the empirical asymptotic MSE can be reliably used for optimal choice of the MT-function parameter, as discussed in subsection IV-C. In the second simulation example, we compared the empirical, asymptotic (48) and empirical asymptotic (49) MSEs of the MT-GQMLE to the empirical MSEs obtained by the GQMLE [29] and the MLE that is used as a benchmark for performance evaluation. The optimal Gaussian MT-function parameter ωopt was obtained by minimizing (49) over the range Ω = [1, 30]. Here, N = 3000 snapshots were used with averaging over 1000 Monte-Carlo simulations. The GSNR is used to index the performance as depicted in Figs. 1(b), 2(b), 3(b) and 4(b). Notice that for the Gaussian noise case, there is no significant performance gap between the compared methods. For the other noise distributions, the proposed MTGQMLE outperforms the GQMLE and attains estimation performance that are significantly closer to those obtained by the MLE that, unlike the MT-GQMLE, requires complete knowledge of the likelihood function. This indicates that the model mismatch errors caused by the normality assumption can be significantly mitigated by the proposed measure transformation approach. VI. C ONCLUSION In this paper a new multivariate estimator, called MT-GQMLE, was developed that applies Gaussian quasi maximum likelihood estimation under a transformed probability distribution of the data. We have shown that the MT-GQMLE can gain sensitivity to higher-order statistical moments and resilience to November 3, 2015
DRAFT
15
Asymptotic RMSE Empirical asymptotic RMSE 1
RMSE [deg]
10
0
10
−1
10
5
10
15 ω
20
25
30
(a)
0
RMSE [deg]
10
MT−GQMLE (Empirical RMSE) MT−GQMLE (Asymptotic RMSE) MT−GQMLE (Empirical asymptotic RMSE) GQMLE MLE
−1
10
−2
10
−5
0
5
10
GSNR [dB]
(b)
Fig. 1.
Source localization in Gaussian noise: (a) Asymptotic RMSE (48) and its empirical estimate (49) versus the width
parameter ω of the Gaussian MT-function (45). The true angle of arrival, number of snapshots and GSNR were set to θ = 30◦ , N = 3000 and SNR = −5 [dB], respectively. Notice that due to the consistency of (49) the compared quantities are close. (b) The empirical, asymptotic (48) and empirical asymptotic (49) RMSEs of the MT-GQMLE versus GSNR as compared to the empirical RMSEs of the GQMLE and the MLE. The true angle of arrival and number of snapshots were set to θ = 30◦ and N = 3000, respectively. Notice that the MT-GQMLE performs similarly to the GQMLE when the noise is normally distributed.
November 3, 2015
DRAFT
16
RMSE [deg]
Asymptotic RMSE Empirical asymptotic RMSE
0
10
5
10
15 ω
20
25
30
(a) 2
10
MT−GQMLE (Empirical RMSE) MT−GQMLE (Asymptotic RMSE) MT−GQMLE (Empirical asymptotic RMSE) GQMLE MLE
1
RMSE [deg]
10
0
10
−1
10
−2
10 −10
−5
0 GSNR [dB]
5
10
(b)
Fig. 2.
Source localization in Cauchy noise: (a) Asymptotic RMSE (48) and its empirical estimate (49) versus the width
parameter ω of the Gaussian MT-function (45). The true angle of arrival, number of snapshots and GSNR were set to θ = 30◦ , N = 3000 and SNR = −5 [dB], respectively. Notice that due to the consistency of (49) the compared quantities are close. (b) The empirical, asymptotic (48) and empirical asymptotic (49) RMSEs of the MT-GQMLE versus GSNR as compared to the empirical RMSEs of the GQMLE and the MLE. The true angle of arrival and number of snapshots were set to θ = 30◦ and N = 3000, respectively. Notice that the MT-GQMLE outperforms the non-robust GQMLE and performs similarly to the MLE.
November 3, 2015
DRAFT
17
RMSE [deg]
Asymptotic RMSE Empirical asymptotic RMSE
0
10
5
10
15 ω
20
25
30
(a) 2
10
1
RMSE [deg]
10
MT−GQMLE (Empirical RMSE) MT−GQMLE (Asymptotic RMSE) MT−GQMLE (Empirical asymptotic RMSE) GQMLE MLE
0
10
−1
10
−2
10 −10
−5
0 GSNR [dB]
5
10
(b)
Fig. 3.
Source localization in t-distributed noise with 2 degrees of freedom: (a) Asymptotic RMSE (48) and its empirical
estimate (49) versus the width parameter ω of the Gaussian MT-function (45). The true angle of arrival, number of snapshots and GSNR were set to θ = 30◦ , N = 3000 and SNR = −5 [dB], respectively. Notice that due to the consistency of (49) the compared quantities are close. (b) The empirical, asymptotic (48) and empirical asymptotic (49) RMSEs of the MT-GQMLE versus GSNR as compared to the empirical RMSEs of the GQMLE and the MLE. The true angle of arrival and number of snapshots were set to θ = 30◦ and N = 3000, respectively. Notice that the MT-GQMLE outperforms the non-robust GQMLE and performs similarly to the MLE.
November 3, 2015
DRAFT
18
RMSE [deg]
Asymptotic RMSE Empirical asymptotic RMSE
0
10
5
10
15 ω
20
25
30
(a) 2
10
MT−GQMLE (Empirical RMSE) MT−GQMLE (Asymptotic RMSE) MT−GQMLE (Empirical asymptotic RMSE) GQMLE MLE
1
RMSE [deg]
10
0
10
−1
10
−2
10 −25
−20
−15 GSNR [dB]
−10
−5
(b)
Fig. 4.
Source localization in K-distributed noise with shape parameter α = 0.75: (a) Asymptotic RMSE (48) and its
empirical estimate (49) versus the width parameter ω of the Gaussian MT-function (45). The true angle of arrival, number of snapshots and GSNR were set to θ = 30◦ , N = 3000 and SNR = −15 [dB], respectively. Notice that due to the consistency of (49) the compared quantities are close. (b) The empirical, asymptotic (48) and empirical asymptotic (49) RMSEs of the MT-GQMLE versus GSNR as compared to the empirical RMSEs of the GQMLE and the MLE. The true angle of arrival and number of snapshots were set to θ = 30◦ and N = 3000, respectively. Notice that the MT-GQMLE outperforms the non-robust GQMLE and attains estimation performance that are significantly closer to those obtained by the MLE.
November 3, 2015
DRAFT
19
outliers when the MT-function associated with the transform is non-constant and satisfies some mild regularity conditions. After specifying the MT-function in the Gaussian family of functions, the proposed estimator was applied to source localization in non-Gaussian noise. Exploration of other classes of MTfunctions may result in additional estimators in this family that have different useful properties.
November 3, 2015
DRAFT
20
A PPENDIX A. Proof of Proposition 1: Property 1: (u)
Since ϕu (x, θ) is nonnegative, then by Corollary 2.3.6 in [33] QX is a measure on SX . (u)
(u)
Furthermore, QX;θ (X ) = 1 so that QX;θ is a probability measure on SX . Property 2: Follows from definitions 4.1.1 and 4.1.3 in [33].
B. Proof Proposition 2: (u)
(u)
By the real-imaginary decompositions of the MT-mean µX (θ), the MT-covariance ΣX (θ) and their (u) ˆ (u) , given in Lemmas 1 and 2 in Appendix C, it is sufficient to prove that ˆ X and Σ empirical estimates µ X w.p. 1 (u) (g) (g) (g) w.p. 1 ˆ Z(g) can be ˆ (g) ˆ Z and Σ ˆ Z −−−→ µZ (θ) and Σ −−−→ ΣZ (θ) as N → ∞. According to (16)-(18) µ µ Z
written as 1 N
(u)
ˆZ = µ
and (g)
ˆ = Σ Z
1 N
N P
n=1 1 N
N P
Zn g (Zn ) n=1 N 1 P g (Zn ) N n=1
Zn ZTn g (Zn ) N P
g (Zn )
(g)
(50)
(g)T
ˆZ ˆZ µ −µ
,
(51)
n=1
N T respectively. Since {Zn }N n=1 is a sequence of i.i.d. samples from PZ;θ , then Zn Zn g (Zn ) n=1 in the
r.h.s. of (51) define a sequence of i.i.d. samples of ZZT g (Z). Moreover, if the condition in (19) is satisfied, then for any pair of entries Zk , Zl of Z we have that h i E [|Zk Zl g (Z)| ; PZ;θ ] = E Zk g1/2 (Z) Zl g1/2 (Z) ; PZ;θ i1/2 i h h ≤ E |Zk |2 g (Z) ; PZ;θ E |Zl |2 g (Z) ; PZ;θ i h ≤ E kXk2 u (X) ; PX;θ < ∞,
where the first semi-inequality stems from the H¨older inequality for random variables [33], and the second one stems from the definitions of Z and g (Z) in Lemma 1 in Appendix C. Therefore, by Khinchine’s
November 3, 2015
DRAFT
21
strong law of large numbers (KSLLN) [30] N 1 X w.p. 1 Zn ZTn g (Zn ) −−−−→ E ZZT g (Z) ; PZ;θ . N →∞ N
(52)
n=1
Similarly, it can be shown that if the condition in (19) is satisfied, then by the KSLLN N 1 X w.p. 1 Zn g (Zn ) −−−−→ E [Zg (Z) ; PZ;θ ] N →∞ N
(53)
N 1 X w.p. 1 g (Zn ) −−−−→ E [g (Z) ; PZ;θ ] , N →∞ N
(54)
n=1
and
n=1
(u) ˆ (u) are obtained ˆ Z and Σ where E [g (Z) ; PZ;θ ] > 0 by the assumption in (10). Since by (50) and (51) µ Z P P P N N 1 1 T by continuous mappings of the terms N1 N n=1 g (Zn ), N n=1 Zn g (Zn ) and N n=1 Zn Zn g (Zn ),
then by (12), (52)-(54) and the Mann-Wald Theorem [42] (g)
w.p. 1
ˆ Z −−−−→ E [Zϕg (Z) ; PZ;θ ] µ
(55)
N →∞
and w.p. 1 ˆ Z(g) −− −−→ E ZZT ϕg (Z) ; PZ;θ − E [Zϕg (Z) ; PZ;θ ] E ZT ϕg (Z) ; PZ;θ . Σ N →∞
(56)
w.p. 1 (g) (g) (g) w.p. 1 ˆ (g) ˆ Z −−−→ µZ (θ) and Σ −−−→ ΣZ (θ) Therefore, by (14), (15), (55) and (56) it is concluded that µ Z
as N → ∞. C. Real-imaginary decompositions of the MT-mean, MT-covariance and their empirical estimates The real-imaginary decompositions of the MT-mean and MT-covariance of X are given in the following Lemma that follows directly from (12), (14) and (15). Lemma 1. Let U and V denote the real and imaginary components of X, respectively. Define the real h i h i T (g) (g) and the MT-function g (Z) , u (X). Let µZ (θ) and µZ (θ) random vector Z , UT , VT 1
(g)
2
denote the p × 1 subvectors of the real-valued MT-mean µZ (θ) satisfying i h (g) µ (θ) Z (g) i1 µZ (θ) = h (g) µZ (θ) 2 h i h i h i h i (g) (g) (g) (g) and let ΣZ (θ) , ΣZ (θ) , ΣZ (θ) and ΣZ (θ) denote the p × p submatrices of the 1,1
(g)
1,2
2,1
2,2
real-valued MT-covariance ΣZ (θ) satisfying h i h i (g) (g) ΣZ (θ) ΣZ (θ) (g) i1,1 h i1,2 ΣZ (θ) = h . (g) (g) ΣZ (θ) ΣZ (θ) 2,1
November 3, 2015
2,2
DRAFT
22
The real-imaginary decompositions of the MT-mean and MT-covariance of X take the form: h i h i (u) (g) (g) µX (θ) = µZ (θ) + i µZ (θ) 1
2
(57)
and
(u)
ΣX
h i (g) (θ) = ΣZ (θ)
1,1
respectively.
h i (g) + ΣZ (θ)
2,2
h i (g) + i ΣZ (θ)
2,1
h i (g) − ΣZ (θ)
1,2
,
(58)
Similarly, the real-imaginary decompositions of the empirical MT-mean and MT-covariance of X are given in the following Lemma that follows directly from (16)-(18). Lemma 2. Let Xn , n = 1, . . . , N denote a sequence of samples from PX;θ , and let Un , Vn denote the T real and imaginary components of Xn , respectively. Define the real random vector Zn , UTn , VnT and i i h h (g) (g) ˆZ ˆZ denote the p × 1 subvectors of the real-valued and µ the MT-function g (Zn ) , u (Xn ). Let µ 2
1
(g)
ˆ Z satisfying empirical MT-mean µ
h (g) i ˆ and let Σ Z
1,1
h (g) i ˆ , Σ Z
h (g) i ˆ , Σ Z
1,2 (g)
2,1
i h (g) ˆ µ Z (g) i1 ˆZ = h µ (g) ˆZ µ 2 h (g) i ˆ and Σ denote the p × p submatrices of the real-valued Z 2,2
ˆ Z satisfying empirical MT-covariance Σ ˆ (g) Σ Z
h i h (g) i (g) ˆ ˆ Σ Σ h Z i1,1 h Z i1,2 = . ˆ Z(g) ˆ Z(g) Σ Σ 2,1
2,2
The real-imaginary decompositions of the empirical MT-mean and MT-covariance of X take the form: i i h h (u) (g) (g) ˆX = µ ˆZ ˆZ (59) +i µ µ 1
2
and
ˆ (u) Σ = X
respectively.
h i ˆ (g) Σ Z
1,1
h (g) i ˆZ + Σ
2,2
h i ˆ Z(g) +i Σ
2,1
h (g) i ˆZ − Σ
1,2
,
(60)
D. Proof of Proposition 3: The influence functions (23) and (24) can be written as:
(u) IFη(u) (y; P ) = cu (y) y − µ (θ) b (y; θ)
X ;θ X x November 3, 2015
(61)
DRAFT
23
and
2
(u) (u) IFΨ(u) (y; PX;θ ) = cu (y) y − µX (θ) B (y; θ) − ΣX (θ) , x
(62)
−1
(u) (u) respectively, where θ ∈ Θ is fixed, c , E−1 [u(X); PX;θ ], b (y; θ) , y − µX (θ) y − µX (θ)
and B (y; θ) , b (y; θ) bH (y; θ). Since kb (y; θ)k = 1 for any y ∈ Cp , the real and imaginary components of b (y; θ) and B (y; θ) are bounded. Thus, the influence functions (61) and (62) are bounded over Cp for any fixed θ ∈ Θ if u (y) and u (y) kyk2 are bounded. E. Proof of Theorem 1: Define the deterministic quantity
2 i h
(u) (u) (u) (u) (63) J¯u (θ) , −DLD ΣX (θ 0 ) ||ΣX (θ) − µX (θ 0 ) − µX (θ) (u) −1 . (Σx (θ)) According to Lemmas 3 and 4, stated below, J¯u (θ) is uniquely maximized at θ = θ 0 and the random objective function Ju (θ) defined in (28) converges uniformly w.p. 1 to J¯u (θ) as N → ∞. Furthermore, by assumptions A-3 and A-4 Ju (θ) is continuous over the compact parameter space Θ w.p. 1. Therefore, by Theorem 3.4 in [43] we conclude that (30) holds. Lemma 3. Let J¯u (θ) be defined by relation (63). If conditions A-1-A-4 are satisfied, then J¯u (θ) is uniquely maximized at θ = θ 0 . Proof: By (63) and assumptions A-3 and A-4 J¯u (θ) is continuous over the compact set Θ. Therefore, by the extreme-value theorem [44] it must have a global maximum in Θ. We now show that the global maximum is uniquely attained at θ = θ 0 . According to (63)
2 i h
(u) (u) (u) (u) J¯u (θ 0 ) − J¯u (θ) = DLD ΣX (θ 0 ) ||ΣX (θ) + µX (θ 0 ) − µX (θ) (u) −1 . (Σx (θ))
(64)
Notice that the log-determinant divergence between positive-definite matrices A, B satisfies DLD [A||B] ≥ 0 with equality if and only if A = B [32]. Also notice that the weighted Euclidian norm of a vector a with positive-definite weighting matrix C satisfies kakC ≥ 0 with equality if and only if a = 0.
Therefore, by assumptions A-2 and A-3 we conclude that the difference in (64) is finite and strictly positive for all θ 6= θ 0 , which implies that J¯u (θ) is uniquely maximized at θ = θ 0 . Lemma 4. Let Ju (θ) and J¯u (θ) be defined by relations (28) and (63), respectively. If conditions A-1 and A-3-A-5 are satisfied, then w.p. 1 sup Ju (θ) − J¯u (θ) −−−→ 0 as N → ∞.
(65)
θ ∈Θ
November 3, 2015
DRAFT
24
Proof: Using (28), (63), the Cauchy-Schwartz inequality and its the matrix extension [45] one can verify that
(u)
(u)
(u) (u) (u)H (u)H ¯ ˆ ˆX ˆX µ sup Ju (θ) − Ju (θ) ≤
ΣX (θ 0 ) − ΣX + µX (θ 0 ) µX (θ 0 ) − µ Fro Fro θ∈Θ
−1 −1
(u) (u) ˆ (u) ΣX
+ log det Σ (θ 0 ) ΣX (θ) × sup X
θ∈Θ Fro
−1
(u)
(u) (u) (u) ˆ X sup ΣX (θ) (66) + 2 µX (θ 0 ) − µ µX (θ)
w.p. 1.
θ ∈Θ (u)
Notice that by assumptions A-3 and A-4 and the compactness of Θ, the terms sup k(ΣX (θ))−1 kFro (u)
θ ∈Θ
(u)
and sup k(ΣX (θ))−1 µX (θ) k are finite. Also notice that since Xn , n = 1, . . . , N is a sequence of θ ∈Θ
w.p. 1 (u) (u) w.p. 1 ˆ (u) − ˆ X −−−→ µX (θ 0 ) and Σ −−→ i.i.d. samples from PX;θ0 , then by Proposition 2 and assumption A-5 µ X
(u)
ΣX (θ 0 ) as N → ∞. Furthermore, note that the operators k·kFro , k·k and log det [·] define real continuous
(u) ˆ (u) . Hence, by the Mann-Wald Theorem [42] the upper bound in (66) converges ˆ X and Σ mappings of µ X
to zero w.p. 1 as N → ∞, and therefore, the relation (65) must hold. F. Proof of Theorem 2: ˆ u is weakly consistent and the true parameter vector θ 0 By assumptions B-1 and B-2 the estimator θ ˆ u lies in an open lies in the interior of Θ, which is assumed to be compact. Therefore, we conclude that θ
neighbourhood U ⊂ Θ of θ 0 with sufficiently high probability as N gets large, i.e., it does not lie on
ˆ u is a maximum point of the objective function Ju (θ) (28) whose gradient the boundary of Θ. Hence, θ
satisfies ∇θ Ju
ˆu θ
N X
n=1
u (Xn ) =
N X
n=1
ˆ u = 0. u (Xn ) ψ u Xn ; θ
(67)
By (26) and assumption B-3, the vector function ψ u (X; θ) defined in (34) is continuous over Θ w.p. 1. Using (67) and the mean-value theorem [46] applied to each entry of ψ u (X; θ) we conclude that N N X √ 1 X ˆ u = √1 ˆu − θ0 , ˆ u (θ ∗ ) N θ u (Xn ) ψ u (Xn ; θ 0 ) − F 0= √ u (Xn ) ψ u Xn ; θ N n=1 N n=1
(68)
ˆ u (θ) defined in (41) is an unbiased estimator of the symmetric matrix function Fu (θ) (35), where F ˆ u and θ 0 . Γu (X; θ) is defined in (36) and θ ∗ lies on the line segment connecting θ ˆ u is weakly consistent θ ∗ must be weakly consistent as well. Therefore, by Lemma 5, stated Since θ P ˆ u (θ ∗ ) − below, F → Fu (θ 0 ) as N → ∞, where Fu (θ) is non-singular at θ = θ 0 by assumption. Hence,
by the Mann-Wald theorem [42] P ˆ −1 (θ ∗ ) − F → F−1 u u (θ 0 ) as N → ∞, November 3, 2015
(69) DRAFT
25
ˆ u (θ ∗ ) is invertible with sufficiently high probability as N gets large. Therefore, by which implies that F
(68) the equality √
N X ˆu − θ0 = F ˆ −1 (θ ∗ ) √1 u (Xn ) ψ u (Xn ; θ 0 ) N θ u N n=1
(70)
holds with sufficiently high probability as N gets large. Furthermore, by Lemma 7 stated below N 1 X D √ u (Xn ) ψ u (Xn ; θ 0 ) − → N (0, Gu (θ0 )) as N → ∞, N n=1
(71)
where Gu (θ) is defined in (33). Thus, by (69)-(71) and Slutsky’s theorem [33] the relation (31) holds. ˆ u (θ) be the matrix functions defined in (35) and (41), respectively. FurtherLemma 5. Let Fu (θ) and F P
more, let θ ∗ denote an estimator of θ . If θ ∗ − → θ 0 as N → ∞ and conditions B-3 and B-4 are satisfied, P ˆ u (θ∗ ) − then F → Fu (θ0 ) as N → ∞.
Proof: By the triangle inequality ˆ u (θ ∗ ) − Fu (θ 0 ) k ≤ kF ˆ u (θ ∗ ) − Fu (θ ∗ ) k + kFu (θ ∗ ) − Fu (θ 0 )k . kF
(72)
Therefore, it suffices to prove that P
ˆ u (θ ∗ ) − Fu (θ ∗ ) k − kF → 0 as N → ∞
(73)
and P
kFu (θ ∗ ) − Fu (θ 0 )k − → 0 as N → ∞.
(74) P
→ θ 0 as N → ∞, by By Lemma 6, stated below, Fu (θ) is continuous in Θ. Therefore, since θ ∗ −
the Mann-Wald Theorem [42] we conclude that (74) holds. We further conclude by Lemma 6 that
P ˆ u (θ ∗ ) − Fu (θ ∗ ) k ≤ sup ˆ u (θ) − Fu (θ) kF → 0, as N → ∞.
F
− θ ∈Θ
ˆ u (θ) be the matrix functions defined in (35) and (41), respectively. If Lemma 6. Let Fu (θ) and F
ˆ
P conditions B-3 and B-4 are satisfied, then Fu (θ) is continuous in Θ and sup Fu (θ) − Fu (θ) − →0
as N → ∞.
θ ∈Θ
Proof: Notice that the samples Xn , n = 1, . . . , N are i.i.d., the parameter space Θ is compact and by (26) and assumption B-3 the matrix function Γu (X; θ) defined in (36) is continuous in Θ w.p. 1. Furthermore, using (26), (36), the triangle inequality, the Cauchy-Schwartz inequality, assumption B-3, and the compactness of Θ it can be shown that there exists a positive scalar C such that [Γu (X; θ)]k,j ≤ 1 + kXk2 C w.p. 1 ∀k, j = 1, . . . , p. Therefore, by (35), (41), assumption B-4 and the uniform weak November 3, 2015
DRAFT
26
ˆ
P law of large numbers [47] we conclude that Fu (θ) is continuous in Θ and sup F →0 u (θ) − Fu (θ) − θ ∈Θ
as N → ∞.
Lemma 7. Given a sequence Xn , n = 1, . . . , N of i.i.d. samples from PX;θ0 N 1 X D √ u (Xn ) ψ u (Xn ; θ 0 ) − → N (0, Gu (θ 0 )) as N → ∞, N n=1
where Gu (θ) and ψ u (X; θ) are defined in (33) and (34), respectively.
Proof: Since Xn , n = 1, . . . , N are i.i.d. random vectors and the functions u (·) and ψ u (·, ·) are real the products u (Xn ) ψ u (Xn , θ 0 ) n = 1, . . . , N are real and i.i.d. Furthermore, by Lemmas 8 and 9, stated below, the definition of Gu (θ) in (33) and the compactness of Θ we have that E [u (X) ψ u (X; θ 0 ) ; PX;θ0 ] = 0 and cov [u (X) ψ u (X; θ 0 ) ; PX;θ0 ] = Gu (θ 0 ) is finite. Therefore by the central limit theorem [24] N P we conclude that √1N u (Xn ) ψ u (Xn ; θ 0 ) is asymptotically normal with zero mean and covariance n=1
Gu (θ 0 ).
Lemma 8. The random vector function ψ u (X; θ) defined in (34) satisfies E [u (X) ψ u (X; θ) ; PX;θ ] = 0.
(75)
E [u (X) ψ u (X; θ) ; PX;θ ] = E [ψ u (X; θ) ϕu (X; θ) ; PX;θ ] E [u (X) ; PX;θ ],
(76)
Proof: Notice that
where ϕu (·; ·) is defined in (12) and the expectation E [u (X) ; PX;θ ] is finite by assumption (10). Also notice that by (26) and (34) the k-th entry of the vector function ψ u (X; θ) is given by: # " −1 ∂Σ(u) (θ) ∂ log φ(u) (X; θ) (u) X (77) = −tr ΣX (θ) [ψ u (X; θ)]k , ∂θk ∂θk ( ) H −1 ∂µ(u) (θ) (u) (u) X + 2Re X − µX (θ) ΣX (θ) ∂θk " # −1 ∂Σ(u) (θ) −1 H (u) (u) (u) (u) X + tr ΣX (θ) ΣX (θ) X − µX (θ) X − µX (θ) . ∂θk Therefore, by (14), (15), (76) and (77) the relation (75) is easily verified. ˆ u (θ) be the matrix functions defined in (33) and (40), respectively. If Lemma 9. Let Gu (θ) and G
ˆ
P conditions B-3 and B-4 are satisfied, then Gu (θ) is continuous in Θ and sup G (θ) − G (θ) →0
− u u
as N → ∞.
November 3, 2015
θ ∈Θ
DRAFT
27
Proof: Notice that the samples Xn , n = 1, . . . , N are i.i.d., the parameter space Θ is compact and by (26) and assumption B-3 the vector function ψ u (X; θ) defined in (34) is continuous in Θ w.p. 1. Using (26), (34), the triangle inequality, the Cauchy-Schwartz inequality, assumption B-3, and the compactness of Θ it can be shown that there exists a positive scalar C such that 2 3 4 T C ∀k, j = 1, . . . , p. ψ (X; θ) ψ (X; θ) ≤ 1 + kXk + kXk + kXk + kXk u u k,j
(78)
Therefore, by (33), (40), assumption B-4 and the uniform weak law of large numbers [47] we conclude
ˆ
P that Gu (θ) is continuous in Θ and sup G → 0 as N → ∞. u (θ) − Gu (θ) − θ ∈Θ
G. Proof of Proposition 4:
Define ξ (X; θ) , u(X)ψ u (X; θ). Using the definitions of Cu (·), Gu (·) and Fu (·) in (32), (33) and (35), respectively, identity 1 stated below, the symmetricity of Fu (θ), the non-singularity of Gu (θ) at θ = θ 0 and the covariance semi-inequality [3] one can verify that
T −1 C−1 ξ (X; θ 0 ) ξ T (X; θ 0 ) ; PX;θ0 u (θ 0 ) = N E[η(X; θ 0 )ξ (X; θ 0 ) ; PX;θ0 ]E × E[ξ (X; θ 0 ) η T (X; θ 0 ); PX;θ0 ]
N E[η(X; θ 0 )η T (X; θ 0 ); PX;θ0 ] , N IFIM (θ 0 ) .
(79)
where equality holds if and only if the relation (38) holds. The proof is complete by taking the inverse of (79). Identity 1. Fu (θ) = E[u(X)ψ u (X; θ)η T (X; θ); PX;θ ] Proof: According to Lemma 8, stated in Appendix F, E [u (X) ψ u (X, θ) ; PX;θ ] = 0. Therefore, by (2), (3) and (34) 0 =
∂ ∂θ
Z
u (x)
X
=
∂ log φ(u) (x; θ) ∂θ
T
f (x; θ) dρ (x)
Z
u (x)
∂ 2 log φ(u) (x; θ) f (x; θ) dρ (x) ∂θ∂θ T
Z
u (x)
(80)
X
+
X
∂ log φ(u) (x; θ) ∂θ
T
∂ log f (x; θ) f (x; θ) dρ (x) . ∂θ
The second equality in (80) stems from assumptions C-1-C-3 and Theorem 2.40 in [48] according to which integration and differentiation can be interchanged. Therefore, by (2), (34)-(37) and (80) E[u(X)ψ u (X; θ)η T (X; θ); PX;θ ] = −E[u(X)Γu (X; θ); PX;θ ] , Fu (θ). November 3, 2015
(81) DRAFT
28
H. Proof of Theorem 3: P ˆu − Under assumption D-1, θ → θ 0 as N → ∞. Therefore, by Lemma 5, stated in Appendix F, and
P ˆu) − ˆ u (θ → Lemma 10, stated below, that require conditions D-2 and D-3 to be satisfied, we have that F
P ˆu) − ˆ u (θ Fu (θ 0 ) as N → ∞ and G → Gu (θ 0 ) as N → ∞. Hence, by the Mann-Wald theorem [42] we
conclude that (42) is satisfied. ˆ u (θ) be the matrix functions defined in (33) and (40), respectively. Lemma 10. Let Gu (θ) and G P
Furthermore, let θ ∗ denote an estimator of θ 0 . If θ ∗ − → θ 0 as N → ∞ and conditions D-2 and D-3 are
P ˆ u (θ ∗ ) − satisfied, then G → Gu (θ 0 ) as N → ∞.
Proof: The proof is similar to the one of Lemma 5 stated in Appendix F. Simply replace Fu (θ) ˆ u (θ) with Gu (θ) and G ˆ u (θ) and use Lemma 9 instead of Lemma 6. and F
R EFERENCES [1] R. A. Fisher, “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London, Series A, vol. 222, pp. 309-368, 1922. [2] R. A. Fisher, “Theory of statistical estimation,” Proceedings of the Cambridge Philosophical Society, vol. 22, pp. 700-725, 1925. [3] E. L. Lehmann, and G. Casella, Theory of Point Estimation, Springer, 1998. [4] S. M. Kay, Fundamentals of statistical signal processing: detection theory, Prentice-Hall, 1998. [5] H. White, “Maximum likelihood estimation of misspecified models,” Econometrica: Journal of the Econometric Society, pp. 1-25, 1982. [6] Q. Ding and S. Kay, “Maximum likelihood estimator under a misspecified model with high signal-to-noise ratio,” IEEE Transactions on Signal Processing, vol. 59, no. 8, pp. 4012-4016, 2011. [7] S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, pp. 79-86, 1951. [8] O. Rousseaux, G. Leus, P. Stoica, “Gaussian maximum-likelihood channel estimation with short training sequences,” IEEE Transactions on Wireless Communications, vol. 4, no. 6, pp. 2945-2955, 2005. [9] E. Weinstein, “Decentralization of the Gaussian maximum likelihood estimator and its applications to passive array processing,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 5, pp. 945-951, 1981. [10] T. Bollerslev and J. M. Wooldridge, “Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances,” Econometric reviews, vol. 11, no. 2, pp. 143-172, 1992. [11] C. Gourieroux, A. Monfort, and A. Trognon, “Pseudo maximum likelihood methods: Theory,” Econometrica, vol. 52, no. 3, pp. 681-700, 1984. [12] B. Ottersten, M. Viberg, P. Stoica, and A. Nehorai, “Exact and large sample maximum likelihood techniques for parameter estimation and detection,” in Radar Array Processing, ch. 4, pp. 99-151, Springer-Verlag, 1993. [13] H. Krim and M. Viberg, “Two decades of array signal processing research: the parametric approach,” IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 67-94, 1996. [14] K. Pearson, “Contributions to the mathematical theory of evolution,” Philosophical Transactions of the Royal Society of London, vol. A, pp. 71-110, 1894. [15] T. Bollerslev, “A conditionally heteroskedastic time series model for speculative prices and rates of return,” The review of economics and statistics, vol. 69, no. 3, pp. 542-547, 1987.
November 3, 2015
DRAFT
29
[16] R. Baillie and T. Bollerslev, “The message in daily exchange rates: a conditional-variance tale,” Journal of Business & Economic Statistics, vol. 7, pp. 297-305, 1989. [17] D. A. Hsieh, “Modeling heteroscedasticity in daily foreign-exchange rates,” Journal of Business & Economic Statistics, vol. 7, no. 3, pp. 307-317, 1989. [18] J. Fan, L. Qi, and D. Xiu, “Quasi-maximum likelihood estimation of GARCH models with heavy-tailed likelihoods,” Journal of Business & Economic Statistics, vol. 32, no. 2, pp. 178-191, 2014. [19] D. B. Williams and D. H. Johnson, “Robust estimation of structured covariance matrices,” IEEE Transactions on Signal Processing, vol. 41, no. 9, pp. 2891-2906, 1993. [20] Y. Yardimci, A. E. Cetin, and J. A. Cadzow, “Robust direction-of-arrival estimation in non-Gaussian noise,” IEEE Transactions on Signal Processing, vol. 46, no. 5, pp. 1443-1451, 1998. [21] P. G. Georgiou, and C. Kyriakakis, “Maximum likelihood parameter estimation under impulsive conditions, a sub-Gaussian signal approach,” Signal processing, vol. 86, no. 10, pp. 3061-3075, 2006. [22] P. J. Huber, “The behaviour of maximum likelihood estimates under nonstandard conditions,” Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 1, pp. 221-233, 1967. [23] W. K. Newey and D. G. Steigerwald, “Asymptotic bias for quasi-maximum-likelihood estimators in conditional heteroskedasticity models,” Econometrica, vol. 65, no. 3, pp. 587-599, 1997. [24] R. J. Serfling, Approximation theorems of mathematical statistics. John Wiley & Sons, 1980. [25] K. Todros and A. O. Hero, “On measure-transformed canonical correlation analysis,” IEEE Transactions on Signal Processing, vol. 60, no. 9, pp. 4570-4585, 2012. [26] K. Todros and A. O. Hero, “Measure transformed canonical correlation analysis with application to financial data,” Proceedings of SAM 2012, pp. 361-364, 2012. [27] K. Todros and A. O. Hero, “Robust multiple signal classification via probability measure transformation,” IEEE Transactions on Signal Processing, vol. 63, no. 5, pp. 1156-1170, 2015. [28] K. Todros and A. O. Hero, “Robust Measure transformed MUSIC for DOA estimation,” Proceeding of ICASSP 2014, pp. 4190-4194, 2014. [29] H. L. Van Trees, Optimum Array Processing, p. 922, John Wiley & Sons, 2002. [30] G. B. Folland, Real Analysis, John Wiley and Sons, 1984. [31] P. Schreier and L. L. Scharf, Statistical signal processing of complex-valued data: the theory of improper and noncircular signals. p. 39, 165, Cambridge University Press, 2010. [32] I. S. Dhillon and J. A. Tropp, “Matrix nearness problems with Bregman divergences,” SIAM Journal on Matrix Analysis and Applications, vol. 29, no. 4, pp. 1120-1146, 2007. [33] K. B. Athreya and S. N. Lahiri, Measure theory and probability theory, Springer-Verlag, 2006. [34] D. R. Cox and D. V. Hinkley, Theoretical Statistics, Chapman & Hall, 1974. [35] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel, Robust statistics: the approach based on influence functions. John Wiley & Sons, 2011. [36] H. Cram´er, “A contribution to the theory of statistical estimation,” Skand. Akt. Tidskr., vol. 29, pp. 85-94, 1946. [37] C. R. Rao, “Information and accuracy attainable in the estimation of statistical parameters,” Bull. Calcutta Math. Soc., vol. 37, pp. 81-91, 1945. [38] E. Ollila, D. E. Tyler, V. Koivunen and H. V. Poor, “Complex elliptically symmetric distributions: survey, new results and applications,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 5597-5625, 2012. [39] E. Conte and M. Longo, “Characterization of radar clutter as a spherically invariant random process,” IEE Proceedings F (Communications, Radar and Signal Processing), vol. 134, no. 2, pp. 191-197, 1987. [40] F. Gini and A. Farina, “Vector subspace detection in compound-Gaussian clutter, part I: Survey and new results,” IEEE Transactions on Aerospace and Electronic Systems, vol. 38, no. 4, pp. 1295-1311, 2002. [41] M. Rangaswamy, “Spherically invariant random processes for modelling non-Gaussian radar clutter,” Proceeding of the 27th ASILOMAR Conference on Signals, Systems and Computers, pp. 1106-1110, 1993. [42] H. B. Mann and A. Wald, “On stochastic limit and order relationships,” Ann. Math. Stat., vol. 14, pp. 217-226, 1943.
November 3, 2015
DRAFT
30
[43] H. White, Estimation, inference and specification analysis, Cambridge university press, 1996. [44] P. M. Fitzpatrick, Advanced calculus, American Mathematical Society, 2006. [45] X. Yang, “Some trace inequalities for operators,” Journal of the Australian Mathematical Society (Series A), vol. 58, no. 3, pp. 281-286, 1995. [46] C. H. Edwards, Advanced calculus of several variables, Courier Corporation, 2012. [47] W. K. Newey and D. McFadden, “Large sample estimation and hypothesis testing,” Handbook of econometrics, vol. 4, pp. 2111-2245, 1994. [48] M. Giaquinta and G. Modica, Mathematical Analysis: An Introduction to Functions of Several Variables. Birkh¨auser, p. 88, Boston, 2009.
November 3, 2015
DRAFT