IEEE SIGNAL PROCESSING LETTERS, VOL. 25, NO. 2, FEBRUARY 2018
189
A New Random Variable Normalizing Transformation With Application to the GLRT Steven Kay
, Life Fellow, IEEE, and Yazan Rawashdeh
Abstract—A means of converting a random variable into an approximate standard normal is described. It is an extension of the transformation inherent in the use of the exponential embedded family approach to multifamily likelihood ratio testing, which helps explain why the transformation employed corrects the deficiencies of the generalized likelihood ratio test. Index Terms—Signal analysis, signal resolution.
I. INTRODUCTION WELL-KNOWN theorem in probability is the probability integral transformation [4] that asserts that the random variable Y = FX (X), where FX is the cumulative distribution function of X, yields a random variable that is uniformly distributed on the [0, 1] interval. This result is used in theoretical work and also to simulate random variables with a specified distribution from uniform ones. In a similar vein, we present a transformation that converts an arbitrary random variable into a standard normal by the use of its cumulant generating function (CGF). The standard normal is, however, an approximate one, since the main theorem is of an asymptotic nature. It is important to note, however, that the asymptotic (as the data record length becomes large) approximation is quite good, even for shorter data records. In practice, this transformation can be found analytically for some important problems of interest. One, which has served as the motivation for the work described herein, is the use of a transformation to convert a chi-squared random variable with k degrees of freedom into a chi-squared random variable with one degree of freedom. Its importance is that it allows the use of the generalized likelihood ratio test (GLRT) for model-order selection and for multifamily detection. Without it, the GLRT would always choose the most complex model. In effect, the transformation serves to equalize the various test statistics, which can then be examined to determine the maximum and, hence, the most likely hypothesis. Upon further examination of this transformation, it was found to be the convex conjugate function of the CGF, hence, what is referred to as the Legendre transform (LT) [2]. In hindsight, this is not unexpected
A
Manuscript received September 17, 2017; revised November 29, 2017; accepted November 30, 2017. Date of publication December 8, 2017; date of current version December 19, 2017. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. James W. Pitton. (Corresponding author: Yazan Rawashdeh.) S. Kay is with the Department of Electrical and Computer Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:
[email protected]). Y. Rawashdeh is with the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this letter are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2017.2781235
, Member, IEEE
since the LT is related to the saddlepoint evaluation of the probability density function (PDF) and produces what is also known as the tilted PDF [3]. The practical utility of the theorem to be presented is that when test statistics of differing PDFs are used to make a decision for which the maximum statistic is chosen to indicate the true hypothesis, they can be compared on an equal basis. Hence, the decision produces the true hypothesis for large data records. This problem arises in model-order selection, testing of different families of PDFs using estimated parameters, and anomaly detection, where the background statistics differ, to name a few. The latter example also arises in multiresolution techniques for which the window sizes differ, and hence, the background statistics differ markedly. II. THEOREM STATEMENT To begin, we give some preliminarily definitions. Let X be a scalar random variable with PDF pX (x) for −∞ < x < ∞ and CGF KX (η) defined as ∞ KX (η) = ln exp(ηx)pX (x)dx. −∞
The parameter η takes on values in the interval N = {η : |KX (η)| < ∞}, where N includes a small interval about zero. The CGF can be shown to be convex. The convex conjugate function of the CGF is defined as [8] ∗ KX (x) = sup (ηx − KX (η)) η ∈N
(1)
and is also known as the LT of the CGF, referred to as LT-CGF. It enjoys a host of convenient mathematical properties, some of which are described in Appendix A. Next, we define the LT of a random variable X by the transformation ∗ 2KX (x), x > E[X] y= 0, x ≤ E[X] where E[X] is the expected value of X. We will denote the Legendre transform of the cumulant generating function by LTCGF. The theorem is as follows. Theorem II.1 (PDF of the LT-CGF): Let XN = N i=1 Ui , where the Ui ’s are independent and identically distributed (IID) with E[Ui ] = μ and var(Ui ) = σ 2 . Also, assume that E[|Ui − μ|3 ] < ∞. Define the random variable YN by the transformation of XN as ∗ 2KX (x), x > E[XN ] N y= (2) 0, x ≤ E[XN ].
1070-9908 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
190
IEEE SIGNAL PROCESSING LETTERS, VOL. 25, NO. 2, FEBRUARY 2018
Then, as N → ∞, the distribution of YN converges to the PDF ⎧1 1 ⎪ exp(−y/2), y > 0 ⎨ √ 2 2πy pY N (y) = ⎪ ⎩ 1 δ(y), y=0 2 and zero otherwise. Thus, asymptotically, YN equals zero with probability of 1/2 and has a χ21 PDF with probability of 1/2. As a corollary to the theorem, if we define ⎧ ⎨ 2K ∗ (x), x > E[XN ] XN z= ⎩ − 2K ∗ (x), x ≤ E[XN ] XN then asymptotically the PDF of ZN will converge to the N (0, 1) PDF. A proof of the theorem is given in Appendix B.
Fig. 1. LT-CGF transformed PDF, standardized chi-squared PDF, and the N (0, 1) approximation for N = 5.
III. EXAMPLE
that Y has the asymptotic distribution, which is a χ21 with probability 1/2 and equals zero with probability 1/2, in accordance with the theorem.
We next give an example, but do not use any stochastic convergence analysis so as to simplify the presentation.
IV. COMPARISON OF ASYMPTOTIC TO TRUE PDF
A. Example—Chi-Squared Random Variable Assume that X ∼ χ2N . The CGF is well known to be KX (η) = ln
1 , (1 − 2η)N /2
η < 1/2.
∗ (x) = maxη < 1/2 (ηx − KX (η)) for x > μ = N, and Thus, KX the maximizing value is found to be ηˆ = 1−N2 /x > 0 for x > ∗ E[X] = N . Thus, we have KX (x) = 12 [x − N (ln(x/N ) + 1)], and hence ∗ (x)u(x − N ) y = 2KX
x + 1 u(x − N ) = x − N ln N
(3)
where u(·) is the unit step function. Now, √ consider values of x for x > E[X] = N by letting x =√N + 2N δ for δ > 0. This is the region for which (x − N )/ 2N = (x − E[X])/ var(X) is positive. Then, as N → ∞ √ √ 2N δ N + ∗ (x) = N + 2N δ − N ln 2KX +1 N √ δ 2N δ − N ln 1 + = N/2 √ 1 δ2 δ → 2N δ − N − N/2 2 N/2 2 x−N 2 = δ = √ 2N and therefore, the distribution of Y =
∗ 2KX (X)u(X
=
X −N √ u 2N
− N) → X −N √ 2N
2
X −N √ 2N
2 u(X − N ) 2
→ (Zu(Z))
where Z ∼ N (0, 1), since, by the central limit theorem (CLT), X √−N converges to the distribution N (0, 1). Finally, we have 2N
The true PDF of the LT-CGF transformed chi-squared random variable can be found, although it is complicated. In particular, we illustrate the difference between √the true and asymptotic PDFs for the transformation of z = y, where y is given by (3). The results are shown in the top panel of Fig. 1 for N = 5. Note that the asymptotic PDF is N (0, 1) for z > 0. As a basis for comparison, we also show the PDF for a standardized chi-squared random variable defined as (X − √ N )/ 2N , which according to the CLT converges to an N (0, 1) PDF as N → ∞, in the top panel of Fig. 1 for N = 5. It is seen that the LT-CGF transformation is much more accurate, especially in the tails of the transformed distribution. This is not unexpected since the LT-CGF transformation is related to a saddle-point approximation, in which the PDF is tilted so as to allow a better approximation in the tails [10]. Carefully examining the two aforementioned transformations to attain normality, we next plot the true PDFs of the LT-CGF random variable and the standardized chi-squared random variable versus the N (0, 1) PDF for the tail region. In the bottom panel of Fig. 1, a comparison is shown. It is seen that, as expected, a CLT approximation is poor in the tail region, while the LT-CGF gives a very close approximation, even for a relatively small number of degrees of freedom. This is important, for example, if accurate signal detection thresholds are to be calculated using a normal approximation. V. DISCUSSION The original motivation for studying the LT-CGF transformation was the observation that it was able to normalize the comparison between the GLRT statistics of different orders. In both [5] and [6], a transformation of twice the log-likelihood ratio defined as ˆi ) p(x; θ (4) lG i = 2 ln p(x; 0) where p(x; θ i ) is a PDF that depends upon an i × 1 parameter ˆ i is the maximum likelihood estimator (MLE) vector θ i , and θ of that parameter, is utilized to achieve good results. Comparing the GLRT statistics directly for nested hypotheses is problematic in that the inherent maximization will always yield the statistic with the largest number of parameters. This well-known result
KAY AND RAWASHDEH: NEW RANDOM VARIABLE NORMALIZING TRANSFORMATION WITH APPLICATION TO THE GLRT
disqualifies any model-order estimator, which relies on maximizing the likelihood function. Instead, a penalty factor must be appended to penalize the higher order models [1], [9], [11]. Note that the use of p(x; 0) is only a normalization factor that does not change as the dimension of θ i changes and, therefore, does not change the maximizing value of i. The embedded exponential family (EEF) proposed in [5] effectively transforms lG i , which is a χ2i random variable asymptotically under H0 or when p(x; 0) is true, to a χ21 random variable independent of the model order i. This remarkable transformation is none other than the LT-CGF transformation of (1). In this way, the increasing number of degrees of freedom of lG i is compensated for with the result that the EEF model-order estimator is consistent. For illustration, consider a problem of detecting a deterministic signal s[n] with unknown values and length (duration). We denote the unknown length of the signal as L so that only the signal samples {s[0], s[1], ....., s[L − 1]} are nonzero. Furthermore, we assume that L can be any value in the range 1 ≤ L ≤ N . Then, we can state the detection problem as the composite hypothesis test: H0 : x[n] = w[n], H1 : x[n] = s1 [n] + w[n], .. . HN : x[n] = sN [n] + w[n],
n = 0, 1, ....., N − 1 n = 0, 1, ....., N − 1 n = 0, 1, ....., N − 1
where w[n] is white Gaussian noise with known variance σ 2 , and si [n] is nonzero only for the samples n = 0, 1, 2, ..., i − 1. si ) , where ˆsi is the MLE By using (4), we get lG i (x) = 2 ln p ip(x;ˆ 0 (x) of si [n]. Then, we use lG i (x) in (3) to get the new test statistic. It was shown in [5] that under H0 , the test statistic of lG i (x) has a chi-squared PDF with i degrees of freedom. Hence, by using the transformation for each case of i, the transformation will equalize the test statistic to have a PDF of N (0, 1) under H0 . However, under Hi , the test statistic has a noncentral chi-squared PDF. Note that to find the exact transformation that will equalize the test statistic based on the LT-CGF is mathematically complicated. Yet, the use of the transformation in (3), even though the random variables are not central chi-squared, has been shown to still work in both [5] and [6]. The transformation still penalizes lG i (x) as the number of tested parameters increases and still produces the correct model order with a robust probability of detection. As an example, let the true signal be s[n] = 2 for 0 ≤ n ≤ 4, and 0 otherwise for N = 20, and the white Gaussian noise have variance σ 2 = 1. To show the performance of the detector using our transformation in (3), the receiver operating characteristics obtained by computer simulation are shown in Fig. 2. For detailed discussion regarding this example, see [5]. The detection performance using the LT-CGF is seen to be superior. The GLRT will always choose the most complex test statistic; hence, L = 20. Therefore, it has been shown that the transformation extends the GLRT detector and makes it more robust.
Fig. 2. Probability of detection versus probability of false alarm for the LT-CGF and GLRT.
the literature and are not easily accessible, while others are new. They are the following. ∗ ∗ (μ) = 0, and KX (μ) = 0, where P1. If μ = E[X], then KX ∗ KX denotes the first derivative with respect to x. ∗ ∗ (μ) = 1/σ 2 , where KX denotes P2. If μ = E[X], then KX 2 the second derivative and σ is the variance of X. ∗ (x) is given by P3. The third derivative of KX
∗ (x) = − KX
In this appendix, we accumulate and prove some basic properties that are needed to prove Theorem II.1. Some of these properties are well known [2], [7] but are scattered throughout
(ˆ η) KX (KX (ˆ η ))3
where ηˆ = arg max(ηx − KX (η)). P4. ηˆa = ηˆ(x/a) for a > 0, where ηˆa (x) = arg max(ηx − aKX (η)). η
The proofs are as follows: ∗ (x) = maxη (ηx − P1. By definition of the LT, we have KX KX (η)), for which the maximizing value of η, termed ηˆ, is obtained by simple differentiation to yield the equation x = ∗ (ˆ η ), and thus, the maximum value is KX (x) = ηˆKX (ˆ η) − KX η ). Note that the solution for ηˆ is unique since Kx (η), being KX (ˆ a CGF is convex over its region of convergence, and thus, its first derivative is monotone increasing. Thus, we can write ηˆ = −1 −1 (x), and in particular, when x = μ, we have ηˆ = KX (μ) KX (ˆ η ). But it is well known that KX (0) = μ so that or μ = KX ηˆ = 0. Finally, we have ∗ KX (μ) = ηˆKX (ˆ η ) − KX (ˆ η) = 0·μ−0=0 ∗ ∗ which proves that KX (μ) = 0. Next, since KX (x) = ηˆx − η ), we have upon differentiating with respect to x KX (ˆ
∗ (x) = ηˆ + x KX
dˆ η dˆ η − KX = ηˆ (ˆ η) dx dx x
∗ Thus, KX (μ) = ηˆ(μ) = 0 using the previous results. ∗ (x) = ηˆ, and thus, we have P2. From (5) KX
APPENDIX A PROPERTIES OF THE LT-CGF
191
∗ (x) = KX
1 1 dˆ η = = . dx dx/dˆ η KX (ˆ η)
Finally, we have
∗ KX (μ) =
1 (ˆ KX η (μ))
=
1 (0) KX
=
1 . σ2
(5)
192
IEEE SIGNAL PROCESSING LETTERS, VOL. 25, NO. 2, FEBRUARY 2018
P3. To find the third derivative, we first use the previous result ∗ (x) = K 1( ηˆ) and differentiate to yield of KX X
∗ (x) = − KX
=
1
(ˆ (KX η ))2 K (ˆ η) − X (KX (ˆ η ))3
KX (ˆ η)
√ (1/ N ), we use P3 and observe that ηˆN satisfies KX (ˆ ηN ) = N ξN so that
K (ˆ η) 1 dˆ η = − X dx (KX (ˆ η ))2 dx/dˆ η
WN = − = −
since dx/dˆ η = KX (ˆ η ). P4.
ηˆa (x) = arg max(ηx − aKX (η)) η
= arg max a(ηx/a − KX (η)) η
= arg max(ηx/a − KX (η)) η
= ηˆ(x/a). APPENDIX B PROOF OF THE MAIN THEOREM We consider the asymptotic distribution of XN − μN ∗ YN = 2KX N (XN )u σN N where XN = i=1 Ui for Ui ’s IID with each one having a mean of μ and a variance of σ 2 . Also, u(·) is the unit step function so that YN is the random variable that takes on values ∗ (x), x > E[X] 2KX y= 0, x ≤ E[X].
∗ ∗ 2KX (XN ) = KX (μN )(XN − μN )2 N N 1 ∗ + KX (ξN )(XN − μN )3 N 3 (XN − μN )2 = 2 σN 3 XN − μN 1 ∗ 3 + KX (ξ )σ . N N N 3 σN
We next prove that the remainder term satisfies 3 1 ∗ XN − μN 1 3 K (ξN )σN = Op √ . 3 XN σN N √ N 3 ) = Op (1) since var((XN − μN )/ σN ) = The term ( X Nσ−μ N 1 and, hence, is bounded in probability and thus simi∗ 3 (ξN )σN = Op larly for its cube. To prove that WN = KX N
N KX (ˆ ηN )N 3/2 σ 3 1 (ˆ N 3 (KX ηN ))3 1
(ˆ ηN ) σ 3 KX 1 = −√ . 3 (K (ˆ η N N )) X1
Now, we note that by using P4, we have ηˆN = ηˆN (μN + θ(XN − μN )) μN XN − μN θ √ = ηˆ1 +√ N N N XN − μN θσ √ = ηˆ1 μ + √ N N σ2 p p X√N −μ N σ and clearly, the argument μ + √θ N → μ and so ηˆ1 → 2 Nσ p
0. Hence, ηˆN → 0 as well. Using this, we have KX (ˆ ηN ) p (0) KX E[(U1 − μ)3 ] 1 1 → = (ˆ (0))3 (KX ηN ))3 (KX (σ 2 )3 1 1
E[|U1 − μ|3 ] 0 and is zero for z < 0. Hence, YN has an asymptotic distribution that for y = 0 is an impulse of strength 1/2 (when zu(z) = 0) and for y > 0 is the distribution of a χ21 random variable scaled by 1/2 (when zu(z) > 0).
KAY AND RAWASHDEH: NEW RANDOM VARIABLE NORMALIZING TRANSFORMATION WITH APPLICATION TO THE GLRT
REFERENCES [1] H. Akaike, “A new look at statistical model identification,” IEEE Trans. Automat. Control, vol. AC-19, no. 6, pp. 716–723, Dec. 1974. [2] O. Barndorff-Nielsen, Information and Exponential Families. New York, NY, USA: Wiley, 1978. [3] O. Barndorff-Nielsen and D. R. Cox, Asymptotic Techniques for Use in Statistics. New York, NY, USA: Chapman & Hall, 1989. [4] S. Kay, Intuitive Probability and Random Processes Using MATLAB. New York, NY, USA: Springer, 2006. [5] S. Kay, “Embedded exponential families: A new approach to model order estimation,” IEEE Trans. Aerosp. Electron., vol. 41, no. 1, pp. 333–345, Jan. 2005.
193
[6] S. Kay, “The multi-family likelihood ratio test for multiple signal model detection,” IEEE Signal Process. Lett., vol. 12, no. 5, pp. 369–371, May 2005. [7] P. McCullagh, Tensor Method in Statistics. New York, NY, USA: Chapman & Hall, 1987. [8] R. T. Rockafellar, Convex Analysis. Princeton, NJ, USA: Princeton Univ. Press, 1970. [9] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465–478, 1978. [10] T. A. Severini, Likelihood Methods in Statistics. New York, NY, USA: Oxford Univ. Press, 2000. [11] P. Stoica and Y. Selen, “Model-order selection: A review of information criterion rules,” IEEE Signal Process. Mag., vol. 21, no. 4, pp. 36–47, Jul. 2004.