arXiv:1110.3436v1 [math.ST] 15 Oct 2011
New Entropy Estimator with an Application to Test of Normality Salim BOUZEBDA1,∗ Issam ELHATTAB1 ,† Amor KEZIOU1 ‡ and Tewfik LOUNIS 2§ L.S.T.A., Université Pierre et Marie Curie 4 place Jussieu 75252 Paris Cedex 05 Laboratoire de Mathématiques Nicolas Oresme CNRS UMR 6139 Université de Caen2
Abstract In the present paper we propose a new estimator of entropy based on smooth estimators of quantile density. The consistency and asymptotic distribution of the proposed estimates are obtained. As a consequence, a new test of normality is proposed. A small power comparison is provided. A simulation study for the comparison, in terms of mean squared error, of all estimators under study is performed. AMS Subject Classification: : 62F12 ; 62F03 ; 62G30 ; 60F17 ; 62E20. Keywords: Entropy; Quantile density; Kernel estimation; Vasicek’s estimator; Spacingbased estimators; Test of nomality; Entropy test.
1
Introduction and estimation
Let X be a random variable [r.v.] with cumulative distribution function [cdf] F (x) := P(X ≤ x) for x ∈ R and a density function f (·) with respect to Lebesgue measure on R. Then its differential (or Shannon) entropy is defined by Z H(X) := − f (x) log f (x)dx, (1.1) R
∗
e-mail: e-mail: ‡ e-mail: § e-mail: †
[email protected] (corresponding author)
[email protected] [email protected] [email protected]
1
where dx denotes Lebesgue measure on R. We assume that H(X) is properly defined by the integral (1.1), in the sense that (1.2)
|H(X)| < ∞.
The concept of differential entropy was introduced in Shannon’s original paper [Shannon (1948)]. Since this early epoch, the notion of entropy has been the subject of great theoretical and applied interest. Entropy concepts and principles play a fundamental role in many applications, such as statistical communication theory [Gallager (1968)], quantization theory [Rényi (1959)], statistical decision theory [Kullback (1959)], and contingency table analysis [Gokhale and Kullback (1978)]. Csiszár (1962) introduced the concept of convergence in entropy and showed that the latter convergence concept implies convergence in L1 . This property indicates that entropy is a useful concept to measure “closeness in distribution”, and also justifies heuristically the usage of sample entropy as test statistics when designing entropy-based tests of goodness-of-fit. This line of research has been pursued by Vasicek (1976), Prescott (1976), Dudewicz and van der Meulen (1981), Gokhale (1983), Ebrahimi et al. (1992) and Esteban et al. (2001) [including the references therein]. The idea here is that many families of distributions are characterized by maximization of entropy subject to constraints (see, e.g., Jaynes (1957) and Lazo and Rathie (1978)). There is a huge literature on the Shannon’s entropy and its applications. It is not the purpose of this paper to survey this extensive literature. We can refer to Cover and Thomas (2006) (see their Chapter 8), for a comprehensive overview of differential entropy and their mathematical properties. In the literature, Several proposals have been made to estimate entropy. Dmitriev and Tarasenko (1973) and Ahmad and Lin (1976) proposed estimators of the entropy using kernel-type estimators of the density f (·). Vasicek (1976) proposed an entropy estimator based on spacings. Inspired by the work of Vasicek (1976), some authors [van Es (1992), Correa (1995) and Wieczorkowski and Grzegorzewski (1999)] proposed modified entropy estimators, improving in some respects the properties of Vasicek’s estimator (see Section 4 below). The reader finds in Beirlant et al. (1997) detailed accounts of the theory as well as surveys for entropy estimators. This paper aims to introduce a new entropy estimator and obtains its asymptotic properties. Comparison Simulations indicate that our estimator produces smaller mean squared error than the other well-known competitors considered in this work. As a consequence, we propose a new test of normality. First, we introduce some definitions and notations. For each distribution function F (·), we define the quantile function by Q(t) := inf{x : F (x) ≥ t}, 0 < t < 1. Let xF := sup{x : F (x) = 0} and xF := inf{x : F (x) = 1}, −∞ ≤ xF < xF ≤ ∞. 2
We assume that the distribution function F (·) has a density f (·) (with respect to Lebesgue measure on R), and that f (x) > 0 for all x ∈ (xF , xF ). Let q(x) := dQ(x)/dx = 1/f (Q(x)), 0 < x < 1, be the quantile density function [qdf]. The entropy H(X), defined by (1.1), can be expressed in the form of quantile-density functional as Z Z d Q(x) dx = log q(x) dx. (1.3) H(X) = log dx [0,1] [0,1] The Vasicek’s estimator was constructed by replacing the quantile function Q(·) in (1.3) by the empirical quantile function and using a difference operator instead of the differential one. The derivative (d/dx)Q(x) is then estimated by a function of spacings. In this work, we construct our estimator of entropy by replacing q(·), in (1.3), by an appropriate estimator qbn (·) of q(·). We shall consider the kernel-type estimator of q(·) introduced by Falk (1986) and studied by Cheng and Parzen (1997). Our choice is motivated by the well asymptotic behavior properties of this estimator. Cheng and Parzen (1997) were established the asymptotic properties of qbn (·) on all compact U ⊂]0, 1[, which avoids the boundary problems. Since the entropy is definite as an integral on ]0, 1[ of a functional of q(·), it is not suitable to substitute directly qbn (·) in (1.3) to estimate H(X). To circumvent the boundary effects, we will proceed as follows. We set for small ε ∈]0, 1/2[, Z 1−ε Hε (X) := ε log q(ε) + ε log q(1 − ε) + log q(x) dx. ε
In view (1.3), we can see that
|H(X) − Hε (X)| = o η(ε) ,
(1.4)
where η(ε) → 0, as ε → 0. The choice of ε close to zero guaranteed the closeness of Hε (X) to H(X), then the problem of the estimation of H(X) is reduced to estimate Hε (X). Given an independent and identically distributed random [i.i.d.] sample X1 , . . . , Xn , and let ε ∈]0, 1/2[, an estimator of Hε (X) can be defined as Z 1−ε b Hε;n (X) = ε log (b qn (ε)) + ε log (b qn (1 − ε)) + log qbn (x) dx, (1.5) ε
where the estimator qbn (·) of the qdf q(·) is defined as follows. Let X1;n ≤ · · · ≤ Xn;n denote the order statistics of X1 , . . . , Xn . The empirical quantile function Qn (·) based upon these random variables is given by Qn (t) := Xk;n , (k − 1)/n < t ≤ k/n, k = 1, . . . , n. 3
(1.6)
Let {Kn (t, x), (t, x) ∈ [0, 1]×]0, 1[}n≥1 be a sequence of kernels and {µn (·)}n≥1 a sequence of σ-finite measures on [0, 1]. A smoothed version of Qn (·) (see, e.g., Cheng and Parzen (1997)) can be defined as Z 1 bn (t) := Q Qn (x)Kn (t, x) dµn (x), t ∈]0, 1[. 0
Finally, we estimate q(·) by
Z d 1 d b Qn (x)Kn (t, x) dµn (x), t ∈]0, 1[. (1.7) qbn (t) := Qn (t) = dt dt 0 Clearly, in order to obtain a meaningful qdf estimator in this way, the sequence of kernels Kn (·, ·) must satisfy certain differentiability conditions, and together with the sequence of measures µn (·), must satisfy certain variational conditions. These conditions will be detailed in the next section. A familiar example is the convolution-kernel estimator Z d 1 −1 t−x qbn (t) := dx, t ∈]0, 1[, h Qn (x)K dt 0 n hn
where K(·) denotes a kernel function, namely a measurable function integrating to 1 on R, and has bounded derivative, and {hn }n≥1 is a sequence of positive reals fulfills hn → 0 and nhn → ∞ as n → ∞. In this case, Csörgő and Révész (1984) and Csörgő et al. (1991) define Z n/(n+1) t−x −1 dQn (x) q¯n (t) := hn K hn 1/(n+1) n−1 X t − i/n −1 K = hn (Xi+1;n − Xi;n ), t ∈ [0, 1]. h n i=1 Calculations using summation by parts show that, for all t ∈]0, 1[, t−1 t −1 Xn;n − K X1;n . qbn (t) = q¯n (t) + hn K hn hn
In the sequel, we shall consider the general family of qdf estimators defined in (1.7). The remainder of the present article is organized as follows. The consistency and normality of our estimator are discussed in the next section. Our arguments, used to establish the asymptotic normality of our estimator, make use of an original application of the invariance principle for the empirical quantile process. In Section 3.1, we discuss briefly the smoothed version of Parzen’s entropy estimator. In Section 4, we investigate the finitesample performance of the newly proposed estimator and compare the latter with the performances of existing estimators. In section 5, a new test of normality is proposed and compared with other competitor. Some concluding remarks and possible future developments are mentioned in Section 6. To avoid interrupting the flow of the presentation, all mathematical developments are relegated to Section 7. 4
2
Main results
Throughout U(ε) := [ε, 1 − ε], for an ε ∈]0, 1/2[ is arbitrarily fixed (free of the sample size n). The following conditions are used to establish the main results. We recall the notations of Section 1. (Q.1) The quantile density function q(·) is twice differentiable in ]0, 1[. Moreover, 0 < min{q(0), q(1)} ≤ ∞; (Q.2) There exists a constant ς > 0 such that sup t(1 − t) |J(t)| ≤ ς, t∈]0,1[
where J(t) := d log q(t) /dt is the score function;
(Q.3) Either q(0) < ∞ or q(·) is nonincreasing in some interval (0, t∗ ), and either q(1) < ∞ or q(·) is nondecreasing in some interval (t∗ , 1), where 0 < t∗ < t∗ < 1. We will make the following assumptions on the sequence of kernels Kn (·, ·). (K.1) For each n ≥ 1 and each (t, x) ∈ U(ε)×]0, 1[, Kn (t, x) ≥ 0, and for each t ∈ U(ε), R1 Kn (t, x)dµn (x) = 1; 0 (K.2) There is a sequence δn ↓ 0 such that Z t+δn Kn (t, x) dµn (x) → 0, Rn := sup 1 − t∈U (ε)
t−δn
as n → ∞;
(K.3) For any function g(·), that is at least three times differentiable in ]0, 1[, Z 1 gn (t) := b g(x)Kn (t, x) dµn (x), 0
is differentiable in t on U(ε), and Z 1 sup g(t) − g(x)Kn (t, x) dµn (x) = O n−α , t∈U (ε)
α > 0;
0
Z ′ d 1 sup g (t) − g(x)Kn (t, x) dµn (x) = O n−β , dt 0 t∈U (ε)
β > 0.
Note that the conditions (Q.1-2-3) and (K.1-2-3), which we will use to establish the conb ε;n (X), match those used by Cheng and Parzen (1997) to vergence in probability of H establish the asymptotic properties of qbn (·). b ε;n (X) defined in (1.5), where the qdf estimator is Our result concerning consistency of H given in (1.7), is as follows.
5
Theorem 2.1 Let X1 , . . . , Xn be i.i.d. r.v.’s with a quantile density function q(·) fulfilling (Q.1-2-3). Let Kn (·, ·) fulfill (K.1-2-3). Then, we have, −1/2 −β b |Hε;n(X) − H(X)| = OP n M(q; Kn ) + n + η(ε) , (2.1)
where
q p −1 cn (q 2 )R′ (1) + n−1/2 Aγ (n)Mq M cn (1), c M(q; Kn ) := Mq Mn (1) δn log δn + Mq′ + M n Z 1 c Mn (g) := sup |g(x)Kn (u, x)|dµn (x), u∈U (ε) 0 Z ′ Rn (g) := sup |g(x)Kn (u, x)|dµn (x), u∈U (ε)
Mg :=
[0,1]\U (ε+δn )
sup |g(u)|,
u∈U (ε)
and δn is given in condition (K.2). The proof of Theorem 2.1 is postponed to the Section 7. b ε;n (X), we will use the following additional To establish the asymptotic normality of H condition. n o < ∞. (Q.4) E log2 q F (X) Let, for all ε ∈]0, 1/2[,
Z n o 2 2 2 − ε log q(ε) + ε log q(1 − ε) + E log q F (X)
1−ε
log ε
2
q(x) dx = o ϑ(ε) ,
where ϑ(ε) → 0, as ε → 0. b ε;n (X), to be proved here may now be The main result, concerning the normality of H stated precisely as follows. Theorem 2.2 Assume that the conditions (Q.1-2-3-4) and (K.1-2-3) hold with α > 1/2 and β > 1/2 in (K.3). Then, we have, n o1/2 √ −1 b ε;n (X) − Hε (X)) − ψn (ε) = OP + oP (1), (2.2) n(H 2ε log ε
where
ψn (ε) :=
Z
(q ′ (x)/q(x))Bn (x)dx, U (ε)
is a centered Gaussian random variable with variance equal to n o + o ϑ(ε) + η(ε) . Var ψn (ε) = Var log q F (X) 6
The proof of Theorem 2.2 is postponed to the Section 7. Remark 2.3 Condition (Q.4) is extremely weak and is satisfied by all commonly encountered distributions including many important heavy tailed distributions for which the moments do not exists, the interested reader may refer to Song (2000) and the references therein. Remark 2.4 We mention that the condition (K.3) is satisfied, when α, β > 1/2, for −ν the difference kernels Kn (t, x) dµn (x) = h−1 n k((t − x)/hn )dx with hn = O(n ) where 1/4 < ν < 1. A typical example for differences kernels satisfying theses conditions is the gaussian kernel.
3
The smoothed Parzen estimator of entropy
We mention that the notations of this section are similar to that used in Parzen (1979) and changes have been made in order to adopt it to our setting. Given a random sample X1 , . . . , Xn of a continuous random variable X with distribution function F (·). In this section we will work under the following hypothesis, for all x ∈ R, x−µ ∗ , H : F (x) = F0 σ where µ and σ are parameters to be estimated, it is convenient to transform Xi to X − µ b i n bi = F0 U , σ bn
where µ bn and σ bn are efficient estimators of µ and σ. It is more convenient to base tests of the hypothesis on the deviations from identity function t of the empirical distribution b1 , . . . , U bn . Let denote by D e n,0 (t), 0 ≤ t ≤ 1 the empirical quantile function function of U b1 , . . . , U bn ; it can be expressed in terms of the sample quantile function Q en (·) of the of U original data X1 , . . . , Xn by ! en (t) − µ Q b n e n,0(t) := F0 , D σ bn
where
en (t) = n Q
i i−1 i−1 i Xi;n , for − t Xi−1;n + n t − ≤ t ≤ , i = 1, . . . , n. n n n n
In our case, we may consider the smoothed version of the quantile density given in Parzen (1979) and defined by ! en (t) − µ 1 d Q b d e n e n,0 (t) = f0 Qn (t) . den,0(t) := D dt σ bn dt σ bn 7
Consequently, define
where
1 d e , den (t) := f0 (Q0 (t)) Q n (t) dt σ b0,n
σ b0,n :=
Z
1
f0 (Q0 (t))
0
d e Qn (t)dt. dt
This is an alternative approach to estimating q(·) when one desires a goodness of fit test of a location scale parametric model, we refer to Parzen (1991) for more details. We mention that in (Parzen, 1979, Section 7.), the smoothed version of den (t) is defined, for the convolution-kernel, by Z 1 1 t−u e e dn (t) := dn (u) K du. hn hn 0
This suggests the following estimator of entropy Z 1 Hn (X) := log(den (t))dt.
(3.1)
0
Under similar conditions to that of Theorem 2.1, we could derive, under H ∗ , that, as n→∞ (3.2)
|Hn (X) − H(X)| = oP (1).
For more details on goodness of fit test in connection with Shannon’s entropy, the interested reader may refer to Parzen (1991). Note that the normality is an example of location-scale parametric model. Remark 3.1 Recall that there exists several ways to obtain smoothed versions of den (·). Indeed, we can choose the following smoothing method. Keep in mind the following definition Z d b d 1 qbn (t) = Qn (t) = Qn (x)Kn (t, x) dµn (x), t ∈]0, 1[. dt dt 0 Then, we can use the following estimator
1 . de∗n (t) := f0 (Q0 (t))b qn (t) σ b0,n
Finally, we estimate the entropy under H ∗ Z 1 ∗ Hn (X) := log(de∗n (t))dt.
(3.3)
0
Under conditions of Theorem 2.1, we could derive, under H ∗ , that, as n → ∞ |Hn∗(X) − H(X)| = oP (1).
(3.4) 8
It will be interesting to provide a complete investigation of Hn (X) and Hn∗ (X) and study the test of normality of statistics based on them. This would go well beyond the scope of the present paper. Remark 3.2 The nonparametric approach, that we use to estimate the Shannon’s entropy, treats the density parameter-free and thus offers the greatest generality. This is in contrast with the parametric approach where one assumes that the underlying distribution follows a certain parametric model, and the problem reduces to estimating a finite number of parameters describing the model.
4
A comparison study by simulations
Assuming that X1 , . . . , Xn , is the sample, the estimator proposed by Vasicek (1976) is given by n n 1X (V ) log {Xi+m;n − Xi−m;n } , Hm,n := n i=1 2m where m is a positive integer fulfilling m ≤ n/2. Vasicek proved that his estimator is consistent, i.e., P
(V ) Hm,n → H(X),
as n → ∞, m → ∞ and m/n → 0.
Vasicek’s estimator is also asymptotically normal under appropriate conditions on m and n. van Es (1992) proposed a new estimator of entropy given by n−m 1 X n+1 (V an) Hm,n := − log {Xi+m;n − Xi;n } n − m i=1 m n X 1 + + log(m) − log(n + 1). k k=m
van Es (1992) established, under suitable conditions, the consistency and asymptotic normality of this estimator. Correa (1995) suggested a modification of Vasicek’s estimator. In order to estimate the density f (·) of F (·) in the interval (Xi−m;n , Xi+m;n ) he used a local linear model based on 2m + 1 points: F (Xj;n) = α + βXj;n , j = m − i, . . . , m + i. This produces a smaller mean squared error estimator. The Correa’s estimator is defined by ! Pi+m n−m X X − X (j − i) 1 j;n (i) j=i−m (Cor) Hm,n := − log , P n i=1 n i+m j=i−m Xj;n − X (i) 9
where X (i)
i+m X 1 Xj;n . := 2m + 1 j=i−m
Here, m ≤ n/2 is a positive integer, Xi;n = X1;n for i < 1 and Xi;n = Xn;n for i > n. By simulations, Correa showed that, for some selected distributions, his estimator produces smaller mean squared error than Vasicek’s estimator and van Es’ estimator. Wieczorkowski and Grzegorzewski (1999) modified the Vasicek’s estimator by adding a bias correction. Their estimator is given by 2m (W G) (V ) Hm,n := Hm,n − log n + log(2m) − 1 − Ψ(2m) n m 2X Ψ(i + m − 1), +Ψ(n + 1) − n i=1 where Ψ(x) is the digamma function defined by (see e.g., Gradshteyn and Ryzhik (2007)) Ψ(x) :=
Γ′ (x) d log Γ(x) = ( . dx Γ x)
For integer arguments, we have Ψ(k) =
k−1 X 1 i=1
i
− γ,
where γ = 0.57721566490 . . . is the Euler’s constant. A series of experiments were conducted in order to compare the performance of our estimator, in terms of efficiency and robustness, with the following entropy estimators : Vasicek’s estimator, van Es’ estimator, Correa’s estimator and Wieczorkowski-Grzegorewski’s estimator. We provide numerical illustrations regarding the mean squared error of each of the above estimators of the entropy H(X). The computing program codes were implemented in R. We have considered the following distributions. For each distribution, we give the value of H(X). • Standard normal distribution N(0, 1):
√ √ H(X) = log(σ 2πe) = log( 2πe).
• Uniform distribution:
H(X) = 0.
• Weibull distribution with the shape parameter equal to 2 and the scale parameter equal to 0.5: H(X) = −0.09768653. 10
Recall that the entropy for Weibull distribution, with the shape parameter k > 0 and the scale parameter λ > 0, is given by λ 1 + log + 1. H(X) = γ 1 − k k • Exponential distribution with parameter λ = 1: H(X) = 1 − log(λ) = 1. • Student’s t-distribution with the number of degrees of freedom equal to 1, 3 and 5: H(X) = 2.53102425, H(X) = 1.77347757, H(X) = 1.62750267,
(degrees of freedom equal to 1 (Cauchy distribution)), (degrees of freedom equal to 3), (degrees of freedom equal to 5).
Keeping in mind that the entropy for Student’s t-distribution, with the number of degrees of freedom ν, is given by ν √ ν+1 1+ν ν 1 H(X) = Ψ −Ψ + log , , νBeta 2 2 2 2 2 where Ψ(·) is digamma function. For sample sizes n = 10, n = 20 and n = 50; 5000 samples were generated. All the considered spacing-based estimators depend on the parameter m ≤ n/2; the optimal choice of m given to the sample size n is still an open problem. Here, we have chosen the parameter m according to Correa (1995), where m = 3 for n = 10 and n = 20, b ǫ,n (X), we have used the standard gaussian and m = 4 for n = 50. For our estimator H kernel, and the choice of the bandwidth is done by numerical optimization of the MSE of b ε;n (X) with respect to hn . In Tables 1, 2 and 3 we have considered normal distribution H N(0, 1) with sample sizes n = 10, n = 20 and n = 50. In Tables 4-9, we have considered the uniform distribution, Weibull distribution, exponential distribution with parameter 1, Student t-distribution with parameter 3 and 5 and Cauchy distribution, respectively, and sample size n = 50. In all cases, and for each considered estimator, we compute the bias, variance and MSE by Monte-Carlo through the 5000 replications. From Tables 1-9, we can see that our estimator works better than all the others, in the sense that the MSE is smaller. We think that the simulation results may be substantially ameliorated if we choose for example the Beta or Gamma kernel, which have the advantage to take into account the boundary effects. It appears that our estimator, based on the smoothed quantile density estimate, behaves better than the traditional ones in term of 11
Estimate 0.85912656 1.18654642 1.03589137 1.28060252 b ε;n (X) 1.33450763 H
bias -0.55981198 -0.23239211 -0.38304716 -0.13833601 -0.08443091
Estimate 1.10631536 1.24420922 1.24195337 1.36008222 b Hε;n (X) 1.43214893
bias -0.31262317 -0.17472931 -0.17698517 -0.05885632 0.01321040
Estimate 1.26025240 1.29349009 1.35839575 1.40287628 b ε;n (X) 1.42053167 H
bias -0.15868613 -0.12544844 -0.06054278 -0.01606226 0.00159314
(V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Variance MSE 0.07087153 0.38419010 0.08296761 0.13689074 0.07197488 0.21862803 0.07087153 0.08993751 0.07002134 0.07707990
Table 1: Results for n = 10, m = 3, hn = 0.157, Normal distribution N(0, 1) (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Variance MSE 0.03182798 0.12952939 0.03532923 0.06582424 0.03230411 0.06359555 0.03182798 0.03526021 0.02920368 0.02934899
Table 2: Results for n = 20, m = 3, hn = 0.081, Normal distribution N(0, 1) (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Variance MSE 0.01126393 0.03643395 0.01210094 0.02782615 0.01148899 0.01514293 0.01126393 0.01151066 0.01085020 0.01084189
Table 3: Results for n = 50, m = 4, hn = 0.0333, Normal distribution N(0, 1) Estimate bias Variance MSE -0.142936202 -0.142936202 0.001730435 0.022161020 -0.000811817 -0.000811817 0.003428982 0.003429298 -0.048455430 -0.048455430 0.001780981 0.004128732 -0.0003123278 -0.0003123278 0.0017304350 0.0017303596 b ε;n (X) -0.001503714 -0.001503714 0.001005806 0.001007967 H (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Table 4: Results for n = 50, m = 4, hn = 0.522, Uniform distribution
efficiency. We turn now to compare robustness property of the above estimators of the entropy H(X). The robustness here is to be stand versus contamination and not in terms of influence function or break-down points which make sense only under parametric or semiparametric 12
Estimate bias Variance MSE -0.260890919 -0.163204390 0.009863565 0.036497265 -0.20651205 -0.10882552 0.01139025 0.02323097 -0.163758480 -0.066071951 0.009953612 0.014317124 -0.118267044 -0.020580515 0.009863565 0.010285150 b ε;n (X) -0.08547838 H 0.01220815 0.01945246 0.01959761 (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Table 5: Results for n = 50, m = 4, hn = 0.6104, Weibull distribution with the shape parameter equal to 2 and the scale parameter equal to 0.5 (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
b ε;n (X) H
Estimate bias Variance MSE 0.85568859 -0.14431141 0.02233983 0.04316114 0.92199497 -0.07800503 0.02313389 0.02921405 0.95566152 -0.04433848 0.02266589 0.02462726 0.998312466 -0.001687534 0.022339826 0.022338205 1.31010002 0.31010002 0.06795533 0.16410376
Table 6: Results for n = 50, m = 4, hn = 0.712, Exponential distribution with parameter 1 Estimate bias Variance MSE 1.63721785 -0.13625972 0.02903662 0.04759752 1.58175430 -0.19172327 0.02360710 0.06036019 1.74658234 -0.02689523 0.03048823 0.03120548 1.779841726 0.006364154 0.029036620 0.029071315 b Hε;n (X) 1.75882993 -0.01464764 0.02592497 0.02613434 (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Table 7: Results for n = 50, m = 4, hn = 0, 0336, Student’s t-distribution with the number of degrees of freedom equal to 3 (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
b ε;n (X) H
Estimate bias Variance MSE 1.48096844 -0.14653424 0.02047265 0.04194084 1.46376243 -0.16374024 0.01830929 0.04511650 1.58556888 -0.04193379 0.02135329 0.02310746 1.623592312 -0.003910361 0.020472653 0.020483849 1.621348438 -0.006154234 0.018522682 0.018556852
Table 8: Results for n = 50, m = 4, hn = 0.344, Student’s t-distribution with the number of degrees of freedom equal to 5
13
Estimate bias Variance 2.51810441 -0.01291983 0.09321374 2.24258072 -0.28844353 0.05651327 2.65247722 0.12145298 0.09936347 2.66072829 0.12970404 0.09321374 b ε;n (X) 2.49016605 -0.04085819 0.07746897 H (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
MSE 0.09336202 0.13970163 0.11409442 0.11001824 0.07912286
Table 9: Results for n = 50, m = 4, hn = 0.0235, Cauchy distribution settings. According to Huber and Ronchetti (2009): “Most approaches to robustness are based on the following intuitive requirement: A discordant small minority should never be able to override the evidence of the majority of the observations.” In the same reference, it is mentioned that resistant statistical procedure, i.e., if the value of the estimate is insensitive to small changes in the underlying sample, is equivalent to robustness for practical purposes in view of Hampel’s theorem, refer to (Huber and Ronchetti, 2009, Section 1.2.3 and Section 2.6). Typical examples for the notion of robustness in the nonparametric setting are the sample mean and the sample median which are the nonparametric estimates of the population mean and median, respectively. Although nonparametric, the sample mean is highly sensitive to outliers and therefore for symmetric distribution and contaminated data the sample median is more appropriate to estimate the population mean or median, refer to Huber and Ronchetti (2009) for more details. In our simulation, we will consider data generated from N(0, 1) distribution where a “small” proportion ǫ of observations were replaced by atypical ones generated from a contaminating distribution F ∗ (·). We consider two cases with ǫ = 4% and ǫ = 10%, and we choose the contaminating distribution F ∗ (·) to be the uniform distribution on [0, 1], the results are presented in Table 10 with ǫ = 4% and Table 11 with ǫ = 10%. Let f (·) denote the density function of N(0, 1) and f ∗ (·) the density of the contaminating distribution F ∗ (·). The contaminated sample, X1 , . . . , Xn , can be seen as if it has been generated from the density fǫ (·) := (1 − ǫ)f (·) + ǫf ∗ (·). Since the sample is contaminated, all the above estimators tend to H(fǫ ) and not to H(f ). The objective here is to obtain the best (in the sense that the corresponding MSE is the smallest) estimate of the entropy H(f ) from the contaminated data X1 , . . . , Xn . We will consider the five estimates as above, and we compute their bias, variance and MSE by Monte-Carlo using 5000 replications. From Tables 10-11, we can see that our estimator is the best one. Remark 4.1 To understand well the behavior of the proposed estimator, it will be interesting to consider several family of distributions: 14
Estimate 1.24299668 1.27422798 1.34170579 1.38562055 b ε;n (X) 1.40066144 H
bias -0.17594185 -0.14471055 -0.07723274 -0.03331798 -0.01827709
Estimate 1.22256006 1.24540126 1.32178766 1.36518393 b Hε;n (X) 1.37585099
bias -0.19637848 -0.17353727 -0.09715087 -0.05375460 -0.04308754
(V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Variance MSE 0.01115852 0.04210290 0.01143207 0.03236179 0.01160997 0.01756326 0.01115852 0.01225745 0.01054644 0.01086995
Table 10: Results for n = 50, m = 4, hn = 0.0333, ǫ = 4%, Normal distribution N(0, 1) (V ) Hm,n (V an) Hm,n (Cor) Hm,n (W G) Hm,n
Variance MSE 0.01245586 0.05100791 0.01355086 0.04365249 0.01243944 0.02186529 0.01245586 0.01533296 0.01219767 0.01404201
Table 11: Results for n = 50, m = 4, hn = 0.0333, ǫ = 10%, Normal distribution N(0, 1) 1. Distribution with support (−∞, ∞) and symmetric. 2. Distribution with support (−∞, ∞) and asymmetric. 3. Distribution with support (0, ∞). 4. Distribution with support ]0, 1[. Extensive simulations for these families will be undertaken elsewhere. In the following remark, we give a way how the choose the smoothing parameter in practice. Remark 4.2 The choice of the smoothing parameter plays an instrumental role in implementation of practical estimation. We recall that the smoothing parameter hn , in our simulations, has been chosen to minimize the MSE of the estimator, assuming that the underlying distribution is known. In more general case, without assuming any knowledge on the underlying distribution function, one can use, among others, the selection procedure proposed in Jones (1992). Jones (1992) derived that the asymptotic MSE of qbn (·), in the case of convolution-kernel estimator, which is given by Z 2 h4n ′′ 2 2 AMSE(b qn (t)) = q (t) x K(x)dx 4 R Z 1 2 q (t) K 2 (x)dx. + nhn R 15
Minimizing the last equation with respect to hn , we find the asymptotically optimal bandwidth for qbn (·) as ( )1/5 R 2 2 q(t) K (x)dx hopt . RR 2 n = 2 ′′ n(q (t)) R x2 K(x)dx Note that hopt n depends on the unknown functions q(t) = and q ′′ (t) =
1 , f (Q(t))
f ′′ (Q(t))f (Q(t)) − 3[f ′ (Q(t))]2 . f 5 (Q(t))
These functions may be estimated in the classical way, refer to Jones (1992), Silverman (1986) and Cheng and Sun (2006) for more details on the subject and the references therein. Another way to estimate the optimal value of hn is to use a cross-validation type method. Remark 4.3 The main problem in using entropy estimates such as in (1.5) is to choose properly the smoothing parameter hn . With a lot more effort, we could derive analog results b ε;n (X) using the methods in Bouzebda and Elhattab (2009, 2010, 2011), as well here for H as the modern empirical process tools developed in Einmahl and Mason (2005) in their work on uniform in bandwidth consistency of kernel-type estimators.
5
Test for normality
A well-known theorem of information theory [see, e.g., p. 55, Shannon (1948)] states that among all distributions that possess a density function f (·) and have a given variance σ 2 , the entropy H(X) is maximized by the normal distribution. The entropy of the normal √ distribution with variance σ 2 is log σ 2πe . As pointed out by Vasicek (1976), this property can be used for tests of normality. Towards this aim, one can use the estimate b ε;n (X) of H(X), as follows. Let X1 , . . . , Xn be independent random replicæ of a random H b ε;n (X) the estimator of H(X) as in variable X with quantile density function q(·). Let H (1.5). Let Tn denote the statistic Tn := log
p
2πσn2
b ε;n (X), + 0.5 − H
(5.1)
where σn2 is the sample standard deviation based on X1 , . . . , Xn , defined by n
σn2
2 1X := Xi − X n , n i=1 16
where X n := n−1
n X
Xi .
i=1
The normality hypothesis will be rejected whenever the observed value of Tn will be significantly less than 0, in the sense we precise below. The exact critical values Tα of Tn at the significance levels α ∈]0, 1[ are defined by the equation P (Tn ≤ Tα ) = α. The distribution of Tn under the null hypothesis cannot readily be obtained analytically. To evaluate the critical values Tα , we have used a Monte Carlo method, for sample sizes 10 ≤ n ≤ 50 and the significance value given by α = 0.05. Namely, for each n ≤ 50, we simulate 20000 samples of size n from the standard normal distribution. Since α = 0.05 = 1000/20000, we determine the 1000-th order statistic t1000,20000 and obtain the critical value Tn,0.05 through the equation Tn,0.05 = t1000,20000 . Our results are presented in Table 12.
Sample size n
Percentage level 0.025
0.1
0.05
35 40 45 50
0.03660258 0.03641781 0.03011983 0.02534047
0.05896114 0.05732028 0.04910612 0.04232442
0.01
0.005
0.07819581 0.1041396 0.1229790 0.07655416 0.1003001 0.1177802 0.06554724 0.0844859 0.1004404 0.05874272 0.07830533 0.09371921
Table 12: Critical Values of Tn . Park and Park (2003) establish the entropy-based goodness of fit test statistics based on the nonparametric distribution functions of the sample entropy and modified sample entropy Ebrahimi et al. (1994), and compare their performances for the exponential and normal distributions. The authors consider n 1X n Ebr Hm,n := log {Xi+m;n − Xi−m;n } , n i=1 ci m where
1+ ci := 2 1+
i−1 m n−i n
if if if
1 ≤ i ≤ m, m + 1 ≤ i ≤ n − m, n − m ≤ i ≤ n.
Yousefzadeh and Arghami (2008) use a new cdf estimator to obtain a nonparametric entropy estimate and use it for testing exponentially and normality, they introduce the 17
following estimator Y ou Hm,n
:=
n X
Xi+m;n − Xi−m;n Fbn (Xi+m;n ) − Fbn (Xi−m;n )
log
i=1
where
Fbn (x) := and
n−1 n(n+1)
n−1 n(n+1)
n−1 n(n+1)
x−X0;n X2;n −X0;n
n n−1
+
i+
1 n−1
+
+
x−X1;n X3;n −X1;n
x−Xi−1;n Xi+1;n −Xi−1;n
1 n − 1 n−1 +
+
x−Xn−2;n Xn;n −Xn−2;n
+
!
Pn
Fbn (Xi+m;n ) − Fbn (Xi−m;n ) , (Fbn (Xi+m;n ) − Fbn (Xi−m;n ))
i=1
x−Xi;n Xi+2;n −Xi;n
for X1;n ≤ x ≤ X2;n ,
x−Xn−1;n Xn+1;n −Xn−1;n
for Xi;n ≤ x ≤ Xi+1;n ,
i = 2, . . . , n − 2,
for Xn−1;n ≤ x ≤ Xn,n ,
n (X2;n − X1;n ), n−1 n := Xn;n − (Xn;n − Xn−1;n ). n−1
X0;n := X1;n − Xn+1;n
To compare the power of the proposed test a simulation was conducted. We used tests (W G) Ebr Y ou based on the following entropy estimators: Hm,n , Hm,n , Hm,n and our test (5.1). In the power comparison, we consider the following alternatives. • The Weibull distribution with density function k ! −x k x exp 1{x > 0}, f (x; λ, k) = λ λ λ where k > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution and where 1{·} stands for the indicator function of the set {·}. • The uniform distribution with density function f (x) = 1, 0 < x < 1. • The Student t-distribution with density function f (x; ν) =
1 Γ((ν + 1)/2) 1 √ , ν > 2, −∞ < x < ∞. 2 Γ(ν/2) νπ (1 + x /ν)(ν+1)/2
Ebr Remark 5.1 We mention the value taken in Table 13 for the statistics based on Hm,n Y ou and Hm,n are the same to that calculated in Table 6, p. 1493 of Yousefzadeh and Arghami (2008), for the same alternatives that we consider in our comparison.
18
Alternatives Uniform Weibull(2) Student t5 Student t3
Statistics based on (W G) (W G) Ebr b Hε;n (X) H4,n H8,n Hm,n 0.9999 0.8264 0.9306 1.0000
0.9262 0.3297 0.1358 0.3696
0.96685 0.33795 0.05530 0.18245
0.9275 0.4211 0.1484 0.3736
Y ou Hm,n
0.8768 0.3444 0.2345 0.5124
Table 13: Power estimate of 0.05 tests against alternatives of the normal distribution based on 20 000 replications for sample size n=50. Alternatives Uniform Weibull(2) Student t5 Student t3
hn hn hn hn hn
= 0.5297 = 0.6555 = 0.0310 = 0.0189
b ε;n (X). Table 14: Choice of smoothing parameter for H
In Table 13 we have reported the results of power comparison for the sample size is 50. We made 20000 Monte-Carlo simulations to compare the powers of the proposed test against 4 alternatives. From Table 13, we can see that the proposed test Tn shows better power than all other statistics for the alternative that we consider. It is natural that our test has good performances for unbounded support since the kernel-type estimators behave well in this situation.
6
Concluding remarks and future works
We have proposed a new estimator of entropy based on the kernel-quantile density estimator. Simulations show that this estimator behaves better than the other competitors both under contamination or not. More precisely, the MSE is consistently smaller than the MSE of the spacing-based estimators considered in our study. It will be interesting to compare theoretically the power and the robustness of the test of normality, based on the proposed estimator of the present paper, with those considered in Esteban et al. (2001). The study of entropy in presence of censored data is possible using a similar methodology as presented here. It would be interesting to provide a complete investigation of the choice of the parameter hn for kernel difference-type estimator which requires nontrivial mathematics, this would go well beyond the scope of the present paper. The problems and the methods described here all are inherently univariate. A natural and useful multivariate extension is the use of copula function. We propose to extend the
19
results of this paper to the multivariate case in the following way. Consider a Rd -valued random vector X = (X1 , . . . , Xd ) with joint cdf F(x) := F(x1 , . . . , xd ) := P(X1 ≤ x1 , . . . , Xd ≤ xd ) and marginal cdf’s Fj (xj ) := P(Xj ≤ xj ) for j = 1, . . . , d. If the marginal distribution functions F1 (·), . . . , Fd (·) are continuous, then, according to Sklar’s theorem [Sklar (1959)], the copula function C(·), pertaining to F(·), is unique and C(u) := C(u1 , . . . , ud) := F(Q1 (u1 ), . . . , Qd (ud )),
for u ∈ [0, 1]d ,
where, for j = 1, . . . , d, Qj (u) := inf{x : Fj (x) ≥ u} with u ∈ (0, 1], Qj (0) := limt↓0 Qj (t) := Qj (0+ ), is the quantile function of Fj (·). The differential entropy may represented via density copula c(·) and quantile densities qi (·) as follows H(X) =
d X
(6.1)
H(Xi ) + H(c),
i=1
where H(Xi ) =
Z
[0,1]
log qi (u) du,
(6.2)
qi (u) = dQi (u)/du, for i = 1, . . . , d and H(c) is the copula entropy defined by Z H(c) = − c(F1 (x1 ), . . . , Fd (xd )) log c(F1 (x1 ), . . . , Fd (xd )) dx d ZR = − c(u1 , . . . , ud ) log c(u1 , . . . , ud ) du [0,1]d Z = − c(u) log c(u) du,
(6.3)
[0,1]d
where
c(u1 , . . . , ud) =
∂d C(u1 , . . . , ud ) ∂u1 . . . ∂ud
is the copula density.
7
Proof.
This section is devoted to the proofs of our results. The previously defined notation continues to be used below. Proof of Theorem 2.1. Recall that we set for all ε ∈]0, 1/2[, U(ε) = [ε, 1 − ε], and Z Hε (X) = ε log q(ε) + log q(x) dx + ε log q(1 − ε) . U (ε)
20
Therefore, by (1.4), we have the following b b (X) − H (X) Hε;n (X) − H(X) ≤ H + |Hε (X) − H(X)| ε;n ε b = H ε;n (X) − Hε (X) + o η(ε) .
(7.1)
We evaluate the first term of the right hand of (7.1). By the triangular inequality, we have Z b (log qbn (x) − log q(x) )dx Hε;n (X) − Hε (X) = ε log qbn (ε) − log q(ε) + U (ε) +ε log qbn (1 − ε) − log q(1 − ε) ≤ ε log qbn (ε) − log q(ε) Z (log qbn (x) − log q(x) )dx + U (ε) + ε log qbn (1 − ε) − log q(1 − ε) .
We note that for z > 0,
| log z| ≤ |z − 1| + |1/z − 1|.
Therefore, we have qbn (x) log (b qn (x)) − log q(x) = log q(x) |b qn (x) − q(x)| |b qn (x) − q(x)| ≤ + . q(x) qbn (x)
Under conditions (Q.1-2-3) and (K.1-2-3), we may apply Theorem 2.2 in Cheng and Parzen (1997), for all fixed ε ∈]0, 1/2[ and recall M(q; Kn ) defined in Theorem 2.1, we have, as n → ∞, sup |b qn (x) − q(x)| = OP n−1/2 M(q; Kn ) + n−β , (7.2) x∈U (ε)
we infer that we have, uniformly over x ∈ U(ε), qbn (x) ≥ (1/2)q(x), for all n enough large. This fact implies b qn (x) − q(x)| , Hε;n (X) − Hε (X) ≤ 4 sup |b x∈U (ε)
in probability, which implies that b −1/2 −β M(q; Kn ) + n + η(ε) . Hε;n (X) − Hε (X) = OP n
Thus the proof is complete. ✷ Proof of Theorem 2.2. Throughout, we will make use of the fact (see e.g. Csörgő and Révész 21
(1978, 1981)) that we can define X1 , . . . , Xn on a probability space which carries a sequence of Brownian bridges {Bn (t) : t ∈ [0, 1], n ≥ 1}, such that, √ n(Qn (t) − Q(t)) − q(t)Bn (t) = q(t)en (t)
(7.3)
and
n1/2 lim sup q(t)en (t) ≤ C, n→∞ Aγ (n)
(7.4)
with probability one, where C is a universal constant and log n, max{q(0), q(1)} < ∞ or γ ≤ 2, Aγ (n) := γ (1+ν)(1−γ) (log log n) (log n) γ > 2, with γ in (Q.2) and an arbitrary positive ν. Using Taylor expansion we get, for λ ∈]0, 1[, √ b ε;n (X) − Hε (X)) n(H Z √ n(log qbn (x) − log q(x) )dx = U (ε) √ + nε(log qbn (ε) − log q(ε ) √ + nε(log qbn (1 − ε) − log q(1 − ε) ) Z √ qbn (x) − q(x) dx + Ln (ε) n = λb qn (x) + (1 − λ)q(x) U (ε) =: Tn (ε) + Ln (ε). (7.5) We have, under our assumptions, qbn (x) → q(x) in probability, which implies that λb qn (x) + (1 − λ)q(x) → q(x) in probability. Thus, we have, by Slutsky’s theorem, as n tends to the infinity, Tn (ε) has the same limiting law of Z √ f Tn (ε) := (1/q(x)) n (b qn (x) − q(x)) dx. (7.6) U (ε)
We have the following decomposition Z 1 Z √ d fn (ε) = Qn (v)Kn (x, v) dµn (v) − Q(x) dx T (1/q(x)) n dx 0 U (ε) Z 1 Z √ d n(Qn (v) − Q(v))Kn (x, v) dµn (v) dx = (1/q(x)) dx 0 U (ε) Z 1 Z √ d − (1/q(x)) n Q(v)Kn (x, v) dµn (v) − Q(x) dx dx U (ε) 0 22
Using condition (K.3) and
Z
U (ε)
(1/q(x))dx ≤ 1,
we have the following Z Z 1 √ d Q(v)Kn (x, v) dµn (v) − Q(x) dx = O(n1/2−β ). (1/q(x)) n dx 0 U (ε)
This, in turn, implies, in connection with (7.3), that Z 1 Z d fn (ε) = T (1/q(x)) (q(v)Bn (v) + q(v)en (v))Kn (x, v) dµn (v) dx dx U (ε) 0 +O(n1/2−β ).
Recalling the arguments of the proof of Lemma 3.2. of Cheng (1995), for In (u) = [u − δn , u + δn ] and Inc (u) = [0, 1]\In (u), we can show that Z Z 1 d (1/q(x)) q(v)en (v)Kn (x, v) dµn (v) dx dx U (ε) 0 Z 1 Z d q(v)en (v)Kn (x, v) dµn (v) . ≤ (1/q(x))dx sup x∈U (ε) dx 0 U (ε)
Then, it follows Z Z 1 d q(v)e (v)K (x, v) dµ (v) dx (1/q(x)) n n n dx 0 U (ε) Z 1 d ≤ sup q(v) en (v) Kn (x, v) dµn (v) dx x∈U (ε) 0 Z 1 d ≤ sup |en (v)| q(v) Kn (x, v) dµn (v) dx x∈U (ε) 0 Z 1 d −1/2 ≤ Cn Aγ (n) q(v) Kn (x, v) dµn (v) dx 0 Z d −1/2 Kn (x, v) dµn (v) ≤ Cn Aγ (n) sup |q(x)| sup x∈U (ε−δn ) x∈U (ε) In (x) dx Z d q(v)Kn (x, v) dµn (v) +Cn−1/2 Aγ (n) sup c (x) dx x∈U (ε) In (2) =: R(1) n (ε) + Rn (ε).
23
Since q(·) is continuous on ]0, 1[, then it is bounded on compact interval U(ε − δn ) ⊂]0, 1[, which gives, by condition (K.2) −1/2 R(1) Aγ (n) n (ε) ≤ Cn
sup x∈U (ε−δn )
|q(x)|
= Cn−1/2 Aγ (n)O(1) = o(1) = oP (1). Let s = 1 + δ with δ > 0 arbitrary, and let r = 1 − 1/s. Then Hölder’s inequality implies (Z 1/r (2) r d −1/2 Rn (ε) ≤ Cn Kn (x, v) dµn (v) [q(v)] Aγ (n) sup dx x∈U (ε) [0,1] Z 1/s ) d s 1Inc (x) (v) Kn (x, v) dµn (v) × . dx [0,1] Under condition (K.3), we can see that Z 1/r r d sup [q(v)] Kn (x, v) dµn (v) dx x∈U (ε) [0,1] 1/r Z d ≤ sup q(x) − [q(v)]r Kn (x, v) dµn (v) dx x∈U (ε) [0,1] = O(n−β ).
This in connection with condition (K.2), implies −1/2 R(2) Aγ (n)O(1)O n (ε) = Cn
sup Rn1/(1/δ)
x∈U (ε)
!
= o(1)
= oP (1). Hence Z
d (1/q(x)) dx U (ε)
Z
0
1
q(v)en (v)Kn (x, v) dµn (v) dx = oP (1).
Then, we conclude that Z 1 Z d fn (ε) = T (1/q(x)) q(v)Bn (v)Kn (x, v) dµn (v) dx dx U (ε) 0 +oP (1). We have by Cheng and Parzen (1997), p. 297, (we can refer also to Stadtmüller (1988, 1986) and Xiang (1994) for related results on smoothed Wiener process and Brownian
24
bridge) Z sup
x∈U (ε)
≤
q(v)Bn (v)Kn (x, v) dµn (v) − q(x)Bn (x) 0 Z 1 sup q(v) |Bn (v) − Bn (x)| Kn (x, v) dµn (v) 1
x∈U (ε)
0
Z + sup Bn (x)
|q(v) − q(x)| Kn (x, v) dµn (v) x∈U (ε) 0 −1 1/2 = OP ((2δn log δn ) )) + OP Rn1/(1+δ) + OP (n−β ) 1
= oP (1).
This gives, under condition (Q.1), Z 1 Z d q(v)Bn (v)Kn (x, v) dµn (v) dx (1/q(x)) dx 0 U (ε) Z 1 1−ε = (1/q(x)) q(v)Bn (v)Kn (x, v) dµn (v) ε 0 Z 1 Z d − (1/q(x)) q(v)Bn (v)Kn (x, v) dµn (v) dx dx U (ε) 0 1−ε Z d (1/q(x)) q(x)Bn (x)dx + oP (1) − = (1/q(x))q(x)Bn (x) dx ε U (ε) Z ′ = (q(x) /q(x))Bn (x)dx + Bn (1 − ε) − Bn (ε) + oP (1). U (ε)
Making use of (Csörgő and Révész, 1981, Theorem 1.4.1), for ε sufficiently small, |Bn (1 − ε) − B(1)| = oP (1) and |Bn (ε) − B(0)| = oP (1). Which implies that Z f Tn (ε) =
′
(q (x)/q(x))Bn (x)dx + oP (1).
U (ε)
By (Cheng and Parzen, 1997, Theorem 2.2.), we have, for β > 1/2 (β in condition (K.3)), √ sup n(log qbn (u) − log(q(u) = oP (1), (7.7) u∈U (ε)
which implies, by using Taylor expansion, that
(7.8)
Ln (ε) = oP (1), 25
where Ln (ε) is given in (7.5). This gives Z ′ f Tn (ε) + Ln (ε) = (q(x) /q(x))Bn (x)dx + oP (1). U (ε)
It follows, √
d b ε;n (X) − Hε (X)) = n(H
Z
′
(q (x)/q(x))Bn (x)dx + oP (1).
U (ε) d
Here and elsewhere we denote by “ =” the equality in distribution. Note that Z ′ (q (x)/q(x))Bn (x)dx U (ε)
is a gaussian random variable with mean Z Z ′ E (q (u)/q(u))Bn(u)du = U (ε)
′
(q (u)/q(u))E(Bn (u))du U (ε)
= 0,
and variance Z Z ′ ′ (q (u)/q(u))Bn(u)du (q (v)/q(v))Bn(v)dv E U (ε) U (ε) Z Z ′ ′ (q (u)/q(u))(q (v)/q(v))Bn (u)Bn (v)dudv = E U (ε) U (ε) Z Z ′ ′ = (q (u)/q(u))(q (v)/q(v))E(Bn (u)Bn (v))dudv U (ε) U (ε) Z Z ′ ′ = (q (u)/q(u))(q (v)/q(v))(min(u, v) − uv)dudv = =
= = =
U (ε)
U (ε)
Z v Z 1 ′ ′ (q (v)/q(v)) (1 − v) (q (u)/q(u))udu + v (q (u)/q(u))(1 − u)du dv U (ε) 0 v Z 2 −ε log(q(ε)) log(q(1 − ε)) + ε log (q(ε)) − log(q(1 − ε)) log(q(x))dx U (ε) +Hε (X) log(q(1 − ε)) − Hε (X) Z log2 (q(x))dx + ε log2 (q(ε)) + ε log2 (q(1 − ε)) − A2 (ε) U (ε) Z 2 2 log (q(x))dx + o ϑ(ε) − H(X) + o η(ε) [0,1] n o + o ϑ(ε) + η 2 (ε) . Var log q F (X)
Z
′
Thus the proof is complete.
✷
26
References Ahmad, I. A. and Lin, P. E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, IT-22(3), 372– 375. Beirlant, J., Dudewicz, E. J., Györfi, L., and van der Meulen, E. C. (1997). Nonparametric entropy estimation: an overview. Int. J. Math. Stat. Sci., 6(1), 17–39. Bouzebda, S. and Elhattab, I. (2009). A strong consistency of a nonparametric estimate of entropy under random censorship. C. R. Math. Acad. Sci. Paris, 347(13-14), 821–826. Bouzebda, S. and Elhattab, I. (2010). Uniform in bandwidth consistency of the kernel-type estimator of the shannon’s entropy. C. R. Math. Acad. Sci. Paris, 348(5-6), 317–321. Bouzebda, S. and Elhattab, I. (2011). Uniform-in-bandwidth consistency for kernel-type estimators of Shannon’s entropy. Electron. J. Stat., 5, 440–459. Cheng, C. (1995). Uniform consistency of generalized kernel estimators of quantile density. Ann. Statist., 23(6), 2285–2291. Cheng, C. and Parzen, E. (1997). Unified estimators of smooth quantile and quantile density functions. J. Statist. Plann. Inference, 59(2), 291–307. Cheng, M.-Y. and Sun, S. (2006). Bandwidth selection for kernel quantile estimation. J. Chin. Statist. Assoc., 44, 271–395. Correa, J. C. (1995). A new estimator of entropy. Comm. Statist. Theory Methods, 24(10), 2439–2449. Cover, T. M. and Thomas, J. A. (2006). Elements of information theory. Interscience [John Wiley & Sons], Hoboken, NJ, second edition.
Wiley-
Csörgő, M. and Révész, P. (1978). Strong approximations of the quantile process. Ann. Statist., 6(4), 882–894. Csörgő, M. and Révész, P. (1981). Strong approximations in probability and statistics. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York. Csörgő, M. and Révész, P. (1984). Two approaches to constructing simultaneous confidence bounds for quantiles. Probab. Math. Statist., 4(2), 221–236.
27
Csörgő, M., Horváth, L., and Deheuvels, P. (1991). Estimating the quantile-density function. In Nonparametric functional estimation and related topics (Spetses, 1990), volume 335 of NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., pages 213–223. Kluwer Acad. Publ., Dordrecht. Csiszár, I. (1962). Informationstheoretische Konvergenzbegriffe im Raum der Wahrscheinlichkeitsverteilungen. Magyar Tud. Akad. Mat. Kutató Int. Közl., 7, 137–158. Dmitriev, J. G. and Tarasenko, F. P. (1973). The estimation of functionals of a probability density and its derivatives. Teor. Verojatnost. i Primenen., 18, 662–668. Dudewicz, E. J. and van der Meulen, E. C. (1981). Entropy-based tests of uniformity. J. Amer. Statist. Assoc., 76(376), 967–974. Ebrahimi, N., Habibullah, M., and Soofi, E. (1992). Testing exponentiality based on Kullback-Leibler information. J. Roy. Statist. Soc. Ser. B, 54(3), 739–748. Ebrahimi, N., Pflughoeft, K., and Soofi, E. S. (1994). Two measures of sample entropy. Statist. Probab. Lett., 20(3), 225–234. Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33(3), 1380–1403. Esteban, M. D., Castellanos, M. E., Morales, D., and Vajda, I. (2001). Monte Carlo comparison of four normality tests using different entropy estimates. Comm. Statist. Simulation Comput., 30(4), 761–785. Gallager, R. (1968). Information theory and reliable communication. New York-LondonSydney-Toronto: John Wiley & Sons, Inc. XVI, 588 p. Gokhale, D. (1983). On entropy-based goodness-of-fit tests. Comput. Stat. Data Anal., 1, 157–165. Gokhale, D. V. and Kullback, S. (1978). The information in contingency tables, volume 23 of Statistics: Textbooks and Monographs. Marcel Dekker Inc., New York. Gradshteyn, I. S. and Ryzhik, I. M. (2007). Table of integrals, series, and products. Elsevier/Academic Press, Amsterdam, seventh edition. Falk, M. (1986). On the estimation of the quantile density function. Statist. Probab. Lett., 4(2), 69–73. Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley Series in Probability and Statistics. John Wiley & Sons Inc., Hoboken, NJ, second edition.
28
Jaynes, E. T. (1957). Information theory and statistical mechanics. Phys. Rev. (2), 106, 620–630. Jones, M. (1992). Estimating densities, quantiles, quantile densities and density quantiles. Ann. Inst. Stat. Math., 44(4), 721–727. Kullback, S. (1959). Information theory and statistics. John Wiley and Sons, Inc., New York. Lazo, A. C. and Rathie, P. N. (1978). On the entropy of continuous probability distributions. IEEE Trans. Inf. Theory, 24, 120–122. Park, S. and Park, D. (2003). Correcting moments for goodness of fit tests based on two entropy estimates. J. Stat. Comput. Simul., 73(9), 685–694. Parzen, E. (1979). Nonparametric statistical data modeling. J. Amer. Statist. Assoc., 74(365), 105–131. With comments by John W. Tukey, Roy E. Welsch, William F. Eddy, D. V. Lindley, Michael E. Tarter and Edwin L. Crow, and a rejoinder by the author. Parzen, E. (1991). Goodness of fit tests and entropy. J. Combin. Inform. System Sci., 16(2-3), 129–136. Prescott, P. (1976). On a test for normality based on sample entropy. J. R. Stat. Soc., Ser. B, 38, 254–256. Rényi, A. (1959). On the dimension and entropy of probability distributions. Acta Math. Acad. Sci. Hungar., 10, 193–215 (unbound insert). Sklar, A. (1959). Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8, 229–231. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379–423, 623–656. Silverman, B. W. (1986). Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London. Song, K.-S. (2000). Limit theorems for nonparametric sample entropy estimators. Statist. Probab. Lett., 49(1), 9–18. Stadtmüller, U. (1986). Asymptotic properties of nonparametric curve estimates. Period. Math. Hungar., 17(2), 83–108. Stadtmüller, U. (1988). Kernel approximations of a Wiener process. Period. Math. Hungar., 19(1), 79–90. 29
van Es, B. (1992). Estimating functionals related to a density by a class of statistics based on spacings. Scand. J. Statist., 19(1), 61–72. Vasicek, O. (1976). A test for normality based on sample entropy. J. Roy. Statist. Soc. Ser. B, 38(1), 54–59. Wieczorkowski, R. and Grzegorzewski, P. (1999). Entropy estimators—improvements and comparisons. Comm. Statist. Simulation Comput., 28(2), 541–567. Xiang, X. (1994). A law of the logarithm for kernel quantile density estimators. Ann. Probab., 22(2), 1078–1091. Yousefzadeh, F. and Arghami, N. R. (2008). Testing exponentiality based on type II censored data and a new cdf estimator. Comm. Statist. Simulation Comput., 37(8-10), 1479–1499.
30