consistency of a minimum-entropy estimator of location - Laboratoire I3S

1 downloads 0 Views 337KB Size Report
Nov 10, 2004 - 1 Semi-parametric minimum-entropy estimation in nonlinear ..... in order to compute its distance to a distribution with density. On the other hand ...
LABORATOIRE

INFORMATIQUE, SIGNAUX ET SYSTÈMES DE SOPHIA ANTIPOLIS UMR 6070

C ONSISTENCY

OF A

M INIMUM -E NTROPY E STIMATOR L OCATION

Eric Wolsztynski, Eric Thierry, Luc Pronzato Projet TO P M ODEL Rapport de recherche ISRN I3S/RR–2004-38–FR Novembre 2004

L ABORATOIRE I3S: Les Algorithmes / Euclide B – 2000 route des Lucioles – B.P. 121 – 06903 Sophia-Antipolis Cedex, France – Tél. (33) 492 942 701 – Télécopie : (33) 492 942 898 http://www.i3s.unice.fr/I3S/FR/

OF

R ÉSUMÉ :

M OTS CLÉS : aberrations, efficacité, entropie, estimation adaptative, estimation paramétrique, modèle semi-paramétrique, robustesse

A BSTRACT:

K EY WORDS : adaptive estimation, efficiency, entropy, parameter estimation, semi-parametric models, robustness, outliers

Consistency of a Minimum-Entropy Estimator of Location ∗ Eric Wolsztynski†

Eric Thierry†

Luc Pronzato†

November 10, 2004

Abstract In regression problems where the density f of the errors is not known, maximum likelihood is unapplicable, and the use of alternative techniques (least squares, robust M -estimation,. . . ) generally results in inefficient estimation of the parameters. We consider here an alternative parametric estimator that was first presented in [13, 12], using an empirical estimate of the entropy of the residuals as the criterion to minimize. Adaptivity of this estimator, that is, the property for the estimator of remaining asymptotically efficient independently of the knowledge of f (see in particular [16, 17, 4] and the review paper [10]), is discussed in [14], where we consider a direct approach, as opposed to the Stone-Bickel 2-step estimator, √ which involves a preliminary n-consistent estimator. In the following we give a detailed proof of consistency of the estimator and discuss the steps mentioned in [14] to prove its adaptivity.

Keywords: adaptive estimation, efficiency, entropy, parameter estimation, semi-parametric models, robustness, outliers

1

Semi-parametric minimum-entropy estimation in nonlinear regression : problem statement

We consider the problem of parameter estimation in general nonlinear regression models. The observations Yi for such models correspond to the measures made from a parametric model η(θ, x), which is a known function of θ and the design variable x ∈ X ⊂ IRd , corrupted by some additive noise ε of density f : ¯ Xi ) + εi , i = 1, . . . , n , Yi = η(θ, 2 {wolsztyn,et,pronzato}@i3s.unice.fr,

(1)

Laboratoire I3S, 2000 route des Lucioles, 06903 Sophia Antipolis ∗ This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views.

1

where θ¯ denotes the unknown value of the model parameters θ ∈ Θ ⊂ IRp . We first introduce some notations. For F a function Θ → IR, ∇F (θ) and ∇2 F (θ) will denote its first and second order derivatives with respect to θ, respectively a p-dimensional vector and a p × p symmetric matrix. For g a function IR → IR, the first, second and third order derivatives are simply denoted g 0 , g 00 and g 000 . Also, the following assumptions will hold throughout the paper. We suppose that θ¯ ∈ int(Θ), Θ = int(Θ), and that η(θ, x) is bounded on Θ × X and two times continuously differentiable w.r.t. θ ∈ int(Θ) for any x ∈ X , ∇η(θ, x) and ∇2 η(θ, x) being bounded on int(Θ)×X . The additive noise (εi ) forms a sequence of independently and identically distributed (i.i.d.) random variables with probability density function (p.d.f.) f (with respect to the Lebesgue measure) that we suppose to be symmetric about zero and of unbounded support. f is also assumed two times continuously differentiable, having bounded derivatives f 0 (.), f 00 (.) and f 000 (.), and we suppose that the Fisher information for location Z ∞ I(f ) = [f 0 (u)]2 /f (u) du −∞

exists. For a given measure µ on the design variable x, the Fisher information matrix MF (θ) associated with f and θ is given by Z MF (θ) = I(f ) ∇η(θ, x)[∇η(θ, x)]> µ(dx) . (2) X

¯ has full rank and that the identifiability condition We suppose that MF (θ) Z ¯ x)]2 µ(dx) = 0 ⇒ θ = θ¯ [η(θ, x) − η(θ, (3) X

is satisfied. We know that, under standard assumptions, the Maximum Likelihood (ML) n estimator θˆM L is asymptotically efficient, that is, √

−1 ¯ n ¯ d n(θˆM L − θ)→N (0, MF (θ)) .

In the present context, f is not known further to the assumptions made above, which implies that we are not in the classical situation where ML estimation is possible, and we need an estimator that does not require the knowledge of f . The model (1) is then called semi-parametric (as Stein first termed it in [16]) with θ and f respectively its parametric and non-parametric parts, and f can be considered as an infinite-dimensional nuisance parameter for the estimation of θ. The ultimate goal is then to define an estimator of θ that is asymptotically efficient, i.e. asymptotically normal with minimum variance without requiring the knowledge of f . In general this proves to be difficult : the presence of a nuisance parameter of infinite dimension induces the loss of efficiency. One can refer to [14] for a review of the developments on adaptive estimation. Our present goal is to provide a proof of consistency (along with other technical 2

results) of an estimator we presented earlier, which consists in minimizing the entropy of a kernel estimate constructed from the symmetrized residuals in the regression model (1). In the next section, we recall some definitions for p.d.f. estimation and entropy estimation and we introduce some notations used later. Section 3 contains the main result on the consistency of the estimator for the simple location model. We also detail further technical points like convergence of the second order derivative of the estimation criterion to the Fisher information matrix (subsection 3.2 provides a brief reminder of the rather standard approach we first considered for proving adaptivity, which motivated these developments). Finally, some perspectives are given in section 4, where extension to the general non-linear regression model is considered, although not in full details.

2

Estimation by minimizing the entropy of the residuals

2.1

Entropy of the true density

Consider the residuals ei (θ) obtained from the observations in the regression model (1), ¯ Xi ) − η(θ, Xi ) . ei (θ) = Yi − η(θ, Xi ) = εi + η(θ, n When f is known, the Maximum Likelihood estimator θˆM L minimizes

¯ n (θ) = − 1 H n

n X

log f [ei (θ)]

(4)

i=1

¯ = −(1/n) Pn log f (εi ) is an empirical ¯ n (θ) with respect to θ ∈ Θ. Since H i=1 version of the (Shannon) entropy Z ∞ H(f ) = − log[f (u)]f (u)du , −∞

an intuitive idea is to base the estimation criterion on entropy (minimizing the entropy of the residuals forces them to gather). The density of ei (θ), given Xi , is ¯ Xi ) + η(θ, Xi )). fe, Xi (u) = f (u − η(θ, Since entropy is invariant by translation, we shall consider the 2n symmetrized residuals ei (θ), −ei (θ), with corresponding density given Xi s fe, Xi (u) =

¤ 1£ ¯ Xi ) + η(θ, Xi )) + f (u + η(θ, ¯ Xi ) − η(θ, Xi )) . f (u − η(θ, 2

(5)

Other techniques could be considered; we could for instance use un-symmetrized residuals with the constraint that their median, or their mean, is zero; however, symmetrization is conceptually simpler and requires less numerical calculation.

3

We can show that the entropy of the marginal distribution of the symmetrized residuals, Z s fes (u) = fe, (6) x (u)µ(dx) X

¯ For this purpose we can use the following lemma is minimum for θ = θ. Lemma 1 ([1], Lemma 8.3.1 p.238) Let p(x) and q(x) be arbitrary probability density functions. R∞ R∞ a. If − −∞ p(x) log q(x)dx is finite, then − −∞ p(x) log p(x)dx exists, and furthermore Z ∞ Z ∞ − p(x) log p(x)dx ≤ − p(x) log q(x)dx , −∞

−∞

with equality if and only if p(x) = q(x) for almost all x (with respect to Lebesgue measure). R∞ R∞ b. If − −∞ p(x) log p(x)dx is finite, then − −∞ p(x) log q(x)dx exists, and the above inequality holds. It gives H(fes )

¸ Z ·Z ∞ 1 ¯ x) + η(θ, x)] log[f s (u)] du µ(dx) f [u − η(θ, e 2 X −∞ ¸ Z ·Z ∞ 1 ¯ x) − η(θ, x)] log[f s (u)] du µ(dx) − f [u + η(θ, e 2 X −∞ ¸ Z ·Z ∞ ≥ − f (u) log[f (u)] du µ(dx) = H(f ) .

= −

X

−∞

Equality is obtained only if for µ-almost all x and almost all u (Lebesgue) ¯ x) + η(θ, x)] = f [u + η(θ, ¯ x) − η(θ, x)] = f s (u) . f [u − η(θ, e ¯ The same is true for From the identifiability condition (3), this implies θ = θ. the conditional entropy of the symmetrized residuals ¸ Z ·Z ∞ s s s Eµ {H(fe,X )} = − fe,x (u) log[fe,x (u)] du µ(dx) X −∞ ¸ Z ·Z ∞ ≥ − f (u) log[f (u)] du µ(dx) = H(f ) , X

−∞

s with equality attained only if H(fe,x ) = H(f ) for µ-almost all x, which again ¯ implies θ = θ. Recall now the following classical result in information theory, see, e.g., [1] p. 239. Consider a random variable B with conditional p.d.f. given s s X fe,x and unconditional p.d.f. fes . Then, H(B|X) = Eµ [H(fe,X )] ≤ H(B) = s H(fe ) (conditioning reduces entropy). We thus have s Eµ {H(fe,X )} ≤ H(fes ) .

4

¯ Direct calcuFinally, we can perform a local study of H(fes ) around θ = θ. lations give Z s 2 s 00 ¯ x) [∇η(θ, ¯ x)]> µ(dx). ∇fe (u)|θ=θ¯ = 0 , and ∇ fe (u)|θ=θ¯ = f (u) ∇η(θ, X

The entropy of fes satisfies Z ∇H(fes ) = ∇

2

H(fes )

=



− Z

−∞ ∞

Z

−∞ ∞

− − −∞

[1 + log fes (u)] ∇fes (u) du, 1 ∇f s (u)[∇fes (u)]> du fes (u) e [1 + log fes (u)] ∇2 fes (u) du .

Since (f log f )00 = (1 + log f )f 00 + (f 0 )2 /f , we obtain ¯ , ∇H(fes )|θ=θ¯ = 0 , ∇2 H(fes )|θ=θ¯ = MF (θ)

(7)

¯ that is, the entropy H(fes ) is locally concave with zero derivative at θ = θ. s Similar results are obtained for the conditional entropy Eµ {H(fe,X )}, s s ¯ x) [∇η(θ, ¯ x)]> , (u)|θ=θ¯ = f 00 (u) ∇η(θ, (u)|θ=θ¯ = 0 , ∇2 fe,x ∀x ∈ X , ∇fe,x

and

2.2

s s ¯ . ∇Eµ {H(fe,X )}|θ=θ¯ = 0 , ∇2 Eµ {H(fe,X )}|θ=θ¯ = MF (θ)

Estimating the entropy of the residuals

¯ n (θ) given by (4), H(fes ) nor Eµ {H(f s )} can be used as criteria for Neither H e,X parameter estimation, since f and θ¯ are unknown. We thus need to define a criterion approaching H(fes ). We can construct an estimate of this last quantity by simply plugging a symmetric kernel estimate fˆnθ of fes into the expression of the exact entropy H(fes ). For technical reasons (related to the proof of consistency of the associated estimators) we introduce a truncation to [−An , An ] and the criterion to be minimized for the estimation of θ is Z An ˆ Hn (θ) = − log[fˆnθ (u)]fˆnθ (u)du , (8) −An

where (An ) is a suitably (slowly) increasing sequence of positive numbers (to be chosen in accordance with the decrease of the bandwidth hn of the kernel estimate fˆnθ , see [13, 12]). Similarly, when a kernel estimate fni,θ of the conditional i s distribution fe,X can be constructed (that is, for designs with replications, see i s [14]), we can plug it in H to approach Eµ {H(fe,X )}. P A second estimator can be defined based on the empirical estimate (1/n) i log f (Xi ) which converges to H(f ) from the law of large numbers. We also 5

plug in the kernel estimate of the density f , yielding the following criterion to be minimized: X ˆ n (θ) = − 1 H log fˆnθ (Xi ). (9) n i Noting that using cross-validated kernels would reduce the bias for finite samples, we can also define a variant to that last estimator by ˆ n (θ) = − 1 H n

n X

θ log fˆn, i (Xi ) ,

(10)

i=1

where fˆn, i uses all but the ith kernel. Justifications of the estimators defined above can be found in [14]. In what follows we only consider the estimator (10). Extension to (8) and (9) can be obtained following similar steps. Only the case of the location model is considered, extension to a general nonlinear regression model will be discussed elsewhere.

3

Estimation of location

3.1

Model and further assumptions

The observations satisfy Yi = θ¯ + εi , i = 1, . . . , n, which are i.i.d. with density ¯ We consider the parameters θ ∈ Θ with Θ satisfying the following f (x + θ). assumption : A1 Θ is a compact set with Θ ⊂ int(Θ), θ¯ ∈ int(Θ). We denote L = max(θ, θ0 )∈Θ |θ − θ0 | the diameter of Θ. For any θ ∈ Θ we form the n residuals ei (θ) = Yi − θ, i = 1, . . . , n, and use the symmetrized sample −ei (θ), ei (θ) which yields the density fes (x) = fes, θ (x) =

1 [f (x + θ¯ − θ) + f (x − θ¯ + θ)] 2

(fes depends on θ since the residuals are functions of the parameter, but we omit the index in further notations whenever it does not affect clarity). We construct the following estimate of fes (u) ¤ 1£ θ fˆnθ (u) = kn (u) + knθ (−u) 2 ˆ n (θ) given by (10), with to be used to compute H knθ (u)

¶ µ n u − ei (θ) 1 X K = n hn i=1 hn 6

(11)

where K(.) satisfies the following standard conditions (see for instance [15]) : Conditions K K1 K(.) is symmetric about zero ; R∞ K2 −∞ |u|K(u)du < ∞ ; K3 K is two times continuously differentiable with derivatives of bounded variation ; K4 the bandwidth hn of the kernels satisfies hn → 0, nhn → ∞ as n → ∞. fˆnθ is then a kernel density estimate based on the 2n symmetrized residuals √ ±ei (θ). A classical choice is the normal density K(u) = 1/ 2π exp (−u2 /2). We also define a smooth truncation for large residuals. Let Un be such that Un (z) = U (|z|/An − 1) with   U (z) = 1 for z ≤ 0, U (z) = 0 for z ≥ 1 (12)  U (z) varying smoothly between 0 and 1 , with U 0 (0) = U 0 (1) = 0, maxz |U 0 (z)| = d1 < ∞, maxz |U 00 (z)| = d2 < ∞, and (An ) a sequence of (slowly) increasing positive numbers. We consider a truncated version of the estimator (10) using (12), ˆ n (θ) = − 1 H n

n X

θ log fˆn, i [ei (θ)] Un [ei (θ)]

(13)

i=1

θ where fˆn, i is similar to (11), but does not use ei (θ), that is,

¤ 1£ θ θ θ fˆn, k (u) + kn, i (u) = i (−u) 2 n, i with θ kn, i (u) =

1 (n − 1) hn

n X

µ K

j=1, j6=i

u − ej (θ) hn

(14)

¶ , i = 1, . . . , n .

We shall make the following assumptions on f . Conditions F F0 f is symmetric about 0, f (x) > 0 ∀x ∈ IR and f (x) and |f 0 (x)| are bounded. R F1 | log f (x)|f (x − a)dx < ∞ ∀a ∈ [−2L, 2L]. F2 there exists a positive constant C such that for any x > C, f satisfies f (x) < 1, f (x) and |f 0 (x)| are decreasing. 7

F2’ |f 00 (x)| and |f 000 (x)| are bounded for x ∈ IR and decreasing for x > C. F3 there exists a strictly increasing function B such that for all u ∈ IR, B(u) ≥ sup|y| ∇2 H[αn θˆn + (1 − αn )θ] ¯, ˆ n (θˆn ) = 0 = ∇H ˆ n (θ) ∇H with αn ∈ [0, 1] (see [8] who uses a similar approach for LS estimation). (C) √ d −1 ¯ →N then implies asymptotic normality, that is, n(θˆn − θ) (0, M−1 2 M1 M2 ). −1 −1 −1 ¯ n ˆ The adaptivity of θ would then directly follow from M2 M1 M2 = MF (θ), the inverse of the Fisher information matrix (2). One may notice that step (C) allows some freedom in the choice of the func¯ n (θ), even though it would be natural to pick (4), for which the asymption H √ d ¯ →N ¯ n (θ) (0, M1 ) holds under standard assumptions, with totic normality n∇H 8

¯ (asymptotic properties of the ML estimator). Also, according to M1 = MF (θ) √ ˆ n (θ) is difficult to obtain, but notice that it is the review [2], n-consistency of H √ √ p ¯ →0, not a prerequisite for n-consistency of θˆn (we only need n∆n (θ) n → ∞). A key step to prove adaptivity of θˆn at step (C) would be to show that ¯

n θ 0 (kn,i ) (−εi ) 2 X d U (ε )→N (0, I(f )) , −√ ¯ θ¯ (−ε ) n i θ n i=1 kn,i (εi ) + kn,i i

√ ¯ when ˆ n (θ) the term on the left-hand side being the major contribution to n∇H ˆ n (θ) is given by (13). The conditions required on the functions f , K and U H for this to hold are currently under investigation. The rest of this paper is dedicated to proving points A and B.

3.3

Main results

The first result concerns the consistency of the estimator defined by the criterion (13) and is given by the following theorem. Its proof is given in Appendix C. ¯ θ¯ ∈ Θ with Θ Theorem 1 Consider observations having the density f (x − θ), satisfying A1, and with f satisfying conditions F0-F2,F3,F4. Let (δn ) be a ˆ n be as defined positive sequence such that δn → 0, nγ δn → ∞ ∀ γ > 0. Let H in (13) where the kernels K(.) satisfy K0-K4 with hn = n−α δn2 in K4 and where Un satisfies (12) with B(3An ) = nα with B(.) defined in F3. Then, for ˆ n (θ)θ,p α < 1/3, H ÃH(θ), n → ∞. ¯ < H(θ) for any θ 6= θ, ¯ ˆ n (θ) is continuous in θ for any n and H(θ) Since H n p ¯ ˆ Theorem 1 implies that the estimator minimizing (13) is consistent : θ →θ when n → ∞. Similarly, with slightly stronger conditions on f , we can prove the following concerning the second order derivative of the criterion. The proof is given in Appendix C. ¯ θ¯ ∈ Θ with Theorem 2 Consider observations having the density f (x − θ), Θ satisfying A1, and with f satisfying all conditions F. Let (δn ) be a positive ˆ n be as defined by (13), sequence such that δn → 0, nγ δn → ∞ ∀ γ > 0. Let H where the kernels K(.) satisfy K with hn = n−α δn in K4 and where Un satisfies (12) with B(3An ) = nα/3 with B(.) defined in F3. Then, for α < 1/7 the ˆ n satisfies ∇2 H ˆ n (θ)θ,p criterion H Ã∇2 H(θ) as n → ∞. Note that under the conditions of Theorems 1 and 2, p

¯ = M−1 (θ), ¯ ˆ n (θˆn )→∇2 H(θ) ∇2 H F see (7). 9

4

Conclusion and extensions

Extension of the present approach to general non-linear models should be similar to the results given here, and will be developed elsewhere. As explained in [14] two situations can be considered in this case, involving either the conditional entropy of residuals resulting from repetitions at fixed points, or the entropy of all residuals mixed together. The main difference with the location problem is in the introduction of dependency of the densities on the model function η(θ, X) and thus on the regressors X. The general approach should however remain the same. The dependence of the kernel estimates (11) or (14) in θ makes the derivation of the asymptotic properties of θˆn minimizing (8) or (13) much more difficult than for the minimizer of an approximation of the score-function used in ML estimation, or equivalently the ML criterion (4). In particular, the adaptivity of θˆn is still an open question (see point C in section 3.2). Several methods exist for entropy estimation, and each of them could be used to define a minimum-entropy estimator. Some do not require kernel smoothing of the empirical density of residuals, which could be considered as an advantage over the plug-in estimates used in this paper, see, e.g., [18] for a samplespacing method and [9] for an approach relying on nearest neighbors (in particular, the latter applies for samples in any dimension k, and could be used for minimum-entropy estimation in multiple regression where Yi is then a kdimensional vector). However, investigating the asymptotic properties of their associated minimum-entropy estimators seems a very difficult task. Another direction would be to consider recent developments in parametric estimation via divergence minimization. In a parametric context (which, for regression models, means that f is known), the asymptotically efficient estimator of [3], based on minimizing Hellinger distance, requires smoothing of the empirical distribution in order to compute its distance to a distribution with density. On the other hand, the approach used by [5, 6] is based on a duality property that permits to estimate the divergences of interest without requiring smoothing. The application to the semi-parametric problem considered in the paper is an open and motivating issue.

Appendix Appendix A : useful expressions and inequalities We indicate below the expressions of the first two derivatives of the true density of the residuals and of the kernel density estimators of fes , fˆnθ (ei (θ)) and θ fˆn, i (ei (θ)) w.r.t. θ, along with an upper bound on their distances. We also give the first two derivatives of the kernel estimator ¶ µ n 1 X x − Yi gˆn (x) = (15) K nhn i=1 hn 10

of the true density f w.r.t. x. For ei (θ) = Yi − θ, we have : ∂ {f s (ei (θ))} = −f 0 (Yi + θ¯ − 2θ) ∂θ e ∂2 {f s (ei (θ))} = 2f 00 (Yi + θ¯ − 2θ) ∂θ2 e µ ¶ n o 1 X 0 Yi + Yj − 2θ ∂ n ˆθ f (ei (θ)) = − 2 K ∂θ n nhn j=1 hn µ ¶ n o X 1 ∂ n ˆθ Yi + Yj − 2θ 0 = − fn, i (ei (θ)) K ∂θ (n − 1)h2n hn j=1, j6=i µ ¶ n o 2 X 00 Yi + Yj − 2θ ∂ 2 n ˆθ f (e (θ)) = K i ∂θ2 n nh3n j=1 hn µ ¶ n o X ∂ 2 n ˆθ 2 Yi + Yj − 2θ 00 K f (ei (θ)) = ∂θ2 n, i (n − 1)h3n hn j=1, j6=i µ ¶ n 1 X 0 x − Yj K gˆn0 (x) = nh2n j=1 hn µ ¶ n 1 X 00 x − Yj K gˆn00 (x) = nh3n j=1 hn

(16) (17) (18) (19) (20) (21) (22) (23)

Notice that θ fˆnθ (x) − fˆn, i (x) =

· µ ¶ µ ¶¸ θ fˆn, 1 x − ei (θ) x + ei (θ) i (x) K +K − , 2nhn hn hn n

yielding K(0) K(0) 2K(0) θ sup |fˆnθ (ei (θ)) − fˆn, + = . i (ei (θ))| ≤ nhn nhn nhn i, θ In the same way we obtain : o o ∂ n ˆθ ∂ n ˆθ K 0 (0) K 0 (0) 2K 0 (0) sup | fn (ei (θ)) − fn, i (ei (θ)) | ≤ + = ∂θ nh2n nh2n nh2n i, θ ∂θ

(24)

(25)

and sup | i, θ

o o ∂ 2 n ˆθ ∂2 n θ K 00 (0) K 00 (0) 4K 00 (0) fn (ei (θ)) − 2 fˆn, +2 = . i (ei (θ)) | ≤ 2 2 3 3 ∂θ ∂θ nhn nhn nh3n (26)

Appendix B : two useful lemmas In this appendix we give two lemmas that we refer to in the proofs of Theorems 1 and 2. 11

Lemma 2 (Dmitriev and Tarasenko, 1973, p.632) If f and its first r + 1 derivatives are bounded, and {εn } is a sequence of positive numbers such that hn = o(εn ), there exists a positive constant c < ∞ such that for sufficiently large n, ½ ¾ ¯ ¯ c ¯ (r) ¯ (r) ¯ P sup ¯gˆn (y) − f (y − θ)¯ > εn ≤ 2r+1 2 . nh εn y n Lemma 3 (Newey, 1991, Corollary 3.1) Consider a sequence of functions ˆ n (θ) = qP and parameter vector θ ∈ Θ. Define Q t (Yt , θ) of the observations YtP n n ¯ t=1 qt (Yt , θ)/n and Qn (θ) = t=1 E[qt (Yt , θ)]/n. Suppose that Θ is a compact ˆ n (θ) − Q ¯ n (θ) = op (1). Let h(.) denote a metric space and that for each θ ∈ Θ, Q function h : [0, ∞) → [0, ∞) with h(0) P = 0 and h continuous at 0, and suppose n ˜ there exists a function bt (Yt ) such that t=1 E[bt (Yt )]/n = O(1), and |qt (Yt , θ)− ˜ ˜ ˆ ¯ qt (Yt , θ)| ≤ bt (Yt )h(d(θ, θ)) for θ, θ ∈ θ. Then supθ∈Θ |Qn (θ) − Qn (θ)| = op (1). ˜ θ)) = d(θ, ˜ θ) = |θ˜ − θ|. We shall use this lemma with h(d(θ,

Appendix C : proofs of Theorems 1 and 2 Proof of Theorem 1. The proof is based on Lemmas 2 and 3. We can break ˆ n (θ) − H(θ) into three separate terms ∆n (θ) = down the difference ∆n (θ) = H 3 2 1 ∆n (θ) + ∆n (θ) + ∆n (θ), with ∆1n (θ)

i 1 Xh θ s = − log fˆn, i (ei (θ)) − log fe (ei (θ)) Un (ei (θ)) n i=1

∆2n (θ)

= −

n

n

∆3n (θ)

1X log fes (ei (θ)) [Un (ei (θ)) − 1] n i=1 Z n 1X = − log fes (ei (θ)) + fes (x) log fes (x)dx n i=1

a) Uniform convergence of ∆1n (θ) to 0. Applying Lemma 2, for hn = o(εn ), there exists a positive constant c such that for gˆn as defined by (15), ½ ¾ c ¯ > εn ≤ P sup |ˆ gn (z) − f (z − θ)| . nhn ε2n z Since the convergence is uniform over Θ, shifting the densities gˆn and f by θ does not change the last probability, and we can also write ( ) c θ s ˆ P sup |fn (x) − fe (x)| > εn ≤ , nhn ε2n x, θ which yields for ε0n = εn + 2K(0)/(nhn ), see (24), ( ) θ s 0 ˆ P sup |f (x) − f (x)| > ε ≤ x, θ, i

n, i

e

12

n

c . nhn ε2n

(27)

θ s 0 Suppose now that supx, θ, i |fˆn, i (x) − fe (x)| < εn . We have

sup |∆1n (θ)|



θ



¯ n¯ o ¯ ¯ θ s sup ¯log fˆn, i (ei (θ)) − log fe (ei (θ))¯ Un (ei (θ)) θ, i ¯ ¯ ¯ ¯ θ s sup ¯log fˆn, i (x) − log fe (x)¯ .

θ, i, |x| 1/B(3An ) = For |x| < 2An , |x±(θ− θ)| e 1/Bn for n large enough (since An → ∞ and B is increasing). Recalling that | log X| < |1/X − 1| + |X − 1|, we thus have " # ¯ ¯ 1 1 ¯ ¯ ˆθ 1 s + sup |∆n (θ)| ≤ sup ¯fn, i (x) − fe (x)¯ θ θ fˆn, i (x) fes (x) θ, i, |x| C + L, ¯ ¯ ¯ f (x − z) + f (x + z) ¯¯ ¯ h(x) = sup ¯log ¯ f (x − z) < | log f (x + L)|f (x − L) . 2 |z|C+L

For |x| < C + L, we have h(x)
εn1 ≤ . 3 ε2 nh z n n1 19

Using (16), (18) and (22), we can also write ( ) o ∂ n ˆθ ∂ c1 s P sup | fn (ei (θ)) − {fe (ei (θ))} | > εn1 ≤ , ∂θ nh3n ε2n1 i, θ ∂θ which yields for ε0n1 = εn1 + 2K 0 (0)/(nh2n ) (see (25)) ( ) o ∂ n ˆθ ∂ c1 s 0 P sup | {fe (ei (θ))} | > εn1 ≤ fn, i (ei (θ)) − . ∂θ nh3n ε2n1 i, θ ∂θ Suppose now that sup | i, θ

o ∂ n ˆθ ∂ fn, i (ei (θ)) − {f s (ei (θ))} | < ε0n1 . ∂θ ∂θ e

(28)

θ From here, supposing, as in the proof of Theorem 1, that supi, θ |fˆn, i (ei (θ)) − s 0 fe (ei (θ))| < εn = εn + 2K(0)/(nhn ), the calculations are similar to those of point (a) for the term En1 (θ). We have ¯ ¯ sup ¯Fn1 (θ)¯ θ





=



¯ n ¯ o 2 à ¯ ∂ ˆθ !2 ¯ ¯ ∂θ fn, i (ei (θ)) ∂ s {f (e (θ))} ¯¯ ¯  − ∂θ e i sup ¯ ¯ Un (ei (θ)) θ (e (θ)) ¯ fes (ei (θ)) i, θ ¯ fˆn, i i ¯ ¯ ¯ ¯µ ¶ µ ¶ 2¯ ¯ ∂ n o 2 ∂ ¯ Un (ei (θ)) ¯ θ fˆn, − {fes (ei (θ))} ¯ ¯ sup ¯ ¯2 i (ei (θ)) ¯ ¯ ˆθ ¯ ∂θ ∂θ ¯ i, θ ¯fn, i (ei (θ))¯ ¯ ¯ ¯ ¯¯ ¯2 ¯ ¯¯ ∂ ¯ 1 1 ¯ ¯¯ s ¯ Un (ei (θ)) + sup ¯ ³ − {f (e (θ))} ¯¯ ´2 i e 2 ¯ (fes (ei (θ))) ¯¯ ∂θ i, θ ¯ ˆθ ¯ fn, i (ei (θ)) ¯ ¯ n o ¯ ¯∂ ∂ θ s ˆ ¯ sup ¯ fn, i (ei (θ)) − {fe (ei (θ))}¯¯ ∂θ ∂θ i, θ ¯ ¯ n o ¯ Un (ei (θ)) ¯∂ ∂ θ s ˆ ¯ ׯ fn, i (ei (θ)) + {fe (ei (θ))}¯¯ ¯ ¯2 ∂θ ∂θ ¯ ˆθ ¯ ¯fn, i (ei (θ))¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯2 1 1 ¯¯ 0 ¯ + sup ¯ ³ ¯ f (ei (θ) + θ¯ − θ)¯ Un (ei (θ)) ´2 − s 2 (fe (ei (θ))) ¯¯ i, θ ¯ ˆθ ¯ fn, i (ei (θ)) ¯¶ µ¯ n o¯¯ ¯¯ ∂ ¯ ¯∂ Un (ei (θ)) θ s 0 ˆ ¯ ¯ ¯ fn, i (ei (θ)) ¯ + ¯ {fe (ei (θ))}¯¯ ¯ sup εn1 ¯ ¯2 ∂θ ∂θ ¯ ˆθ ¯ i, θ ¯fn, i (ei (θ))¯ 20

¯ ¯ ¯ ¯ ¯ ¯¯ ¯2 1 1 ¯ ¯¯ 0 + sup ¯ ³ ¯ f (ei (θ) + θ¯ − θ)¯ Un (ei (θ)) ´2 − s 2 (fe (ei (θ))) ¯¯ i, θ ¯ ˆθ ¯ fn, i (ei (θ)) ¯ ¯ ¯ ¯¢ Un (ei (θ)) ¡ ≤ sup ε0n1 ε0n1 + ¯f 0 (ei (θ) + θ¯ − θ)¯ + ¯f 0 (ei (θ) + θ¯ − θ)¯ ¯ ¯2 ¯ ˆθ ¯ i, θ ¯fn, i (ei (θ))¯ ¯ ¯ ¯ ¯ ¯ ¯¯ ¯2 1 1 ¯ ¯¯ 0 + sup ¯ ³ ¯ f (ei (θ) + θ¯ − θ)¯ Un (ei (θ)) ´2 − s 2 (fe (ei (θ))) ¯¯ i, θ ¯ ˆθ ¯ fn, i (ei (θ)) ≤

¯ ¯¢ ¡ 1 ε0n1 ε0n1 + 2 ¯f 0 (z + θ¯ − θ)¯ ¯ ¯2 ¯ i, θ, |z|