Fan (1991), [54], characterized the ..... consistency rates are derived and compared with those of Fan and Troung ...... [56] Jianqing Fan and Young K. Truong.
Estimation in nonlinear functional error-in-variables models Silvelyn Zwanzig University of Hamburg Institute of Mathematical Stochastics Bundesstrasse 55, D-20146 Hamburg Habilitationsschrift November 1997
Contents 1 Introduction 1.1 Implicit and explicit models . . . . . . . . . . 1.2 Structural and functional models . . . . . . . 1.3 Review of Literature . . . . . . . . . . . . . . 1.3.1 Linear models . . . . . . . . . . . . . . 1.3.2 Generalized linear models . . . . . . . 1.3.3 Nonlinear functional relations . . . . . 1.3.4 Structural nonlinear models . . . . . . 1.3.5 Inconsistency of the functional m.l.e. . 1.3.6 Numerical algorithms . . . . . . . . . . 1.3.7 M-estimators . . . . . . . . . . . . . . 1.3.8 Additional information . . . . . . . . . 1.3.9 Nonparametric relationship . . . . . . 1.4 Examples . . . . . . . . . . . . . . . . . . . . 1.4.1 Continuous culture in microbiology . . 1.4.2 Astrometric analysis of Schmidt plates 1.4.3 Polymerization . . . . . . . . . . . . . 1.5 Scope and content of the paper . . . . . . . .
. . . . . . . . . . . . . . . . .
5 5 7 8 8 9 11 12 13 14 15 15 18 20 21 21 22 23
2 Notation 2.1 Scalar products, norms, empirical measure . . . . . . . . . . . . . 2.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 27 29
I
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Estimation in the parametric error-in-variables model 31
3 Parametric model 3.1 Nonlinear functional relation model . . . . . 3.2 Repeated observation model . . . . . . . . . 3.3 Model with known order of design points . . 3.4 Nonlinear semiparametric functional relation 3.5 Vector model . . . . . . . . . . . . . . . . . 1
. . . . . . . . . . . . model . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
33 33 34 35 36 36
2
CONTENTS
4 Estimation of the parameter of interest 4.1 Estimating functions and classes of estimates . . . . . . . . . 4.2 Minimum contrast estimates . . . . . . . . . . . . . . . . . . 4.3 Maximum likelihood estimates . . . . . . . . . . . . . . . . . 4.3.1 Maximum likelihood and minimum contrast estimates 4.3.2 Estimating functions and maximum likelihood . . . . 4.3.3 Estimates under a factorization condition . . . . . . . 4.3.4 Reparameterization . . . . . . . . . . . . . . . . . . . 5 Least squares estimator 5.1 Definition of the l.s.e. . . . . . . . . . . 5.2 Geometrical interpretation . . . . . . . 5.3 Naive least squares estimator . . . . . 5.4 Estimating functions and least squares 5.5 Contrast functions and least squares .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . .
39 39 43 47 48 50 51 52
. . . . .
55 55 58 59 60 61
6 Auxiliary results 7 Consistency of the l.s.e. 7.1 Consistency of the l.s.e. under vanishing variances . . . . . . . . . 7.1.1 Consistency of the l.s.e. in the repeated observation model 7.1.2 Consistency of the l.s.e. under σ 2 → 0 . . . . . . . . . . . 7.2 Consistency of the l.s.e. under an entropy condition . . . . . . . . 7.3 Discussion of the entropy condition . . . . . . . . . . . . . . . . . 7.4 Consistency in special models . . . . . . . . . . . . . . . . . . . . 7.4.1 Consistency of the l.s.e. in the repeated observation model 7.4.2 Consistency of the l.s.e. in the semiparametric functional relation model . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Consistency of the l.s.e. in functional relation model with known order of the design points . . . . . . . . . . . . . .
69 89 90 94 102 104 122 130 130 133 138
8 Inconsistency of the l.s.e. 141 8.1 Nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.2 Weighted l.s.e. in linear models . . . . . . . . . . . . . . . . . . . 145 8.3 Parameter of interest . . . . . . . . . . . . . . . . . . . . . . . . . 150 9 Asymptotic normality of the l.s.e. 161 9.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.2 The main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.3 Proof of the asymptotic normality . . . . . . . . . . . . . . . . . . 172 9.3.1 The formal stochastic expansion of the l.s.e. for the parameter of interest . . . . . . . . . . . . . . . . . . . . . . . . . 172 9.3.2 Application of the normal approximation to the leading term177
3
CONTENTS
9.4
9.3.3 Estimation of the remainder term in the formal expansion 179 9.3.4 Derivation of the approximation bound . . . . . . . . . . . 182 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10 Alternative estimators 10.1 The main idea . . . . . . . . . . . . . . . . . . . . . 10.2 The approximate alternative estimator . . . . . . . 10.3 Consistency . . . . . . . . . . . . . . . . . . . . . . 10.4 Asymptotic normality . . . . . . . . . . . . . . . . 10.5 The corrected least squares estimator . . . . . . . . 10.5.1 Polynomial functional relation model . . . . 10.5.2 Exponential model . . . . . . . . . . . . . . 10.5.3 Gaussian regression curve . . . . . . . . . . 10.5.4 Laplace distribution . . . . . . . . . . . . . 10.5.5 Application of the Fourier transform method 10.6 On a c-MCE for rational regression functions . . . . 11 Efficiency 11.1 A minimax bound of Hajek type . . . . . . . . 11.2 Comparison of the estimators . . . . . . . . . 11.2.1 Efficiency of the l.s.e. . . . . . . . . . . 11.2.2 Inefficiency of the alternative estimator 11.2.3 Outlook . . . . . . . . . . . . . . . . . 11.3 Efficiency in the replication model . . . . . . .
II
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . .
185 185 189 191 196 200 203 204 204 205 207 208
. . . . . .
213 213 219 219 219 221 221
Estimation in the nonparametric model
225
12 Orthogonal series estimators 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Construction of orthogonal series estimators . . . . . . . . . . . .
227 227 228 230
13 Location Submodel
233
14 Consistency
239
15 Rate of Convergence 249 15.1 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 15.1.1 Discussion of the rate . . . . . . . . . . . . . . . . . . . . . 251 16 Appendix
253
4
CONTENTS
Chapter 1 Introduction Error-in-variables models are used to describe situations in which different series of random variables are observed whose mean values are coupled by a function. They differ from regression models in a central aspect. In setting up a regression model one has to decide which variables are independent and which are dependent. The results in regression theory do depend on this choice. In practice it is often very difficult to decide which way the dependence goes. This problem is avoided in error-in-variables models. Generally one could say that whenever it is not clear which variable is the independent one, error-in-variables models are appropriate. The observations are denoted by (yi , xi ) ,
i = 1, ..., n,
(1.1)
and Exi = ξi .
(1.2)
with expected values Eyi = ηi
The aim of the experiment is not the estimation of the unknown expected values (ηi , ξi ) themselves. The only issue is the inference about the relationship between the expected values on the base of the observations (1.1).
1.1
Implicit and explicit models
First let us describe some different types of error-in-variables models. The implicit form of the error-in-variables model is given by f (ηi , ξi ) = 0.
(1.3)
Equation (1.3) illustrates the equilibrium between both series of random variables yi , i = 1, ..., n and xi , i = 1, ..., n. The implicit model can also be written in an explicit form, such that ηi = g (ξi ) . 5
(1.4)
6
CHAPTER 1. INTRODUCTION
The difference between explicit and implicit models come into being in the very moment that model assumptions on the unknown relationship are made. In parametric inference one works with known classes of functions depending on an unknown parameter of fixed dimension: f ∈ Mimpl = {f (., ., β) : β ∈ Θ ⊆ Rp }
(1.5)
g ∈ Mexp l = {g (., β) : β ∈ Θ ⊆ Rp }.
(1.6)
f (ηi , ξi , β) = 0, for all β ∈ Θ, i = 1, ..., n
(1.7)
or
If the equation
can be solved explicitly for ηi , f (ηi , ξi , β) = ηi − g (ξi , β) , then both models coincide. In the following we will use the explicit form (1.4) only and impose conditions to be specified below on the function g. In Part I we will consider the parametric case (1.6). In Part II we will regard the nonparametric case, where g is a member of a known Holders class of functions with smoothness degree ν, g ∈ Mν ∈ LG ([0, 1]) .
(1.8)
If we now introduce unobservable additive error variables ε1i , ε2i , such that ε1i = yi − ηi
and ε2i = xi − ξi ,
we have the following explicit model equations yi = g (ξi ) + ε1i
(1.9)
xi = ξi + ε2i .
(1.10)
The first equation (1.9) describes a regression model with E (yi /ξi ) = g (ξi ) .
(1.11)
In the second equation (1.10) one has the error in the variables. The ”independent” variables ξi are observed with a measurement error ε2i only.
1.2. STRUCTURAL AND FUNCTIONAL MODELS
1.2
7
Structural and functional models
Similarly to regression theory we have both types of models, the random coefficient model with ξi ∼ Pξ i.i.d. (1.12) and the fixed coefficient model with design points ξ1 , ..., ξn
not random.
(1.13)
Kendall (1952) introduced the denotation for both of the models in [95] and [96]. He called the random coefficient model (1.12) the structural and the fixed coefficient model (1.13) the functional one. To unify the approach Dolby (1976), [44], considered the general ultrastructural model with ξi ∼ Pi .
(1.14)
If V ar (ξi ) = 0, this yields the functional model. In the case of independent and identical distributions we obtain the structural model (1.12). In this context one should also mention the papers of Gleser (1985), [67] and Srivastava, Shalabh (1997), [154]. In their book Carroll, Ruppert and Stefanski (1995), [31] distinguished between two different types of inference: functional and structural modeling. In the former they generalized the usual notion of functional modeling in that they allowed the ξi , i = 1, .., n to be random, provided only minimal assumptions are made about the distribution Pξ . In the latter case the inference is mainly based on the distribution Pξ . Under (1.12) the observations (1.1) have a distribution Pyi xi of mixture type Pyi xi (A) =
Z
Pyi xi ξi (A) dPξ ,
(1.15)
which is called Model II in the papers [14], [15], [16] by Bhanja and Gosh (1992). The functional model is related to the conditional distribution Pyi xi (./ξi ) . The unknown expected values ξi of xi play the role of nuisance or incidental parameters. Bhanja and Gosh denoted (1.13) as Model I. The main characteristic of Model I is that the number of nuisance parameters increases with the sample size. The estimation theory deals with the problem of eliminating the nuisance parameters. Different approaches and their applicability for nonlinear parametric model are discussed in Chapter 4. A detailed study of the relation between these types of general models is given by Pfanzagl (1993), [134].
8
CHAPTER 1. INTRODUCTION
In the remainder of this paper we are only interested in the functional case (1.13). It is in some sense the stronger model. If we have a convergence result for an arbitrary ξi in (1.13), we will also have one for the structural model (1.15). Otherwise it is possible to derive minimax efficiency bounds in the functional model from the structural one by virtue of max Pyi xi (A/ξi ) ≥ ξi
Z
Pyi xi ξi (A) dPξi .
The Bayesian approach lies between both types of models. Reilly and PatinoLeal (1981), [137] introduced the Bayesian idea to this context and calculated the joint posterior distribution for the nuisance parameter and the parameter of interest. Then the marginal distribution of it gives an estimating criterion for the parameter of interest. This approach is adapted to the case with replicated observations and unknown error structure by Keller and Reilly (1991), [93]. In Chapter 8 of their book, Carroll, Ruppert and Stefanski (1995), [31], gave an overview of these methods and discussed their application.
1.3
Review of Literature
In this paper the main topic is the estimation of the functional explicit relation without additional data sets. Nevertheless let us discuss shortly the results and ideas given in the literature for general measurement error models, also for implicit or structural ones. This will make this review quite extensive, as it contains many results on structural models as well. However, many ideas carry over since the models, though different, are closely related. The theory of error-variables models has a long history. A short historical review is given by Sprent (1990), [153]. There he suspected Gauss of having had such models in mind when proposing the least squares procedure! Considerable progress has been made in constructing consistent and efficient estimators for the regression slopes in linear error-in-variables models. Progress has been much slower for nonlinear models especial in the functional case.
1.3.1
Linear models
It may be possible to say, that the main theory for the linear relationship ηi = α + βξi is well developed for structural and functional and for multivariate models, see for instance Chapter 29 of Kendall and Stuart (1979), [97], the book of Fuller (1987), [58], the Chapter by Nussbaum in Humak (1983), [85], the textbook by Schneeweiß and Mittag (1986), [145], the 1982-Wald Memorial lecture of T. W. Anderson (1982), [10], the survey paper by Chi-lun Cheng and van Ness
9
1.3. REVIEW OF LITERATURE
(1994), [36], the lecture notes in statistics by Nagelkerke (1992), [126] and the ”proefschrift” by Hillegers (1987), [80]. In the linear case efficient estimators for β exist also in the case where the incidental parameters are not consistently estimable. Estimating the linear relationship can also be considered as a special case of efficient and adaptive estimation in semiparametric models and is therefore included in the books of Bickel et al. (1993), [19], and in van der Vaart (1987), [171]. The linear structural model with unknown Pξ is an examples of the results of van der Vaart (1996), [172] on efficient m.l.e. in semiparametric mixture models. Consistency results for the m.l.e. in the linear semiparametric structural model with unknown Pξ are given in Murphy and van der Vaart (1996), [125]. This paper deals also with the linear functional model, where the empirical measure, generated by the nuisance parameters lies in a neighborhood of the unknown Pξ .
1.3.2
Generalized linear models
The main characteristic of the generalized linear model is that the unknown parameter β and ξi are involved in the relationship only through β T ξi , such that (1.7) is of the form
f ηi , β T ξi = 0, for all β ∈ Θ, i = 1, ..., n. In the literature generalized linear models are introduced by the relation
yi = h β T ξi , ε1i ,
(1.16)
where h is called the link function, which can be known or unknown. If the error ε1i is additive, then the model (1.16) corresponds to the explicit generalized linear model g (ξi , β) = h β T ξi . (1.17)
The generalized linear model has become a widely used tool of data analysis in the structural and in the functional context. The models without error in the variables are the scope of the book of McCullagh and Nelder (1994), [122]. Under structural model assumptions Armstrong (1985), [11], proposed a transformation method for the construction of maximum quasi likelihood estimators in (1.17). Schafer (1987), [143], followed up the suggestions of [11] and used an EM algorithm to obtain estimators of ”pseudo maximum likelihood” type. One important special case of (1.17) is the logistic regression model, where the link function is 1 h (t) = (1.18) 1 + exp (−t)
and the response yi ∈ {0, 1} is a Bernoulli variable. Carroll et al. (1984), [33] investigated the conditional maximum likelihood estimator in the structural case in
10
CHAPTER 1. INTRODUCTION
a Monte Carlo study. Stefanski (1989), [156] proved the inconsistency of the functional maximum likelihood estimator in (1.18) and proposed a general strategy for reducing measurement-error-induced bias. Under conditions that are appropriate, when the measurement error is small Stefanski and Carroll (1985), [161], introduced estimators for the model (1.18). The same authors (1990), [160], studied the model (1.18) in the structural context with normal distribution Pξ and with unknown distribution Pξ of the ξi and compared the asymptotic efficiencies. In the case of unknown Pξ the existence of a validation study, see (1.28), is assumed. Following the ideas of Gatto and Ronchetti (1996), [62], Gatto (1996), [61], derived saddlepoint approximations for the density of the bias corrected estimator of Stefanski and Carroll (1985), [161], in the binary logistic regression model. The basic idea of the generalized linear model, when the regressor variable is free of error, is explained in the paper of Brillinger (1977), [24]. He showed that the naive ordinary linear least squares estimator provides useful estimates of the coefficients of the linear combination up to a constant of proportionality for arbitrary unknown link functions h. The idea of Brillinger (1977) is extended to error-in-variables models by Carroll and Li (1992), [30]. They considered a structural generalized linear error-invariables model (1.16) with unknown link function h. The conditional common n−dimensional distribution of (yi )i=1,...,n given (ξi )i=1,...,n depends on (ξi )i=1,...,n only through a small number p of projected variables. The idea is to find the space spanned by β without parametric or nonparametric model fitting. Carroll and Li (1992), [30], developed techniques for estimating the regression slope up to a constant of proportionality when ξi is subject to error. Under the symmetry condition on the measurement error distribution,
E bT ξi /β T ξi = α + β T ξi , they showed that the inverse regression curve E (xi /yi ) degenerates to a straight line. Standard linear least squares regression methods are used to estimate this line. Stefanski and Carroll (1990), [164], constructed tests for the hypotheses of no relationship, basing on a paper of Tostson and Tsiatis (1988), [169]. The same authors, Carroll and Stefanski (1987), [162], derived the efficient score functions for such models in the sense of Bickel and Ritov (1987), [18]. Under the additional knowledge of instrumental variables Stefanski and Buzas (1995), [159], proposed an instrumental variable estimators in the binary regression model.
1.3. REVIEW OF LITERATURE
1.3.3
11
Nonlinear functional relations
An essential assumption which is often made in the nonlinear case is that of repeated observation, yij = ηi + ε1ij (1.19) xij = ξi + ε2ij
(1.20)
with j = 1, ..., ri , and i = 1, .., n. Under these assumptions the relation between P the number of observations N = ni=1 ri and the number of nuisance parameters n is much more suitable. One early paper dealing with the nonlinear parametric case is that of Villages (1969), [174]. He considered a model (1.19), (1.20) with a fixed number of design points n and an increasing number of replications r at each point and showed the consistency of the least squares estimator. Thus, he avoided the problem of increasing number of nuisance parameters. Note this was the same year the paper by Jennrich (1969), [91], was published, which together with Malinvaud (1970), [118], mark the beginning of the asymptotic theory for the nonlinear regression model. The estimating equations for the maximum likelihood estimators are derived under normal distribution assumptions by Dolby (1972), [42], for univariate nonlinear functional relationships and by Dolby and Freedman (1975), [45], for the multivariate case. In Dolby and Lipton (1972), [46], a discussion of the convergence of the proposed iterate method for calculation of the maximum likelihood estimator is included. Dolby (1976), [43], established the connection between the method of Britt and Luecke (1973), [25], for the implicit functional model and the method of Dolby (1972), [42], for the explicit model. Egerton and Laycock (1979), [48], proposed an improvement for the iterate method of Dolby (1972), [42], and Dolby and Freedman (1975), [45]. A summarized representation of those methods is included in the chapter on errors-in-variables models of the book of Seber and Wild (1989), [149], in Hillegers (1986), [80], in Humak (1983), [85], and Nagelkerke (1992), [126]. The first paper with consistency results in a nonlinear explicit functional model with increasing number of design points n is that by Wolter and Fuller 2 (1982), [178]. They assumed balanced replications ri = r , such that rn → ∞, and presented an iterative estimation procedure. Starting from the naive estimator β P at the average design points xi = 1r rj=1 xij , they improved β by applying the first iteration step for solving the maximum likelihood equation. In a further paper Amememiya and Fuller (1988), [8], considered the parametric implicit model (1.5) with repeated observations and showed the consistency of the maximum likelihood estimator for β for r → ∞. Under the assumption that the number of replications r increases faster than then the number of design points n , such that nr → ∞, the maximum likelihood estimator is asymptotically normal distributed. For the same model Schnell (1990), [146], presented a max-
12
CHAPTER 1. INTRODUCTION
imum likelihood estimator for the covariance matrix and a likelihood ratio test based on this estimator. Because of the unknown design points, even polynomial models are nonlinear. Wolter and Fuller (1982), [179], constructed a consistent estimator for the quadratic functional relationship, which uses the knowledge of the error variance. In deriving this result no replication is assumed. They applied this estimator on a data set of the locations of earthquakes near the Tonga trench. Wolter and Fuller’s (1982), [179], estimator was extended to polynomial errorin-variables model of arbitrary order p, that is g (ξi , β) =
p X
βk (ξi )k
(1.21)
k=0
by Hausman et al. (1991), [76]. The estimator depends on the moments of the measurement errors up to the p’th order. Hausman et al. (1991), [76], applied it to an econometric model and presented an adaptive estimator in the case that additional information on the unknown parameters of the error distribution is available.
1.3.4
Structural nonlinear models
During the last years progresses have been made in nonlinear structural models. In the structural case the experiment is i.i.d. with the mixture distribution (1.15). Under conditions that ensure the identification of the parameter β, Kiefer and Wolfowitz (1956), [98], showed the consistency of the maximum likelihood estimator. Their proof is a modification of Wald’s and its fundamental ideas can be found in Wald (1948), [175]. But in contrast to the linear model with normal distributed errors the likelihood function is not given explicitly, even in the case of known Pξ and known error distribution. A notable exception is the binary logistic regression model (1.17), (1.18), see Carroll et al. (1984), [33]. Hsiao (1989), [82], proposed an estimator basing on the conditional expected value E (yi /xi ) = G (xi , β) .
(1.22)
Under identification conditions he applied the least squares estimator to the nonlinear regression model yi = G (xi , β) + vi . (1.23) Schafer (1990), [144], discussed the problem of finding the explicit representation of G (xi , β). Gleser (1990), [68], showed that an explicit solution exists for the exponential regression model g (ξi , β) = β1 exp (β2 ξ) .
(1.24)
In the corresponding model (1.23) the least squares estimator is consistent. Further Gleser (1990), [68], considered the case of the conditional expected value and
1.3. REVIEW OF LITERATURE
13
the conditional expected variance both being given approximately only. Then the model (1.23) corresponds to the misspecificated nonlinear regression model, (see, for example Gallant (1986) , [60], or Zwanzig (1980), [182]) and the least squares estimator is biased. There is also a number of papers studying the behavior of the naive estimator, that is the estimator of the model without error-in-variables, in the nonlinear error-in-variables model. Girliches and Ringstad (1970), [66], discussed the application of the naive estimator in econometric nonlinear models and concluded that ”errors in variables are bad enough in linear models. They are likely disastrous to any attempts to estimate additional nonlinearity or curvature parameters”. Gleser (1990), [68], started also with proposals for the improvements to the naive approach. Whittemore and Keller (1988), [176], used the quasi-likelihood approach to improve the naive estimator for small measurement errors. The basic idea is that the m.l.e. or m.q.l.e. (maximum quasi likelihood estimator) can be expanded around the naive estimator with remainder terms, which are small for small error variances. The approximate quasi likelihood approach is further developed for more general structural models in Carroll and Stefanski (1990), [34]. Cook and Stefanski (1994), [38], proposed an alternative general method releasing on computer simulation and extrapolation (SIMEX) for improving the naive estimator. The main idea is to simulate new observations with additional measurement errors and to find a curve of the naive estimators with respect to the measurement error variance. Then the extrapolation is done in the direction to zero measurement variance and gives at zero the desired new estimator. The asymptotic properties of SIMEX are studied for small measurement errors in the nonlinear structural relation model in Carroll et al. (1996), [29]. In the monograph of Carroll, Ruppert and Stefanski (1996), [31], Chapter 4 is dedicated to this method. A graphical method for detecting nonlinearity in replicated structural models is proposed by Carroll and Spiegelman (1992), [32]. In a recent paper Chanda (1996), [12], considered the estimation problem of a polynomial structural model, where the unobservables a sequence of autoregressive random variables.
1.3.5
Inconsistency of the functional m.l.e.
In the general model with no replications and no additional assumptions the consistency of the maximum likelihood estimator was an unsolved problem. Already Neyman and Scott (1948), [128], showed that ”maximum-likelihood estimates of the structural parameter relating to a partially consistent series of observation need not be consistent”. For the circular implicit model f (ξi , ηi , β) = β 2 − ξi2 + ηi2
14
CHAPTER 1. INTRODUCTION
the inconsistency of the maximum likelihood estimator is known, see for instance Section 2.IV.E. in the booklet of Nagelkerke, [126]. Also for the logistic regression model, (1.17), (1.18), Carroll et al. (1984), [33], argued that the functional maximum likelihood estimator is not consistent even in the case when all variances are known. Stefanski (1989), [156], showed the way how to prove the inconsistency of the m.l.e. for β and gave a proof for the logistic error-in-variables model, (1.17), (1.18). Nevertheless, methods based on a stepwise linearization of the functional relationship
(k)
(k)
(k)
g ξi , β ≈ g ξi , β + g (ξ) ξi , β
(k+1)
ξi
(k)
− ξi
(1.25)
still appear in the literature also in the general case, where ξi is not consistently estimable. Dolby (1972), [42], started to use such a method to calculate the maximum likelihood estimator and it is useful under conditions, which guaranty the consistency of the estimator for the nuisance parameters. Otherwise it is only possible to speak about an approximately local linear model, as Linssen and Hillegers (1989), [115], did it. Already Hillegers (1986), [80], discussed the inconsistency of this approach. Replacing xi and yi bythe averages of the repeated observations in (1.9) and (1.10) one obtains σ 2 = O ri−1 . This led to the study of the asymptotic of small measurement errors σ 2 = V ar (xi ) → 0, (1.26) an approach widespread in the literature on the inference in nonlinear models. In Chapter 3 of Fuller (1987), [58], the corresponding results for the structural model are summarized. Under (1.26) the maximum likelihood estimator is consistent and asymptotically normal distributed. For the logistic regression models results of this kind are given in Stefanski and Carroll (1985), [161]. Most of the asymptotic results in structural nonlinear relations are with respect to small measurement errors, see [176], [34], [29], [156], [31]. For general structural models Chesher (1991), [37], published a discussion paper on the effect of measurement errors on the information produced by statistical procedures. He derived small variance approximations for distributions in the measurement error models.
1.3.6
Numerical algorithms
In the numerical literature the maximum likelihood estimator under normal distribution or the weighted least squares estimator is well studied under the names: total least squares or orthogonal regression, or orthogonal distance estimation. The numerical algorithms for solving the nonlinear minimization problem are presented, which are globally and locally convergent, by Schwetlick and Tiller (1985), [148], and by Boggs, Byrd, Schnabel (1987), [22]. Procedures are already
1.3. REVIEW OF LITERATURE
15
implemented in the statistic software library as discussed by Boggs and Rogers (1990), [23]. Southwell (1990), [152], derived an algorithm, which allows the use of standard numerical derivatives. A numerical procedure, which solves nonlinear unbiased estimating equations of M-estimators, is considered by Tak Mak (1993) in [117].
1.3.7
M-estimators
Another problem is to find alternative estimators, which are consistent also in the case where the nuisance parameters vary free and the number of nuisance parameters grows up in the same order as the sample size. Stefanski (1989), [155], introduced M-estimators for special models of functional and structural type, as the exponential (1.24), the polynomial (1.21)and the generalized linear models (1.17). The functional generalized linear model is also considered in Stefanski, Carroll (1987), [162], and Stefanski (1989), [156]. Tsuoyschi Nakamura (1990), [127], presented a method, which under the assumption of the existence of a corrected score function, yields M-estimates for functional and structural error-in-variables models. The M-estimates are consistent and asymptotically normal distributed. He applied this method on some special models, where the general linear model in the canonical form of McCullagh and Nelder (1994), [122], is the most interesting in the context here. Independently the same approach was also proposed by Stefanski (1985) in a technical report, which was published 1989 in [157]. There he discussed the existence of corrected scores for normal distributed measurement errors and derived asymptotic results in the sense of (1.26). The corrected score functions are given for the exponential model (1.24) and for polynomial models (1.21). In [26] Buzas and Stefanski (1996) extended the corrected-score method studied by Nakamura (1990), [127], and Stefanski (1989), [157], to a large class of generalized linear measurement error models. They assumed normal errors and an expansion of the known link functions together with a lemma of Stein (1981), [166]. Hanfelt and Kuing-Yee Liang (1997), [75], proposed a correction of the conditional quasi-likelihood function for the generalized linear model.
1.3.8
Additional information
In this paper only the primary data set (1.1) is assumed to be given. Additional information is required in form of assumptions on the distributions or on the parameter sets. In the literature it is frequently assumed that external or independent studies are carried out and additional information in the form of additional data sets are given. The main reason for this is the complicated structure of error-in-variables models. In the functional models there are 2n observations and n + p parameters.
16
CHAPTER 1. INTRODUCTION
On the other hand in the case of structural models the unknown distribution Pξ comes in as an additional nonparametric component. Carroll and Stefanski (1990), [34], tried to give a taxonomy of the data sets likely to be available in measurement error studies, see also Section 1.4 of Carroll, Ruppert, Stefanski (1995), [31]. One possibility is to require replicated observation like (1.19), (1.20). Another frequently assumption is that additional information is available in the form of observation of instrumental variables, wij , i = 1, ..., n, j = 1, ..., r
(1.27)
which are correlated to the observations xij , i = 1, ..., n, j = 1, ..., r and uncorrelated to yij , i = 1, ..., n, j = 1, ..., r . Amemeiya introduced instrumental variable estimators and studied the asymptotic properties for r → ∞ and n → ∞ in [5], [7], [6]. This approach is often introduced as a new type of estimator, but it is rather a new model with a larger data set. Another important source of information is a validation study, where the pairs, (xj , ξj ) , j = 1, ..., s, (1.28) are observed. Here the second line of the error-in-variables model, (1.10), plays the role of an external regression model. In that case the relation between the variables ξj and their observations xj can also be supposed as nonlinear, xj = gv (ξj , α) + ε2j ,
j = 1, ..., s.
(1.29)
Then an estimate ξei (xi ) of ξi , with respect to the observed xi , can be constructed by the usual nonlinear calibration methods. Carroll, Ruppert, Stefanski (1995), [31], described this regression calibration method in Chapter 3. The replacement method uses the approximation,
E (g (ξi , β) /xi ) ∼ g ξei (xi ) , (β) ,
(1.30)
in (1.23). The small variance asymptotics for methods basing on prediction of ξi is given in Carroll, Stefanski (1990), [34]. Lee and Sepanski (1995), [108], considered an orthogonal decomposition of the first equation of the explicit structural nonlinear model (1.9) yi = g (ξi , β) + ε1i into yi = xTi γ (β) + ε∗i with
γ (β) = E xT1 x1 E xT1 g (ξ1 , β)
(1.31)
17
1.3. REVIEW OF LITERATURE and
E xTi ε∗i = 0. Then the estimate βb of β is defined by the least squares estimator βb = arg min
n X i=1
yi − xTi γ (β)
2
in the regression model (1.31). The validation data set (1.28) is used to estimate γ (β) by γe = γe (β) , γe = arg min γ
s X
j=1
(g (ξj , β) − xj γ (β))2 .
This idea is connected to the approach, (1.23), of Hsiao (1989), [82], where the conditional expected value (1.22) is approximated linearly by the validation data set. In a general nonlinear structural error-in-variables model Sepanski and Carroll (1991), [150], used the validation data set (1.28) for the kernel estimates of the first two conditional moments of the distribution Pyi /x and estimated the parameter of interest β by adaptive quasi likelihood methods. In the case of validation studies it is also possible to consider models with less information on the underlying relationship. Carroll and Wand (1991), [35], studied the binary logistic regression model (1.17), (1.18), with unknown error distributions and with an expanded validation data set, where yj is observed also, (yj , xj , ξj ) ,
j = 1, ..., s.
(1.32)
This data set is used for a kernel estimate of the conditional distribution of Py/x . Then the regression parameter β is estimated semiparametrically by an adaptive likelihood method, which also use the observations of the primary data set (1.1). The order of the kernel depends on the dimension of the xi . This semiparametric estimation method was further developed by Carroll, Knickebocker, Wang (1995), [28]. They supposed a random subset of the primary data set (1.1), where ξi is observed also, such that the validation data set is of type (1.32). The problem is the high dimension of ξi , which requires a kernel of higher order and with it nonpositive kernels. Here the idea is to assume that the conditional probability of √ ξi given xi depends on xi only through a linear T combination xi γ. Given an n-consistent estimate for γ the problem is reduced to one dimensional estimated observations xTi γb and the nonpositive kernel can be avoided. It is shown that the method of Carroll and Wand (1991), [35] can be applied on this estimated observations and that it gives asymptotically equivalent results.
18
1.3.9
CHAPTER 1. INTRODUCTION
Nonparametric relationship
Up to now only nonparametric error-in-variables models with i.i.d. random ξi , that is the structural nonparametric relationship, have been considered in the literature. Deconvolution Under structural model assumptions the error-in-variables equation (1.10) describes a convolution x1 = ξ1 ∗ ε21 . (1.33)
The estimation of the unknown density fξ of Pξ from the i.i.d. observations xi , i = 1, ..., n, is called deconvolution. To make this nonparametric problem identifiable, it is assumed that the errors ε21 are independent of (x1 , ξ1 ) and i.i.d. with the known density fε . The problem of deconvolution is of interest independently on the error-invariables models. In engineering statistics it is the problem of extraction of a signal when one observes a signal plus noise. In the empirical Bayes theory the deconvolution problem is encountered, when ξ1 is the location parameter and Pξ represents the unknown prior distribution. The estimation of mixing distributions for location families also involves the convolution problem also. Therefore, in all these field results on the deconvolution problem are published. An application to the analysis of cell DNA content is discussed in Mendelsohn and Rice (1982), [124]. Deconvolution estimates basing on Fourier methods are proposed by several authors, for instance Lui and Taylor (1989), [70], Stefanski and Carroll (1990), [163], and Cun-Hui Zhang (1990), [181]. In Stefanski and Carroll (1990), [163], a small historical review is given. The estimates are constructed as follows. Because of (1.33), the characteristic function ϕξ of ξ1 is the quotient of the characteristic function ϕx of x1 and the charatereristic function ϕε of ε21 , ϕx ϕξ = . ϕε The characteristic function ϕx of x1 is estimated by ϕex , which is the characteristic function of a kernel density estimation of fx with kernel K. Because a density kernel estimate is the convolution of the kernel function K with the empirical measure Gx , generated by the data xi , i = 1, .., n , the estimator ϕex of ϕx is the product of the characteristic function ϕK of the kernel K and the empirical characteristic function ϕn , ϕex = ϕK ϕn .
Thus, the estimator ϕeξ of the ϕξ has the same product structure as ϕex , ϕK ϕeξ = ϕn . ϕε
19
1.3. REVIEW OF LITERATURE
Using the inversion formula one obtains that the estimator for the fξ is a kernel estimator with kernel K ∗ , which is determined by the characteristic function ϕK ∗ =
ϕK . ϕε
(1.34)
Lui and Taylor (1989), [70] derived formulas for the variance and bias of the deconvolution estimator and recommended special choices of the bandwidth. Devroye (1989), [41], showed that in any case, where the error distribution has an almost sure positive characteristic function, it exists a consistent estimator for fξ . The estimate constructed there is based on the kernel method, but differs by truncation from the deconvolution kernel estimates of the other papers. Density deconvolution estimation for dependent observations was considered in Masry (1993), [120] and in Hesse (1995), [78]. Another approach to the estimation of a component of a convolution is given by Gaffey (1959), [59]. He used an inversion formula for distribution functions. Mendelsohn and Rice (1982), [124], approximated a nonparametric estimator fn of the density fx of x1 by a convolution with unknown component fξ . The estiR mator of fξ is the spline minimizing the distance between fn and fε (s − t) fξ (t) dt. Consistency rates Consistency rates depend mainly on the tail behavior of the characteristic function ϕε , because of (1.34). Deconvolving a density with smooth measurement error is intrinsically difficult, with convergence rates much slower than those usually encountered in density estimation. Carroll and Hall (1988), [27], showed that if the density fξ of ξ1 has k bounded derivatives and the errors are normal, the fasted rate of convergence of any density estimator in the deconvolution probk lem is only (log n)− 2 . This is the rate achieved by the kernel density estimator above. Zhang (1990), [181], discussed the optimal rates of convergence under the L2 −norm and derived lower and upper bounds on rates. In Stefanski (1990), [158], consistency rates are also considered. Fan (1991), [54], characterized the optimal rates of convergence by two types of error distributions: ordinary smooth and supersmooth distributions. The deconvolution problem can be understood as estimating a functional of the density. In that setting Fan (1993), [55], derived global lower minimax bounds to obtain the optimal rates of convergence. If only part of the observations is contaminated by an additional error, then Hesse (1995), [79], showed that the available convergence rates for the deconvolution estimates are equal to the existing optimal rates with uncontaminated observations. Prakash Patil (1996), [132], considered a model with replicated observations and obtained the usual rates of nonparametric density estimators for the deconvolution problem.
20
CHAPTER 1. INTRODUCTION
Nonparametric structural relation Fan and Truong (1993), [56], published first results for the nonparametric errorin-variable model. They assumed structural explicit relations (1.9), (1.10) with (1.8) and (1.12). The main idea is the construction of a kernel estimator gb of the regression function g with the deconvolution kernel K ∗ , defined by (1.34). The consistency is shown for the supremum norm and for Lp -norms. An interesting feature is that the convergence rates are the same for both types of norms. This is not true for the ordinary nonparametric regression, (see Stone (1982), [167]). The rates are the same as in the deconvolution case and depend mainly on the type of error distribution. In Fan and Truong (1993), [56], the lower bound for the convergence rates of all possible estimators are derived and the proposed kernel estimator achieves them. The convergence and rates have already been considered by Stefanski and Carroll (1991), [165]. In the context of testing the absence of association in a generalized linear model (1.16), they applied the estimator with kernel K ∗ for estimating the expected value of E (ξ1 /x1 ) in the efficient score test. The kernel estimator of Fan and Truong (1993), [56], was studied by Masry (1993), [121] in the multivariate regression model of stationary random processes with error-in-variables. He established strong consistency and uniform convergence rates under ordinary and supersmooth error distributions. First Hausman et al. (1995), [77], proposed a nonparametric polynomial series estimator for structural nonlinear models with replicated observations. The idea is to estimate the parametric function nonparametrically by a polynomial and choose as estimator for the regression parameter the best least squares approximation to the nonparametric estimator.
1.4
Examples
The motivation for studying error-in-variables models comes from applications. In the papers of Carroll, Ruppert, Stefanski et al. several long time medical studies lead to structural nonlinear error-in-variables models and the exploration of their data is the aim in presenting new methods of estimating and testing. In Carroll, Ruppert, Stefanski (1995), [31], some of them like the Framingham data set and the NHANES example are presented and studied. Rudemo et al. (1989), [138], regarded the application of nonlinear structural relationships to bioassay and gave a small error approximation. In one of the classical papers to functional models of Neyman, Scott (1948), [128], problems of astronomy, connected with the study of the dynamics of the galaxy, are considered. Here let us shortly introduce some examples, which may be modeled as nonlinear functional error-in-variables model.
21
1.4. EXAMPLES
1.4.1
Continuous culture in microbiology
Schulze and Lipe (1964), [147], described an experiment in a continuous flow culture under complete mixing conditions. The aim is to establish the relationship between the substrate concentration S and the growth rate k1 for a given type of microorganisms. The basic equation relating substrate concentration and growth rate is given theoretically by k1 = km
S , S + S0
(1.35)
where km and So are unknown parameters. The main interest lies in estimating So , corresponds to the Michaelis-Menten constant and represents the substrate concentration at which the growth rate reaches one half of its maximum value. In each experiment a steady state i is reached in the reactor with continuous supply and flow off, when the concentration of the microorganisms in the reactor is constant. The steady state i depends on the starting cell concentration of microorganisms and the starting substrate concentration in the feed solution. As long as the steady state conditions are maintained, the growth rate k1 is equal to the dilution rate. The dilution rate is measured by yi for each steady state i. The substrate concentration S in the reactor is equal to the effluent rate, which is measured by xi at each steady state i. This type of experiment relates to the model (1.9), (1.10) with parametric model assumptions (1.6). Through the starting conditions the scientist has some influence on the nuisance parameters ξi , which here are the concentrations of the substrate inside the reactor for the steady state i. The assumption of a functional model with unknown but fixed design points seems to be reasonable.
1.4.2
Astrometric analysis of Schmidt plates
Schmidt plates are photo plates of stars. For a given system of reference stars i, i = 1, ..., n the celestial positions ηi = (ηi1 , η2i ) are well known. The real position ξi = (ξi1 , ξ2i ) of the star i on the plate is measured by xi = (xi1 , x2i ) with an error. The aim is to reduce the position on the plate to the celestial one, in order to determine the dynamic of stars by comparison of plates of different time points. The relationship between ξi and ηi is given by a third order polynomial, which includes the terms for the inclination and distortion, see Hirte (1989), [81], 2 2 3 ηij = β0j + β1j ξi1 + β2j ξi2 + β3j ξi1 ξi2 + β4j ξi1 + β5j ξi2 + .... + β9j ξi2 + εij .
The errors in the equation εij describe atmospherical disturbances; here they are considered like the errors of yij in (1.9). The unknown parameters βkj , k = 0, .., 9, j = 1, 2 are the plate constants, one is interested in. The comparison of two overlapping plates is considered in Eichhorn (1960), [49], and delivers a similar relationship. In a series of papers by Eichhorn (1978),
22
CHAPTER 1. INTRODUCTION
[50], Eichhorn (1985), [51], Eichhorn (1988), [52], and by Jeffereys (1980), [89], Jeffereys (1981), [90], the application of nonlinear functional relations to astrometric problems is considered; the least sqares estimator is proposed; and methods for linearization are given. In this special problem there are no influences on the nuisance parameters, the real position of the stars on the plate. The assumption of the star positions being i.i.d., is also not acceptable. Thus, we have here a special case of a polynomial functional relation model. In this model the hope is to have some more information about the measurement error distribution, especially the knowledge of this distribution up to moments of sixth order. For this special set up the formulas of alternative estimators are given in Zwanzig (1997), [188].
1.4.3
Polymerization
The application of error-in-variables model in chemistry, especially in polymerization, was another starting point for the study of nonlinear functional relations. Britt and Luecke (1973), [25], discussed the chemical reaction between two components, whose reaction rate is given by r=
AP1 P2 , 1 + BP1 + CP2
where r is the reaction rate and P1 and P2 are the pressures of the two components, all measured with errors. A, B, C are the unknown rate constants of interest. Here the authors proposed an iterative algorithm for the calculation of the orthogonal least squares. They studied the statistical properties of the least squares estimator in the local linear case. In terpolymerization Duever and al. (1983), [47], had the model P1 M1 (M1 r23 r23 + M2 r31 r23 + M3 r32 r21 ) (M1 r12 r13 + M2 r13 + M3 r12 ) = P3 M3 (M1 r13 r23 + M2 r13 r21 + M3 r12 r21 ) (M3 r31 r32 + M1 r32 + M2 r31 ) P2 M2 (M1 r32 r13 + M2 r13 r31 + M3 r12 r31 ) (M2 r21 r23 + M1 r23 + M3 r21 ) , = P3 M3 (M1 r13 r23 + M2 r13 r21 + M3 r12 r21 ) (M3 r31 r32 + M1 r32 + M2 r31 ) where the Mi are the molar concentrations of monomer i in the liquid phase and Pi are the molar concentration of the monomers in the polymer phase, i = 1, 2, 3. The reactivity ratios rij , i 6= j, j, i = 1, 2, 3 , are the parameters of interest. The concentrations are measured with errors. In their paper the authors compared the ordinary least squares estimator with the estimator in the implicit errorsin-variables models in a simulation study and gave preference to the error-invariables method. Linssen and Hillegers (1989), [113], began with an application in copolymerization models, (see Van der Meer, Linssen, German (1978), [123]). They
23
1.5. SCOPE AND CONTENT OF THE PAPER
presented an estimator for the nonlinear functional relation model basing on a linearization like (1.25). A detailed study of the application of functional nonlinear error-in-variables model instead of models with no measurement error in polymerization models is given in the last chapter of Hillergers (1986), [80]. The application of error-in-variables models to copolymerization was also a consulting problem at the University of Hamburg. Copolymerization is called the reaction of two monomers M1 , M2 to a polymer consisting of both. The concentration of the monomer j at time point t is denoted by Mj (t) . The experiments are carried out for the determination of the copolymerization parameter, r1 =
k11 , k12
r2 =
k22 , k21
where kjj is the reaction rate of the polymerization of the monomer Mj and kjk is the reaction rate when the polymerization changes from the monomer Mj to Mk . The copolymerization equation is given by
M1 (t) M1 (t) r1 M2 (t) + 1 dM1 (t) = . dM2 (t) M2 (t) M1 (t) + r2
(1.36)
M2 (t)
The experiment is repeated for n different mixtures i of the concentration of the monomers in the solvent solution and constant reaction time, where ηi =
dM1 (t) dM2 (t)
and
ξi =
M1 (t) M2 (t)
are measured with errors. Therefore the coplymerization can also be described by a nonlinear functional relation model. An influence on the design is possible by the choice of the starting concentrations of the monomers. Keeler and Reilly (1992), [94], considered also the copolymerization model (1.36) in the context of functional error-in-variables models and presented an extension of the concept of D-optimal design.
1.5
Scope and content of the paper
The main topic of this paper is the explicit functional relation model (1.9), (1.10) under parametric and nonparametric model assumptions. The only data set available is the primary one (1.1). Additional information is included in the form of assumptions on the set of parameters or on the error distributions. For instance the replicated model (1.19), ( 1.20) can be formulated as an unreplicated model with sample size N and a nuisance parameter set, which is an n−dimensional linear subspace of the RN .
24
CHAPTER 1. INTRODUCTION The most interesting asymptotic approach is that of increasing sample size n → ∞,
which includes the increasing number of nuisance parameters. Most asymptotic results in Chapter 6 , Chapter 7 and Chapter 9 are given by inequalities, where the bounds depend of the sample size and of the error variance. Thus, the small error asymptotic is also included, for σ 2 → 0. Otherwise it is possible to obtain rates for the combinations of both σ 2 (n) → 0, for n → ∞. Under parametric model assumptions a considerable part is dedicated to the weighted least squares estimator in (1.19), (1.20), also called orthogonal regression estimator, because of the nice geometrical background and its spread in applications. Using the entropy notation, conditions on the nuisance parameter set are found, where the least squares estimator is consistent. The proof is based on the methods of the empirical process theory. The conditions are weaker as the known ones in the literature. One special case is that under a known order of the unknown design points ξi the least squares estimator is consistent. This additional knowledge about the nuisance parameter set is for instance given in chemistry. On the other hand conditions are given in the general functional explicit model (1.19), (1.20), which imply the inconsistency of the least squares estimator. This is the case for instance for nonlinear relationships, where the design points vary freely in some interval. Unfortunately the problem of consistency or inconsistency is not completely solved. There remains a grey area, where the behavior of this estimator is unknown, but for most cases relevant for practical purposes the answer is given now. Under conditions ensuring the consistency of the least squares estimator the asymptotic normality is proved in form of a Berry-Esseen inequality. For normal distributed errors the maximum likelihood estimator and the least squares estimator coincide. Under the conditions of the asymptotic normality the least squares estimator is asymptotical efficient in a local minimax sense. The inconsistency of the least squares estimator in important cases initiates the search of alternative consistent estimators. In Chapter 10 an alternative estimators are presented. The idea behind this is connected with deconvolution integral equalities and with the approach of Stefanski (1989), [157], of correcting score functions. In contrast to the known M-estimators in error-in-variables models here the starting point is the least squares method and a correction of this distance measure.
1.5. SCOPE AND CONTENT OF THE PAPER
25
The alternative estimator can be constructed for polynomial and for exponential functions. Both sets of functions generate orthonormal systems for nonparametric model classes. This is used in the second part of the paper where orthogonal series estimators for the unknown relationship are presented. The consistency rates are derived and compared with those of Fan and Troung (1993), [56]. Up to now, no nonparametric estimators for the functional nonparametric models have been given in the literature.
26
CHAPTER 1. INTRODUCTION
Chapter 2 Notation In this chapter we sum up denotations we will use in the whole paper. Nevertheless we try to explain notation and concepts also at that place, where we will need it.
2.1
Scalar products, norms, empirical measure
k.k denotes the Euclidean norm for vectors and also for matrices. Because we have vectors and matrices with an increasing dimension, it useful to introduce the following normalized norms. The notation, which will be introduced here, depend on a sequence of weights. This are the same weights, which will be occur in Definition 5.1 of the weighted least squares estimator. Given a n dimensional vector of nonnegative weights w = (w1 , ..., wn )T , wi ≥ 0, we define as the weighted scalar product for X, Y ∈ Rn
(X, Y )w =
X
wi = 1
(2.1)
n X
w i xi yi .
(2.2)
i=1
We write (X, Y )n = (X, Y )w for w =
The norms generated by the scalar products are |X|2w = (X, X)w ,
1 1 , ..., n n
T
.
|X|2n = (X, X)n .
(2.3)
(2.4)
Then the weighted Euclidean distance is for X, Y ∈ Rn
|X − Y |2w = 27
n X i=1
wi (xi − yi )2 .
(2.5)
28
CHAPTER 2. NOTATION
and for wi =
1 n
we have |X − Y |2n =
for X, Y ∈ Rn
n 1X (xi − yi )2 . n i=1
(2.6)
For any matrices n × p dimensional matrices Z Z = (zij )i=1,...,n,j=1,...,p ,
(2.7)
we define the norms and scalar products by for X, w ∈ R
n
n X
(Z, X)w =
wi xi zij
i=1
!
j=1,...,p
∈ Rp .
(2.8)
Analogously, for an n × p × p dimensional Z = (zikj )i=1,..,n,k=1,...,p,j=1,...,p
(2.9)
we write for X, W ∈ R
n
(Z, X)w =
n X
wi xi zikj
i=1
!
.
(2.10)
k=1,...,p,j=1,...,p
Further it is convenient to introduce a weighted the empirical measure generated by the sequence of design points ξ ∈ Rn . Given weights with (2.1) we define for all intervals A of R n Gw (A) =
X
wi IA (ξi ) ,
(2.11)
1 for ξi ∈ A . 0 for ξi ∈ /A
(2.12)
i=1
where IA (.) is the indicator function, IA (ξi ) =
(
Thus it holds |G (ξ, β)|2w = and for β, β ∈ Θ
and
n X
wi (g (ξi , β))2 =
i=1
G (ξ, β) , G ξ, β
w
=
Z
Z
(g (ξi , β))2 dGw .
(2.13)
(2.14)
g (ξi , β) g ξi , β dGw .
Z 2 2 dGw . G (ξ, β) − G ξ, β = g (ξi , β) − g ξi , β w
(2.15)
29
2.2. DERIVATIVES We use the unweighted empirical measures also Gn (A) = Gw (A) =
n 1 1 1X , ..., IA (ξi ) , for w = n i=1 n n
T
.
(2.16)
Thus it holds |G (ξ, β)|2n = and for β, β ∈ Θ
2.2
Z n 1X (g (ξi , β))2 = (g (ξi , β))2 dGn , n i=1
G (ξ, β) , G ξ, β
n
=
Z
g (ξi , β) g ξi , β dGn .
(2.17)
(2.18)
Derivatives
In this subsection the notation for the partial derivatives is given. Let be h : (x, β) ∈ R × Θ → R
(2.19)
a twice continuously differentiable real valued function of p + 1 variables. We denote the partial derivatives with respect to the j-th component of β = (β1 , ..., βp )T by ∂ h (x, β) (2.20) hβj (x, β) = ∂βj and the p dimensional vector whose components are the partial derivatives with respect to βj by β h 1 (x, β) : (2.21) hβ (x, β) = . βp h (x, β)
Analogously, we will use for the second partial derivatives with respect to βj and βk ∂2 h (x, β) (2.22) hβj βk (x, β) = ∂βj ∂βk and for the p × p dimensional matrix whose elements are the partial derivatives of second order with respect to βj and βk
hβ1 β1 (x, β) ... hβ1 βp (x, β) : : : hββ (x, β) = . βp β1 βp βp (x, β) h (x, β) ... h
(2.23)
The partial derivatives with respect to x we denote by hx (x, β) =
∂ h (x, β) ∂x
(2.24)
30
CHAPTER 2. NOTATION
and the derivatives of second order with respect to x by hxx (x, β) =
∂2 h (x, β) . ∂x∂x
(2.25)
The derivatives of second order with respect to x and to βj are hxβj (x, β) =
∂2 h (x, β) , j = 1, .., p. ∂x∂βj
(2.26)
For the p dimensional vector of them we write
hxβ1 (x, β) xβ : h (x, β) = . hxβp (x, β)
(2.27)
We will use the common denotation for classes of smooth functions: (
Ck,α [0, 1] = f : [0, 1] → R :
f (k)
f k times differentiabel, is α H¨olderian with constant α
)
.
(2.28)
Part I Estimation in the parametric error-in-variables model
31
Chapter 3 Parametric model 3.1
Nonlinear functional relation model
Suppose we have n independent and but in general not identically distributed, two dimensional real valued observations (y1 , x1 ) , ...., (yn , xn ), generated by yi = g (ξi , β) + ε1i ,
(3.1)
xi = ξi + ε2i ,
(3.2)
i = 1, ..., n.
(3.3)
with The first equation describes a nonlinear regression model, the second the errorin-variables. The distribution of each observation (yi , xi )is Pξi β . The common distribution of the whole sample is ((y1 , x1 ) , ...., (yn , xn )) ∼
n Y
Pξi β = Pξβ .
i=1
In this part we assume that the model for the regression function is given parametrically. That means, the regression function is a member of known parametric class M of smooth functions: M = {g (., β) : X ⊆ R → R : β ∈ Θ} .
(3.4)
The regression parameter β ∈ Θ ⊂ Rp is the parameter of interest. The dimension p of β does not depend on the sample size n. Compare for instance the example in Section 1.4.1., where g (., β) is the Michaelis Menten curve (1.35). The design points or variables {ξ1 , ..., ξn } ⊂ R are unknown and fixed. The ξi are the nuisance parameters, whose number grows up with the sample size n. We write the nuisance parameters as components of a column vector of dimension n: ξ (n) = (ξ1 , ..., ξn )T ∈ F (n) ⊆ Rn . 33
(3.5)
34
CHAPTER 3. PARAMETRIC MODEL
The errors ε1i , ε2i are independent, but not necessarily identically distributed 2 with expected value zero and positive variances σji , j = 1, 2; i = 1, .., n. The error distributions do not depend on the parameter β. The distributions of the error terms are allowed to depend on the sample size n. Then the asymptotic 2 approach, σmax → 0, can be included also. For instance this is the case of the averaged model, (3.11),(3.12). Let us introduce this model in the following section separately.
3.2
Repeated observation model
We assume, that the experiment is repeated independently ri times at each design point ξi , i = 1, .., q. This means, we have only q differentdesign points and n = Pq i=1 ri observations (y11 , x11 ) , ..., (y1r1 , x1r1 ) ...., yqrq , xqrq mutually independent but in general not identically distributed, generated by yik = g (ξi , β) + ε1ik ,
(3.6)
xik = ξi + ε2ik ,
(3.7)
where i = 1, ..., q,
k = 1, .., ri .
(3.8)
The regression function fulfills also the assumption (3.4). 2 The errors εji1 , ..., εjir are i.i.d. with expected value zero and variance σji . The weighted average of the εji1 , ..., εjir is denoted by ri 1 X εji = wjik εjik , wji k=1
(3.9)
where the weights are wjik ≥ 0,
and
wji =
ri X
wjik .
(3.10)
k=1
We alsu use the denotations xi , y i respectively. After we have a two averaging, dimensional real valued observations (y 1 , x1 ) , ...., y q , xq , independent and in general not identically distributed, generated by y i = g (ξi , β) + ε1i ,
(3.11)
xi = ξi + ε2i ,
(3.12)
with i = 1, .., q . The errors εji , j = 1, 2, i = 1, ..., q are independent, not identically distributed with expected value zero and bounded variances depending on ri , ! ri ri 1 X 1 X 2 2 E (εji ) = V ar wjik εjik = (wjik )2 σji . (3.13) 2 wji k=1 (wji ) k=1
35
3.3. MODEL WITH KNOWN ORDER OF DESIGN POINTS
As the increasing number of replications increases, the error variances tend to zero. Here the model (3.11), (3.12) is called the averaged model. It has the same structure as the original one, (3.1), (3.2), with error distributions depending on additional parameters ri . Another way of imbedding the model (3.6), (3.7) into the model (3.1), (3.2) is the following one. In (3.6), (3.7) we have n observations, and the n dimensional vector of the nuisance parameters (3.5) has only q different components, or more generally, the parameter set F (n) lies in a q dimensional linear subspace Lq of the n dimensional Euclidean space Rn : F (n) ⊆ Lq ⊂ Rn .
(3.14)
The next two special models are characterized by a special structure of the set of nuisance parameters.
3.3
Model with known order of design points
The model is the same as in (3.1), (3.2). The observations (y1 , x1 ) , ...., (yn , xn ) are independent and generated by yi = g (ξi , β) + ε1i ,
(3.15)
xi = ξi + ε2i ,
(3.16)
with i = 1, ..., n. For the regression function we assume (3.4); β is the parameter of interest with fixed dimension p. The key assumption is that the order of the unknown design points is known. Instead of (3.5), we require n
o
F (n) = ξ = (ξ1 , ..., ξn )T : 0 ≤ ξ1 ≤ ξ2 ≤ ... ≤ ξn−1 ≤ ξn ≤ 1 .
(3.17)
This assumption seems to be artificial, but in a lot of applications it is useful. For instance in the examples from biology or chemistry in Section 1.4.1 and 1.4.3, - one has are unknown design points ξi -namely the different levels of concentration. The experimenter measures this concentration level with error. But he does have some influence on the levels of the concentration and he can guarantee with high security that the concentration level of the next experiment will be higher. In Chapter 7 we will see that under the additional information (3.17), the least squares estimator is consistent. If we assume that the design points have an arbitrary position in the interval [0, 1], then the least squares estimator is consistent in the linear model only, otherwise it is inconsistent. This will be shown in Chapter 8. The assumption (3.17) can be rewritten F
(n)
(
1 i = ξ= f , ..., f ..., f (1) n n
T
)
: f : [0, 1] → [0, 1] , f ↑ increasing .
(3.18) Such additional information is used in the following semiparametric model.
36
CHAPTER 3. PARAMETRIC MODEL
3.4
Nonlinear semiparametric functional relation model
The name semiparametric may be misleading in the context of errors in variables models, because all models of this kind are semiparametric by the increasing number of nuisance parameters. Here we call those models semiparametric where the design points are generated by an unknown smooth function of known design points. Without loss of generality we assume that the hidden design is equidistant. The set of nuisance parameters can be interpreted as the set of all smooth Ylvisaker designs, see Sacks and Ylvisacker (1984), [139]. The observations (y1 , x1 ) , ...., (yn , xn ) are independent and generated by
i yi = g f , β + ε1i , n
(3.19)
i + ε2i , (3.20) n with i = 1, ..., n. For the regression function g we assume (3.4), and β is the parameter of interest with fixed dimension p. The first equation (3.19) describes a usual nonlinear semiparametric model, the second (3.20) a usual nonparametric regression model with fixed design. We assume xi = f
f ∈ Fm,α (C, L) where (k) f (x) ≤ C, k = 0, 1, ..., m Fm,α (C, L) = f ∈ Cm,α [0, 1] : (m) f (x1 ) − f (m) (x2 ) ≤ L |x1 − x2 |α
(3.21) In this case we can show the consistency of the least squares estimator. That is done in Section 7.4.2.
3.5
Vector model
Sometimes, especially within proofs, it is useful to write the model (3.1), (3.2) in vector notation. Let Y = (y1 , ..., yn )T , X = (x1 , ..., xn )T , ξ = ξ (n) = (ξ1 , ..., ξn )T and and
(3.22)
ε1 = (ε11 , ..., ε1n )T , ε2 = (ε21 , ..., ε2n )T
(3.23)
G (ξ, β) = (g (ξ1 , β) , ..., g (ξn , β))T .
(3.24)
37
3.5. VECTOR MODEL Then the vector model becomes Y = G (ξ, β) + ε1 ,
(3.25)
X = ξ + ε2 .
(3.26)
Note: the length of the vectors X, Y depend on the sample size n. In order to simplify the notation we suppress the dependence on n.
38
CHAPTER 3. PARAMETRIC MODEL
Chapter 4 Estimation of the parameter of interest The nonlinear functional relation model, (3.1) and (3.2), is a special case of a model with parameter of interest and nuisance parameters. The estimation of an interesting parameter in the presence of nuisance parameters, especially when the number of nuisance parameters is large, is a long standing and important problem in statistics. The classic reference is Neyman and Scott (1948), [128]. In this chapter we will give a short overview of different approaches to handling the case of nuisance parameters, and discuss which of them are useful for the nonlinear functional relation model.
4.1
Estimating functions and classes of estimates
One of the general approaches described in this section is that of estimating functions. Godambe (1960), [70], was the first who introduced the concept of estimating functions. Godambe and Thompson (1974), [74], generalized it to the estimating problem with nuisance parameters. Definition 4.1 A real function fi (., ., .) on R×R × Θ is called an unbiased estimating function iff Eξi β (fi (., ., .)) = 0,
for all ξi , with ξ ∈ F (n) , for all β ∈ Θ
(4.1)
and
Eξi β fi (., ., .)2 < ∞, 2
for all ξi , with ξ ∈ F (n) , for all β ∈ Θ.
(4.2)
Then the estimator for the parameter of interest is defined as the measurable solution of the estimating equation. 39
40
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
Definition 4.2 The estimator βe with respect to f is a measurable solution of n 1X fi xi , yi , βe = 0. n i=1
2
(4.3)
Let us consider the Taylor expansion n n n 1X 1X 1X e fi (xi , yi , β) = f i xi , yi , β + fiβ (xi , yi , β ′ ) β − βe , n i=1 n i=1 n i=1
(4.4)
e Under regularity conditions on where β ′ is an intermediate value between β and β. the unbiased estimating functions, ensuring the existence of the following limits n n 1X 1X fi (xi , yi , β) = Eξ β fi (xi , yi , β) + oP (1) = oP (1) n i=1 n i=1 i
(4.5)
and n n 1X 1X fiβ (xi , yi , β ′ ) = Eξ β f β (xi , yi , β ′ ) + oP (1) = V (f, β) + oP (1) , (4.6) n i=1 n i=1 i i
where V (f, β) ≻ 0,
e Under further regularity conditions we get also we obtain the consistency of β. the stochastic expansion n √ 1 X e n β−β = √ V −1 (f, β) fi (xi , yi , β) + oP (1) n i=1
and the asymptotic normality √ n β − βe −→ Np 0, V −1 BV −1 ,
(4.7)
(4.8)
with
n 1X Eξi β fi (xi , yi , β)2 . n i=1 (4.9) Godambe (1980), [72], defined the information of an estimating function as a measure of how fi may be used to estimate the parameter of interest.
V = n→∞ lim
n 1X Eξ β f β (xi , yi , β) n i=1 i i
and B = n→∞ lim
Definition 4.3 The information of fi about β is defined as I (fi , β) = 2
1
Eξi β fi (xi , yi , β)
2 Eξi β fiβ (xi , yi , β) . 2
(4.10)
4.1. ESTIMATING FUNCTIONS AND CLASSES OF ESTIMATES
41
The optimum estimating functions fi∗ are defined with respect to Definition 4.3, as that estimating function with highest information. Under regularity conditions Godambe and Thompson (1974), [74], derived the form of the optimum estimating function fi∗ for one observation (xi , yi ) with density pi and with the parameter (ξi , β) ,
∂ log pi ∂ log pi fi∗ (xi , yi , β) = C1 (ξi , β) + C2 (ξi , β) ∂β ∂ξi
!2
∂ 2 log pi , (4.11) + ∂ξi2
provided the constants C1 (ξi , β) , C2 (ξi , β) are such that the resulting estimating function fi∗ is independent of ξi . Unfortunately, they had no result on the existence of estimating functions and especially of the optimum estimating function. In a follow-up paper of Godambe (1976), [71], the uniqueness of the optimum estimating function, if it exists, is shown. In the simple linear functional relation model with standard normal distributed errors xi ∼ N (ξi , 1) , yi ∼ N (ξi β, 1) (4.12) we have for instance the unbiased estimating function fi (xi , yi , β) = (yi − βxi ) (xi + βyi ) .
(4.13)
But already in this simple case we have no solution for the constants C1 (ξi , β) , C2 (ξi , β) in (4.11), such that the estimating function in (4.11) is independent of ξi . Let us discuss this point in detail. For (4.12) we have ∂ log pi = (yi − ξi β) ξi ∂β
(4.14)
and ∂ log pi ∂ξi
!2
+
∂ 2 log pi = ((yi − ξi β) β + (xi − ξi ))2 − 1 − β 2 . ∂ξi2
In particular,
(4.15)
fi∗ (0, 0, 0) = −C2 (ξi , 0) ξi2 − 1 , and for all ξi2 6= 1 this implies
fi∗ (1, 0, 0) = ξi2 − 2ξi C2 (ξi , 0) = fi∗ (0, 0, 0)
2ξi − ξi2 , ξi2 − 1
that contradicts fi∗ (1, 0, 0)′ s independence of ξi . Kumon and Amari (1984), [105], considered the orthogonal decomposition of the estimating function in an information, ancillary and a normal component. In particular, they studied that subset of all estimating functions, in which the
42
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
coefficient of the information component is independent of the nuisance parameters, which they call uniform informative. They derived a new lower bound for the asymptotic variance of the estimators with respect to uniform informative estimation functions. In the simple linear functional relation model (4.12) the estimating function (4.13) is uniform informative and optimal in the sense that the concerning estimator attains this new lower bound for the variances. On the basis of differential geometry and the concept of Hilbert bundles, Amari and Kumon (1988), [4], constructed a decomposition of the set of all regular estimating functions. They also gave abstract sufficient and necessary conditions for the existence of an optimal estimating function. Their results are applied to a special type of ξ−exponential family, pξi β (xi , yi ) = exp (ξi s (xi , yi , β) + r (xi , yi , β) − ψ (ξi , β)) .
(4.16)
The simple linear model (4.12) is a member of this class since s (xi , yi , β) = xi + yi β, r (xi , yi , β) = − and
1 2 xi + yi2 2
1 ψ (ξi , β) = − ξi2 1 + β 2 − ln (2π) . 2 It is easily seen that the nonlinear functional relation model - even with normal errors - is not in this class. Amari and Kumon (1988), [4], gave an explicit expression for the optimal estimating function in the model (4.16). For the simple linear functional relation model (4.12) they arrived at the old result that there is no optimal estimator in the class of all unbiased estimating functions. The conditions of these universal theorems are not easy to verify in general models other than the ξ−exponential classes. The way out used by Amari and Kawanabe (1996), [3], [2], is to restate the problem by regarding the nuisance parameters as i.i.d. random variables. The problem in the nonlinear functional relation model is the initial one - the construction of unbiased estimating functions. The main difficulty is that the estimating function does not depend on the nuisance parameters. The question arises how to eliminate the nuisance parameters in the joint parametric estimating equation, such that (4.1) is satisfied. In Chapter 8 we will see that in the general nonlinear case the least squares estimator is not an estimator in the sense of (4.1). The alternative estimator introduced in Chapter 10 will be of such kind. The M-estimators for the error-in-variables models, introduced by Stefanski (1985), [155], are also of this kind, see the references in Section 1.3.7. Unfortunately, up to now the construction of Stefanski’s M-estimators has been done only for some special models, like the generalized linear, the exponential or the polynomial ones. The alternative estimator presented here depends also on additional conditions, which are still quite restrictive and seem to exclude a number of models.
43
4.2. MINIMUM CONTRAST ESTIMATES
4.2
Minimum contrast estimates
Another approach is that of minimum contrast estimation. The estimator is defined as a solution of a minimization problem. For a moment we let us pass to the case that we want to estimate both types of parameter simultaneously and disregard the different importance of the parameter of interest and the nuisance parameter. We consider the composite n + p−dimensional parameter θ = (ξ, β) ∈ F (n) × Θ = Ξ ⊆ Rn+p . Let denote by Ξc the compactification of Ξ in the compactified n+p−dimensional n+p Euclidean space R . Definition 4.4 A nonrandom positive real function Cn : θ ∈ Ξc → R+ is called a contrast for θ at θ(n) iff it is lower semicontinuous and θ(n) = arg minc Cn (θ).
(4.17)
θ∈Ξ
2 Examples in the nonlinear functional relation model are Cn (θ) =
n X i=1
or Cn (θ) = C (β) =
wi g (ξi , β) − g ξi , β 0
Z
g (ξ, β) − g ξ, β 0
2
2
(4.18)
dG (ξ) ,
(4.19)
where β(n) = β 0 with β 0 , satisfying (3.1). The first one depends on the unknown design points ξ ∈ F (n) ; the second on an asymptotical design G. If we have an unique parameterization of the regression function, each distance measure d for functions g (., β) ∈ M seems to be an useful contrast at β(n) = β 0 ,
Cn (θ) = Cn (β) = d g (., β) , g ., β 0
.
(4.20)
Cn measures the difference between any parameter β ∈ Θ to the true parameter β 0 , satisfying (3.1). The idea of the minimum contrast approach consists now in an appropriate estimating of this contrast and in the search of the minimal points. Definition 4.5 Suppose a measurable function Cen (., ., .) : Rn × Rn × Ξc :→ R+
(4.21)
and let it be continuous with respect to θ, then Cen (X, Y, θ) =: Cen (θ) is called a contrast function. 2
44
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
In the general statistical experiment, which includes random processes and random fields as well, Liese and Vajda (1995), [111], introduced a more general concept and called the function corresponding to (4.21) a contrast principle, compare also Liese and Vajda (1995), [110]. Note, in order to simplify the denotation, we will suppress the dependence on the sample and let the tilde hint to this. We then define the corresponding estimator as follows. Definition 4.6 A measurable solution θe : Rn × Rn → Ξc is called a minimum contrast estimator iff θe ∈ arg minc Cen (θ), (4.22) θ∈Ξ
where Ξc denotes the compactification of the parameter set F (n) × Θ in R
n+p
.2
Under the model assumptions above the existence of minimum contrast estimators are given by the Lemma 2 of Liese and Vajda (1995), [111]. If we require an averaging structure of the contrast function, such that n 1X Fi (xi , yi , θ) n i=1
(4.23)
n 1X Eξ β 0 Fi (xi , yi , θ) , n i=1 i
(4.24)
Cen (θ) = and if we set Cn (θ) =
then the minimum contrast estimator is the same as in the notation of Pfanzagl (1969),[133]. Stefanski (1985), [155], called estimates based on (4.23) Mestimates. The classical notation of M- estimates in the sense of Huber (1981), [83], refers to minimum contrast functions, which depend on the observations and the parameters via the difference (xi − ξi ) and (yi − g (ξi , β)) . (Unfortunately, these types of contrast functions depend on the nuisance parameters.) Sometimes the name generalized M-estimator is also used for estimates corresponding to (4.23), see the discussion in Liese and Vajda (1995), [111]. The following lemma gives the connection between the consistence of the minimum contrast estimator and the uniform consistent approximation of the contrast Cn (θ) by the contrast function Cen (θ) . It is a version of an ”argmin” result, like the argmax theorem for i.i.d. experiments in van der Vaart and Wellner (1996), [173]. Consider the differences of the contrasts and the contrast functions,
∆Cn (θ) = Cn (θ) − Cn θ(n)
and ∆Cen (θ) = Cen (θ) − Cen θ(n) .
(4.25)
45
4.2. MINIMUM CONTRAST ESTIMATES
Lemma 4.1 Let ρ : R+ → R+ be strictly increasing , with ρ (0) = 0. Let d (., .) be a semimetric on Ξc . For any ǫ > 0, define the set n
o
Ξn (ǫ) = Ξc ∩ θ : d θ, θ(n) > ǫ . Let Cn be a contrast for which
∆Cn (θ) = Cn (θ) − Cn θ(n) ≥ ρ d θ, θ(n) Then
e θ P d θ, (n)
∀ǫ > 0 2 Proof.
(4.26)
.
(4.27)
e (θ) ∆Cn (θ) − ∆C n > ǫ ≤ P sup ≥ 1 .
θ∈Ξn (ǫ)
ρ d θ, θ(n)
(4.28)
From (4.27) follows that, for all θ ∈ Ξn (ǫ) ,
Cn (θ) − Cn θ(n) ≥ ρ d θ, θ(n) and
∆Cn (θ)
inf θ∈Ξn (ǫ)
We have Cen θe ≤ Cen θ(n) ρ (ǫ) > 0, thus
ρ d θ, θ(n)
> ρ (ǫ) > 0
≥ 1.
(4.29)
(4.30)
e θ and, under θe ∈ Ξn (ǫ) , we have also ρ d θ, (n)
∆Cen θe
e θ ρ d θ, (n)
≤ 0.
>
(4.31)
Hence, under θe ∈ Ξn (ǫ) we obtain from (4.30) and (4.31) the following chain of inequalities: 1 ≤ inf
θ∈Ξn (ǫ)
∆Cn (θ)
ρ d θ, θ(n)
−
∆Cen θe
e θ ρ d θ, (n)
e (θ) ∆Cn (θ) − ∆C n ≤ sup , (4.32) θ∈Ξn (ǫ)
ρ d θ, θ(n)
which yields the statement of the lemma. 2 Note, that the rate of convergence of the minimum contrast estimator given by this lemma depends mainly on the separation property of the contrast (4.27) and the semimetric d (., .) chosen in (4.27). The minimum contrast approach for the common parameter θ can applied to several other contrast functions as, for instance, the L1 −contrast function Cen (θ) =
n 1X |yi − g (ξi , β)| + |xi − ξi | , n i=1
46
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
or generally for all contrasts satisfying the separation condition (4.27) and e (θ) ∆Cn (θ) − ∆C n P sup ≥ 1 → 0.
θ∈Ξn (ǫ)
ρ d θ, θ(n)
(4.33)
This general approach for nonparametric problems is studied by Van de Geer (1990), [63], and in Birge and Massart (1993), [20]. They all introduced contrast function of average structure like in (4.23) and applied results from empirical process theory. Under an entropy condition on the parameter space, the approximation property (4.33) is satisfied. The derived consistency rates of minimum contrast estimates in an abstract model, which is primarily a generalization of the nonparametric regression model, although it is intended to describe various other models as well. The parameter is chosen abstract but indepedent of the sample size n. The semiparametric functional relation model (3.19), (3.20), is included in their general approach. The purpose of the paper of Birge and Massart (1993), [20] is the global approach to the consistency of minimum contrast estimators and they paid the price of more complex assumptions. In another paper Birge and Massart (1994), [21], developed the general theory of minimum contrast estimators further for estimators defined on sieves. The method of sieves is much more closely related to our model because a sequence of the parameter sets are considered, which can be chosen to depend on the sample size. Thus it is possible to interpret the l.s.e. (5.5) as a method of sieves, where the dimension of the parameter space increases in the same order as the sample size. Van de Geer, (1995), [64], discussed the sieves approach for the least squares estimators in the nonparametric regression model and for the nonparametric maximum likelihood estimator. Generally speaking the method of sieves is a very interesting one in the nonlinear functional relation model, when the aim is the estimation of the nuisance parameters as well. The L1 −approach for the nonlinear functional model is considered in Zwanzig (1997), [189] and is based on results of van de Geer (1990), [63]. In this paper we will consider the least squares estimator only because of its common use. The case of general contrast functions which includes the consistent estimation of the nuisance parameter is not within the scope of this paper. Now let us come back to our aim of estimating the parameter of interest β only. The main problem is to find a reasonable contrast Cn (θ) at θ(n) = (ξ 0 , β 0 ) which can be estimated by a contrast function depending of β only, that is, Cen (θ) = Cen (β) . The connection to the estimating functions, defined in (4.2), (4.10), is found when in (4.23) Fiβ (xi , yi , θ) = Fiβ (xi , yi , β) is independent of ξ and when Eθ Fiβ (xi , yi , β) = 0,
47
4.3. MAXIMUM LIKELIHOOD ESTIMATES then Fiβ (xi , yi , β) = fi (xi , yi , β) .
Hence, equation (4.3) is the normal equation of the minimizing problem (4.22). A modified version of Lemma 4.1 restricted to the estimation of the parameter of interest can be stated as follows: Lemma 4.2 Let ρ : R+o→ R+ be strictly increasing , with ρ (0) = 0, let Ξn (ǫ) =
n
c Ξ ∩ θ : β − β(n)
> ǫ , Let Cn be a contrast function with
Cn (β) − Cn β(n) ≥ ρ
β − β(n)
.
Then ∀ǫ > 0 2
(4.34)
P
βe − β(n)
> ǫ ≤ P sup ∆Cn (β) − ∆Cen (β) > ρ (ǫ) . (4.35) θ∈Ξn (ǫ)
This case of a contrast function which does not depend of the nuisance parameter is considered in Chapter 10. The alternative estimator introduced there is based on such a construction of a contrast function independent of the nuisance parameters.
4.3
Maximum likelihood estimates
A classical reference for an overview on methods of eliminating nuisance parameters from the likelihood function is Kalbfleisch and Sprott (1970), [92]. In order to simplify the notation, let us assume for a moment that εji , j = 1, 2; i = 1, ..., n are i.i.d. with density p and l = ln p. σji
(4.36)
The density of the sample (x1 , y1 ) , ..., (xn , yn ) generated by (3.1) and (3.2) is !
n Y
!
xi − ξ i yi − g (ξi , β) p , p σ2i σ2i i=1
(4.37)
and the log-likelihood function is L (ξ, β) =
n X i=1
Li (ξi , β)
!
!
xi − ξ i yi − g (ξi , β) +l . with Li (ξi , β) = l σ1i σ2i (4.38)
48
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
Definition 4.7 A measurable solution βb of
βb ∈ arg max max L (ξ, β)
(4.39)
is a maximum likelihood estimator for β and a measurable solution ξb of ξb ∈ arg max max L (ξ, β)
(4.40)
β∈Θ ξ∈F (n)
β∈Θ ξ∈F (n)
is a maximum likelihood estimator for ξ.2
In stepwise calculating the maximum likelihood estimator for β it is useful to introduce the maximum likelihood estimator for ξ for any given fixed β. Suppose ξb (β) ∈ arg max L (ξ, β) , (4.41) ξ∈F (n)
and
Lp (β) = max L (ξ, β) = L ξb (β) , β . ξ∈F (n)
(4.42)
Lp (β) is called profile likelihood function. Then the maximum likelihood estimator can be calculated with the help of the profile likelihood: βb ∈ arg max max L (ξ, β) , with β∈Θ ξ∈F (n)
max max L (ξ, β) = max L ξb (β) , β . β∈Θ ξ∈F (n)
β∈Θ
(4.43) Kalbfleisch and Sprott (1970), [92], discussed the use of the maximum relative likelihood function: L ξb (β) , β . maxβ∈Θ L ξb (β) , β
We are now interested in seeing how the maximum likelihood function is related to the estimating functions and to the minimum contrast functions.
4.3.1
Maximum likelihood and minimum contrast estimates
There are two possible connections between the maximum likelihood approach and the idea of minimum contrast estimation; one to consider the negative likelihood function as contrast function for the composite parameter and the other to set the negative profile likelihood as an contrast function independent on the nuisance parameters. We will discuss both. In the first case Cen (θ) = − n1 L (ξ, β) the question arises what is the corresponding contrast? This means, we are looking for a deterministic function Cn (θ) such that 1 ∆L (ξ, β) = −∆Cn (θ) + oPθ0 (1) n
and
β 0 = arg min Cn (θ) . β∈Θ
(4.44)
49
4.3. MAXIMUM LIKELIHOOD ESTIMATES Under regularity conditions which ensure that n 1X (Li (ξi , β) − Eθ0 Li (ξi , β)) = oPθ0 (1) , n i=1
the corresponding contrast is the expected value − n1 Eθ0 L (ξ, β). Under normal distributed errors then the contrast is n g (ξi0 , β 0 ) − g (ξi , β) 1 X Cn (θ) = 2n i=1 σ1i
!2
ξ 0 − ξi + i σ2i
!2
.
The problem now is to verify the separation condition (4.27) in Lemma 4.1 for 1 E 0 L (ξ, β) in order to obtain a consistency result for the maximum likelihood n θ estimator of β. For the case of normal distributed errors this is done in Lemma 5.1 in the context of the least squares estimation. Consider now Cen (β) = − n1 Lp (β), where the negative profile likelihood is treated as a contrast function. What, then, is the corresponding contrast? To illustrate the matter let us consider the simple linear functional relation model (4.12) with the joint density of the sample (yi − ξi β)2 (xi − ξi )2 exp − − 2 2 i=1 2π n Y 1
!
(4.45)
and ξbi (β) =
βyi + xi . 1 + β2
(4.46)
Then the profile likelihood function is Lp (β) =
n X i=1
!
1 (yi − xi β)2 − ln (2π) . 2 1+β
(4.47)
One has 1 ∆Lp (β) = −∆Cn (θ) + oPθ0 (1) n with
n 2 1X 1 0 Cn (θ) = . ξ β − ξ β i i n i=1 1 + β 2
(4.48)
Here, we have β 0 = arg minβ∈Θ Cn (θ) . The distance (4.48) is a special weighted P sum of squares differences. Under n1 ξi2 > √1n , the separation condition (4.27) is satisfied.
50
4.3.2
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
Estimating functions and maximum likelihood
The connection to the estimating functions of the subsection above is given by the normal equations. In the functional relation model with the likelihood function (4.38) the β−score function is defined by !
∂ yi − g (ξi , β) xi − ξ i Ui = u (xi , yi , ξi , β) = l +l ∂β σ1i σ2i that is
yi − g (ξi , β) σ1i
Ui = −l′
!
(4.49)
!
∂ yi − g (ξi , β) xi − ξ i Vi = v (xi , yi , ξi , β) = l +l ∂ξi σ1i σ2i Vi = −l
′
yi − g (ξi , β) σ1i
!
1 ξ xi − ξ i g (ξi , β) − l′ σ1i σ2i
!
!!
1 . σ2i
Suppose the components ξbi (β) of (4.41) are such that
,
1 β g (ξi , β) . σ1i
The ξi −score function is defined by
that is
!!
v xi , yi , ξbi (β) , β = 0.
,
(4.50)
(4.51)
This may be for instance the case when the set of nuisance parameters F (n) is an open product set of the Rn . Set Ubi = u xi , yi , ξbi (β) , β . (4.52)
In the nonlinear functional relation model with normal distributed errors and F (n) , such that (4.51) holds, we have Ubi =
with ξbi (β) given implicitly by
yi − g ξbi (β) , β 2 σ1i
yi − g ξbi (β) , β 2 σ1i
g β ξbi (β) , β
xi − ξbi (β) g ξ ξbi (β) , β + = 0. 2 σ2i
If Ubi is an estimating function, that is
Eβξi Ubi = 0,
(4.53)
(4.54)
(4.55)
and if certain regularity conditions hold, then the maximum likelihood function is consistent. The nonlinear case with (4.51) does not satisfy (4.55). We will show
51
4.3. MAXIMUM LIKELIHOOD ESTIMATES
this in Section 8.3. But in the simple linear functional relation model (4.12) we have ξbi (β) given explicitly by (4.46) and Because of
Ubi =
1 (yi − xi β) (yi β + xi ) . (1 + β 2 )
Eβξi yi2 = 1 + ξi2 β 2 ,
Eβξi yi xi = ξi2 β,
(4.56)
Eβξi x2i = 1 + ξi2 ,
(4.57)
we have in this special case (4.55). Note, that (4.56) yields the estimating function (4.13).
4.3.3
Estimates under a factorization condition
There exists a number of papers on the estimation of the parameter of interest β provided the existence of a sufficient statistic T for the nuisance parameter for given β. Suppose the density of the joint sample distribution (4.37) admits a factorization n Y
q (xi , yi, β) p0 (T (xi , yi , β) , β, ξi ) .
(4.58)
i=1
Then a conditional maximum likelihood estimator c.m.l.e. is defined by βb ∈ arg max β∈Θ
n X
ln q (xi , yi, β) .
(4.59)
i=1
Under identification conditions on the conditional density q (xi , yi, β) Andersen (1970), [9], showed the consistency and asymptotical normality of the c.m.l.e., see also Pfanzagl (1993), [135]. Note that the exponential family (4.16) satisfies (4.58). Especially, in the simple linear functional relation model (4.12) we have for the density (4.45) !
n Y 1
1 (1 + β 2 ) 2 exp − (y − x β) − (T (xi , yi , β) − ξi )2 . i i 2 2 (1 + β ) 2 i=1 2π
(4.60)
Then T (xi , yi , β) is the sufficient statistic for ξi for given β, T (xi , yi , β) =
βyi + xi . 1 + β2
(4.61)
In that case the conditional maximum likelihood is defined by n X
1 (yi − xi β)2 . 2) β∈Θ (1 + β i=1
βb = arg min
(4.62)
Comparing (4.62) to the profile likelihood in (4.47), we see that the maximum likelihood estimator and the conditional maximum likelihood estimator coincide.
52
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST Godambe (1976), [71], introduced the conditional score function as Wi (β, ξi ) = Ui − E (Ui | Ti ) ,
(4.63)
with the β− score Ui given in (4.49). Under the factorization condition (4.58), the conditional score is independent on ξi and an estimating function in the sense of Definition 4.1. Lindsay (1982), [112], discussed the computational and conceptional advantages of (4.63) over the conditional likelihood approach. Mantel and Godambe (1993), [119], found the conditional linear optimal unbiased estimating function for general models with a complete sufficient statistic for the nuisance parameters. They considered as an example the simple linear functional relation model and obtained that (4.13) is the optimal one. The generalized linear measurement error model (1.16) with normal error distributions fulfills the condition (4.58). Stefanski and Carroll (1987), [162], derived the conditional score function (4.63) for such models. In the general nonlinear case the problem is open to find a nontrivial factorization of the type (4.58), which yields a consistent conditional maximum likelihood estimator.
4.3.4
Reparameterization
Cox and Reid (1987), [39], proposed a new parameterization (β, λi ) for the density of (xi , yi ) by writing ξi = hi (β, λi ) , (4.64) chosen to make the new parameter β, λi orthogonal in the sense of the Fisher information metric. There are a number of statistical advantages of the orthogonal parameterization discussed by Cox and Reid (1987), [39]. The most important one in our context is that the m.l.e. of β when λi is given varies only slowly with λi . For further references see also Cox and Reid (1993), [40], and Liang (1987), [109]. Under (4.64) the new log-likelihood function of (yi , xi )is !
!
yi − g (hi (β, λi ) , β) xi − hi (β, λi ) Li (λi , β) = l +l , σ1i σ2i
(4.65)
with Iβλ = Eβξi Lβi (λi , β) Lλi (λi , β) = 0.
(4.66)
Note, that under (4.64), we have a general nonlinear functional relation model of the type yi = g (hi (β, λi ) , β) + ε1i xi = hi (β, λi ) + ε2i .
53
4.3. MAXIMUM LIKELIHOOD ESTIMATES
Specializing the condition (4.66) to the functional relation model with standard normal distributed errors, we obtain the differential equation hβi = −
g ξ (hi , β) g β (hi , β) . 1 + g ξ (hi , β)2
(4.67)
In the simple linear model (4.12) this equation has the form hβi = −
β hi , (1 + β 2 )
the new parameter λi orthogonal to β is λi = λi (ξi , β) = and the new linear model is yi = √
q
1 + β 2 ξi ,
β λi + ε1i , 1 + β2
(4.68)
1 λi + ε2i . (4.69) 1 + β2 Then we have the for β− score function in (4.68) and (4.69) with standard normal distributed errors: xi = √
Lβi
(λi (β) , β) =
1 1 + β2
!3
2
λi (yi − βxi )
(4.70)
and
yi Lβi λi βe , βe = 0, for βe = . xi The estimating function involved in (4.70),
fi∗ (yi , xi , β) = yi − βxi , is an optimum estimation function in the sense of Godambe’s information (4.10) because fi∗ is of type (4.11) with 3
(1 + β 2 ) 2 , C1 (λi , β) = λi
C2 (λi , β) = 0.
Note, this is no contradiction to the discussion in Section 1.1 because here we have the different , more general, model (4.68), (4.69). The estimator for β with respect to fi∗ basing on the whole sample (yi , xi ) , i = 1, .., n, is the moment estimator, which uses the first moments only, Pn yi e β = Pni=1 . i=1
xi
54
CHAPTER 4. ESTIMATION OF THE PARAMETER OF INTEREST
Unfortunately this estimator loses its importance already in the case of more than one parameter, for instance if, in the original model, g (ξi , β) = β1 + β2 ξi . In the nonlinear model with standard normal errors and with parameters satisfying (4.67) a straightforward transformation of this approach gives no new idea. Let us shortly discuss this point. Here we have Lβi (λi (β) , β) =
gβ ξ ξ y − g − x g + hg . i i 1 + (g ξ )2
Setting analogously C1 (λi , β)−1 =
gβ , 1 + (g ξ )2
C2 (λi , β) = 0,
in (4.11), we obtain fi∗ (yi , xi , β) = yi − g − xi g ξ + hg ξ . If the lefthand side is independent on λi for all xi , yi , β, then also fi∗ (0, 0, β) = hi g ξ (hi , β) − g (hi , β) = const. This implies that we have the linear case, g (hi , β) = hi g1 (β) + g2 (β) . Already Cox and Reid (1987), [39], mentioned that it is in general not possible to find a totally orthogonal parameterization; they proposed instead a local approach, compare also Godambe (1991), [73]. This requires that we operate in the neighborhood of the ”true” nuisance parameter; but we cannot ensure that in the general nonlinear case, where the nuisance parameters are not consistently estimable. Nevertheless this approach seems to be useful also in the nonlinear case to improve consistent estimators. It is worth to think about it in further studies - but we will not do that here.
Chapter 5 Least squares estimator In this chapter we introduce the least squares estimators for both the parameter of interest and the nuisance parameter. In the following we also use the abbreviation l.s.e. for the least squares estimator. In contrast to the discussion of the above Chapter we do not distinguish between nuisance parameters and parameter of interest in this l.s.e. approach. Both types of parameters are estimated by the least squares principle. The least squares estimator has a nice heuristic background and is common in applications. In the numerical literature it is called the orthogonal regression estimator, and its numerical properties are well studied, see also Section 1.3.6. in the introduction. Nevertheless in Chapter 8 the l.s.e. will turn out to be inconsistent already in very simple nonlinear cases.
5.1
Definition of the l.s.e.
In this paper we consider weighted squares. Introduce two series of normalized weights w1 = (w11 , ..., w1n )T , w1i ≥ 0, w2 = (w21 , ..., w2n )T , w2i ≥ 0 and
n X
w1i +
n X
w2i = 1, w1 max = max w1i , w2 max = max w2i . i=1,..,n
i=1
i=1
i=1,..,n
(5.1)
(5.2)
Then the weighted sum of squares is Q (ξ, β) =
n X i=1
w1i (yi − g (ξi , β))2 +
n X i=1
w2i (xi − ξi )2 .
(5.3)
Definition 5.1 The least squares estimator βb is a measurable solution of the minimization problem βb ∈ arg minc min β∈Θ
ξ∈(F (n) )c
n X i=1
w1i (yi − g (ξi , β))2 + 55
n X i=1
w2i (xi − ξi )2 ,
(5.4)
56
CHAPTER 5. LEAST SQUARES ESTIMATOR
where (F (n) )c ⊆ (Rn )c and Θc ⊆ (Rp )c are the compactified parameter sets in the corresponding compactified Euclidean spaces.2 Further let us introduce the l.s.e. for the nuisance parameters. Definition 5.2 We define the least squares estimator ξb as a measurable solution of the minimization problem ξb ∈ arg minc min
n X
β∈Θ ξ∈(F (n) )c i=1
w1i (yi − g (ξi , β))2 +
n X i=1
w2i (xi − ξi )2 ,
(5.5)
where (F (n) )c ⊆ (Rn )c and Θc ⊆ (Rp )c are the compactified parameter sets in the corresponding compactified Euclidean spaces.2 For normal distributed errors
2 εji ∼ N 0, σji ,
(5.6)
with j = 1, 2; i = 1, ..., n, the log-likelihood function is L (ξ, β) =
n X i=1
!
1 1 − 2 (yi − g (ξi , β))2 − 2 (xi − ξi )2 − ln (2πσ1i σ2i ) . (5.7) 2σ1i 2σ2i
Then the maximum likelihood estimators for β and ξ are the measurable solutions of
βb ∈ arg maxc
max
n X
β∈Θ ξ∈(F (n) )c i=1
1 1 − 2 (yi − g (ξi , β))2 − 2 (xi − ξi )2 σ1i σ2i
!
(5.8)
and ξb ∈ arg maxc max
n X
β∈Θ ξ∈(F (n) )c i=1
!
1 1 − 2 (yi − g (ξi , β))2 − 2 (xi − ξi )2 . σ1i σ2i
(5.9)
In the case of normal distributed errors the maximum likelihood estimator is the weighted least squares estimator with the special weights w1∗ and
=
v v , ..., 2 2 σ11 σ1n n X 1
2 i=1 σ1i
!T
+
,
w2∗
=
n X 1
2 i=1 σ2i
v v , ..., 2 2 σ21 σ2n 1 = . v
!T
(5.10)
(5.11)
57
5.1. DEFINITION OF THE L.S.E.
Note that if the errors are i.i.d. and normally distributed, then the maximum likelihood estimator is the least squares estimator with
1 1 wj = , ..., 2n 2n
T
, j = 1, 2.
(5.12)
Analogously to the profile likelihood,(4.42), we introduce a projected sum of squares and, analogously to (4.41), we will define the least squares estimator of ξ for any given β. Definition 5.3 We call Qp (β) =
min
ξ∈(F (n) )c
n X i=1
w1i (yi − g (ξi , β))2 +
n X i=1
w2i (xi − ξi )2
(5.13)
the projected sum of squares.2 Definition 5.4 For any given β ∈ Θ we define the least squares projection ξb (β) as a measurable solution of the minimization problem ξb (β) ∈ arg
min
ξ∈(F (n) )c
n X i=1
2
w1i (yi − g (ξi , β)) +
n X i=1
w2i (xi − ξi )2 ,
(5.14)
where (F (n) )c ⊆ (Rn ) is the compactified parameter set in the compactified Euclidean space. 2 We have Qp (β) =
n X i=1
w1i
yi − g ξbi (β) , β
2
+
n X i=1
w2i xi − ξbi (β)
2
,
(5.15)
where ξbi (β) , i = 1, ..., n are the components of ξb (β) . If we are also interested in the least squares estimator of βb (ξ) for any given ξ ∈ F (n) , this leads to the least squares estimator in the usual nonlinear regression model defined by the equation (3.1) only. Definition 5.5 For any given ξ ∈ F (n) we define the nonlinear least squares estimator βb (ξ) as a measurable solution of the minimization problem βb (ξ) ∈ arg min β∈Θ
2
n X i=1
w1i (yi − g (ξi , β))2 .
(5.16)
58
CHAPTER 5. LEAST SQUARES ESTIMATOR
Note that βb (ξ 0 ), where the components of ξ 0 fulfill (3.6), is the weighted least squares estimator in the nonlinear regression model. Otherwise βb (ξ) is the weighted least squares estimator in the inadequate nonlinear regression model. In Zwanzig (1980), [182], it is shown that βb (ξ) = β (ξ) + oPξo βo (1) , where β (ξ) ∈ arg min β∈Θ
n X i=1
w1i g ξi0 , β 0 − g (ξi , β)
2
.
(5.17)
Obviously, we have
βb ξb = βb .
ξb βb = ξb and
(5.18)
Remark 5.1 For continuous regression functions g (., .) the existence of measurable solutions of the minimization problems (5.4), (5.5), (5.14), (5.16), follows from Theorem 3.10 in Pfanzagl (1969), [133].
5.2
Geometrical interpretation
A nice geometrical intuition lies behind the least squares estimator. Let us assume in this context, that the set of nuisance parameters F (n) is a product set in Rn , F (n) = F1 × ... × Fn , Fi ⊂ R.
(5.19)
Then we have min Q (ξ, β) =
ξ∈(F (n) )c
n X i=1
h
i
min c w1i (yi − g (ξi , β))2 + w2i (xi − ξi )2 .
ξi ∈(Fi )
Now we consider the case when each component of the projection
satisfies the normal equation
w1i yi − g ξbi (β) , β
ξb (β) = ξbi (β)
i=1,..,n
g ξ ξbi (β) , β + w2i xi − ξbi (β) = 0.
The tangent t (., β) of the regression curve g (., β) at ξbi (β) is
t (x, β) = g ξbi (β) , β + x − ξbi (β) g ξ ξbi (β) , β .
(5.20)
(5.21)
Define a turned tangent tw (., β) of the regression curve g (., β) at ξbi (β) by w 1i ξ b tw (x, β) = g ξbi (β) , β + x − ξbi (β) g ξi (β) , β . w2i
(5.22)
59
5.3. NAIVE LEAST SQUARES ESTIMATOR From (5.20) it follows that for all β ∈ Θ and for all x ∈ R xi yi
!
ξbi (β) ⊥ − b g ξi (β) , β
x tw (x, β)
!
b ξi (β) . − tw ξbi (β) , β
(5.23)
w1i For w = 1 the line between the observation point (xi , yi )and the regression curve 2i at the projected design point ξbi (β) is orthogonal to the tangent to the curve at this point. This orthogonality holds for all regression parameters β. The least squares estimator βb for β is the one that minimizes the sum of the orthogonal distances between the observation point and the curve. This is the empirical procedure, when one draws manually a curve through a cloud of observations by eye. Especially if the curve g (., β) is steeply increasing, the heuristic idea is to fit the curve to points (yi , xi ) by minimizing the Euclidean distance in the (x, y) −plane. In the non-weighted case the l.s.e. βb is the orthogonal regression estimator known from the literature on numerical analysis. By choosing different weights wji for different i = 1, ..., n the regions of the design area can be distinguished. The ratio of the weights w1i , w2i influences the angle between the line from the observation point to the regression curve at the projected design point and the tangent at this point.
5.3
Naive least squares estimator
Several different kinds of notation can be found in the literature. Sometimes the estimator defined in (5.4) is called orthogonal regression estimator. The estimator of the nonlinear regression model, which ignores the errors in the variables, is named least squares estimator. We will introduce the latter one as the naive least squares estimator. This definition is useful in studying the effect of the errors in the variables. Moreover this estimator is important in the replication model, (3.6),(3.7) and for the approach of small measurement errors. Definition 5.6 The naive least squares estimator β is defined as the measurable solution of β ∈ arg min β∈Θ
2
n X i=1
w1i (yi − g (xi , β))2 .
(5.24)
Here the distances are minimized on the vertical lines (xi , .) only. The naive least squares estimator in the replicated model is β ∈ arg min β∈Θ
q X i=1
w1i (y i − g (xi , β))2 .
(5.25)
60
CHAPTER 5. LEAST SQUARES ESTIMATOR
Note that if in the model, (3.6), (3.7) the replications at ξi are equally weighted w1ik =
1 w1i , for all k = 1, .., ri ri
then
q X ri X
Q (x, β) =
i=1 k=1
=
q X ri X
i=1 k=1
(5.26)
w1ik (yik − g (xi , β))2
w1ik (yik − y i )2 +
q X i=1
w1i (y i − g (xi , β))2
(5.27)
and the estimator defined in (5.16) taken at ξ = x and the naive least squares estimator of the averaged model β coincide: βb (x) = β.
5.4
(5.28)
Estimating functions and least squares
The connection between the theory of least squares and the approach based on the estimating functions defined in (4.1) and (4.2), is given by the normal equations of the projected sum of squares (5.13). Consider the vector of first derivatives of (5.15) Qβp (β) = (5.29) −2
Pn
i=1
g β ξbi (β) , β + g ξi ξbi (β) , β ξbiβ (β) w1i yi − g ξbi (β) , β P −2 ni=1 w2i xi − ξbi (β) ξbiβ (β) .
In the general case the unbiasedness condition,
Eβξ Qβp (β) = 0, for all β ∈ Θ, ξ ∈ F (n) ,
(5.30)
is hard to check with the exception of the linear case. For instance in the simple linear functional model g (ξi , β) = ξi β, with arbitrary error distribution we have analogously to the result (4.47) on the projected likelihood function that the least squares projection is explicitly given by
and that
ξbi (β) = Qp (β) =
n X
w2i xi + w1i yi β , w2i + w1i β 2
w2i w1i (yi − xi β)2 2 w + w β 1i i=1 2i
(5.31)
satisfies condition (5.30). Under the (5.20) we will show in Section 8.3 that the Qβp (β) tends to zero for n → ∞ only in the case of linear regression functions, or Qβp (β) tends to zero for 2 σmax → 0.
61
5.5. CONTRAST FUNCTIONS AND LEAST SQUARES
5.5
Contrast functions and least squares
Here we give the connection between the minimum contrast approach, introduced in Section 4.2 and least squares idea. Similarly to the discussion in Section 4.31 of the relation between maximum likelihood and minimum contrast approach we see here also two ways of applying the minimum contrast approach : first to use the projected sum of squares , (5.15), as a contrast function independent of the nuisance parameters Cen (β) = Qp (β) ; second to apply the minimum contrast idea on the composite parameter and to choose the sum of squares, (5.3) as the minimum contrast function Cen (θ) = Q (ξ, β). Consider the first case: Cen (β) = Qp (β) . The aim in models with nuisance parameters is to find a contrast function independent of the nuisance parameters and to apply the Lemma 4.2. The problem now is to derive the deterministic contrast function corresponding to Qp (β) and to check the separation condition. This way is more complicated than it seems on the first view, because the least squares projection ξbi (β) , defined in (5.14) depends on the data set and is only implicitly given. Therefore let us try to study the behavior of ξbi (β) for all β. ξbi (β) is a minimum contrast estimate with respect to Q (ξ, β) for arbitrary β ∈ Θ and we want to apply Lemma 4.1. Then the question arises, what is the corresponding contrast, (4.17) corresponding to Q (ξ, β)? Using the quadratic structure of Q (ξ, β) we obtain that under conditions on (n) F ,which imply for j = 1, 2 and for arbitrary continuous functions h (., .) sup sup
n X
β∈Θ ξ∈F (n) i=1
wji εji h (ξi , β) = oP (1) ,
(5.32)
that Q (ξ, β) =
n X i=1
w1i g
+
ξi0 , β 0
n X X
− g (ξi , β)
2
+
n X i=1
w2i ξi0 − ξi
2
(5.33)
2 wji σji + oPξ0 β0 (1) ,
j=1,2 i=1
with ξ 0 = (ξi0 )i=1,..,n . Hence for any given β ∈ Θ the corresponding contrast is Cn (ξ, β) =
n X i=1
w1i g
ξi0 , β 0
− g (ξi , β)
2
+
n X i=1
w2i ξi0 − ξi
2
.
(5.34)
In order to apply Lemma 4.1 we consider ξ(n) (β) ∈ arg min
n X
ξ∈(F (n) )c i=1
w1i g ξi0 , β 0 − g (ξi , β)
2
+
n X i=1
w2i ξi0 − ξi
2
. (5.35)
62
CHAPTER 5. LEAST SQUARES ESTIMATOR
In general we have Cn (ξ, β) − Cn (η, β) = Pn
2
i=1
w1i (g (ξi0 , β 0 ) − g (ξi , β)) − +
If the points ηi , ξi
Pn
i=1
2
Pn
i=1
w2i (ξi0 − ξi ) − ξi g (ξi , β)
!
w1i (g (ξi0 , β 0 ) − g (ηi , β))
Pn
2
(5.36)
2
i=1
w2i (ξi0 − ηi ) . ηi g (ηi , β)
and
!
lie on an ellipsoid around the center !
ξi0 g (ξi0 , β 0 )
,
then for different ξi and ηi one has 2
2
w1i (g (ξi0 , β 0 ) − g (ξi , β)) + w2i (ξi0 − ξi ) = 2 2 0 0 w1i (g (ξi , β ) − g (ξi (β) , β)) + w2i (ξi0 − ξi (β)) , the minimum point in (5.35) is not unique and the separation condition (4.27) of Lemma 4.1 is violated. This means we cannot obtain the consistency of ξb (β) for any given β by Lemma 4.1. Note, this problem occurs only in case of nonlinear regression functions. In the linear case we have (5.31) and this approach works. For illustration let us consider the case β = β0 . The situation is then much more easier, ξ(n) (β 0 ) = ξ 0 holds and Cn (ξ, β 0 ) − Cn (ξ 0 , β 0 ) = Pn
2
i=1
w1i (g (ξi0 , β 0 ) − g (ξi , β)) + ≥
Then Lemma 4.1 yields Pξ0 β 0
n X
w2i
i=1
with
Pn
i=1
w2i (ξi0 − ξi )
2
2
i=1
2 ξbi β 0 − ξi0 > ε
Pn
!
w2i (ξi0 − ξi ) .
≤ P max sup
j=1,2 ξ∈F (n)
h1 ξi , β 0 = g ξi0 , β 0 − g ξi , β 0 , and
n X i=1
wji εji hj ξi , β
0
>ǫ
!
h2 (ξi , β) = ξi0 − ξi .
Thus the consistency of ξb (β 0 ) follows from (5.32). Unfortunately, this case is the uninteresting one, because our final aim is to estimate the parameter of interest β 0.
63
5.5. CONTRAST FUNCTIONS AND LEAST SQUARES
At this point let us mention what happens on the other side for any given ξ with βb (ξ) defined in (5.16)? The contrast is the same as in (5.34) and β (ξ) ∈ arg minc β∈Θ
n X i=1
w1i g ξi0 , β 0 − g (ξi , β)
2
(5.37)
with β (ξ 0 ) = β 0 . We are in the situation of the inadequate nonlinear regression model, compare Zwanzig (1980), [182], and for ξ = ξ 0 we are in the usual nonlinear regression model. Hence let us come back to the starting point of this subsection and take this e time the Cn (θ) = Q (ξ, β) . We will show the consistency of second way. Set ξb = ξb βb and of βb = βb ξb simultaneously. This means we consider that both types of parameters are of the ”same interest” and disregard the difference in the meaning of the parameter β, ξ . Under (5.32) the contrast corresponding to Q (ξ, β) is also Cn (ξ, β) as given in (5.34). We have
ξ 0 , β 0 ∈ arg min min Cn (ξ, β) .
(5.38)
β∈Θ ξ∈F (n)
and Cn (ξ 0 , β 0 ) = 0. Then the separation condition (4.27) achieves the form
with the notation
2 2 Cn (ξ, β) ≥ ρ
β − β 0
+ ξ − ξ 0
w2
,
(5.39)
n 2 2 X ξ − ξ 0 = w2i ξi0 − ξi w2
i=1
introduced in (2.5). We write also
2 2 Cn (ξ, β) = G (ξ, β) − G ξ 0 , β 0 + ξ − ξ 0 . w1
w2
In order to show (5.39), we will follow the line of the technical report Zwanzig (1990), [184]. The following Lipschitz condition with respect to the nuisance parameters is used. L1 ∃n0 ∃L1 , L1 < ∞ ∀n ≥ n0 ∀β ∈ Θc ∀ξ, ξ ′ ∈ F (n)c ,
2
2
|G (ξ, β) − G (ξ ′ , β)|w1 ≤ L1 |ξ − ξ ′ |w2 .
Define
2
Ln (ξ, β) = G (ξ, β) − G ξ 0 , β
Lemma 5.1 Under L1,
1 ∃n0 ∃ τ ∈ 0, 2 such that
w1
2
+ G ξ 0 , β − G ξ 0 , β 0
w1
2
+ ξ − ξ 0 . w2 (5.41)
∀n ≥ n0 ∀ξ, ξ 0 ∈ F (n)c ∀β, β 0 ∈ Θc
2 2 G (ξ, β) − G ξ 0 , β 0 + ξ − ξ 0 ≥ τ Ln (ξ, β) . w1
2
(5.40)
w2
(5.42)
64
CHAPTER 5. LEAST SQUARES ESTIMATOR
Proof. Within this proof let us use the abbreviations G (ξ 0 , β 0 ) = G00 and G (ξ 0 , β) = G0 . In this notation (5.41) becomes
2
Ln (ξ, β) = G − G0
w1
2
2
+ G00 − G0
2
+ ξ − ξ 0
w1
By adding ±G0 in |G − G00 |w1 , we obtain
w2
.
(5.43)
Cn (ξ, β) = Ln (ξ, β) (1 − 2∆n (ξ, β)) with ∆n (ξ, β) =
(5.44)
(G0 − G00 , G0 − G)w1
(5.45)
|G − G0 |2w1 + |G00 − G0 |2w1 + |ξ − ξ 0 |2w2
for Ln (ξ, β) > 0 and ∆n (ξ, β) = 0 otherwise. It remains to show that there exists a constant τ > 0 such that 1 sup sup ∆n (ξ, β) ≤ − τ. c β∈Θc 2 ξ∈(F (n) )
(5.46)
2
( 1 −τ1 ) Let c = 2 L1 with L1 from (5.40) and τ1 such that 0 < τ1 < distinguish two cases: i) 2 2 ξ − ξ 0 ≤ c G0 − G00
2 2 ξ − ξ 0 > c G0 − G00 . w2
We will
(5.47)
w1
w2
and ii)
1 . 2
(5.48)
w1
Under i) we apply the Cauchy-Schwarz inequality and the assumption (5.40) 2
2
|G0 − G|w1 |G0 − G00 |w1 ∆n (ξ, β) ≤ L2n 2
2
2
4
L1 |ξ − ξ 0 |w2 |G0 − G00 |w1 L1 c |G0 − G00 |w1 ≤ ≤ , L2n L2n 2
with Ln = Ln (ξ, β) given in (5.41). Note |G0 − G00 |w1 ≤ Ln . We have
2
2 |G0 − G00 |w1 χc |G0 − G00 |w1 1 , ≤ L c ≤ ∆n (ξ, β)2 ≤ ≤ L c − τ 1 1 1 L2n Ln 2 4
2
and thus in case i) (5.46) follows. Consider case ii). Because of
2 0 ≤ G0 − G00 − G0 − G
w1
65
5.5. CONTRAST FUNCTIONS AND LEAST SQUARES
we have
2
= G0 − G00
w1
2
+ G0 − G
2 G0 − G00 , G0 − G
w1
w1
− 2 G0 − G00 , G0 − G
2
≤ G0 − G00
w1
Using this and (5.41) we obtain for ∆n = ∆n (ξ, β) 2
w1
2
+ G0 − G
w1
,
.
2
2
|ξ − ξ 0 |w2 |G0 − G00 |w1 + |G0 − G|w1 ≤1− . Ln Ln From the assumption (5.40) we get 2∆n ≤
2
2∆n ≤ 1 −
|ξ − ξ 0 |w2
|G0 − G00 |2w1 + (1 + L1 ) |ξ − ξ 0 |2w2
.
x is increasing in x. Using (5.48) we For positive a, the function f (x) = a+(1+L 1 )x c have ca ≤ x and f (x) ≥ 1+c+cL1 . We obtain c . 2∆n ≤ 1 − 1 + (1 + L1 ) c
For τ2 , 0 < τ2
ǫ ≤ Pβ 0 ,ξ0 w2
sup
(ξ,β)∈Ξc (ǫ) j=1,2 i=1
with Ξc (ǫ) = ((ξ, β) : Ln (ξ, β) > ǫ) and h1i (ξ, β) = 2
n X X
g (ξi0 , β 0 ) − g (ξi , β) , and Ln (ξ, β)
wji εji hji (ξ, β) > τ ,
h2i (ξ, β) =
ξi0 − ξi . Ln (ξ, β)
(5.49)
66
CHAPTER 5. LEAST SQUARES ESTIMATOR 2
2
Proof. Set Cen (θ) = Q (ξ, β) and Cn (ξ, β) = |G (ξ, β) − G (ξ 0 , β 0 )|w1 +|ξ − ξ 0 |w2 in Lemma 4.1. We have θ0 = θ(n) ∈ arg min Cn (ξ, β) and ∆Cn (ξ, β) = Cn (ξ, β) . Using the quadratic structure of the estimation criterion we have Q (ξ, β) = 2 2 G ξ 0 , β 0 − G (ξ, β) + |ε2 |w2 + 2 ε1 , G ξ 0 , β 0 − G (ξ, β) w1
2 + ξ 0 − ξ + |ε2 |2w2 + 2 ε2 , ξ 0 − ξ w2
Thus
∆Cen (ξ, β) = ∆Cn (ξ, β)+2
n X i=1
w2
w1
.
w1i ε1i g ξi0 , β 0 − g (ξi , β) +2
n X i=1
w2i ε2 ξi0 − ξi .
From Lemma 5.1 follows that the separation condition is fulfilled for
ρ d θ, θ0 2
= τ Ln (ξ, β) .
Since Ln (ξ, β) ≥ |ξ − ξ 0 |w2 , Lemma 4.1 implies
b 0 2 Pβ 0 ,ξ0 ξ − ξ > ǫ w2
b β > ǫ ≤ P 0 sup ≤ Pβ 0 ,ξ0 Ln ξ, θ
θ∈Ξc (ǫ)
e ∆Cn (ξ, β) − ∆Cn (ξ, β)
τ Ln (ξ, β)
≥ 1
with Ξc (ǫ) = ((ξ, β) : Ln (ξ, β) > ǫ) ⊇ F (n)c (ǫ) × Θc . Using (5.49) we obtain the statement. 2 If we want to estimate the parameter of interest we need the known contrast condition Con of the Jennrich type , compare Jennrich [91], with respect to the parameter of interest. Con ∃n0 ∀n ≥ n0 ∃an , 0 < an < ∞, ∀β, β ′ ∈ Θ 2 2 G ξ 0 , β − G ξ 0 , β ≥ an kβ − β ′ k . w1
√ Wu (1981), [180], showed that a contrast condition of type Con with an n → ∞ is necessary for the consistency of the least squares estimator in nonlinear regression. (Compare also the fundamental paper of Shepp (1965), [151].) We obtain that the contrast (5.34) satisfies the separation condition (4.27) of Lemma 4.1 for the distance measure τ Ln (ξ, β) and that 2
ρ (d (θ, θ′ )) ≥ an kβ − β ′ k . Stated precisely, we have:
67
5.5. CONTRAST FUNCTIONS AND LEAST SQUARES Theorem 5.3 Under L1 and Con with an
1 ∃n0 ∃ τ ∈ 0, 2
∀n ≥ n0 ∀ξ 0 ∈ F (n) ∀β 0 ∈ Θ
such that Pβ 0 ,ξ0
2
an
βb − β 0
> ǫ ≤ Pβ 0 ,ξ0
sup (ξ,β)∈Ξc (ǫ)
n X X
j=1,2 i=1
wji εji hji (ξ, β) > τ
with Ξc (ǫ) = ((ξ, β) : Ln (ξ, β) > ǫ) and hji (ξ, β) given in (5.49). 2 2
Proof. Instead of applying Ln (ξ, β) ≥ |ξ − ξ 0 |w2 in the proof of Theorem 5.2, 2 we use the contrast condition and have Ln (ξ, β) ≥ an kβ − β 0 k . Thus
2 b β >ǫ . Pβ 0 ,ξ0 an
βb − β 0
> ǫ ≤ Pβ 0 ,ξ0 Ln ξ,
The remaining part of the proof is same as of Theorem5.2. 2 For proving consistency of the l.s.e., it remains to show Pθ0 max sup
j=1,2 θ∈Ξc (ǫ)
n X
wji εji hji (θ) > τ
i=1
!
→0
(5.50)
with Ξc (ǫ) = Ξc ∩ {(ξ, β) : Ln (ξ, β) > ǫ} and h1i (θ) =
g (ξi0 , β 0 ) − g (ξi , β) , and Ln (ξ, β)
h2i (θ) =
ξi0 − ξi . Ln (ξ, β)
This will be done in the following chapter by an auxiliary result stated independently of the nonlinear functional relation model.
68
CHAPTER 5. LEAST SQUARES ESTIMATOR
Chapter 6 Auxiliary results In nonlinear regression theory the quadratic structure of the least squares criterion leads to sums of independent, not necessarily identically distributed random vectors of the following structure n X
SN (β) =
εi ai (β) ,
(6.1)
i=1
where the weights ai (β) are non random vectors and depend on β continuously and multiplied by a random scalar εi . The random scalars εi are mutually independent with expected value zero and positive variances σi2 . Then ESN = 0. P Further, the weights are such that Cov (SN ) = ni=1 σi2 ai (β) ai (β)T is positive definite. The asymptotic behavior of the least squares estimator depends mainly on the asymptotic properties of the sum (6.1). Ivanov and Zwanzig (1983), [88], derived an stochastic expansion of the least squares estimator, which is a polynomial in terms of the structure (6.1). L¨auter (1990), [107], proved the strong consistency of the least squares estimator in the nonlinear regression model and showed as an auxiliary result the almost sure convergence of sums sup
n X
β∈Θ i=1
εi ai (β) ,
with one dimensional weights ai (β) . In the previous chapter we already showed that it suffices to derive (5.50) for the consistency of the least squares estimator. In contrast to nonlinear regression theory we have in the error-in-variables models the regression parameter β ∈ Θ as the parameter of interest and additionally as nuisance parameters the unknown design points. This means we are also interested in sums of independent, not identically distributed random values, but there are additionally parameterized with an increasing number of parameters. The results given in this chapter are independent of the error-in-variables model, (3.1) and (3.2). 69
70
CHAPTER 6. AUXILIARY RESULTS Let us introduce the special type of sum SN (ξ, β) =
N X
wi εi hi (ξ, β) = (ε, H (ξ, β))w ,
(6.2)
i=1
where H (., .) : X (n) × A →RN with X (n) ⊆ F (n) ⊆ Rn and A ⊆ Θ ⊆ Rp H (ξ, β) = (h1 (ξ, β) , ..., hN (ξ, β))T and hi (ξ, β) is nonrandom and continuous on X (n) × A . The dimension n may depend on the sample size and increases with N . The dimension p is assumed to be fixed. The random variables εi are mutually independent and have expected value zero and the variances σi2 . The weights wi are known, positive and normalized, that is N X
wi = 1.
i=1
Also and V ar (SN (ξ, β)) =
E SN (ξ, β) = 0
(6.3)
N X
(6.4)
i=1
with
wi2 σi2 h2i (ξ, β) ≤ mN |H (ξ, β)|2w
mN = max wi σi2 . i≤N
We will be interested in the behavior of
(6.5)
sup sup SN (ξ, β) .
(6.6)
ξ∈X (n) β∈A
Here we have similarities to the theory of empirical processes. There, a sample of i.i.d. r.v. x1 , ..., xN and a function f , member of some smooth functional class F ,are given, and one is interested in the behavior of #
"
N 1 X (f (xi ) − Ef (x1 )) , sup f ∈F N i=1
(6.7)
for instance compare the recent overview on empirical processes and their applications Gine (1996), [65]. Unfortunately, we cannot apply most of the results of the empirical processes directly, because we are not in the i.i.d. case. Even if we assume the errors to be i.i.d. and the nuisance parameters generated by smooth functions f like the Ylvisaker design points, introduced in (3.21), ξi = f
i , n
71 we do not have the desired structure because the transformation depends on i. Otherwise we could understand the terms as a transformation of independent, not identical distributed random variables zi , i = 1, ..., n εi h i
i f , β = Ff β (zi ) n
and could apply results of Alexander (1984), [1]. But we will use the special structure of the terms in the sum, that is, random variable εi times the weight wi hi (ξ, β). This gives the chance to formulate the assumptions separately on the distribution of the εi and on the functions hi . Thus we are here in the situation to have not the same problem as in (6.7) but a strongly related one. We will apply some ideas of the empirical process theory. In particular, we assume an entropy type condition on the set X (n) , to control the size of X (n) . This is the same approach as taken in the empirical process theory with respect to the size of the functional space F . Let us recall the notation of the entropy of a set X (n) ⊆ (Rn )c . We will use the entropy locally and with respect to the weighted distance defined in (2.5), namely d2 (ξ, ξ r ) = |ξ − ξ r |2w =
N X i=1
wi (ξi − ξir )2 .
(6.8)
Consider the balls X (ξ 0 , η) in X (n) ⊆ Rn with radius ǫ and center ξ r = (ξ1r , ..., ξnr )T ∈ X (n) with regard to the weighted distance defined in (2.5). X (ξ r , ǫ) = {ξ : |ξ − ξ r |w ≤ ǫ} ∩ X (n) .
(6.9)
Definition 6.1 {ξ 1 , ..., ξ m } is an ǫ−covering set of X (ξ 0 , D) iff
0
X ξ ,D ⊆
m [
r=1
X (ξ r , ǫ) .
The N (ǫ, D) is the ǫ−covering number iff it is smallest value of m, the cardinal number an ǫ−covering set of X (ξ 0 , D). The ǫ− entropy H(ǫ, D) of X (n) is defined by H (ǫ, D) = ln N (ǫ, D) . 2 We can introduce the same concept for arbitrary subnorms d of the common parameter set Ξ. In that case we will write for the ǫ− entropy Hd (ǫ, D) . We will use the index d only, when we choose another distance as the above one in (6.8). Since we wish to obtain exponential bounds, we introduce a cumulant condition of Statulevicius type on the random terms and on the quadratic random terms.
72
CHAPTER 6. AUXILIARY RESULTS
S1 For the k’th cumulants χk of
εi σi
εi 1+γ ∃γ ≥ 0 ∃CS ∃HS ∀i ∀k = 3, ... χk HS CSk−2 . ≤ (k!)
σi
S2 For the k’th cumulants χk of ∃γ ′ ≥ 0 ∃C2S ∃H2S
εi σi
2
∀i ∀k = 3, ... χk
εi σi
2 ! 1+γ ′ k−2 H2S C2S . ≤ (k!)
(6.10)
These conditions are related to each other. From S1 with γ, HS ≥ 1, CS ≥ 1 follows S2 with γ ′ = 2γ + 1, H2S , C2S , see in the Appendix, Lemma 16.5. For normal distributed errors, the Statulevicius condition S1 with γ = 0 is valid. The following lemma gives that under normally distributed errors also the Statulevicius condition S2 with γ ′ = 0 holds. Lemma 6.1 Under εi ∼ N (0, σi2 ) the Statulevicius condition S2 with γ ′ = 0 holds.2
2
Proof. Under εi ∼ N (0, σi2 ) , σεii is Chi squared distributed with one degree of freedom. The cumulants of the Chi squared distribution with F degree of freedom are χk = F 2k−1 (k − 1)! (6.11) and the Statulevicius condition S2 is fulfilled with γ ′ = 0 and H2S = 4F and D2S = 2.
(6.12)
2 Further we will assume that the error variances are bounded. V For all N
|σ|2w ≤ var < ∞.
For the function H (ξ, β) , we assume Lipschitz properties with respect to the parameter of interest and with respect to the nuisance parameters. H1 ∃N0 ∀N ≥ N0 ∃C1L = C1L (N ) ∀β ∈ A ∀ξ ′ , ξ ∈ X (n) 2
2
|H (ξ, β) − H (ξ ′ , β)|w ≤ C1L |ξ − ξ ′ |w .
73 H2 ∃N0 ∀N ≥ N0 ∃C2L = C2L (N ) ∀β ′ , β ∈ A ∀ξ ∈ X (n) 2
2
|H (ξ, β) − H (ξ, β ′ )|w ≤ C2L kβ − β ′ k . The next lemmata refer to the sum SN (ξ, β) , introduced in (6.2). Introduce also the sum of independent random variables with expected value zero S, which is independent from the parameter (ξ, β), by S=
|ε|2w
−
|σ|2w
=
N X
wi σi2
i=1
εi σi
2
!
−1 .
(6.13)
Lemma 6.2 For arbitrary subsets Ξ ⊆ (F (n) )c ×Θc , such that
sup |H (ξ, β)|2w ≤ ∆ with ∆−1 τ 2 − |σ|2w ≥ d
(6.14)
(ξ,β)∈Ξ
one has, for all N , P
sup SN (ξ, β) > τ (ξ,β)∈Ξ
!
≤ P (S > d) .
2 Proof. Applying the Cauchy- Schwarz inequality to SN (ξ, β) = (ε, H (ξ, β))w and (6.14), we have sup (SN (ξ, β))2 ≤ |ε|2w sup |H (ξ, β)|2w ≤ ∆ |ε|2w .
(ξ,β)∈Ξ
(ξ,β)∈Ξ
Because of (6.14), we get P
sup SN (ξ, β) > τ (ξ,β)∈Ξ
2
!
≤ P ∆ |ε|2w > τ 2
(6.15)
≤ P |ε|2w − |σ|2w > ∆−1 τ 2 − |σ|2w ≤ P (S > d) . Let us introduce balls A (β r , η) in A ⊆ Θc with center β r and radius η A (β r , η) = A ∩ {β : kβ − β r k ≤ η} .
(6.16)
Lemma 6.3 The function H (., .) fulfills the Lipschitz condition H2 with the constant C2L = C2L (N ). Then for all N and for all η such that −1 2 η −2 C2L τ − |σ|2w ≥ d,
P 2
sup
sup
ξ∈X (n)
β∈A(β r ,η)
|SN (ξ, β) − SN (ξ, β r )| > τ
(6.17) !
≤ P (S > d) .
(6.18)
74
CHAPTER 6. AUXILIARY RESULTS
Proof.
We have by the Cauchy Schwarz inequality
|SN (ξ, β) − SN (ξ, β r )|2 = (ε, H (ξ, β) − H (ξ, β r ))2w ≤ |ε|2w |H (ξ, β) − H (ξ, β r )|2w . (6.19) From the Lipschitz condition we obtain that for sufficiently large N there exists a constant C2L such that sup
|H (ξ, β) − H (ξ, β r )|2w ≤ C2L η 2 .
(6.20)
|SN (ξ, β) − SN (ξ, β r )|2 ≤ C2L |ε|2w η 2 .
(6.21)
sup
ξ∈X (n) β∈A(β r ,η)
Thus sup
sup
ξ∈X (n) β∈A(β r ,η)
We get P
sup
sup
ξ∈X (n) β∈A(β r ,η)
r
2
|SN (ξ, β) − SN (ξ, β )| > τ
2
!
−1 2 ≤ P S > η −2 C2L τ − |σ|2w .
(6.22)
Now under (6.17), the statement follows. 2 Analogously, we are interested in balls of the nuisance parameters given in (6.9). Lemma 6.4 Suppose the function H (., .) fulfills the Lipschitz condition H1 with the constant C1L = C1L (N ) . Then for all N and for all ǫ with −1 2 ǫ−2 C1L τ − |σ|2w ≥ d,
P
r
sup sup |SN (ξ, β) − SN (ξ , β)| > τ
ξ∈X (ξ r ,ǫ) β∈Θ
!
(6.23) ≤ P (S > d) .
(6.24)
2 Proof. we use
This proof is the same as the proof of Lemma 6.3. Instead of (6.20) sup sup |H (ξ, β) − H (ξ r , β)|2w ≤ C1L ǫ2 .
ξ∈X (ξ r ,ǫ) β∈Θ
2 The next lemma specifies where the Statulevicius condition S2 on the errors is used. Lemma 6.5 Under the Statulevicius condition S2 with constants γ ′ ,C2S , H2S , then for S defined in (6.13) ∃C, C1 , C2 ∀N ∀d > 0
exp −d2 CN −1 m−2 N − 1 ′ P (S > d) ≤ 1+γ dC2 mN exp − C2S
d ≤ C1 N d ≥ C1 N
1+γ ′ 1+2γ ′ 1+γ ′ 1+2γ ′
−
1
−
1
mN C2S1+2γ mN C2S1+2γ
′
′
.
(6.25)
75 Proof.
2 S = |ε|2w −|σ| random variables with expected of independent w is a sum
value zero. Note χk
εi σi
2
− 1 = χk
χk (S) ≤
N X i=1
εi σi
k wi σi2
2
for k ≥ 3 and
χk
εi σi
2
−1
!
Under the Statulevicius condition S2 with constants γ ′ , H2S , C2S in (6.10) the cumulants of S are estimated by k
χk (S) ≤ N (mN ) (k!) with
1+γ ′
k−2 H2S C2S
k! 2
≤
!1+γ ′
H Dk−2
(6.26)
′
H = 21+γ H2S N m2N and D = C2S mN . Using Corollary 2.1. of Bentkus and Rudskis (1980), [13], see also Lemma 16.8 in the Appendix, we obtain for
′
d ≤ H 1+γ D−1
1 1+2γ ′
= const N
1+γ ′ 1+2γ ′
d2 P (S > d) ≤ exp − 4H and otherwise
2
dC2 mN P (S > d) ≤ exp − C2S
−
!
!−
1
mN C2S1+2γ
′
(6.27) (6.28)
1 1+γ ′
.
Here, for the first time an explicit exponential rates of convergence of type
1
R1N = exp −const N − m−2 N or R2N
dC2 mN = exp − C2S
!−
1 1+γ ′
occur. In order to understand the meaning of these rates let us consider at this point some special cases for mN . We have R1N → 0 iff N m2N → 0. We know that in general mN = max wi σi2 ≤ |σ|2w , (6.29) i≤N
because of the normalization have
P
wi = 1. In the case of unweighted distances we wi =
1 N
76
CHAPTER 6. AUXILIARY RESULTS
and
2 σmax . i≤N N For asymptotically unweighted distances, such that for all N
mN = N −1 max σi2 =
max (wi ) ≤ const , min (wi ) one has
(6.30)
2 mN = O σmax N −1 .
(6.31)
Another interesting special case for the weights is σ −2 wi∗ = PN i −2 . i=1 σi
Then also
m N = PN
1
−2 i=1 σi
(6.32)
2 ≤ N −1 σmax .
(6.33)
If we consider the approach of vanishing variances in the sense that
2 σmax = max σi2 → 0, i=1...N
(6.34)
then we have an additional benefit in the speed of convergence in (6.25). Summarizing, we have: Remark 6.1 In the case of asymptotically equal weights wi ≍
1 N
(6.35)
and for S2 with γ ′ ≥ 0, there exists constants C, C1 , C2 ∀N ∀d > 0, such that
−4 exp (−d2 CN σmax ) d ≤ C1 N P (S > d) ≤ 1 −2 1+γ ′ exp −C2 (d N σmax ) d ≥ C1 N
γ′ 1+2γ ′ γ′ 1+2γ ′
2 σmax 2 σmax
.
(6.36)
We need one lemma more for fixed ξ ∈ X (ξ 0 , D) and β ∈ A (β 0 , R). The main reason for stating Lemma 6.6 in such a general form is that we want to include unbounded design regions. Set for ξ ∈ X (ξ 0 , D) and for β ∈ A (β 0 , R) the bounds H(N ) and hmax (N ) such that sup
|H (ξ, β)|2w ≤ H (N )
(6.37)
max (|hi (ξ, β)|) ≤ hmax (N ) .
(6.38)
sup
ξ∈X (ξ0 ,D) β∈A(β0 ,R)
sup
sup
ξ∈X (ξ0 ,D) β∈A(β0 ,R) i≤N
77 In general, we have
max wi |hi (ξ, β)|2 ≤ |H (ξ, β)|2w ≤ N max wi |hi (ξ, β)|2 . i≤N
i≤N
(6.39)
It is possible to obtain different rates for fixed τ, 0 < τ < 12 , depending on the behavior of the ratio hmax (N ) . H (N ) Lemma 6.6 Let ξ ∈ X (ξ 0 , D) and β ∈ A (β 0 , R) arbitrary but fixed. Then under the Statulevicius condition S1 with γ ≥ 0, ∃C, c ∃N0 ∀N > N0 P (SN (ξ, β) > τ )
−1 exp −Cτ 2 m−1 (N ) N H
exp −C max
≤ 2
τ i≤N (wi σi )hmax (N )
1 1+γ
for τ 1+2γ ≤ for τ 1+2γ ≥
c(mN H(N ))1+γ . maxi≤N (wi σi )hmax (N ) c(mN H(N ))1+γ . maxi≤N (wi σi )hmax (N )
.
(6.40)
Proof. We will apply the result of Bentkus and Rudskis (1980), [13], quoted as Lemma 16.8 in the Appendix, directly to SN (ξ, β) . The k’th cumulant of SN (ξ, β) is χk (SN (ξ, β)) . We have χk (SN (ξ, β)) =
N X
wik σik χk
i=1
and |χk (SN with
εi hk (ξ, β) . σi i
(6.41)
X εi N k k k (ξ, β))| ≤ max χk w σ h (ξ, β) i i i i≤N
σi
i=1
N N X X 2 2 2 k k k w σ h (ξ, β) (wi σi |hi (ξ, β)|)k−2 w σ h (ξ, β) ≤ i i i i i i i=1 i=1
≤ max wi σi2 |H (ξ, β)|2w max (wi σi |hi (ξ, β)|)k−2 i≤N
i≤N
≤ mN H (N ) max (wi σi ) hmax (N ) i≤N
k−2
.
From the Statulevicius condition S1, with constants γ, HS , CS , we obtain |χk (SN (ξ, β))| ≤ (k!)
1+γ
Hs Csk−2 mN
H (N ) max (wi σi ) hmax (N ) i≤N
k−2
.
(6.42)
Now set H = 21+γ Hs mN H (N ) and D = Cs max (wi σi ) hmax (N ) . i≤N
(6.43)
78
CHAPTER 6. AUXILIARY RESULTS
Then from Lemma 16.8, (6.40) follows. 2 Sometimes we will use a rough version of Lemma 6.6 for arbitrary small positive τ > 0, independent of N. Lemma 6.7 Let ξ ∈ X (ξ 0 , D) and β ∈ A (β 0 , R) arbitrary but fixed. Then under the Statulevicius condition S1 with γ ≥ 0, ∃C, c > 0 ∃N0 ∀c > τ > 0 ∀N > N0 and under c (mN H (N ))1+γ →0 maxi≤N (wi σi ) hmax (N )
(6.44)
mN H (N ) P (SN (ξ, β) > τ ) ≤ exp −C maxi≤N (wi σi )2 hmax (N )2
!
1 1+2γ
and under
∃N0 ∃c00 ∀N > N0
(6.45)
(mN H (N ))1+γ > c00 > 0 maxi≤N (wi σi ) hmax (N ) !
1 . P (SN (ξ, β) > τ ) ≤ exp −c mN H (N )
(6.46)
2 Proof.
From Lemma 6.6 follows (6.46) for τ < τ (N ) =
c (mN H (N ))1+γ maxi≤N (wi σi ) hmax (N )
1 ! 1+2γ
.
Under (6.44), ∃N0 ∀N > N0 : τ > τ (N )
!
τ (N )2 P (SN (ξ, β) > τ ) ≤ P (SN (ξ, β) > τ (N )) ≤ exp −C . mN H (N )
We have τ (N )2 = mN H (N )
mN H (N ) maxi≤N (wi σi )2 hmax (N )2
!
1 1+2γ
.
2 In the case of normal distributed errors we don’t need this statement. There we obtain the following result directly. Lemma 6.8 Let ξ ∈ X (ξ0 , D) and β ∈ A (β0 , R) arbitrary fixed and
εi ∼ N 0, σi2 . Then
2
!
τ2 1 . P (SN (ξ, β) > τ ) ≤ exp − 4 mN |H (ξ, β)|2w
79 Proof.
Under normal distributed errors we have SN (ξ, β) ∼ N 0,
and
N X
wi2 σi2 h2i
(ξ, β)
i=1
!
V ar (SN (ξ, β)) ≤ max wi σi2 |H (ξ, β)|2w . i=1
The statement follows from the exponential estimates of the tail probability for normal distributed r.v. X ∼ N (0, V ar (X)) namely τ2 1 P (X ≥ τ ) ≤ √ exp − 4V ar (X) 2
!
τ2 ≤ exp − 4V ar (X)
!
.
2 Of course we can also argue, that for normal distributed errors the Statulevicius condition S1 with γ = 0 is valid. Now we combine the results of Lemma 6.2, Lemma 6.3, Lemma 6.4, Lemma 6.5 and Lemma 6.7. Set −1 2 r (N ) = m−1 , (N mN )−1 . (6.47) N min τ H (N ) Theorem 6.9 Assume the Statulevicius conditions S1 with γ and S2 with γ ′ and V and the Lipschitz conditions H1, H2 with constants C1L = C1L (N ) , C2L = C2L (N ) . Add to this the following assumptions:
c
1. Let X (n) ⊆ F (n) and A ⊆ Θ be any subsets for which there exist constants c1 , c2 such that for all ξ 0 ∈ F (n) and all β 0 ∈ Θ ∃R0 ∀R ≥ R0 sup
sup |H (ξ, β)|2w ≤ c1 R−2
(6.48)
ξ∈X (n) β∈A(R)
and
n o for A (R) = A ∩ β :
β − β 0
≥ R
∃D0 ∀D ≥ D0 sup sup |H (ξ, β)|2w ≤ c2 D−2
(6.49)
ξ∈X (D) β∈A
o n for X (D) = X ∩ ξ : ξ − ξ 0 ≥ D . w
2. For all ξ 0 ∈ F (n) , all β 0 ∈ Θ and the bounds H (N ) given in (6.37) and hmax (N ) given in (6.38) ∃c00 ∃N0 ∀N ≥ N0
0 < c00 ≤
(mN H (N ))1+γ . maxi≤N (wi σi ) hmax (N )
(6.50)
80
CHAPTER 6. AUXILIARY RESULTS 3.
∀c > 0 ∃N0 ∀N ≥ N0 ln C2L (N ) N
4. The set X (n) ⊆ F (n)
c
τ2 C1L (d+|σ|2w )
m2N
≤ c r (N ) .
(6.51)
has an ǫ−entropy H (ǫ, D), satisfying
∀c > 0 ∃N0 ∀N ≥ N0 where ǫ2 ≤
2+2γ ′ 1+2γ ′
and D2 ≥ c2
H (ǫ, D) ≤ c r (N ) ,
d+|σ|2w τ2
(6.52)
with some d = const N
1+γ ′ 1+2γ ′
mN .
Then there exists a positive constant c0 for all 0 < τ < c00 sup sup SN (ξ, β) ≥ τ
P
ξ∈X (n) β∈A
!
≤ exp (−c0 r (N )) .
(6.53)
2 Proof. Consider a covering system for X (n) × A with X (n) ⊆ c (Rn ) and A ⊆ Θ ⊂ Rp , from (6.48), (6.49). Choose η > 0, R > 0 as η2 =
τ2
2C2L d +
|σ|2w
R2 =
and
2c1 d + |σ|2w τ2
F (n)
c
⊆
,
such that (6.17) and, for ∆ = c1 R−2 , (6.14) holds with dN = C2 N
1+γ ′ 1+2γ ′
mN .
(6.54)
Under V q q R d + |σ|2w ≤ 2 c1 C2L ≤ const C2L (N )dN τ −2 . η τ2
We have Θ⊆
s n [
r=1
r 2
β : kβ − β k ≤ η
with s=
2R √ p η
such that A⊆ n
2
!p
2
o
∪
2 0 2 β : β − β ≥ R
p
s [
(6.55)
< const dpN C2L (N ) 2 τ −2p
(6.56)
A (β r , η) ∪ A (R)
(6.57)
r=1
o
with A (R) = β : kβ − β 0 k ≥ R2 ∩ A and A (β r , η) defined in (6.16).
81 Further take ǫ > 0 and D > 0 such that (6.23) and for ∆ = c2 D−2 (6.14) with d given in (6.54) hold. We have for X (D) =
2 0 2 ξ : ξ − ξ ≥ D ∩ X (n)
and X (ξ r , ǫ) given in (6.9) that
w
N (ǫ,D)
X (n) ⊆
[
r=1
X (ξ r , ǫ) ∪ X (D) ,
(6.58)
where N (ǫ, D) is the covering number of X (n) , defined in Definition 6.1. Summarizing we get X (n) × A ⊂
s [
N (ǫ,D)
[
r=1 d=1
X ξ d , ǫ × A (β r , η) ∪ (X (D) × A) ∪ X (n) × A (R) . (6.59)
Using the general relation P
sup f (x) ≥ τ
x∈A∪B
!
!
!
≤ P sup f (x) ≥ τ + P sup f (x) ≥ τ , x∈A
x∈B
we obtain the estimate P
sup sup SN (ξ, β) ≥ τ
ξ∈X (n) β∈A
≤ +P
sup
(ǫ,D) s NX X
r=1 d=1
sup SN (ξ, β) ≥ τ P sup r ξ∈X (ξ d ,ǫ) β∈A(β ,η) !
sup SN (ξ, β) ≥ τ + P
ξ∈X (n) β∈A(R)
Now we consider
!
(6.60) !
sup sup SN (ξ, β) ≥ τ .
ξ∈X (D) β∈A
sup SN (ξ, β) ≥ τ P sup r ξ∈X (ξ d ,ǫ) β∈A(β ,η)
(6.61)
sup f (x) ≤ sup |f (x) − f (a)| + f (a)
(6.62)
and apply the following, given here in a more general form, simple inequalities, x∈A
and
x∈A
τ τ P (x1 + x2 ≥ τ ) ≤ P x1 ≥ + P x2 ≥ . 2 2
(6.63)
82
CHAPTER 6. AUXILIARY RESULTS
Thus (6.61) is smaller than
τ sup |SN (ξ, β) − SN (ξ, β r )| ≥ ≤P sup r 2 ξ∈X (ξ d ,ǫ) β∈A(β ,η)
τ r +P sup |SN (ξ, β )| ≥ . 2 ξ∈X (ξ d ,ǫ)
Using (6.62) and (6.63) once more, we get that (6.61) is smaller than
τ sup |SN (ξ, β) − SN (ξ, β r )| ≥ P sup r 2 ≤ ξ∈X (ξ d ,ǫ) β∈A(β ,η) |
P sup + ξ∈X (ξ d ,ǫ) |
{z
Lemma 6.3
τ SN (ξ, β r ) − SN ξ d , β r ≥
4
{z
}
Lemma 6.4
(6.64)
}
τ d r + P SN ξ , β ≥ .
4
For the chosen η and ǫ, the assumptions (6.17) (6.23) are fulfilled and we apply Lemma 6.3 on the first term in (6.64); Lemma 6.4 to the second term. Hence (6.61) is smaller than ≤
2P |
|ε|2w
−
|σ|2w
{z
τ d r P SN ξ , β ≥ >d 4} . } + | {z
Lemma 6.5
(6.65)
Lemma 6.6
1+γ ′
Now we apply Lemma 6.5 on the first term for d = C1 N 1+2γ ′ mN . Under (6.50) we use the first case of Lemma 6.6, to estimate the last term in (6.65). Hence there exists a constant C such that
sup SN (ξ, β) ≥ τ P sup r ξ∈X (ξ d ,ǫ) β∈A(β ,η)
≤ 2 exp −C N
−1 m2N
(6.66)
+ exp −Cτ 2 (mN H (N ))−1
−1 2 ≤ 3 exp −C m−1 , (N mN )−1 N min τ H (N )
≤ 3 exp (−C r (N ))
with r (N ) given in (6.47). Come back to (6.60).
83 Under (6.48) and (6.49) for the chosen D and R we can estimate the last two terms in (6.60) by Lemma 6.2. We have P
sup
!
sup SN (ξ, β) ≥ τ + P
ξ∈X (n) β∈A(R)
sup sup SN (ξ, β) ≥ τ
ξ∈X (D) β∈A
≤ 2P (S > d) ≤ 2 exp −C N m2N
−1
.
!
(6.67)
The last inequality follows from Lemma 6.5 with d given in (6.54). Setting (6.66) and (6.67) in (6.60) we obtain P
sup sup SN (ξ, β) ≥ τ
ξ∈X (n) β∈A
!
≤ 4sN (ǫ, D) exp (−Cr (N ))
ln (4s) H (ǫ, D) ≤ exp −r (N ) C − − r (N ) r (N )
!!
.
(6.68)
The bound ln s with s given in (6.56) divided by the rate r (N ) is sufficiently small, because of (6.51). Under the entropy condition (6.52) the last term in (6.68) is sufficiently small also. Hence there exists a positive constant c0 such that ln (4s) H (ǫ, D) − ≥ c0 > 0, (6.69) C− r (N ) r (N ) and the statement of the theorem holds. 2 Let us state a special case of the Theorem 6.9 for asymptotically unweighted distances and bounded Lipschitz constants and bounded factors hi (ξ, β) . and under stronger Statulevicius conditions S1, S2. Corollary 6.10 Assume the Statulevicius conditions S1, S2 with γ = γ ′ = 0, V and the Lipschitz conditions H1, H2 with constants C1L (N ) ≤ const, C2L (N ) ≤ const and 1 2 wi ≍ , max σi2 = σmax ≤ const. i≤N N Add to this the following assumptions: 1. X (n) ⊆ {ξ : |ξ − ξ 0 |N ≤ D} and A ⊆ Θ, Θ are bounded. 2. sup sup max |hi (ξ, β)| ≤ const
ξ∈X (n) β∈A i≤N
(6.70)
84
CHAPTER 6. AUXILIARY RESULTS 3. The set X (n) has an ǫ−entropy H (ǫ, D),which satisfies for ǫ2 ≤
τ2
C1L d0 + |σ|2N
∀c ∃N0 ∀N ≥ N0 ,
with some d0 ≥ 0
2 σmax N −1 H (ǫ, D) ≤ c.
(6.71)
Then for arbitrary positive small constant τ there exists a positive constant c0 such that ! P
sup sup SN (ξ, β) ≥ τ
ξ∈X (n) β∈A
−2 ≤ exp −c0 N σmax .
(6.72)
2 Proof. For sets X (n) ⊆ {ξ : |ξ − ξ 0 |N ≤ D} and A ⊆ Θ, Θ bounded , the conditions (6.48) and (6.49) hold vacuously. The asymptotically equal weights imply −1 N 2 m−1 = max ≤ const 2 . w σ i i N i≤N σmax The d in (6.54) is chosen as fixed constant. Because of (6.39) we set under (6.70) H(N ) = N sup sup max(wi |hi (ξ, β)|2 ) ξ∈X (n) β∈A i≤N
= constN
1 sup sup max |hi (ξ, β)|2 = const. N ξ∈X (n) β∈A i≤N
Then the bound in (6.50) is independent from N. The rate in (6.47) is −2 r (N ) ≤ c4 N σmax .
Assumption (6.51) is fulfilled for constants C2L and the above rate. The entropy condition (6.52) becomes (6.71). 2 Note, the case of vanishing variances is included in Theorem 6.9. Now we assume (6.74), the opposite case to (6.50), and give the analog version of the Theorem 6.9, for uniformly increasing functions hi . Let
τ r0 (N ) = min maxi≤N (wi σi ) hmax (N )
!
1 1+γ
, N m2N
−1
(6.73)
Theorem 6.11 Assume the Statulevicius conditions S1 with γ and S2 with γ ′ , V and the Lipschitz conditions H1, H2 with constants C1L = C1L (N ) , C2L = C2L (N ) . Add to this the following assumptions:
85
c
1. X (n) ⊆ F (n) and A ⊆ Θ are subsets, such that there exist constants c1 , c2 such that for all ξ 0 ∈ F (n) and all β 0 ∈ Θ ∃R0 ∀R ≥ R0 sup
sup |H (ξ, β)|2w ≤ c1 R−2
ξ∈X (n) β∈A(R)
o n for A (R) = A ∩ β :
β − β 0
≥ R
and
∃D0 ∀D ≥ D0 sup sup |H (ξ, β)|2w ≤ c2 D−2 ξ∈X (D) β∈A
for X (D) = X ∩ {ξ : |ξ − ξ0 |w ≥ D} . 2. For all ξ 0 ∈ F (n) and all β 0 ∈ Θ for the bounds given in (6.37) and (6.38) it holds ∃c ∃N0 ∀N ≥ N0
τ 1+2γ ≥ c
(mN H (N ))1+γ . maxi≤N (wi σi ) hmax (N )
(6.74)
and ∀c > 0 ∃N0 ∀N ≥ N0
3. The set X (n) ⊆ F (n)
c
ln C2L (N ) N
where ǫ ≤
τ2 C1L (d+|σ|2w )
m2N
≤ c r0 (N ) .
(6.75)
has an ǫ−entropy H (ǫ, D), which fulfills
∀c ∃N0 ∀N ≥ N0 2
2+2γ ′ 1+2γ ′
2
and for D ≥
H (ǫ, D) ≤ c r0 (N ) . d+|σ|2 c2 τ 2 w
with some d = const N
1+γ ′ 1+2γ ′
mN .
Then the following assertion holds: There exists a positive constant c0 for all τ with (6.74) such that P
sup sup SN (ξ, β) ≥ τ
ξ∈X (n) β∈A
!
≤ exp (−c0 r0 (N )) .
2 Proof. The proof follows that of Theorem 6.9. The only difference is that under (6.74) the last term in (6.64) is estimated by the second case of Lemma 6.6. Instead of (6.66) therefore we obtain
sup SN (ξ, β) ≥ τ P sup r ξ∈X (ξ d ,ǫ) β∈A(β ,η)
86
CHAPTER 6. AUXILIARY RESULTS
≤ 2 exp −C N m2N
−1
+ exp −C
τ maxi≤N (wi σi ) hmax (N )
τ ≤ 3 exp −C min maxi≤N (wi σi ) hmax (N )
!
1 1+γ
, N m2N
!
−1
1 1+γ
≤ 3 exp (−C r0 (N )) with r0 (N ) from (6.73). 2 Consider now the asymptotically unweighted case with increasing functions hi under the stronger Statulevicius conditions S1, S2. Corollary 6.12 Assume the Statulevicius conditions S1, S2 with γ = γ ′ = 0, V and the Lipschitz conditions H1, H2 with constants C1L , C2L and wi ≍
1 2 , max σi2 = σmax ≤ const. N i≤N
Then for arbitrary positive bounded τ add to this the following assumptions: 1. X (n) ⊆ {ξ : |ξ − ξ0 |N ≤ D} and A ⊆ Θ, Θ bounded. 2. hmax (N ) →∞ σmax H (N )
(6.76)
The set X (n) has an ǫ−entropy H (ǫ, D), which fulfills σmax hmax (N ) N −1 H (ǫ, D) ≤ c,
∀c ∃N0 ∀N ≥ N0 , where ǫ2 ≤
τ2
C1L d0 + |σ|2N
with some d0 ≥ 0.
Then the following assertion holds: There exists a positive constant c0 , such that for any fixed τ, 0 < τ < 12 , ∃N0 ∀N ≥ N0 , P
sup sup SN (ξ, β) ≥ τ
ξ∈X (n) β∈A
2
!
!
N . ≤ exp −c0 σmax hmax (N )
(6.77)
87 Proof. Under (6.76) we apply Theorem 6.11. For sets X (n) ⊆ {ξ : |ξ − ξ0 |N ≤ D} and A ⊆ Θ, Θ bounded, the conditions (6.48) and (6.49) hold vacuously. Then under the special conditions here the rate is −1 2 r0 (N ) = max max (wi σi ) hmax (N ) , N mN
i≤N
≥ N max max (σi ) hmax (N ) , max 2
i≤N
i≤N
2 σi2
−1
≥ N max (σi ) hmax (N ) i≤N
−1
.
We see that an increasing bound hmax (N ) makes the rate slow. Let us close this chapter with a common specialization of Theorem 6.9 and Theorem 6.11 for normal distributed errors. In both cases the rate is the same as in Theorem 6.9:
−1 2 r (N ) = m−1 , (N mN )−1 . N min τ H (N )
Theorem 6.13 Assume
εi ∼ N 0, σi2
and V and the Lipschitz conditions H1, H2 with the constants C1L = C1L (N ) , C2L = C2L (N ) . Add to this the following assumptions:
c
1. Let X (n) ⊆ F (n) and A ⊆ Θ be any subsets for which there exist constants c1 , c2 such that for all ξ 0 ∈ F (n) and all β 0 ∈ Θ ∃R0 ∀R ≥ R0 sup
sup |H (ξ, β)|2w ≤ c1 R−2
ξ∈X (n) β∈A(R)
n
for A (R) = A ∩ β : β − β 0 ≥ R
and
o
∃D0 ∀D ≥ D0 sup sup |H (ξ, β)|2w ≤ c2 D−2 ξ∈X (D) β∈A
n o for X (D) = X ∩ ξ : ξ − ξ 0 ≥ D . w
2.
∀c > 0 ∃N0 ∀N ≥ N0 ln C2L (N ) N 2 m2N ≤ c r (N ) .
3. The set X (n) ⊆ F (n)
c
has an ǫ−entropy H (ǫ, D), which fulfills
∀c > 0 ∃N0 ∀N ≥ N0
where ǫ2 ≤
τ2
C1L (d+|σ|2w )
and D2 ≥ c
H (ǫ, D) ≤ c r (N ) ,
d+|σ|2w 2 τ2
with some d.
Then the assertion holds that there exists a positive constant c0 for all 0 < τ P
sup sup SN (ξ, β) ≥ τ
ξ∈X (n) β∈A
2
!
≤ exp (−c0 r (N )) .
88
CHAPTER 6. AUXILIARY RESULTS
Proof. Let us follow the line of the proof of Theorem 6.9. For normal distributed errors the Statulevicius conditions are fulfilled with γ = γ ′ = 0, compare Lemma 6.1. We can choose d = const in (6.54). For the estimation of
τ P SN ξ d , β r ≥ 4
in (6.65) we apply Lemma 6.8 instead of Lemma 6.6. Therefore we don’t need the assumption (6.50). The case of increasing function hi is included also. We obtain 1 τ d r P SN ξ , β ≥ ≤ exp − 3
4
τ2 4 mn |H (ξ d , β r )|2w
!
τ2 1 ≤ exp − 3 4 mn H (N )
!
with H(N ) introduced in (6.37). The remainder part of the proof does not change. 2
Chapter 7 Consistency of the l.s.e. We come back to the error-in-variables model (3.1), (3.2). Theorem 5.2 and Theorem 5.3 give the link between the deviation of the least squares estimator from the true parameter and the uniform convergence of the sum un (ξ, β) of independent, not identically distributed random values: un (ξ, β) =
n X X
wji εji hji (ξ, β) ,
(7.1)
J=1,2 i=1
with h1i (ξ, β) =
g (ξi0 , β 0 ) − g (ξi , β) Ln (ξ, β)
and h2i (ξ, β) =
ξi0 − ξi , Ln (ξ, β)
(7.2)
defined for Ln (ξ, β) > 0, where 2 2 Ln (ξ, β) = G (ξ, β) − G ξ 0 , β + G ξ 0 , β − G ξ 0 , β 0 + |ξ − ξ0 |2w2 . w1 w1 (7.3) P P
Lemma 7.1 For |H (ξ, β)|2w = j i wji hji (ξ, β), with hji defined in (7.1), (7.2), and Ln (ξ, β) , given in (7.3), for all ǫ > 0 and for each set Ξa ⊆ Ξc (ǫ) = {(ξ, β) : Ln (ξ, β) > ǫ} sup |H (ξ, β)|2w ≤
(ξ,β)∈Ξa
2 inf (ξ,β)∈Ξa Ln (ξ, β)
.
(7.4)
2 Proof.
We have 2
|H
(ξ, β)|2w
2
|G − G00 |w1 + |ξ − ξ 0 |w2 = . Ln 2 89
(7.5)
90
CHAPTER 7. CONSISTENCY OF THE L.S.E.
2
2
2
Because of |G − G00 |w1 ≤ 2 |G − G00 |w1 + |G − G00 |w1 and (7.3) 2
|H (ξ, β)|2w ≤
2
2 2 1 2 |G − G0 |w1 + 2 |G0 − G00 |w1 + |ξ − ξ0 |w2 ≤ Ln Ln Ln
and hence sup |H (ξ, β)|2w ≤
(ξ,β)∈Ξa
2 inf (ξ,β)∈Ξa Ln (ξ, β)
(7.6)
.
2
The Statulevicius conditions S1, S2 on the k ′ th cumulants of the normalized errors get the forms: S1’ ∃γ > 0 ∃CS ∃HS ∀j = 1, 2 ∀i = 1, 2, ... ∀k = 3, ...
S2’
χ k
εji σji
! 1+γ HS CSk−2 . ≤ (k!)
(7.7)
∃γ ′ > 0 ∃C2S ∃H2S ∀j = 1, 2 ∀i = 1, 2, ... ∀k = 3, ... !2 εji 1+γ ′ k−2 χ k H2S C2S . ≤ (k!) σji
(7.8)
Sometimes we need only the following weaker moment condition M0: M0 εji max E i,j σji
!4
≤ κ0 < ∞.
(7.9)
Now we are in the position to show different consistence results with help of the auxiliary results in the chapter above. First we consider the case of ”vanishing variances”.
7.1
Consistency of the l.s.e. variances
under vanishing
The condition ”vanishing variances” means that the weighted average of the variances |σ1 |2w1 + |σ2 |2w2 becomes arbitrary small. This can be required for n tending to infinity as it is done in the following condition: Var
lim |σ1 |2w1 + |σ2 |2w2 = 0. n→∞
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 91 For instance in the average model (3.11), (3.12) with weights wjik = wji = 2q1 , and σ ji = V ar (εji ) we have
1 2qri
and
!
q q q q X X 1 2 X σ2 1 X 1 2 E (ε2i )2 = σ1i + σ2i ≤ max = E (ε1i )2 + 2q i=1 rmin i=1 2qri i=1 i=1 2qri (7.10) 2 2 where maxi,j σji = σmax and mini ri = rmin . Then for uniformly increasing repetitions, rmin = r (n) → ∞, the assumption Var is fulfilled. Remember the overview in the introduction: the model with repeated observations was the first one, in which consistent estimators in nonlinear error-in-variables models are founded. Another approach is the asymptotic of small error variances. There the asymptotic inference is not done in the sense of an increasing number of obser2 vations but rather for vanishing variances σmax ≤ σ 2 → 0, see also the literature review in Section 1.3.5. Then the assumption of vanishing variances becomes following form:
|σ 1 |2w1 +|σ 2 |2w2
Var’
∀d > 0 ∃σ0 ∀σ ≤ σ0 : |σ1 |2w1 + |σ2 |2w2 ≤ d.
(7.11)
The main point is that under this assumptions of vanishing variances Var and Var’ we does not need any result on the sum un (ξ, β) and therefore we don’t require any entropy condition. The key instrument is Lemma 6.2. We reduce the consistency of the l.s.e. to the convergence of the sum
S = |ε1 |2w1 + |ε2 |2w2 − |σ1 |2w1 + |σ2 |2w2 .
(7.12)
which is independent of the parameters (ξ, β) . Then the consistency results are based on Lemma 6.5. This connection to the sum given in (7.12) is explicitly given in the following lemma. We state it under Var and under Var’ simultaneously. Lemma 7.2 In the model (3.1), (3.2) under the condition of vanishing variances Var (Var’) and under the Lipschitz condition L1 1 ∃τ ∈ (0, ) ∀ǫ > 0 ∃n0 ∀n ≥ n0 ( ∃σ0 ∀σ ≤ σ0 ) ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) 2 b defined in (5.5), 1. For ξ,
Pξ0 β 0
b 0 2 ξ − ξ > ǫ ≤ P w2
!
ǫτ 2 S≥ . 4
(7.13)
b 2. If in addition the contrast condition Con with an is satisfied, then for β, defined in (5.4),
Pξ0 β 0 2
b 0 2 an β − β > ǫ ≤ P
!
ǫτ 2 S≥ . 4
(7.14)
92
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Proof. Take ǫ > 0 arbitrary but fixed. From Theorem 5.2 follows that there 1 exists a constant τ, τ ∈ 0, 2 Pξ0 β 0
2 b ξ − ξ 0 > ǫ ≤ P w2
sup (ξ,β)∈Ξc (ǫ)
!
un (ξ, β) > τ .
Under Con we obtain from Theorem 5.3 for the same constant τ as above Pξ0 β 0
b 0 2 an β − β > ǫ ≤ P
sup (ξ,β)∈Ξc (ǫ)
un (ξ, β) > τ .
Now we will use Lemma 6.2. Let us check (6.14) with d = inf
(ξ,β)∈Ξc (ǫ)
!
ǫτ 2 . 4
We have
Ln (ξ, β) ≥ ǫ.
Then from the inequality (7.4) follows 2 |H (ξ, β)|2w ≤ . ǫ (ξ,β)∈Ξc (ǫ) sup
Because of Var, we find an n0 independent of ξ 0 and β 0 , such that for all n ≥ n0 ǫτ 2 ǫτ 2 − |σ1 |2w1 + |σ2 |2w2 ≥ >0 2 4
(7.15)
and (6.14) is fulfilled. The analog arguments hold under Var’. In this case there exists a σ0 independent of ξ 0 and β 0 such that for all σ ≤ σ0 (7.15) is valid. Then Lemma 6.2 yields, that P
sup (ξ,β)∈Ξc (ǫ)
un (ξ, β) > τ
!
!
ǫτ 2 ≤P S≥ , 4
(7.16)
with S defined in (7.12). Hence (7.13) and (7.14) are valid. 2 b defined in (5.5), and Now let us formulate our first result with respect to ξ, b defined in (5.4) under Var and under Var’ simultaneously. In difference to β, the result on the nuisance parameter estimator ξb we need for βb the additional contrast condition Con on the regression function. Remember the denotation 2 mn = maxji wji σji . Theorem 7.3 In the model (3.1), (3.2) assume the condition of vanishing variances Var (Var’) and M0 with κ0 and the Lipschitz condition L1. Then: ∀ǫ > 0 ∃cV > 0 ∃n0 ∀n ≥ n0 ( ∃σ0 ∀σ ≤ σ0 ) ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n)
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 93 b defined in (5.5), 1. For ξ,
κ0 mn b 2 2 0 2 |σ | + |σ | Pξ0 β 0 ξ − ξ > ǫ ≤ cV 1 2 w1 w2 . 2
(7.17)
ǫ
w2
b 2. If in addition the contrast condition Con with an is satisfied, then for β, defined in (5.4)
Pξ0 β 0 2
κ0 mn
b 2 2 0 2 an β − β > ǫ ≤ c V |σ | + |σ | 1 2 w1 w2 . 2
(7.18)
ǫ
2
Proof. Because of Lemma 7.2 it remains to estimate P S ≥ ǫτ4 . Under M0 follows from the Chebychev inequality and the Bonferoni equality ǫτ 2 P S≥ 4
!
4 ≤ ǫτ 2
4 ≤ ǫτ 2
4 ≤ ǫτ 2
2
2
E
j=1,2 i=1
!4
X
j=1,2
2 X X n
εji max E i,j σji ≤ cV
with
4 2 Eε4ji − σji wji
2
|εj |2wj − |σj |2wj
2 max wji σji ji
|σ1 |2w1 + |σ2 |2w2
(7.19)
κ0 mn 2 2 |σ | + |σ | , 1 2 w w 1 2 ǫ2
cV =
4
2 τ
,
and we obtain the result. 2 The next theorem contains an exponential probability bound. We get this result under the Statulevicius condition S2’ instead of the moment condition M0. Theorem 7.4 In the model (3.1), (3.2) let the Statulevicius condition S2’ with γ ′ and C2S , the Lipschitz condition L1 and condition of vanishing variances Var (Var’) be satisfied. Then: ∀ǫ > 0 ∃ce1 , ce2 , Ce > 0 ∃n0 ∀n ≥ n0 ( ∃σ0 ∀σ ≤ σ0 ) ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) :
94
CHAPTER 7. CONSISTENCY OF THE L.S.E. b defined in (5.5), 1. For ξ,
b 0 2 Pξ0 β 0 ξ − ξ > ǫ w2
2
exp −ce1 n ǫmn
≤
exp −ce2 ( C
ǫ
1
) 1+γ ′ 2S mn
1+γ ′
−
1
1+γ ′
−
1
ε ≤ Ce mn n 1+2γ ′ C2S1+2γ ε ≥ Ce mn n 1+2γ ′ C2S1+2γ
′
′
.
(7.20)
b 2. If in addition the contrast condition Con with an is satisfied, then for β, defined in (5.4),
2 Pξ0 β 0 an
βb − β 0
> ǫ
2
2
exp −ce1 n ǫmn
exp −ce2 ( C
≤
ǫ
) 2S mn
1 1+γ ′
1+γ ′
−
1
1+γ ′
−
1
ε ≤ Ce mn n 1+2γ ′ C2S1+2γ
ε ≥ Ce mn n 1+2γ ′ C2S1+2γ
′
′
.
(7.21)
Proof. Because of Lemma 7.2 we have to estimate (7.13), (7.14). Instead of the Chebychev inequality as above we apply Lemma 6.5 with the there given constants C, C1 , C2 . We set Ce =
4C1 , τ2
c e1 = C
2
τ 4
,
c e2 = C 2
τ 4
1 1+γ ′
.
2 Which type of rate is valid for arbitrary fixed ε > 0, depends on the asymptotic 1+γ ′
−
1
′
behavior of mn n 1+2γ ′ C2S1+2γ . Under the here taken general assumptions both cases are possible. In the following we give the results of Theorem 7.4 and 7.3 explicitly for main special cases of Var and Var’ once more. First we consider the repeated observation model (3.6), (3.7), where Var is valid.
7.1.1
Consistency of the l.s.e. in the repeated observation model
Consider the model (3.6), (3.7): yik = g (ξi , β) + ε1ik
(7.22)
xik = ξi + ε2ik
(7.23)
2 with i = 1, ..., q, k = 1, ..., ri , Eε2jik = σji , j = 1, 2 . The number of observations Pq is n = i=1 ri , q is the number of unknown design points and ri replications are
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 95 taken at each point ξi . The weights are wjik , j = 1, 2, i = 1, ..., q, k = 1, ..., ri , P and jik wjik = 1. Introduce wji =
ri X
wjik
and
wmax = max max max wjih . j=1,2 i≤q
k=1
h≤ri
The main idea here is to show that the least squares estimators in the model (7.22), (7.23) are the same as in the average model (3.11), (3.12) and that in the average model the condition of vanishing variances Var is valid. Then we can apply Lemma 7.2 to the average model and therefore the problem of consistency is now the problem of convergence of the sum of independent not identically distributed random values, independent of (ξ, β) : S = |ε1 |2w1 + |ε2 |2w2 − |σ 1 |2w1 − |σ 2 |2w2 with
ri 1 X wjik yik εji = wji k=1
and σ 2ji = V ar (εji ) .
We start with the corollary of Lemma 7.2. 2 Corollary 7.5 Assume the model (7.22), (7.23) with maxj=1,2 maxi≤q σji ≤κ< ∞ and with the Lipschitz condition L1, and
R (n) = max j=1,2
q X i=1
max wijk → 0 f orn → ∞ k≤ri
b defined in (5.5), Then for the least squares estimator ξ,
1 ∃τ ∈ (0, ) ∀ǫ > 0 ∃n0 ∀n ≥ n0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) 2 2 0 0 Pξ β ξb − ξ 0
w2
!
ǫτ 2 . >ǫ ≤P S≥ 4
b defined in and if in addition the contrast Con with an is satisfied, then for β, (5.4),
1 ∃τ ∈ (0, ) ∀ǫ > 0 ∃n0 ∀n ≥ n0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) 2 Pξ0 β 0 2
2
aq βb − β 0 > ǫ ≤ P
!
ǫτ 2 . S≥ 4
96
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Proof. The least squares estimators ξb and βb , defined in (5.5) and (5.4), with weights wjik in the replication model (3.6), (3.7) are the same as the least squares estimator ξb and βb in the average model (3.11), (3.12) with weights wji , since the function Q(ξ, β), which is to be minimized with respect to ξ and β in (5.5) and (5.4), is Q(ξ, β) =
q X ri X
i=1 k=1
=
q X ri X
i=1 k=1
+
q X i=1
where
w1ik (yik − g (ξi , β))2 +
w1ik (yik − y i )2 +
q X ri X
i=1 k=1
2
w1i (y i − g (ξi , β)) +
q X i=1
q X ri X
i=1 k=1
w2ik (xik − ξi )2
w2ik (xik − xi )2
(7.24)
w2i (xi − ξi )2 ,
ri ri 1 X 1 X yi = w1ik yik and xi = w2ik xik . w1i k=1 w2i k=1
We will apply Lemma 7.2 to the average model (3.11), (3.12). It remains to show that Var is valid in (3.11), (3.12) under the above conditions. In the average model the variances are ri ri 2 2 X X σ1i σ2i 2 2 (w1ik ) and σ 2i = V ar (xi ) = (w2ik )2 , = V ar (y i ) = 2 2 (w1i ) k=1 (w2i ) k=1 (7.25) compare (3.13) also. We have
σ 21i
|σ 1 |2w1 + |σ 2 |2w2 =
q X
w1i σ 21i +
i=1
q X i=1
w2i σ 22i ≤ max j
q ri X 2κ X i=1 w 2i
k=1
(w2ik )2 ≤ 2κR (n) → 0.
(7.26) Thus the condition of vanishing variances Var is fulfilled and Lemma 7.2 gives the result. 2 We state the consistency results for the least squares estimators ξb and βb . Corollary 7.6 Suppose the following conditions:
1. Assume the model (7.22), (7.23) with the Lipschitz condition L1. 2. R (n) = max j=1,2
q X i=1
max wijk → 0 for n → ∞
3. 2 max max σji j=1,2 i≤q
≤ κ < ∞ and
(7.27)
k≤ri
εjik max max max E j=1,2 i≤q k≤ri σji
!4
≤κ
(7.28)
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 97 b defined in (5.5), Then for the least squares estimator ξ,
∀ǫ > 0 ∃cVr > 0 ∃n0 ∀n ≥ n0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n)
2
Pξ0 β 0 ξb − ξ 0
w2
> ǫ ≤ c Vr
wmax R (n) . ǫ2
(7.29)
b defined in and if in addition the contrast Con with an is satisfied, then for β, (5.4)
∀ǫ > 0 ∃cVr > 0 ∃n0 ∀n ≥ n0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) Pξ0 β 0 2
wmax R (n)
b 0 2 aq β − β > ǫ ≤ c V r . 2
(7.30)
ǫ
For wjik =
1 2qri
the condition (7.27) is fulfilled if rmin → ∞. Furthermore we have wmax R (n) ≤
1 2 2rmin q
.
2
Proof. Because of Corollary 7.5 we have to estimate P S ≥ ǫτ4 . We use the line of the proof of Theorem 7.3. It remains to check the condition M0 in the average model (3.11), (3.12). Under (7.28) we have E (εji )4 =
X
k1 ,k2, k3, k4
=
X k
wjik wji
!4
wjik1 wjik2 wjik3 wjik4 E (εjik1 εjik2 εjik3 εjik4 ) (wji )4
X E (εji1 )4 + 3 k
wjik wji
Hence M0 holds with κ0 = κ + 3, since
4
!2 2
4 σji
≤
X k
wjik wji
!4
E (εji1 )4 + 3σ 4ji .
P
2 maxk wjik εji κ k (wjik )4 E q ≤ P + 3 ≤ κ P 2 2P 2 + 3 ≤ κ + 3. σ 2ji k (wjik ) k (wjik ) k (wjik )
The results (7.29), (7.30) follow now from the chain of inequalities (7.19) and
mn = max wji σ 2ji j,i
!
ri 2 X σ2i = max (w2ik )2 ≤ κ max max wjik ≤ κwmax . (7.31) j,i j,i k w2i k=1
98
CHAPTER 7. CONSISTENCY OF THE L.S.E.
We set cVr = 2
4
2 τ
κ2 (κ + 3) .
2 If we assume the Statulevicius condition, we get the following result. Corollary 7.7 Assume the replication model (7.22), (7.23) where the Lipschitz condition L1 and the Statulevicius condition S2’ for εijk with γ ′ ≥ 0 are satisfied with 2 max max σji ≤κ 0 ∃cS1 , cS2 , C0 > 0 ∃n0 ∀n ≥ n0 ∀β0 ∈ Θ ∀ξ0 ∈ F (n) b defined in (5.5), 1. For the least squares estimator ξ, 2 Pξ0 β 0 ξb − ξ 0
w2
≤
2
exp −cS1 q ǫm2
n
ǫ
1
exp −cS2 ( C2S mn ) 1+γ ′
>ǫ
1+γ ′
1
1+γ ′
1
ε ≤ C0 mn q 1+2γ ′ rmin 1+2γ ′ ε ≥ C0 mn q 1+2γ ′ rmin 1+2γ ′
.
(7.33)
b 2. If in addition the contrast condition Con with aq is satisfied, then for β, defined in (5.4)
Pξ0 β 0
2
2
exp −cS1 q ǫm2
b 0 2 aq β − β > ǫ
n 1 ǫ ′ 1+γ exp −cS2 ( C2S mn )
≤
1+γ ′
1
1+γ ′
1
ε ≤ C0 mn q 1+2γ ′ rmin 1+2γ ′ ε ≥ C0 mn q 1+2γ ′ rmin 1+2γ ′
.
(7.34)
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 99 Proof. The first arguments are the same as in the proof of Corollary 7.5. We will apply Theorem 7.4 to the average model. First we show that the Statulevicius condition S2’ for εijk implies a Statulevicius condition S2’ for the averaged errors εji . We have
εji χk σ ji
≤
k k X wjik wjik 1 2
k1 =k2
(wji σ ji )2k
!2 k k X wjik wjik 1 2 = 2k χk (εjik1 εjik2 ) k1 ,k2
2k σji (k!)
1+γ ′
(7.35)
(wji σ ji )
k k X wjik wjik 1 2
k−2 H2S C2S +
k1 6=k2
(wji σ ji )2k
2k σji χk
εjik1 εjik2 2 σji
!
.
Using Lemma 16.6 and Lemma 16.7 of the Appendix we get constants C and H such that ! ′′ εjik1 εjik2 ≤ (k!)1+2γ HC k−2 χk σji σji with
!
γ′ − 1 γ′ γ = max ,0 ≤ . 2 2 ′′
We obtain
εji χk σ ji
!2
2 σji wjik1 wjik2
X
≤
k1 ,k2
(wji σ ji )
2
!k
′
k−2 (k!)1+γ Hmax Cmax
with Hmax = max (H2S , H) and Cmax = max (C2S , C) . Remember (7.25) we have 2 σji
(wji σ ji )
2
and under (7.32) X
k1 ,k2
2 σji wjik1 wjik2 (wji σ ji )2
!k
Summarizing we get
P
ri h=1
k wjih
h=1
2 wijh
= P ri
εji χk σ ji
!2
2
= P ri
k ≤
1
h=1
2 wijh
2 (wmax )
k−2
P
′
≤ (k!)1+γ Hmax
P
ri h=1
ri h=1
2 wijh
Cmax ri
2 wjih
k
2
≤
1
const. rik−2
k−2
and in the average model the Statulevicius condition S2’ with γ ′ and C 2S = Cmax ri is fulfilled. Now we apply Theorem 7.4 and obtain the result. 2 In the case of asymptotical balanced design, that is rmin ≥ crmax , and a stronger Statulevicius condition with γ ′ = 0 the result of Corollary 7.7 becomes a more convenient form. Let us give it separately.
100
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Corollary 7.8 Assume the replication model (7.22), (7.23) where the Lipschitz condition L1 and the Statulevicius condition S2’ for εijk with γ ′ = 0 are satisfied and where 2 max max σji ≤κ 0 ∃c > 0 ∃n0 ∀n ≥ n0 ∀β0 ∈ Θ ∀ξ0 ∈ F (n) b defined in (5.5), 1. For the least squares estimator ξ,
Pξ0 β 0
2 b ξ − ξ 0
w2
!
ǫ2 . > ǫ ≤ exp −c 2 q wmax
(7.38)
b 2. If in addition the contrast condition Con with aq is satisfied, then for β, defined in (5.4),
Pξ0 β 0 2
b 0 2 aq β − β > ǫ ≤ exp −c
!
ǫ2 . 2 q wmax
(7.39)
Proof. The condition (7.37) and the balanced design condition yield the existence of a constant c0 , such that mn q
1+γ ′ 1+2γ ′
rmin
1 1+2γ ′
= mn qrmin !
ri σ2 X ≥ const max qrmin 2i (w2ik )2 i w2i k=1
!
2 2 σmax σ2i 2 wmax rmin ≥ const2 qrmin wmax ≥ c0 n. ≥ constqrmin max i w2i rmax wmax
Hence for any ǫ > 0 there exists an n0 , such that ǫ ≤ c0 n and the first case in the rate of Corollary 7.7 is valid. Furthermore ǫ2 exp −cS1 qm2n
!
ǫ2 ≤ exp −cS1 2 2 κ qwmax
!
!
ǫ2 ≤ exp −c 2 . qwmax
2 Note, that we do not require q → ∞. The only limit is taken with respect to the vanishing variance condition. Also we have no rates for ǫ here, the ǫ are
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 101 arbitrary small but independent on n. The value of n0 is chosen with respect to ǫ, such that (7.15) is valid. Otherwise we don’t require any assumptions on the growth of the number of repetitions rmin in relation to the growth of the number of design points. In the case of equiweighted least squares we get the exponential probability bound 2 exp −cqrmin . At first this bound seems to be much stronger than normally expected, because under the balanced design condition ( 7.36)
2 exp −cqrmin ≤ exp (−c1 nrmin ) ≤ exp (−cn) .
The reason lies in the special structure of S = |ε1 |2w1 + |ε2 |2w2 − |σ 1 |2w1 − |σ 2 |2w2 .
The mean value of S is zero, and the variance of V ar S tends also to zero. Let us discuss this for the case of
εjik ∼ N 0, σ 2
i.i.d.,
and balanced design and equal weights. Then S= with
σ2 εji ∼ N 0, r
The sum
q 1 1 X X ε2ji − σ 2 2q j=1,2 i=1 r
!
i.i.d. and V ar S =
1 4 σ . qr2
q r X X S = 2 ε2ji ∼ χ2q σ j=1,2 i=1 ∗
is Chi squared distributed with 2q degrees of freedom, with cumulants are given in (6.11). Thus the cumulants of S are derived by
χk S =
1 2 σ 2qr
!k
∗
χk (S ) =
where C= and
1 2 σ 2qr
!k
2q 2k−1 (k − 1)! ≤
1 2 1 σ and H = 2 2 σ 2 qr qr
ǫτ 2 ≤ HC −1 = 2r → ∞. 4
k! k−2 C H 2
102
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Applying the first case of Lemma 16.8 of Bentkus, Rudskis (1980), [13], we obtain directly ǫτ 2 P S≥ 4
!
ǫτ 2 ≤ exp − 4
!2
1 = exp −const qr2 . 4H
Corollary 7.9 Assume the same conditions as in Corollary 7.8 and that −2 q −1 wmax ≥ const n.
(7.40)
Then:
lim ξb − ξ 0 n→∞
and
lim n→∞ 2
√
w2
Pξ0 β 0 − a.s.
=0
aq
βb − β 0
= 0
(7.41)
Pξ0 β 0 − a.s..
(7.42)
Proof. Take ǫ > 0, arbitrary but fixed. Under (7.40), we obtain from Corollary 7.8 the rate ! ǫ2 exp −c 2 ≤ exp (−c1 n) q wmax in (7.38) and (7.39). Thus we have for n ≥ n0 : exp (−c1 n) ≤ a−n , a < 1 and ∞ X
n=1
Pξ0 β 0
2 b ξ − ξ 0
w2
>ǫ ≤
∞ X
n=1
exp (−c1 n) ≤ const
∞ X
n=1
a−n < ∞,
and the same holds for βb such that ∞ X
n=1
Pξ0 β 0
b 0 2 aq β − β > ǫ < ∞.
This entails the a.s. convergence of ξb − ξ 0 and of
βb − β 0
to zero by the w2 Lemma of Borel Cantelli, compare also Gnedenko (1991) p.186, [69]. 2
7.1.2
Consistency of the l.s.e. under σ 2 → 0
The next special cases concern the vanishing error approach. It is not required that the number of observations tends to infinite. The only limit is taken with respect to the variances. Thus a fixed number of nuisance parameters is allowed. Of course the more interesting case is that of an increasing number of nuisance parameters, but nevertheless it is reasonable to include that case also. We assume the variant of the vanishing variance condition Var’.
7.1. CONSISTENCY OF THE L.S.E. UNDER VANISHING VARIANCES 103 Corollary 7.10 Assume the model (3.1), (3.2) with the Lipschitz condition L1 and max max σij2 = σ 2 → 0 (7.43) j=1,2 i≤n
and vanishing fourth moments max max E (εji )4 ≤ σ 4 κ0 .
(7.44)
j=1,2 i≤q
Then: 1. For the least squares estimator ξb for the nuisance parameters ξ, defined in (5.5), ∀ǫ > 0 ∃cV > 0 ∃σ0 ∀σ ≤ σ0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) ∀n
2 σ 4 wmax . Pξ0 β 0 ξb − ξ 0 > ǫ ≤ cV κ0 w2 ǫ2
(7.45)
b 2. If in addition the contrast condition Con with an is satisfied, then for β, defined in (5.4)
∀ǫ > 0 ∃cV > 0 ∃σ0 ∀σ ≤ σ0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) ∀n Pξ0 β 0 2
σ 4 wmax
b 0 2 an β − β > ǫ ≤ c V κ 0 . 2
ǫ
(7.46)
Proof. In order to use Theorem 7.3 we have to check the conditions Var’ and M0. Under (7.43) we have |σ1 |2w1 + |σ2 |2w2 ≤ 2 max max σij2 → 0. j=1,2 i≤n
(7.47)
and Var’ holds. Furthermore under (7.44) M0 is valid. 2 The last special case concerns the exponential probability bound. Corollary 7.11 Assume the model (3.1), (3.2) with the Lipschitz condition L1 and Statulevicius condition S2’ with γ ′ ≥ 0 and max max σij2 = σ 2 → 0 j=1,2 i≤n
(7.48)
Then: ∀ǫ > 0 ∃c > 0 ∀n ∃σ0 ∀σ ≤ σ0 ∀β 0 ∈ Θ ∀ξ 0 ∈ F (n) 1. For the least squares estimator ξb for the nuisance parameters ξ, defined in (5.5), 1 2 − 1+γ ′ − 1+γ b 0 2 ′ wmax Pξ0 β 0 ξ − ξ > ǫ ≤ exp −cσ . (7.49) w2
104
CHAPTER 7. CONSISTENCY OF THE L.S.E.
b 2. If in addition the contrast condition Con with an is satisfied, then for β, defined in (5.4),
Pξ0 β 0 2
1
2 − 1+γ ′ − 1+γ
b 0 2 ′ an β − β > ǫ ≤ exp −cσ . wmax
(7.50)
Proof. Because of (7.47) under (7.48) Var’ is satisfied. Take ǫ > 0, n > 0 arbitrary fixed. In Theorem 7.4 we apply the second case, because ∃σ0 ∀σ ≤ σ0 such that 1+γ ′
−
1
1+γ ′
′
mn n 1+2γ ′ C2S1+2γ ≤ σ 2 wmax n 1+2γ ′ const ≤ ǫ holds. Then the results are consequences of Theorem 7.4 with
exp −ce2
ǫ
2 C2S maxj=1,2 maxi≤n wji σji
1 1+γ ′
1 2 − 1+γ ′ − 1+γ ′ . wmax ≤ exp −cσ
2
7.2
Consistency of the l.s.e. under an entropy condition
In this section we will give our main consistency result for the l.s.e. under an entropy condition on the nuisance parameter space. The key tools are the auxiliary results stated in Theorem 6.9 and the identification result in Lemma 5.1. The condition of vanishing variances is not used. But we will need, that the variances are bounded. V’ ∃var < ∞ ∀n
|σ1 |2w1 + |σ2 |2w2 ≤ var < ∞.
In the following wewill require an entropy condition on the compactified c nuisance parameter set F (n) ⊆ Rn . Let us introduce the entropy concept once more for the special case here. c Consider the ball F (ξ 0 , η) in F (n) ⊆ Rn with radius ǫ and center ξ r = (ξ1r , ..., ξnr )T ∈ F (n) with regard to the weighted distance defined in (2.5). n
o
F (ξ r , ǫ) = ξ : |ξ − ξ r |w2 ≤ ǫ ∩ F (n)
c
(7.51)
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION105 Definition 7.1 We define N (ǫ, D) the ǫ−covering number as the smallest value of m for which there exist ξ 1 , ..., ξ m , such that
F ξ0, D ⊆
m [
r=1
F (ξ r , ǫ)
(7.52)
form a minimal covering set of F (ξ 0 , D) . c The ǫ− entropy H(ǫ, D) of F (n) is given by H (ǫ, D) = ln N (ǫ, D) .
(7.53)
2 Note that according to this definition the entropy H depends on the dimension n and on the chosen center ξ 0 ∈ F (n) . In Chapter 6 we already introduce the Lipschitz condition L1 with respect to ξ. Further we will need also a Lipschitz condition with respect to β : L2
∃L2 < ∞ ∃n0 ∀n > n0 ∀ǫ > 0 ∀β, β ′ ∈ Θc ∀ξ ∈ F (n) 2
c
2
|G (ξ, β) − G (ξ, β ′ )|w1 ≤ L2 kβ − β ′ k
First let us show some lemmata, which give the connection between the conditions for the auxiliary results in Theorem 6.9 and Theorem 6.11, which were independently formulated of the error-in-variables model, and the assumptions made in the error-in-variables model. Lemma 7.12 For (ξ, β) ∈ Ξc (ǫ) = {(ξ, β) : Ln (ξ, β) > ǫ}
(7.54)
the condition L2 with the constant L2 entails the Lipschitz condition H2 on X (n) × A ⊆ Ξc (ǫ) with the constant C2L = 66 L2 2
1 . ǫ2
106
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Proof.
Because of (7.1) and (7.2), we have G (ξ, β) − G (ξ 0 , β 0 ) ξ − ξ0
1 H (ξ, β) = Ln (ξ, β) Adding 1 ± Ln (ξ, β)
G (ξ, β ′ ) − G (ξ 0 , β 0 ) ξ − ξ0
!
!
.
(7.55)
,
in |H (ξ, β) − H (ξ, β ′ )|2w and we get that
2
|H (ξ, β) − H (ξ, β ′ )|w ≤
(7.56)
2 ′ 2 2 |G (ξ, β) − G (ξ, β )|w1 + 2A Ln (ξ, β)
(7.57)
with 2 1 1 ′ 00 2 0 2 − A = G (ξ, β ) − G + ξ − ξ . w1 w2 Ln (ξ, β) Ln (ξ, β ′ )
(7.58)
Applying L2 and (7.54) in (7.57) we obtain 2
|H (ξ, β) − H (ξ, β ′ )|w ≤
2 L2 kβ − β ′ k + 2A. ǫ2
It remains to estimate A given in (7.58). Because of (7.3) and 2 2 2 G (ξ, β ′ ) − G00 ≤ 2 G ξ 0 , β ′ − G00 + G (ξ, β ′ ) − G ξ 0 , β ′ w1
we have
w1
2 2 G (ξ, β ′ ) − G00 + ξ − ξ 0 ≤ 2Ln (ξ, β ′ ) . w1
Hence
Furthermore
and
w1
w2
2 1 1 Ln (ξ, β ′ ) . − A ≤ 2 ′ Ln (ξ, β) Ln (ξ, β )
1 1 |Ln (ξ, β) − Ln (ξ, β ′ )| − , ≤ Ln (ξ, β) Ln (ξ, β ′ ) Ln (ξ, β) Ln (ξ, β ′ )
A≤2 From (7.3) we get
|Ln (ξ, β) − Ln (ξ, β ′ )|2 . Ln (ξ, β)2 Ln (ξ, β ′ ) 2
|Ln (ξ, β) − Ln (ξ, β ′ )| ≤
(7.59)
(7.60)
(7.61)
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION107
We apply
2 2 ′ 0 0 ′ 2 2 G (ξ, β) − G ξ , β − G ξ, β − G ξ , β w1 w1 2 2 2 +2 G ξ 0 , β − G ξ 0 , β 0 − G ξ 0 , β ′ − G ξ 0 , β 0 . w1 w1 2 2 2 2 2 2 kak − kbk ≤ ka − bk 2 kak + kbk
and
(7.62)
2
|Ln (ξ, β) − Ln (ξ, β ′ )| ≤
2 ′ 0 0 ′ 2 ≤ 4M1 G ξ , β − G ξ , β + G ξ, β − G (ξ, β) w1
(7.63)
w1
2 +4M2 G ξ 0 , β − G ξ 0 , β ′
w1 ′ 2
with
≤ 4 (2M1 + M2 ) L2 kβ − β k
2 2 ′ M1 = G (ξ, β) − G ξ 0 , β + G ξ, β − G ξ 0 , β ′ . w1
and
w1
2 2 M2 = G ξ 0 , β − G ξ 0 , β 0 + G ξ 0 , β ′ − G ξ 0 , β 0 . w1
w1
Remember that for all β
2
|G (ξ, β) − G (ξ 0 , β)|w1 ≤ 1, Ln (ξ, β)
2
|G (ξ 0 , β) − G (ξ 0 , β 0 )|w1 ≤ 1, Ln (ξ, β)
thus for i = 1, 2 Mi ≤ Ln (ξ, β ′ ) Ln (ξ, β) Hence 1 A≤ Ln (ξ, β)
(7.64)
!
1 1 + . Ln (ξ, β) Ln (ξ, β ′ ) !
1 1 1 2 ′ 2 16L kβ − β k ≤ + 32L2 kβ − β ′ k . 2 ′ 2 Ln (ξ, β) Ln (ξ, β ) ǫ (7.65) Thus from (7.57) follows 1 2 2 sup |H (ξ, β) − H (ξ, β ′ )|w ≤ 66L2 2 kβ − β ′ k . ǫ ξ∈X (n) 2 Lemma 7.13 For (ξ, β) ∈ Ξc (ǫ) = {(ξ, β) : Ln (ξ, β) > ǫ}
(7.66)
the condition L1 with the constant L1 entails the Lipschitz condition H1 on X (n) × A ⊆ Ξc (ǫ) with the constant C1L = ǫ−2 18 (L1 + 1) .
2
108
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Proof.
Remember (7.55) and add 1 ± Ln (ξ, β)
G (ξ ′ , β) − G (ξ 0 , β 0 ) ξ′ − ξ0
!
,
in |H (ξ ′ , β) − H (ξ, β)|2w . Then we estimate 2
|H (ξ ′ , β) − H (ξ, β)|w
≤
(7.67)
2 2 ′ 2 ′ + |ξ − ξ | |G (ξ, β) − G (ξ , β)| w2 + 2B w1 Ln (ξ, β)2 1 2 ≤ 2 2 (L1 + 1) |ξ − ξ ′ |w2 + 2B ǫ
(7.68)
with 2 1 1 ′ 0 2 ′ 0 0 2 B = − G (ξ , β) − G ξ , β + ξ − ξ . w2 w1 Ln (ξ, β) Ln (ξ ′ , β)
(7.69)
Because of (7.3) we have
2 2 G (ξ ′ , β) − G ξ 0 , β 0 + ξ ′ − ξ 0 ≤ 2Ln (ξ ′ , β) . w1
Hence
w2
|Ln (ξ, β) − Ln (ξ ′ , β)|2 . B≤2 Ln (ξ, β)2 Ln (ξ ′ , β)
From (7.3) we get
(7.70)
2
|Ln (ξ, β) − Ln (ξ ′ , β)| ≤
(7.71)
2 2 2 0 ′ 0 2 G (ξ, β) − G ξ , β − G (ξ , β) − G ξ , β w1 w1 2 2 2 +2 ξ − ξ 0 − ξ ′ − ξ 0 w2
(7.72) (7.73)
w2
Applying (7.62) to (7.72) and (7.73), we get that (7.71) is smaller than 2
2
2
≤ 2M3 |G (ξ ′ , β) − G (ξ, β)|w1 +2M4 |ξ − ξ ′ |w2 ≤ 2 (M3 L1 + M4 ) |ξ − ξ ′ |w2 (7.74) with
2 2 M3 = G (ξ, β) − G ξ 0 , β + G (ξ ′ , β) − G ξ 0 , β . w1
and
2
M4 = ξ − ξ 0
From (7.64) follows for i = 3, 4
Mi2 ≤ Ln (ξ ′ , β) Ln (ξ, β)
w2
w1
2
+ ξ ′ − ξ 0
w2
.
!
1 1 + . Ln (ξ, β) Ln (ξ ′ , β)
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION109 Hence !
1 1 2 + (L1 + 1) |ξ − ξ ′ |w2 . ′ Ln (ξ, β) Ln (ξ , β)
1 B≤4 Ln (ξ, β)
(7.75)
Thus from (7.68) follows 2
2
sup |H (ξ ′ , β) − H (ξ, β)|w ≤ ǫ−2 18 (L1 + 1) |ξ − ξ ′ |w2 . β∈A
2 In order to check the behavior of hmax (N ) and H (N ) in Theorem 6.9 and Theorem 6.11 we need one small result more. Define a common bound for the regression function and the design points by B (n) := max (1, B (n)∗ ) ,
(7.76)
where B (n)∗ :=
sup
sup
max g (ξi , β) − g ξi0 , β 0 + ξi − ξi0 ,
i≤n 0 ξ∈{ξ:|ξ−ξ 0 |w 0. ǫ (n)1+2γ
(7.82) c
3. The entropy H (., .) of the set of nuisance parameters F (n) satisfies for some positive sufficiently small constants c and all sufficiently large con1+γ ′
stants C and Dn2 = C mn n 1+2γ ′ lim n→∞
H (cǫ (n) , Dn ) = 0. r (n)
(7.83)
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION111 Then the following two assertions hold: 1. If either the parameter set of the regression parameter Θ is bounded or the contrast condition Con with an ≥ n−1 const is valid, then there exists a constant c0 > 0 such that ∀β0 ∈ Θ ∀ξ0 ∈ F (n)
2 Pξ0 β 0 ξb − ξ 0 > ǫ (n) ≤ exp (−c0 r (n) ) . w2
(7.84)
2. Under the contrast condition Con with an ≥ n−1 const, there exists a constant c0 > 0 such that ∀β0 ∈ Θ ∀ξ0 ∈ F (n)
2
Pξ0 β 0 an
βb − β 0
> ǫ (n) ≤ exp (−c0 r (n) ) .
2
(7.85)
Proof. Consider first the assertion (7.84). We obtain from Theorem 5.2, that for some τ, 0 < τ < 12 , and ǫ = ǫ (n) Pξ0 β 0 with
b 0 2 ξ − ξ > ǫ ≤ P
sup
w2
(ξ,β)∈Ξc (ǫ)
!
un (ξ, β) > τ ,
Ξc (ǫ) = {(ξ, β) : Ln (ξ, β) > ǫ}
(7.86)
and un (ξ, β) given in ( 7.1) and (7.2). For the second assertion (7.85) we need the contrast condition Con with an . Then Theorem 5.3 yields Pξ0 β 0
2
an
βb − β 0
> ǫ ≤ P
sup (ξ,β)∈Ξc (ǫ)
!
un (ξ, β) > τ .
Remember the definition of Ln (ξ, β) in (7.3), we have under L1 (ξ, β) ∈ Ξc (ǫ) that 2
ǫ < (L1 + 1) ξ − ξ 0
and Ξc (ǫ) ⊆
with A1 =
(n) X1
× A1 ∪
w2
(n) X2
2
+ G0 − G00
× A2 ⊆ Ξc
1 0 00 2 β ∈ Θ : G − G > ǫ , c
w1
for all
2
w1
ǫ 2 (L1 + 1) A2 = Θc
!
(7.87)
112
CHAPTER 7. CONSISTENCY OF THE L.S.E.
and (n) X1
= (F
(n) c
),
(n) X2
= ξ ∈ (F
−1 ǫ 0 2 . ) : ξ − ξ ≥ (L1 + 1)
(n) c
We obtain P
sup (ξ,β)∈Ξc (ǫ)
un (ξ, β) > τ
!
≤
X
l=1,2
2
w2
P sup sup un (ξ, β) > τ . (n)
ξ∈Xl
β∈Al
(7.88)
The result follows from Theorem 6.9. It remains to show the conditions of Theorem 6.9. S1’, S2’, V’ are the corresponding variants of S1, S2, V. Under c (ξ, β) ∈ Ξ 2(L1ǫ+1) from Lemma 7.12 and Lemma 7.13 follows, that the Lipschitz conditions L1, L2 with constants L1 and L2 entail the conditions H1 and H2 with constants C1L = ε−2 9(L1 + 1)3 and C2L = ǫ−2 33L2 (L1 + 1)2 . Let us check (6.48). We have for l = 1, 2
n o Al (R) = Al ∩ β :
β − β 0
> R .
If the parameter set Θ is bounded, then we can find a constant R0 for all β 0 ∈ Θ and
n o Θc ⊂ β :
β − β 0
≤ R0 .
In that case Al (R) for R > R0 is empty and (6.48) holds vacuously. If the parameter set Θ is unbounded, then we assume the contrast condition Con. Then
2
2 Ln (ξ, β) ≥ G ξ 0 , β − G ξ 0 , β 0 ≥ an
β − β 0
, w1
and because of Lemma 7.1 we get sup (n)
ξ∈Xl
sup |H (ξ, β)|2w ≤
β∈Al (R)
2 . an R 2
Hence the condition (6.48) holds with c1 = 2a−1 n . With out loss of generality we choose c1 ≥ 1. Now consider (6.49). We define the sets for l = 1, 2 Xl (D) =
(n) Xl
∩
From Lemma 7.1 follows for l = 1, 2
ξ : ξ − ξ 0
w2
>D .
sup sup |H (ξ, β)|2w ≤ 2D−2 .
ξ∈Xl (D) β∈Al
This means that (6.49) holds with c2 = 2.
(7.89)
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION113 Under the condition (7.82) we are in the situation of Theorem 6.9. Really, let us check (6.50) and remember the definition in (7.76). Because of (7.87) we have sup |H (ξ, β)|2w
sup (n)
ξ∈Xl
≤
h
2
2
supξ∈X (n) (Dn ) supβ∈Al (R) |G (ξ, β) − G (ξ 0 , β 0 )|w1 + |ξ − ξ 0 |w2 l
inf ξ∈X (n) (Dn ) inf β∈Al (R) Ln (ξ, β)2 l
≤
(Dn ) β∈Al (R)
h
2
2
supξ∈X (n) (Dn ) supβ∈Al (R) |G (ξ, β) − G (ξ 0 , β 0 )|w1 + |ξ − ξ 0 |w2 l
inf
(ξ,β)∈Ξc
ǫ 2(L1 +1)
L (ξ, β)2 n
i i
n 4 (L1 + 1)2 X X 0 2 0 0 2 ξ − ξ + g (ξ , β) − g ξ , β ≤ w sup sup max i i ji i i i≤n ǫ2 (n) j=1,2 i=1 ξ∈X (Dn ) β∈Al (R) l
≤ We set
n 4 (L1 + 1)2 X X 4 (L1 + 1)2 2 w B (n) ≤ B (n)2 ji 2 ǫ2 ǫ j=1,2 i=1
4 (L1 + 1)2 B (n)2 H (N ) := , ǫ2
and obtain sup (n) ξ∈Xl (Dn )
sup |H (ξ, β)|2w ≤ H (N ) .
(7.90) (7.91)
β∈Al (R)
Furthermore we introduce hmax (N ) :=
2 (L1 + 1) B (n) , ǫ
(7.92)
since sup (n)
ξ∈Xl
≤
sup max (|hi (ξ, β)|)
(Dn ) β∈Al (R)
i≤n
(7.93)
supξ∈X (n) (Dn ) supβ∈Al (R) maxi≤n (|g (ξi , β) − g (ξi0 , β 0 )| + |ξi − ξi0 |) l
inf
(ξ,β)∈Ξc
ǫ 2(L1 +1)
L (ξ, β) n
2 (L1 + 1) B (n) = hmax (N ) . ǫ Hence under the assumption (7.82) we have that (6.50) holds for arbitrary small τ, because ≤
2+2γ (2 (L1 + 1))2+2γ m1+γ (mN H (N ))1+γ n ǫ B (n) = max (wji σji ) hmax (N ) 2 (L1 + 1) max (wji σji ) B (n) ǫ2+2γ
114
CHAPTER 7. CONSISTENCY OF THE L.S.E.
max (wji σji )γ B (n)1+2γ ≥ cτ const. (7.94) ǫ1+2γ Furthermore we obtain our rate in (7.80) as the special case of the rate (6.47) in Theorem 6.9 since under (7.90) 1+γ ≥ const σmin
min τ 2 H (N )
−1
, N −1 m−1 N
!
τ 2 4 (L1 + 1)2 2 . = min ǫ (n) , n−1 m−1 n B (n)2
Now consider (6.51), that is to show that
2+2γ ′
ln c1 C2L (N ) N 1+2γ ′ m2N r (N )
becomes arbitrary small. In the case when the contrast condition Con is needed we have c1 = 2a−1 n ≥ 1, otherwise c1 = 1. Thus under Con Cn :=
2+2γ ′
−2 ln a−1 (n) n 1+2γ ′ m2n n ǫ
r (n)
has to become arbitrary small for all n ≥ n0 . If r(n) = B (n)−2 ǫ (n)2 m−1 n , then −1 for B (n) ≥ 1, and mn ≤ 1 and c1 = 2an ≥ n const ln Cn =
=
−2 a−1 n ǫ
(n) n
2+2γ ′ 1+2γ ′
m2n
ǫ (n) m−1 n 2+2γ ′
−1 ln a−1 (n) mn B (n)−2 n 1+2γ ′ n r
r (n)
2+2γ ′
ln n r−1 (n) n 1+2γ ′ ≤ const
r (n)
and because of (7.81) ln (r−1 (n)) c + ≤ r (n)
2+2γ ′ 1+2γ ′
+ 1 ln (n)
r (n)
→ 0,
−2 2 for n → ∞. Otherwise for r (n) = n−1 m−2 ǫ (n) m−1 n ≤ B (n) n , we have
ǫ2 (n) ≥ B (n)2 n−1 m−1 n and obtain from (7.81) ln Cn =
−2 a−1 n ǫ
(n) n
2+2γ ′ 1+2γ ′
n−1 m−2 n
m2n
ln ≤
a−1 n B
(n)
−2
′
n
2+ 2+2γ 1+2γ ′
n−1 m−2 n
m4n
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION115 ln (r−2 (n)) c + ≤ r (n)
2+2γ ′ 1+2γ ′
+ 1 ln (n)
r (n)
→ 0,
for n → ∞. The last condition of Theorem 6.9 that we have to show is the entropy con1+γ ′
dition (6.52). We have V’, C2L = 33ǫ−2 L2 and c2 = 2 and dn = const mn n 1+2γ ′ . Hence, there exist a constant c and C , such that
τ2
C2L dn + Dn2 ≥ 2
|σ1 |2w1
+
|σ2 |2w2
≤ c ǫ (n)2 ,
(7.95)
1+γ ′ dn + |σ1 |2w1 + |σ2 |2w2 1+2γ ′ . ≥ C m n n τ2
This means the entropy condition (7.83) implies the entropy condition (6.52) of Theorem 6.9. 2 2 The condition (7.82) is satisfied for instance under γ = 0 and σmin > 0 and ǫ (n) ≤ B (n) . The next theorem concerns the case, when (7.82) is violated. Theorem 7.16 Suppose the model (3.1), (3.2), where the regression functions are Lipschitz continuous in the sense of L1 and L2 with constants L1 , L2 . The error distributions satisfy the Statulevicius conditions S1’ with γ, and S2’ with γ ′ , and the boundness condition V’. Add to this the following assumptions: 1. Set the rate r0 (n)
ǫ (n) r0 (n) = min B (n) maxji (wji σji )
!
1 1+γ
, , n−1 m−2 n
(7.96)
and with ǫ (n) ≥ n−1 assume lim n→∞
ln n = 0. r0 (n)
(7.97) 1+γ ′
2. For some sufficiently large constants C and R and Dn2 = C mn n 1+2γ ′ lim n→∞
1+2γ m1+γ n B (n) = 0. max (wji σji ) ǫ1+2γ
(7.98)
116
CHAPTER 7. CONSISTENCY OF THE L.S.E.
3. For the entropy H (., .) of the set of nuisance parameters (F (n) )c for some positive sufficient small constant c and all sufficient large constants C and 1+γ ′
Dn2 = C mn n 1+2γ ′ lim n→∞
H (cǫ (n) , Dn ) = 0. r0 (n)
(7.99)
Then the following assertions hold : 1. If either the parameter set of the regression parameter Θ is bounded or the contrast condition Con with an ≥ n−1 const is valid then there exists a constant c0 > 0 for all sufficiently large n, n > n0 ∀β0 ∈ Θ ∀ξ0 ∈ F (n)
b 0 2 Pξ0 β 0 ξ − ξ > ǫ (n) ≤ exp (−c0 r0 (n) ) .
(7.100)
w2
2. Under the contrast condition Con with an ≥ n−1 const there exists a constant c0 > 0 for all sufficiently large n, n > n0
Pξ0 β 0
2
∀β0 ∈ Θ ∀ξ0 ∈ F (n)
b 0 2 an β − β > ǫ (n) ≤ exp (−c0 r0 (n) ) .
(7.101)
Proof. The proof goes along the line of the proof of Theorem 7.15 above. The difference is that instead of Theorem 6.9 we will apply Theorem 6.11 whose assumptions differ mainly in (6.74). Since (7.98) we have in (6.74), 2+2γ (H (N ) mN )1+γ m1+γ n ǫ B (n) = const → 0. max (wi σi ) hmax (N ) max (wji σji ) B (n) ǫ2+2γ
Then there exists an n0 and the constants τ from Lemma 5.1 fulfills the assumption (6.74) of Theorem 6.11. Further under (7.92) it holds for all n ≥ n0
min τ max (wi σi )
1
1 − 1+γ ′
hmax (N )
1 − 1+γ ′
,N
= min τ ǫ (n) 1+γ ′ (2 (L1 + 1) max (wji σji ) B (n))
−1
m−2 N
1 − 1+γ ′
, n−1 m−2 n
Then the corresponding rate of Theorem 6.11 is given by (7.96). The entropy condition required here respects the condition of Theorem 6.11. Remains to check condition (6.75). For r0 (n) = n−1 m−2 n the arguments are the same as in the proof
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION117 1
of Theorem 7.15. Consider the case r0 (n) = ǫ (n) 1+γ ′ (max (wji σji ) B (n)) with (7.97), then because of ǫ (n) ≥ n−1 , mn ≤ const, an ≥ n−1 const
ln c1 C2L (N ) N
2+2γ ′ 1+2γ ′
m2N
r0 (n)
ln ≤ const
≤ const r0 (n)
−2 a−1 n ǫ (n)
n
2+2γ ′ 1+2γ ′
r0 (n)
m2n
′
ln n ≤ const
1 − 1+γ ′
2+2γ 3 1+2γ ′
r0 (n)
!
2 + 2γ ′ + 3 ln n → 0. 1 + 2γ ′
−1
2 Both theorems contain a consistency result with some rate in the case of arbitrary small ǫ = ǫ (n) and also a result for ǫ (n) → ∞. The rates are mainly determined by the corresponding entropy conditions (7.83) and (7.99). Let us once more summarize the results for the case: σmax ≤ const, σmin
∃ const ∀n :
wmax ≍ n−a ,
1 < a ≤ 1. 2
(7.102)
This means all variances are asymptotically of the same order σ 2 , such that
The normalization
P Pn j
2 σji ≍ σ2.
i=1
(7.103)
wji = 1 implies wmin ≤
1 ≤ wmax . n
(7.104)
The assumption (7.102) allows an asymptotical different order for the weights. Then 2 mn = max wji σji ≍ σ 2 n−a (7.105) ji
and
max (wji σji ) ≍ σ n−a . ji
The condition, 12 < a ≤ 1 in (7.102), arises from (7.104) and from: r (n) ≤ 2 2a−1 n−1 m−2 → ∞. n = σ n Then the differently taken assumptions in (7.82) and (7.98) with respect to the growth of B (n) are lim σ 1+2γ n→∞
B (n)1+2γ >0 naγ ǫ (n)1+2γ
lim σ 1+2γ n→∞
B (n)1+2γ = 0. naγ ǫ (n)1+2γ
and
Under (7.102) we will give now an unifying formulation for both cases.
118
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Corollary 7.17 Suppose the model (3.1), (3.2), and (7.102). The regression functions are Lipschitz continuous in the sense of L1 and L2 with constants 2 L1 , L2 . The error distributions satisfy σji = σ 2 < ∞ and the Statulevicius condi′ tions S1’ with γ, S2’ with γ , Set the rate r0 (n)
ǫ (n) na r0 (n) = min B (n) σ
!
1 1+γ
, n2a−1 σ −4 .
For the entropy H (., .) of the set of nuisance parameters (F (n) )c for some positive sufficient small constant c and all sufficient large constants C and Dn2 = γ′
C σ 2 n 1+2γ ′ lim n→∞
H (cǫ, Dn ) = 0 and r0 (n)
lim n→∞
ln (n) = 0, r0 (n)
(7.106)
Then the following assertions hold : 1. If either the parameter set of the regression parameter Θ is bounded or the contrast condition Con with an ≥ n−1 const is valid, then there exists a constant c0 > 0 for all sufficiently large n, n > n0 , ∀β0 ∈ Θ ∀ξ0 ∈ F (n)
2 Pξ0 β 0 ξb − ξ 0 > ǫ (n) ≤ exp (−c0 r0 (n)) . w2
(7.107)
2. Under the contrast condition Con with an ≥ n−1 const there exists a constant c0 > 0 for all sufficiently large n, n > n0 ,
2
∀β0 ∈ Θ ∀ξ0 ∈ F (n)
2
Pξ0 β 0 an
βb − β 0
> ǫ (n) ≤ exp (−c0 r0 (n) ) .
Proof. First note, the rate is chosen such ǫ (n) ≥ n−1 . Because otherwise we have r0 (n) ≤ c−1 0 and that contradicts (7.106). 1.Under (7.106) and lim σ 1+2γ n→∞ Theorem 7.16 is valid. 2. Under lim σ 1+2γ n→∞
B (n)1+2γ = 0, naγ ǫ1+2γ
B (n)1+2γ = cτ > 0, naγ ǫ1+2γ
(7.108)
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION119 Theorem 7.15 holds. a−1 For ǫ (n) ≤ n 2 B (n) σ −1 the rate is r (n) = ǫ2 (n) B (n)−2 na σ −2 . Further (7.108) yields 1+2γ 1+2γ B (n) σ = naγ ǫ1+2γ
σ 2 B 2 (n) na ǫ2
!1+γ
with na ǫ B (n) σ
!
na ǫ = B (n) σ
r1,0 (n) r (n)
!1+γ
→ cτ > 0,
1 1+γ
= r1,0 (n) .
and hence for sufficiently large n 1
(2cτ ) 1+γ r (n) ≥ r1,0 (n) ≥ r0 (n) . For ǫ (n) ≥ n
a−1 2
(7.109)
B (n) σ −1 the rate is r (n) = n2a−1 σ −4 ≥ r0 (n) .
2 Here we see that an increasing bound B (n) is possible, but it has an influence on the consistency rate. In cases where we have an estimate exp (−2 ln n) of the probability we get strong consistency results. Let us state the strong consistency result without rate. For simplicity we will do it under (7.102) and under the contrast condition Con with an = const only. Corollary 7.18 Suppose the model (3.1), (3.2), and (7.102). The error distri2 butions satisfy σji = σ 2 < ∞ and the Statulevicius conditions S1’ with γ, S2’ with γ ′ . The regression functions are Lipschitz continuous in the sense of L1 and L2 with constants L1 , L2 and Con with an = const . The bound B (n), defined γ′
in (7.76) with Dn2 = C σ 2 n 1+2γ ′ , fulfills B (n) (ln n)1+γ lim = 0. n→∞ na
(7.110)
The entropy H (., .) of the set of nuisance parameters (F (n) )c for some positive sufficient small constant c and all sufficient large constants C and Dn2 = γ′
C σ 2 n 1+2γ ′ satisfy lim n→∞
H (cǫ, Dn )
min
na B(n)
1 1+γ
, n2a−1
=0 .
(7.111)
120
CHAPTER 7. CONSISTENCY OF THE L.S.E. Then the following assertions hold: n
Pξ0 β 0 − a.s.
Pξ0 β 0 − a.s..
∀β0 ∈ Θ ∀ξ0 ∈ F (n)
lim
βb − β 0
= 0 n→∞
2
Take ǫ > 0 arbitrary fixed. Corollary 7.17 implies ∞ X
n=1
with
2
lim ξb − ξ 0 = 0 n→∞
and
Proof.
∀β0 ∈ Θ ∀ξ0 ∈ F (n)
∞ X b 0 2 exp (−c0 r0 (n)) , P ξ − ξ > ǫ ≤ const (n0 (ǫ))
For r0 (n) =
n
ǫ na B(n)σ
na r0 (n) = min B (n)
1 1+γ
(7.112)
n=n0
!
1 1+γ
, n2a−1 .
there exists an n00 > n0 such that because of (7.110) ǫ na B (n) σ
!
1 1+γ
≥
2 ln n c0
(7.113)
for all n > n00 . For r0 (n) = n2a−1 we have also n2a−1 ≥ (7.112) by ≤ const (n00 )
∞ X
n=1
exp (−2 ln n) ≤ const
2 c0
∞ X
n=1
ln n. Hence we estimate
n−2 < ∞.
We obtain the statement by the Lemma of Borel Cantelli. The same arguments
holds for
βb − β 0
. 2 Let us close the section with the consistency results for bounded regression functions and bounded design points. In other words, we assume that there exists a constant R and a constant B for all n, such that B (n) ≤ sup
sup
max g (ξi , β) − g ξi0 , β 0 + ξi − ξi0 ≤ B.
ξ∈F (n) β∈{β:kβ−β 0 k≤R} i≤n
(7.114) The result here is independent of the constants γ and γ in the Statulevicius conditions S1’ and S2’. That’s it is enough to required only one of them, compare Lemma 16.6 and Lemma 16.5 in the Appendix. ′
7.2. CONSISTENCY OF THE L.S.E. UNDER AN ENTROPY CONDITION121 Corollary 7.19 Assume the model (3.1), (3.2), and (7.102). The regression functions are Lipschitz continuous in the sense of L1 and L2 with constants L1 , L2 2 and bounded in the sense of (7.114). The error distributions satisfy σji = σ2 < ∞ ′ and the Statulevicius condition S2’ with γ . For the entropy H (., .) of the set of nuisance parameters (F (n) )c for some positive sufficient small constant c and all sufficient large constants C and D and ǫ (n) ≤ n
a−1 2
σ2
(7.115)
suppose lim
n→∞
H (cǫ, D) σ =0 ǫ (n)2 na
(7.116)
and lim n→∞
ln (n) σ = 0. ǫ (n)2 na
(7.117)
Then the following assertions hold : 1. If either the parameter set of the regression parameter Θ is bounded or the contrast condition Con with an ≥ n−1 const is valid, then there exists a constant c0 > 0 for all sufficiently large n, n > n0 , ∀β0 ∈ Θ ∀ξ0 ∈ F (n) Pξ0 β 0 and
! ǫ (n)2 na b 0 2 ξ − ξ > ǫ (n) ≤ exp −c0 2
σ
w2
2
lim ξb − ξ 0 = 0 n→∞ n
Pξ0 β 0 − a.s..
2. Under the contrast condition Con with an ≥ n−1 const there exists a constant c0 > 0 for all sufficiently large n, n > n0 , ∀β0 ∈ Θ ∀ξ0 ∈ F (n) Pξ0 β 0 and
2
2 ǫ (n)2 na
an βb − β 0 > ǫ (n) ≤ exp −c0 2
σ
lim
βb − β 0
= 0 n→∞
Pξ0 β 0 − a.s..
!
122
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Proof. First, we show the exponential probability bounds. We apply Theorem 7.15 directly. The condition (7.114) includes the boundness of the nuisance parameter set. That’s it is enough to require the entropy condition for a constants D which is independent of n and γ ′ . Under (7.115) the rate in (7.80) is 2 a −2 r (n) = ǫ (n)2 m−1 n , and under (7.102) that is r (n) = ǫ (n) n σ . Further, under (7.102) and (7.114) the condition (7.82) is satisfied since max (wji σ)γ ǫ (n)−1−2γ ≥ naγ n
1−a (1+2γ) 2
σ −2−γ ≥ n
1−a +γ 2
σ −2−γ ≥ cτ .
Hence the exponential probability bounds follows from Theorem (7.15). Second, we show the strong consistency results. Let ǫ > 0 arbitrary fixed. Assume for a moment a < 1. Then there exists an n0 such that ǫ≥n
a−1 2
σ2
for all n ≥ n0 . The entropy condition (7.116) implies the entropy condition for fixed ǫ also. We apply Theorem 7.15 for that case and obtain the rate r (n) = n2a−1 σ −4 . For a = 1 and ǫ ≥ σ 2 we get the same rate . For a = 1 and ǫ ≤ σ 2 we obtain the rate r (n) = const na = const n. In all cases there exists an n00 > n0 such that 2 r (n) ≥ ln n, c0 for all n < n00 . Then the strong consistency is derived by the same arguments as above in Corollary 7.18. 2 The theorems in this section are formulated under relative general conditions. In each theorem the most restrictive condition is that on the entropy of the nuisance parameter set which has to be of smaller asymptotical order than the rate of convergence, required in (7.83), (7.99), (7.106), (7.111) and in (7.116). The theorems will be applied in such a way that for special nuisance parameter sets bounds for the entropy are derived and then the rates ǫ (n) are determined in such a way that the entropy condition is fulfilled. Of course the results make only sense for arbitrary small ǫ (n) .
7.3
Discussion of the entropy condition
c
In this section we will give bounds for the ǫ−entropy H (ǫ, D) of F (n) ∩ F (n) (ξ 0 , D) with respect to the norm |.|w2 , defined in (7.53). The proofs of the statements are based on the approach of Kolmogoroff, Tichomirow (1960), [99]. In difference to the results there, we have to take into account the weighted and normalized norm |.|w2 . For a review on the entropy see also Lorentz (1966), [114]. The aim of this section is not to give the sharpest possible bounds. But rather we need upper bounds in order to check the entropy conditions ( (7.83), (7.99), (7.106),(7.111),
123
7.3. DISCUSSION OF THE ENTROPY CONDITION
(7.116)) in the consistency theorems above. With the help of the lower bounds, we will show some inconsistency results. Generally we have n w2 min ≤ and under
Pn
i=1
n X i=1
w2i ≤ n w2 max
w2i = 1 w2 min ≤
Lemma 7.20 Assume
n X
1 ≤ w2 max . n
(7.118)
w2i = 1.
(7.119)
i=1
1. For the n dimensional cube of length 2a ≤ 2D
n ln
"
F (n) = [−a, a]n
(7.120)
#! √ 2a w2 min 2a ≤ H (ǫ, D) ≤ n ln +1 , ǫ ǫ
(7.121)
where [.] denotes the Gaussian bracket. 2. For the n dimensional ball with of radius d ≤ D
n ln
"
F (n) = F (n) ξ 0 , d = ξ : ξ − ξ 0
≤d ,
w2
#! " # ! √ 2d w2 min 2d ≤ H (ǫ, D) ≤ n ln √ +1 . ǫ w2 min ǫ
(7.122)
(7.123)
3. For q−dimensional linear subspaces F (n) = Lq = {ξ : ξ = Aη, η ∈ Rq } , where A is a n × q matrix with rank q and
λmin = λmin AT A ,
2D H (ǫ, D) ≤ q ln ǫ 2
(7.124)
λmax = λmax AT A , s
λmax qw2 max + 1 . λmin w2 min
(7.125)
(7.126)
124
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Proof.
Define the cube W (ξ r , ǫ) as
W (ξ r , ǫ) = ξ : max |ξi − ξir | ≤ ǫ . i≤n
(7.127)
Because of (7.119) we have max |ξi − ξir |2 w2 min ≤ i≤n
n X i=1
w2i (ξi − ξir )2 = |ξ − ξ r |2w2 ≤ max |ξi − ξir |2 , i≤n
(7.128)
and thus √
!
ǫ F (ξ , ǫ w2 min ) ⊆ W (ξ , ǫ) ⊆ F (ξ , ǫ) ⊆ W ξ , √ . w2 min r
r
r
r
(7.129)
1. Consider F (n) = W (0, a) . We use the composition of a cube by smaller cubes Wl [
r=1
W (ξ r , ǫ) ⊂ W (0, a) ⊂
W [u
W (ξ r , ǫ) ,
(7.130)
r=1
where the upper bound for the ǫ−covering number with respect to the supremum norm is Wu = Wu (ǫ, a) =
2a +1 ǫ
n
,
(7.131)
and the lower bound for the ǫ−covering number with respect to the supremum norm is n 2a Wl = Wl (ǫ, a) = . (7.132) ǫ Note, [.] denotes the Gaussian brackets. Applying (7.129), we get the chain Wl [
r=1
r
√
F (ξ , ǫ w2 min ) ⊂ F
(n)
⊂
W [u
r=1
F (ξ r , ǫ) .
Hence the ǫ−covering number N (ǫ, D) satisfies Wl (ǫ, a) =
2a ǫ
n
√ ≤ N (ǫ w2 min , D)
and N (ǫ, D) ≤ Wu (ǫ, a) . Thus we obtain bounds (7.121) for the entropy by ln Wl
ǫ ,a √ w2 min
!!
≤ H (ǫ, D) ≤ ln (Wu (ǫ, a)) .
125
7.3. DISCUSSION OF THE ENTROPY CONDITION 2. Now F (n) = F (n) (ξ 0 , d) . Then from (7.129) follows
0
W ξ ,d ⊂ F
(n)
!
d ⊂W ξ ,√ . w2 min 0
(7.133)
Using the composition (7.130) for the inner and outer cube and using (7.129), we obtain Wl [
r=1
Wl W W [u [u [ √ F (ξ r , ǫ) W (ξ r , ǫ) ⊂ W (ξ r , ǫ) ⊂ F (n) ⊂ F (ξ r , ǫ w2 min ) ⊂ r=1
r=1
r=1
(7.134)
with Wl = Wl (ǫ, d)
and
Wu = Wu
!
d ǫ, √ . w2 min
(7.135)
We obtain the bounds in (7.123) from ln Wl
ǫ ,d √ w2 min
!!
≤ H (ǫ, D) ≤ ln Wu
d ǫ, √ w2 min
!!
and (7.131) and (7.132) . 3. Set F (n) = Lq . Let W be the n × n diagonal matrix with the elements w21 , ..., w2n . Introduce A (D) as
A (D) = (Lq )c ∩ F (n) ξ 0 , D
= ξ : ξ − ξ0
T
= ξ = Aη : η − η 0
W ξ − ξ 0 ≤ D2 , ξ = Aη, η ∈ Rq
T
AT W A η − η 0 ≤ D2 , η ∈ Rq .
From the relation for the eigenvalues
λmin w2 min = λmin AT A w2 min ≤ λmin AT W A
(7.137)
λmax AT W A ≤ λmax AT A w2 max = λmax w2 max we get the chain of inequalities
λmin w2 min max ηi − ηi0 i≤q
≤ λmax
≤ η − η0
T
2
2 ≤ λmin AT W A
η − η 0
2
AT W A η − η 0 = ξ − ξ 0
(7.138)
(7.139)
w2
2 2 AT W A
η − η 0
≤ qλmax w2 max max ηi − ηi0 . i≤q
(7.136)
(7.140)
126
CHAPTER 7. CONSISTENCY OF THE L.S.E. Here k.k is the Euclidean norm of the Rq . Denote a cube in Lq ⊂ Rn with length 2ǫ and center ξ r = Aη r by
W (q) (η r , ǫ) = ξ : ξ = Aη, max (ηi − ηir )2 ≤ ǫ2 i≤q
and a ball in Lq with radius ǫ and center ξ r = Aη r by
o
n
A(q) (η r , ǫ) = ξ : ξ = Aη, |ξ − ξ r |2w2 ≤ ǫ2 .
From (7.139) and (7.140) we obtain W
(q)
ǫ η ,√ λmax qw2 max r
!
(q)
⊆A
r
(η , ǫ) ⊆ W
!
ǫ η ,√ . λmin w2 min (7.141) r
(q)
Analogously to (7.130) and (7.131) we have A (D) ⊂ W (q)
D η0, √ λmin w2 min
with Wu(q)
=
Wu(q)
D ǫ, √ λmin w2 min
!
=
!
⊂
"
2D √ +1 ǫ λmin w2 min
(q)
W u [
W (q) (η r , ǫ)
(7.142)
r=1
#
!q
.
(7.143)
Because of (7.141) we have W
(q)
r
(q)
(η , ǫ) ⊂ A
r
η , ǫ λmax qw2 max = F
Hence A (D) ⊂ with
q
N [u
r=1
F (n) (ξ r , ǫ) ∩ Lq ⊂
ǫ D √ ,√ λmax qw2 max λmin w2 min
Nu = Wu(q)
(n)
!
N [u
r=1
q
r
ξ , ǫ λmax qw2 max ∩ Lq . F (n) (ξ r , ǫ)
2D = ǫ
s
q
λmax qw2 max + 1 . λmin w2 min (7.144)
We obtain upper bound for the entropy by H (ǫ, D) ≤ ln (Nu ) . 2 In the following lemma wecalculate the entropy in the models of semiparametric type where ξi = f ni , compare (3.18). The connection between the parameter sets F (n) ⊂ Rn and the set of functions Fk,α (C, L) is given by Z 2 ξ − ξ 0 = w2
1 0
f (x) − f 0 (x)
2
dGw ≤ max f i
2
i i − f0 n n
where Gw is the discrete measure, which gives each point
i n
(7.145)
the weight w2i .
127
7.3. DISCUSSION OF THE ENTROPY CONDITION Lemma 7.21 For F
(n)
= ξ
(n)
i : ξi = f , f ∈ Fk,α (C, L) n
(7.146)
with (m) ≤ C, m = 0, ..., k f (x) Fk,α (C, L) = f ∈ Ck,α [0, 1] : (k) f (x1 ) − f (k) (x2 ) ≤ L |x1 − x2 |
and with Ck,α [0, 1] given in (2.28) there exists a constant c = c (C, L, D) , such that 1 1 k+α H (ǫ, D) ≤ c . (7.147) ǫ 2 Proof. have
⊂
Here we use a result of Kolmogoroff, Tichomirow (1960) , [99] . We
(
i f n
F (n) ∩ F (n) ξ 0 , D : i=1,...,n
Z
1 0
0
f (x) − f (x)
2
)
2
dGw ≤ D , f ∈ Fk,α (C, L) .
Because of the smoothness conditions in Fk,α (C, L) for all f ∈ Fk,α (C, L) it holds ∀x ∈ [0, 1]
∃ ix ∈ {1, ..., n}
such that
max f (x) − f 0 (x)
x∈[0,1]
≤3
"
ix∗ f n
−f
0
ix∗ n
2
2
ix f (x) − f n
2
= f (x∗ ) − f 0 (x∗ )
ix∗ + f (x ) − f n ∗
2
0
WFk,α (C,L) f , D =
(
C2 n2
2
(7.148)
∗
0
+ f (x ) − f
i i 2 1 C2 ≤ 3 max f − f0 + 6 2 ≤ 6C 2 1 + 2 i n n n n Introduce the cube in the set of smooth functions
≤
0
0
W f ,D =
ix∗ n
≤ (4C)2 .
f ∈ Fk,α (C, L) : max f (x) − f 0 (x) ≤ D x∈[0,1]
and the corresponding cube in the semiparametric model by
(
i f n
i=1,...,n
0
: f ∈ WFk,α (C,L) f , D
)
.
)
.
2 #
128
CHAPTER 7. CONSISTENCY OF THE L.S.E.
Then (7.148) yields
F (n) ∩ F (n) ξ 0 , D ⊂ F (n) ⊂ W f 0 , 4C , and (7.145) gives
(7.149)
W f 0 , ǫ ⊆ F (n) ξ 0 , ǫ .
(7.150)
From Theorem XV of Kolmogoroff, Tichomirow (1960), [99], p. 31 we know that the ǫ−entropy of Fk,α (C, L) with respect of the supremum norm is smaller than a constant times
1 ǫ
1 k+α
. Hence
WFk,α (C,L) f 0 , D ⊂ with
W [
r=1
1 W = exp c ǫ
1 k+α
Hence from (7.151) and (7.150)
F (n) ⊂ W f 0 , 4C ⊆
W [
r=1
WFk,α (C,L) f 0 , ǫ ,
(7.151)
.
W f 0, ǫ ⊆
(7.152)
W [
r=1
F (n) ξ 0 , ǫ ,
with W given in (7.152). 2 The last result of this section concerns the model (3.19), (3.20). It is related to the entropy of isotonic functions, because of the relation (3.18) between the ordered set of design points and the set of isotonic functions. In van de Geer (1990), [63], Example 2.1, an entropy bound for isotonic function is given, there the norm of the functions is defined with respect to the empirical measure Gn n 1X IA (xi ) , Gn (A) = n i=1
with fixed design points xi . In our case we have the measure Gw (A) =
n X
w2i IA
i=1
i . n
In the following lemma we derive an rough entropy bound in an direct constructive way for ordered sets. Lemma 7.22 For of nuisance parameters with a known order n
F (n) = ξ (n) : 0 ≤ ξ1 ≤ ξ2 ≤ ...ξn ≤ 1 the entropy is bounded by
H (ǫ, D) ≤ ln n 2
2 +2 ǫ
o
2 +2 ǫ
(7.153)
(7.154)
7.3. DISCUSSION OF THE ENTROPY CONDITION
129
Proof. For ξ1 , .., ξn ∈ [0, 1] there are g different ξi1 , ..., ξig which have a distance from each other of more than ǫ. From Kolmogoroff, Tichomirow (1960), [99], p. 8 formula (3) we know that the largest cardinal number Mǫ ([0, 1]) of ǫ−distinguishable points in [0, 1] is the 2ǫ −covering number N 2ǫ ([0, 1]) Mǫ ([0, 1]) = N 2ǫ ([0, 1]) ≤ Hence
2 + 1. ǫ
2 + 1. ǫ Let g be arbitrary fixed. We define a cluster system g = 1, ...,
(7.155)
G1 = {ξ1 , ..., ξn1 : |ξ1 − ξn1 | ≤ ǫ} G2 = {ξn1 +1 , ..., ξn2 : |ξn1 +1 − ξn2 | ≤ ǫ}
such that
...
o Gg = ξng−1 +1 , ..., ξn : ξng−1 +1 − ξn ≤ ǫ , n
|ξ1 − ξn1 +1 | > ǫ, ..., ξng−1 − ξn > ǫ.
The first set contains all points whose distance to ξ1 is less than ǫ .. The following set is defined by the first point which has a distance greater than ǫ to the starting point of the set before and so on. This procedure uses essentially the known order 0 ≤ ξ1 ≤ ξ2 ≤ ...ξn ≤ 1 in the set F (n) . Define an ǫ−covering system of [0, 1] [0, 1] ⊂
Nǫ ([0,1])
[
k=1
[(k − 1) ǫ, kǫ] .
(7.156)
Because of Gl ⊂ [0, 1] there exists an kl and Gl ⊂ [(kl − 1) ǫ, kl ǫ] ∪ [kl ǫ, kl+1 ǫ] . Gl is covered at by two sets because each set [(k − 1) ǫ, kǫ] has the length ǫ and Gl has the length ǫ. The least of Gl lies on kl . n favorable case is that the center o (n) (n) Thus for all ξ ∈ F = ξ : 0 ≤ ξ1 ≤ ξ2 ≤ ...ξn ≤ 1 we find ξ r = ξ (g, G1 , ..., Gg , k1 , ..., kg ) ∈ Rn
given by
r ξ =
k1 ǫ : k1 ǫ
n1
: kg ǫ : n g
kg ǫ
130
CHAPTER 7. CONSISTENCY OF THE L.S.E.
where
g X
nl = 1,
l=1
such that for ξi ∈ Gli it holds ξir = kli ǫ and |ξi − kli ǫ| ≤ ǫ and |ξ − ξ r |2w2 ≤ max |ξi − ξ r |2 ≤ ǫ2 . i≤n
Summarizing we have a ǫ−covering system of F (n) F (n) =
[
r=r(g,G1 ,...,Gg ,k1 ,...,kg )
F (ξ r , ǫ) .
In order to get an upper bound for the entropy it remains to calculate how many r (g, G1 , ..., Gg , k1 , ..., kg ) we need. The number of g is given by (7.155). The set Gl is determined by the center kl and the position of the border to the following set Gl+1 , which lies between the two points ξnl , ξnl +1 . There are n−g ≤ n possibilities for the position of the border. The center h i k1 0. For
B d = const A the minimum of F (d) is attained,
1 2k
p
min F (d) = constA 2k B d
2k−p 2k
.
Applying this to (10.26) for A=
1 , ǫ2k µγ2 (k)
B=
1 − k2 n ǫk µγ1 (k)
194
CHAPTER 10. ALTERNATIVE ESTIMATORS
we obtain (10.24). 2 Choosing now the constants k, µ, r in a convenient form we get the following auxiliary result: Lemma 10.3 Let Θ0 be a compact subset of Rp . Suppose QM with k ≥ 2 and QL with k ≥ 2. Then for p (10.27) k = + 2δ1 , δ1 > 0 2 and δ1 such that µ = const n−r , r > 0 (10.28) with
(p + 4δ1 ) (δ1 − κ) 2 (γ1 (k) 4δ1 + γ2 (k) p) there exists a constant const, such that for all ǫ r=
Pθ0
! n 1 X sup (qµ (xi , yi , β) − Eθ0 qµ (xi , yi , β)) ≥ ǫ ≤ const n−κ . β∈Θ0 n
(10.29)
(10.30)
i=1
2 Proof.
In Lemma 10.2 the constants µ, k are chosen such that µ−
γ1 (k)(2k−p)+γ2 (k)p 2k
k
p
n− 2 + 4 ≤ n−κ .
2 Combining the results of both lemmata and using the lemma of Borel Cantelli for κ = 1 + δ we obtain the strong consistency of the ac-MCE βeµ(n) , defined in (10.16). Theorem 10.4 Let Θ be a compact subset of Rp . 1. Suppose in (10.14) that µ = µ (n) → 0
(10.31)
and Con’, QM, QL with k given in (10.27), (10.28), (10.29) with κ > 0. Then ∀τ > 0
sup Pθ0
βeµ(n) − β 0
> τ ≤ const n−κ .
(10.32)
β 0 ∈Θ
2. Suppose Con’, QM, QL with k given in (10.27), (10.28), (10.29) with κ>1 Then 2
βeµ(n) → β 0
a.s..
(10.33)
195
10.3. CONSISTENCY Proof.
From Lemma 10.1 it follows ∀τ > 0 Pθ0
e
βµ − β 0 > τ ≤ Pθ0
sup 2 |Sn (β)| ≥ ρ (τ ) − µ
β∈Θ(τ )
!
!
≤ Pθ0 sup 2 |Sn (β)| ≥ ρ (τ ) − µ . β∈Θ
For 2ǫ = ρ (τ ) − µ we apply Lemma 10.3 and obtain !
Pθ0 sup |Sn (β)| ≥ ρ (τ ) − µ ≤ const n−κ . β∈Θ
When we choose κ = 1 + δ then Borel Cantelli implies the strong consistency result (10.33). 2 We close this section by formulating the consistency result for the exact corrected MCE. Introduce the related versions of QM and QL for the exact corrected estimation function q, satisfying (10.9). QM0 There exist a constant c and a real number k, k ≥ 2
n 2 1X sup Eθ0 |q (xi , yi , β) − Eθ0 q (xi , yi , β)|k k ≤ c. β∈B n i=1
QL0 There exist a random variable M(n) , a constant c and a real number k ≥ 2 , such that for all n and for all β, β ′ ∈ Θ n 1X 2 |q (xi , yi , β) − Eθ0 q (xi , yi , β) − (q (xi , yi , β ′ ) − Eθ0 q (xi , yi , β ′ ))| n i=1
′ 2 ≤ M(n)
β − β
,
with
sup Eθ0 M(n) k ≤ c. θ0
Theorem 10.5 1. Suppose Con’ with ρ (ǫ) , QM0, QL0 with k ≥ 2. Then
p p k sup Pθ0
βe − β 0
> ǫ ≤ const ρ (ǫ)− 2 −k n− 2 + 4 .
(10.34)
β 0 ∈B
2. Suppose Con’, QM0, QL0 with k ≥ 2 and p k > +1 2 Then βe → β 0 a.s.. 2
(10.35)
196
CHAPTER 10. ALTERNATIVE ESTIMATORS
Proof. The proof goes along the line above. We can interpret the assumptions QM0, QL0 as QM with γ1 (k) = 0, QL with γ2 (k) = 0. Because of γ1 (k) = 0 and γ2 (k) = 0, we have in (10.26) Pθ0
! n 1 X sup (q (xi , yi , β) − Eθ0 q (xi , yi , β)) ≥ ρ (ǫ) β∈Θ0 n i=1
p
k
p
≤ const ρ (ǫ)− 2 −k n− 2 + 4 ,
which delivers together with Lemma 10.1 the statement (10.34). For k > p2 + 1 the rate in (10.34) is more than n−1 . Then (10.35) follows from the Lemma of Borel Cantelli. 2 Let us write one corollary more, where we have some convergence rates. We will need it as auxiliary result for proving the asymptotic normality. Corollary 10.6 Suppose Con’ with ρ (ǫ) ≥ cǫ, QM0 , QL0 with k ≥ 2 and k=
3p + δ1, 2
Then ∀τ > 0
lim sup Pθ0 n→∞ β 0 ∈B
δ1 > 0.
2 √
e
n β − β 0 > τ = 0.
Proof. The statement follows from (10.34) with for some δ > 0 and ρ (ǫ) = 1 +δ n 4 for some δ > 0 . 2
10.4
Asymptotic normality
For proving asymptotic normality we need conditions, which ensure the consistency of the ac-MCE, and additional regularity conditions, which imply the central limit theorem. One of them may be the Lyapunov condition, which includes the boundness of some moments, n
2+δ 1X Eθ0
qµβ (xi , yi , β)
< const. n i=1
But we cannot require this for our approximate estimators. In QM, QL we allow that analog moments increase for an approximate rate µ, µ → 0. We will need this property of slow increasing moments for special cases. That is the reason why we show the asymptotic normality only for exactly adjusted alternative estimators c-MCE. The asymptotic normality is derived by methods common used in nonlinear regression. In difference to nonlinear regression we require the existence of more
197
10.4. ASYMPTOTIC NORMALITY
moments than usual. The reason lies in the application of the consistency result in (10.24). For exact corrected MCE another way for proving consistency should be possible, which do not need such high order of moments. Otherwise, because of the adjusting procedure, we have already strong assumptions on the error distribution, which include the existence of moments of high order in many cases. First let us introduce the additional regularity assumptions: Int β0 ∈ intΘ,
Θ is a compact set in Rp .
The following conditions all related to the corrected estimation criterion. They are the regularity conditions which delivers the existence and asymptotic normality of n 1X Ce β (β) = q β (xi , yi , β) , n i=1
where Ce β (β) denotes the p−dimensional vector of first derivatives of Ce (β) , introduced in (10.15). QDiff ∀i = 1, .... ∀yi ∀xi q (xi , yi , .) is two times partially differentiable. The derivatives are denoted by q β (xi , yi , β) and q ββ (xi , yi , β) respectively. QCov1 ∃n0 ∀n ≥ n0 ∀β ∈ Θ λmin (Dn (β)) > 0 with Dn (β) =
n 1X Covθ q β (xi , yi , β) . n i=1
(10.36)
A Lyapunov type condition are required in the following version. QLap ∃const ∃δ > 0 ∃n0 ∀n ≥ n0 ∀β ∈ Θ ∀ξ ∈ F (n) n
2+δ 1X Eθ
q β (xi , yi , β)
< const. n i=1
The average of the matrices of second derivatives has to satisfy the following Lipschitz type condition which is similar to OL, namely.
198
CHAPTER 10. ALTERNATIVE ESTIMATORS
QDiffL There exist a random variable M(n) , a constant c and a real number k ≥ 2 , such that for all n and for all β, β ′ ∈ Θ
with
n
2 1X ′ 2
ββ
q (xi , yi , β) − q ββ (xi , yi , β ′ ) ≤ M(n) β − β , n i=1
sup Eθ M(n) k ≤ const. θ
The following one can be named as moment condition on the second derivatives of q. QDiffM There exists a constant c, such that for all n and for all β ∈ Θ and all ξ ∈ F (n) n
2 1X
ββ
q (xi , yi , β) − Eθ q ββ (xi , yi , β) ≤ c. n i=1 Further we need for the existence of the asymptotic covariance matrix: QCov2 ∃n0 ∀n ≥ n0 ∀µ ∀β ∈ Θ λmin (Vn (β)) > 0, with Vn (β) =
n 1X Eθ q ββ (xi , yi , β) . n i=1
(10.37)
Now we are able to state a result on the asymptotic normality of the c-MCE, and we will give a outline of the proof.. Theorem 10.7 Suppose the conditions of Corollary 10.6 and additional Int, QDiff, QDiffL,QDiffM,QLap, QCov1, QCov2. Then it holds √ 1 nAn (β) 2 βe − β −→ Np (0, Ip ) , with
An (β)−1 = Vn (β)−1 Dn (β) Vn (β)−1 , where Dn (β) given in (10.36) and Vn (β) given in (10.37).2
(10.38)
199
10.4. ASYMPTOTIC NORMALITY Proof.
We have the Taylor expansion for all observations (yi , xi ) n √ eβ 1 X q β (xi , yi , β) nC (β) = √ n i=1
=
√ √ e β e √ e nC β + n β − β Ce ββ (β) + n βe − β Ce ββ β − Ce ββ (β) ,
where β is an intermediate point between βe and β. Because of Int and Theorem 10.5 we have by usual arguments like in nonlinear regression that t √ e β e (10.39) nC β = op (1) .
We obtain the expansion
n √ eβ 1 X nC (β) = √ q β (xi , yi , β) n i=1
=
√ e n β − β Eθ Ce ββ (β) + R + oP (1) ,
where the remainder term is √ √ R = n βe − β Ce ββ β − Ce ββ (β) + n βe − β Ce ββ (β) − Eθ Ce ββ (β) = R1 + R2 .
Using QDiffL, we obtain R1 ≤ ≤
q
√
e n β − β
Ce ββ β − Ce ββ (β)
2 q √ √ M(n) n
βe − β
β − β
≤ M(n) n
βe − β
.
From Corollary 10.6 follows
2 √
e n β − β
= op (1)
(10.40)
and from Chebychev inequality and QDiffL we get M(n) = OP (1) . Hence R1 = op (1) . Estimate now the second remainder term under QDiffM by Chebychev inequality and by (10.40), then also R2 = op (1) . The key point of the adjusting procedure was to ensure, that Eθ Ce β (β) = 0.
200
CHAPTER 10. ALTERNATIVE ESTIMATORS Summing up, we have under QCov2 −1 √ e √ n β − β = n E Ce ββ (β) Ce β (β) − Eθ Ce β (β) + op (1) .
Under QLap, QCov1, and we have √ Hence
1
nDn− 2 (β) Ce β (β) − Eθ Ce β (β) → N (0, Ip ) .
1 √ e 1 −1 −1 2 2 η + op (1) , n β − β = V (β) Dn (β) V (β)
with η → N (0, Ip ) . 2 Note, for nuisance parameters sets F (n) , which imply the positive definiteness in (10.36), (10.37) for all ξ ∈ F (n) , the other assumptions are chosen, such that the remainder term tends uniformly with respect to β, ξ to zero.
10.5
The corrected least squares estimator
In this paper we are mostly interested in least squares methods. That is the reason why we start the consideration of special cases for the corrected least squares estimator (c-l.s.e.), satisfying the Definition 10.1 with the contrast given in (10.10), namely Cn (θ) =
n 2 1X g ξi , β 0 − g (ξi , β) . n i=1
Because of the quadratic structure we can reformulate the deconvolution equation (10.9) to the asymmetrical deconvolution equation (10.11), but it is also possible to specialize it one step more. The key assumption for the new estimator is the requirement of the existence of two continuous functions f and h : Ex ∀ξ1 ∃ f (ξ1 , .) ∈ C (Θ) ∀ β ∈ Θ E f (ξ1 + ε21 , β) = g (ξ1 , β)
(10.41)
∀ξ1 ∃ h (ξ1 , .) ∈ C (Θ) ∀ β ∈ Θ E h (ξ1 + ε21 , β) = (g (ξ1 , β))2 .
(10.42)
201
10.5. THE CORRECTED LEAST SQUARES ESTIMATOR
Note the expected value is taken with respect to ε21 , whose distribution is independent of the parameter θ. Under Ex we have that q (x1 , y1 , β) = (y1 − f (x1 , β))2 + h (x1 , β) − (f (x1 , β))2 , satisfies (10.11), because for all y1 the expected value of q with respect to ε21 is Eq (ξ1 + ε21 , y1 , β) = y12 − 2y1 Ef (ξ1 + ε21 , β) + Ef (ξ1 + ε21 , β)2 + Eh (ξ1 + ε21 , β) − Ef (ξ1 + ε21 , β)2 = y12 − 2y1 Ef (ξ1 + ε21 , β) + Eh (ξ1 + ε21 , β) = y12 − 2y1 g (ξ1 , β) + g (ξ1 , β)2 = (y1 − g (ξ1 , β))2 .
Then the c-l.s.e. βe is given as a measurable solution of the optimization problem: βe ∈ arg min β∈Θ
n h i 1X (yi − f (xi , β))2 + h (xi , β) − (f (xi , β))2 . n i=1
(10.43)
The first term of (10.43) can be interpreted as the least squares criterion in the transformed (projected) regression model yi = f (xi , β0 ) + ui , with Eui = 0,
(10.44)
where ui = f (xi , β0 ) − g (ξi , β0 ) + ε1i . The other terms of (10.43) are the correction for the covariance between the f (xi , β0 ) and the transformed error term ui . We have Covθ (ui f (xi , β)) = Eθ (f (xi , β) − g (ξi , β))2 = Eθ (f (xi , β)2 ) − g (ξi , β)2 . The central point is the construction of the transformations f and h in (10.41) and (10.42). The following lemma gives some hint, how to derive them. We will denote the Fourier transform of an arbitrary function f ∈ L2 (R) by fb, such that pb denotes the characteristic function of ε21 : fb (t) :=
Z
1 Z exp (−itx) fb (t) dt exp (itx) f (x) dx and f (x) = 2π
(10.45)
Let us further suppose for the Fourier transform of the regression functions g (x, β) and of the squared regression functions (g (x, β))2 = G(2) (x, β):
202
CHAPTER 10. ALTERNATIVE ESTIMATORS
ExFour ∀β∈Θ:
Z
gb (t, β) pb (t)
!2
dt < ∞ and
Z
b (t, β) G (2) pb (t)
!2
dt < ∞ . (10.46)
This assumption ExFour is really strong. We will give only one example with respect to the Gamma distribution and very smooth regression functions . Lemma 10.8 Suppose for all β ∈ Θ : g (., β) ∈ L4 (D) and ε21 has a known density p2 , with p2 (ε21 ) = p2 (−ε21 ) . Then under ExFour the transformations f (x1 , β) and h (x1 , β) in Ex are given by 1 Z gb (t, β) f (x1 , β) = exp (−itx1 ) dt. (10.47) 2π pb (t) and
h (x1 , β) =
b (t, β) G 1 Z (2) exp (−itx1 ) dt. 2π pb (t)
(10.48)
Proof. In the course of the proof we will suppress the indices 1 and 21. The formula (10.41) describes the convolution f ∗p=g
(10.49)
or otherwise ∀β∈Θ:
Z
f (ξ − ε, β) p (−ε) dε =
Z
f (ξ − ε, β) p (ε) dε = g (ξ, β) .
From (10.49) it follows by the theorem on the Fourier transform of convolutions that ∀ β ∈ Θ : fb (t, β) pb (t) = gb (t, β) . (10.50)
Now we calculate the transformation f (x, β) by the inversion of its Fourier transform fb (t, β) in (10.50) 1 Z gb (t, β) f (x, β) = exp (−itx) dt. 2π pb (t)
The same is done for h (x, β) . 2 Note, that is exactly the same approach as used to deconvolution, compare for instance, Stefanski, Carroll (1990), [163].
10.5. THE CORRECTED LEAST SQUARES ESTIMATOR
10.5.1
203
Polynomial functional relation model
The polynomial model was one of the first where alternative estimators introduced. For a quadratic relation Wolter, Fuller (1982) , [179] introduced one, in order to study the earth quakes in the vicinity of the Tonga trench. Hausman et al (1991) [76] and Hausman et al (1995) [77] used alternative estimators for polynomial Engels curves. Consider for the regression function a polynomial of k’th order: g (ξ1 , β) =
k X
βl (ξ1 )r
(10.51)
r=0
Note, because g (., β) is nonlinear in the nuisance parameter ξ1 , it is a nonlinear model in this context. Further we need the complete knowledge of all moments up to the 2k ′ th order of the error-in-variables distribution: Mom For some k the moments µr = E (ε21 )r , r = 1, ..., 2k are known. The correction for a real valued monomial z r of order r : Efr (z + ε) = z r
(10.52)
It is given by f0 (z) = 1, and for all r ≥ 2
r
fr (z) = z −
f1 (z) = z
r−2 X
r l
l=0
!
fl (z) µr−l ,
(10.53)
(10.54)
compare for instance Part II or Zwanzig (1996) [187]. (10.52) is easily checked by the Binomial formula r
(z + ε) =
r X
r l
l=0
!
z l εr−l ,
because: Efr (z + ε) r
= E (z + ε) − =E
r X l=0
r l
!
r−2 X l=0
!
z l εr−l −
r l
!
r−2 X l=0
Efl (z + ε) µr−l r l
!
Ez l Eεr−l = z r .
204
CHAPTER 10. ALTERNATIVE ESTIMATORS
Please do not mix the moments µr and the parameter for the approximation in (10.14), also the here introduced fr are not the function introduced in Ex. For the model (10.51) the functions in Ex are f (ξ1 , β) =
k X
βr fr (ξ1 )
(10.55)
r=0
and h (ξ1 , β) =
k k X X
βr1 βr2 fr1 +r2 (ξ1 ) .
(10.56)
r1 =0 r2 =0
For the application of polynomials models to astrometric plate reduction, compare Section1.4.2, we have two dimensional polynomials in ξ1 = (ξ11 , ξ12 ) of third order. The correction formulas corresponding to (10.55) and (10.56) for the special astrometric setting up are given in Zwanzig (1997), [188].
10.5.2
Exponential model
Consider the model g (ξ1 , β) = β0 + β1 exp (β2 ξ1 ) .
(10.57)
Assume the knowledge of the exponential moments of the error-in-variables distribution. ExpMom ε2i
i.i.d. and E exp αε21 = m (α) .
We have Z
exp (β2 (ξ1 + ε21 )) p (ε21 ) dε21 = exp (β2 ξ1 ) m (β2 )
For the model (10.57) the functions introduced in Ex are f (x1 , β) = β0 + β1 m (β2 )−1 exp (β2 x1 )
(10.58)
and h (x1 , β) = β02 + 2β1 β0 m (β2 )−1 exp (β2 x1 ) + β12 m (2β2 )−1 exp (2β2 x1 ) . (10.59)
10.5.3
Gaussian regression curve
Consider
and assume
!
(ξ1 − β2 )2 g (ξ1 , β) = β1 exp − , β3 > 2σ22 2β3
(10.60)
205
10.5. THE CORRECTED LEAST SQUARES ESTIMATOR Nor ε2i
i.i.d. ∼ N 0, σ22 .
For the model (10.60) the functions introduced in Ex are √ β1 β3
!
(10.61)
√ β12 β3
!
(10.62)
(ξ1 − β2 )2 q exp − f (x1 , β) = 2 (β3 − σ22 ) β3 − σ22 and
(ξ1 − β2 )2 exp − . h (x1 , β) = q β3 − 2σ22 β3 − 2σ22
10.5.4
Laplace distribution
If we assume for the error-in-variables distribution, the Laplace one, we can derive the corrected least squares criterion for a great class of smooth regression functions. La ε2i
i.i. Laplace distributed with density p, p (u) =
1 exp (− |ui |) . 2
Consider a function g : R −→ R, g ∈ C 2 (R) with G
Z
|g (t)| exp (− |t|) dt < ∞,
Z
lim g (t) exp (− |t|) = 0,
|t|→∞
|g ′′ (t)| exp (− |t|) dt < ∞
and |t|→∞
lim g ′ (t) exp (− |t|) = 0.
Lemma 10.9 Consider a function g : R −→ R, g ∈ C 2 (R) . Assume G. Then for all x ∈ R 1Z (g (x + t) − g ′′ (x + t)) exp (− |t|) dt = g (x) . 2 2
206
CHAPTER 10. ALTERNATIVE ESTIMATORS
Proof. The proof is done directly by computing the respected integrals with partial integration and using for p (t) = that for t 6= 0
1 exp (− |t|) 2
(10.63)
p′ (t) = −p (t) sgn (t)
(10.64)
lim p′ (t) = p (0) = − lim p′ (t) .
(10.65)
and that t→−0
t→+0
2 For regressions functions g (., β) satisfying G from Lemma 10.9 follows that the functions in Ex are f (x1 , β) = g (x, β) − g ξξ (x, β) and
h (x1 , β) = g (x, β)2 − 2 g ξ (x, β)
2
− 2g (x, β) g ξξ (x, β) .
(10.66)
(10.67)
An approximate corrected least squares estimator Let us give at least one example for an corrected least squares estimator. We assume La with and for the regression function g (ξ1 , β) = β |ξ1 | ,
β ∈ [−d, d] .
We get the formal solution of the first deconvolution equation (10.41) in Ex from the following Lemma: Lemma 10.10 Consider a function g : R −→ R, g ∈ C 2 (R) and g is strictly increasing. The inverse function of g is denoted by g −1 . Assume G. Then for all x, y ∈ R 1Z q (x + t) exp (− |t|) dt = |x| , 2
with q (x) = |x| − 2δ (x) , where δ is the δ−function. Here formally defined by Z
2
δ (x + t) exp (− |t|) dt = exp (− |x|) .
(10.68)
207
10.5. THE CORRECTED LEAST SQUARES ESTIMATOR
Proof. The proof is done directly by computing the respected integral with partial integration and using (10.65), (10.64). Compare Lemma 8.1 by Kukush, Zwanzig (1997), [104], with y = 0 and g (ξ) = ξ. 2 That means the formal solution of (10.41) is f (x1 , β) = β (|x1 | − 2δ (x1 )) . We introduce a nonnegative, normalized, smooth kernel function ̟ : R → R with bounded support, sup p (̟) ⊆ [−1, 1]
Z
3
̟ ∈ C (R) ,
̟ (t) dt = 1.
(10.69)
Then we approximate the δ−function by !
1 t δµ (t) := ̟ . µ µ
(10.70)
Hence we get for the approximate criterion fµ (x1 , β) = β (|x1 | − 2δµ (x1 )) . The second function h in (10.42) of Ex is related to g (ξ)2 = β 2 ξ 2 , which satisfies the conditions G of Lemma 10.9. We get
h (x1 , β) = β x2 − 2 . Summing up, we obtain for the approximate corrected estimation criterion (10.15). Cµ (β) =
n X i=1
(yi − β (|xi | − 2δµ (xi )))2 − 2β 2 −2 |xi | δµ (xi ) + 2δµ2 (xi ) + 1 =
n X i=1
yi2 − 2yi fµ (xi , β) + β x2i − 2
.
It remains to show (10.14). We have |Efµ (x1 , β) − Ef (x1 , β)| = |Efµ (ξ1 + ε11 , β) − β |ξ1 || ≤ 2 |β|
10.5.5
µ ≤ const µ. 2
Application of the Fourier transform method
Because of the strong condition ExFour the application of the Fourier transform method only is useful for heavy tail error distributions. Let us consider the Gamma distribution, which is a member of the group of ordinary smooth distributions, concerning the classification of Fan(1991), [54].
208
CHAPTER 10. ALTERNATIVE ESTIMATORS
Gam Let ε2i be i.i. centered Gamma distributed with parameters ϑ > 0, r natural number and density p, p (u) =
(
ϑr (r−1)!
u− 0
r ϑ
r−1
exp −ϑ u −
r ϑ
for u ≥ ϑr . else
Further we need that the regression function has the following properties, first all derivatives with respect to ξ1 exist, the k ′ th we denote by g (k) (., β) and second the derivatives satisfies the boundness condition 1 1 + |ξ1 |m
l X (k) g (ξ1 , β)
k=0
!
≤ cm,l ,
l = 0, 1, 2, .... m = 0, 1, ....
That means we require, that the regression function is a member of S (R) , see for instance Triebel (1972), [170], p.97. Smoo For all β ∈ Θ :
g (., β) ∈ S (R)
It follows from Lemma 10.8, by using the formulas for the characteristic function of a Gamma distribution, 1 ϑr , pb (λ) = √ eiλm (ϑ − iλ)r 2π
that the functions in Ex are
!
r 1 X r k −k (k) f (x, β) = r x − ,β ϑ g ϑ k=0 r ϑ
and
! (k) r r k −k 2 1 X g x − ,β ϑ , h (x, β) = r ϑ k=0 r ϑ
where g (k) (x, β) denotes the k ′ th derivatives with respect to the first argument.
10.6
On a c-MCE for rational regression functions
Consider the special case of regression functions g= Then we have the model yi =
g1 . g2
g1 (ξi , β) + ε1i g2 (ξi , β)
(10.71)
10.6. ON A C-MCE FOR RATIONAL REGRESSION FUNCTIONS
209
xi = ξi + ε2i , with inf inf g2 (ξi , β) ≥ d > 0, ξ1
β
(10.72)
which we can rewrite to an implicit model G (ηi , β) = η1i −
g1 (η2i , β) =0 g2 (η2i , β)
where η1i =
g1 (ξi , β) g2 (ξi , β)
and η2i = ξi .
Then under (10.72) this model is equivalent to G0 (ηi , β) = (g2 (η2i , β) η1i − g1 (η2i , β))2 = 0. A reasonable contrast for implicit models is n 1X (G0 (ηi , β))2 = Cn (θ) = n i=1
=
n 1X (g2 (η2i , β) η1i − g1 (η2i , β))2 . n i=1
(10.73)
Under (10.72) we have
Cn η 0 , β − Cn η 0 , β 0 =
n 2 1X 0 0 0 g2 η2i , β η1i − g1 η2i ,β n i=1
n 2 1X g ξi0 , β 0 − g ξi0 , β . ≥d n i=1
Thus the contrast condition Con’ for Cn (θ) is fulfilled under the Jennrich type contrast condition on the regression function g n
2 2 1X g ξi0 , β 0 − g ξi0 , β ≥ const
β 0 − β
. n i=1
Models of the special form (10.71) occur in the biometry, for instance in the context of growth curves. We will give for one special case the corrected MCE related to the contrast (10.73).
210
CHAPTER 10. ALTERNATIVE ESTIMATORS
A c-MCE for a growth curve Consider the regression function g (ξ1 , β) =
exp (β1,1 ξ1 ) , g2 (ξ1 , β)
where g2 is a polynomial of K’th order, K X
g2 (ξ1 , β) =
βk ξ1k ,
k=0
satisfying (10.72). Assume ExpMom with m (β) , then the estimation criterion for the c-MCE related to the contrast in (10.73) is given by q (y1 , x1 , β)
= m (2β1,1 )−1 exp (2β1,1 x1 ) + y12 − σ12 h (x1 , β) − 2y exp (β1,1 x1 ) f (x1 , β) where the function f and h
f (x1 , β) =
K X
βk fk∗ (x1 )
k=0
and h (x1 , β) =
K K X X
βk1 βk2 fk∗1 +k1 (x1 )
k1 =0 k2 =0
relate to the polynomial g2 (ξ1 , β) . The auxiliary functions fk∗ (x1 ) has to fulfill Efk∗ (x1 ) exp (β11 x1 ) = exp (β11 ξ1 ) ξ1k for k = 0, ..., 2K. They correspond to the auxiliary function introduced in (10.52), (10.53) and ∗ k (10.54) with µk = E exp (β1,1 ε21 ) ε21 , that means m (β11 ) f0∗ (z) = 1 and for all r ≥ 2 m (β11 ) fr∗ (z) = z r −
r−2 X l=0
r l
!
fl∗ (z) µ∗r−l .
10.6. ON A C-MCE FOR RATIONAL REGRESSION FUNCTIONS
211
Application to copolymerization In application to chemistry models of type (10.71) also important. In Section 1.4.3 we discussed the application of error-in-variables models to copolymerization. The copolymerization equation, given in (1.36), yields the model g (ξi , β) =
ξi (r1 ξi + 1) ξ i + r2
with
ξi =
M1 (t) . M2 (t)
The corrected minimum contrast estimator with respect to the contrast (10.73) is given by n 1X e e q (xi , yi , r1 , r2 ) (r1 , r2 ) ∈ arg min n i=1
with
q (xi , yi , r1 , r2 ) =
(r1 )4 f4 (xi ) − 2r1 r2 x2i − σ22 yi + (r2 )2 yi2 − σ12
+2r1 f3 (xi ) (1 − yi ) + 2r2 yi2 − yi − σ12 xi .
212
CHAPTER 10. ALTERNATIVE ESTIMATORS
Chapter 11 Efficiency In this Chapter we consider the nonlinear functional relation model (3.1),(3.2) with normal distributed errors. A Hajek type local asymptotic minimax bound is derived, and different estimators are compared. It turns out that the least squares estimator with optimal weights is the best one - of course only under assumptions, that ensure asymptotic normality. This is not surprising because, under normal distribution, the optimal weighted least squares estimator is the maximum likelihood one. As we have seen in the previous chapters the assumptions for asymptotical normality of the l.s.e. are quite restrictive, and that not just for proving. In Chapter 8, the inconsistency of the l.s.e. are shown for interesting models. Therefore in Chapter 10 alternative estimators are introduced with asymptotic normal distribution under less restrictive assumptions. Otherwise these can only calculated for special models. Under normal distribution we have alternative estimators for example for polynomial regression curves, for exponential models and some growth curves. We will see that these estimators do not attain the Hajek bound. Further, the model with replications is investigated in more detail. Here, the least squares estimator attains the bound. In that model the naive least squares estimator is consistent and asymptotically normal, but it is asymptotically optimal only in the linear case.
11.1
A minimax bound of Hajek type
A minimax bound of Hajek type is derived, basing on the method of local least favorable alternatives. This result is given in Zwanzig (1989), [183]. In the linear functional relation model asymptotic minimax bound of Hajek’s type obtained in Ibragimov, Hasminski (1983), [86], and in Nussbaum (1983), [129], and in Nussbaum, Zwanzig (1988), [131]. The result presented here is a generalization of the local minimax theorem for linear models in [131]. Denote by L the class of loss functions l : Rp → R+ with l (0) = 0 and l (−x) = 213
214
CHAPTER 11. EFFICIENCY
l (x) and let for all c the set {x : l (x) > c} be convex in Rp and ln (l (x)) ≤ o (x2 ) . The local neighborhood of the nuisance parameters is here defined by the distance, n 1 X n 2 2 µ (ξ, ξ ′ ) = |ξ − ξ ′ |n (11.1) (ξi − ξi′ ) = λn i=1 λn where
λn = λmin Φ = Φ=
n X i=1
1
1 λ0 (n) v2 β βT
2 2 σ1i + σ2i giξ
2 g i g i P
=
(11.2) 1 Φw∗ ,σ . v2
(11.3)
P
−2 with Φw∗ ,σ derived in (9.43) and v −1 = j=1,2 ni=1 σji given in (9.40). In the following minimax theorem, the minimal eigenvalue λn will describe the best rate of convergence which can be achieved for some estimator βe for β. For ”common” models we expect that λn is of order n,
λn ≍ n.
(11.4) √
Then we have the usual parametric convergence rate of n. Otherwise in the nonlinear regression theory several models are known with a other rates, for instance for the regression function g (t, θ) = cos (2πθt) ,
t = 1, ..., n
3 Kundu (1993), [106], showed for the l.s.e. that n 2 θb − θ is asymptotically normal distributed. Consider the Fisher information
In (β) =
n T 1X giβ giβ . n i=1
(11.5)
2 Because of the relation (9.12) and (11.2), we have for σji = σ2
λn = σ 2
1 σ2 λ (n) ≤ λmin (In (β)) . 0 n2 n
Hence it makes sense to maintain the minimal eigenvalue λn as a normalizing term in the norm and in the following regularity assumption, and to give up the requirement (11.4). The main idea least favorable alternatives is to n of the method of local o n o pass (n) from the model Pf,t : f ∈ F , t ∈ Θ to a smaller model Pf (t),t : t ∈ Θ with special parameterized nuisance parameters f (t) , where f (t) is chosen as least favorable in the local neighborhood of (ξ, β) with f (β) = ξ. In the course of the proof we will see, that a useful choice is fi (t) = ξi + ViT (t − β) ,
(11.6)
215
11.1. A MINIMAX BOUND OF HAJEK TYPE with Vi =
2 σ2i
2 ξ −σ2i gi
2 giξ
+
2 σ1i
giβ ,
(11.7)
where all derivatives are taken at (ξi , β) . We need the smoothness condition Diff and the following regularity assumption. Reg There exists a constant κ ≥ 1, independent of n and of (ξ, β), such that n 1 X kVi k2 ≤ κ, λn i=1
(11.8)
n 2 1 X kVi k2 giξξ ≤ κ, , λn i=1 n
2 1 X kVi k2
giξβ
≤ κ, , λn i=1
n 1 ββ 2 1 X ≤ κ, . tr gi 2 λn i=1 σ1i
Note, the boundness condition Bound and the condition inf infn
β∈Θ ξ∈F
λn ≥ const n
(11.9)
imply the regularity condition Reg. Let us now state the minimax result. Theorem 11.1 In the model (3.1) and (3.2) with
2 εji ∼ N 0, σji
and with Diff and Reg and lim λn = ∞
n→∞
for any loss function l ∈ L and for any estimator βe it holds lim lim n→∞ inf
δ→0
2
sup
sup
{t:kt−βk≤δ} {f :µ(ξ,f )≤δ}
Et,f l Φ
1 2
βe − t
≥
Z
ldN.
(11.10)
216
CHAPTER 11. EFFICIENCY
Proof.
The proof follows the line of that in Zwanzig (1989), [183]. We have
µ (ξ, f ) =
n n 2 1 X 1 X (ξi − fi (t))2 = ViT (t − β) ≤ κ kt − βk2 . λn i=1 λn i=1
and because of (11.8). Then we estimate sup
sup
kβ−tk≤δκ µ(ξ,ξ ′ )≤δκ
≥ sup
1 Et,ξ′ l Φ 2 βe − t
sup
kβ−tk≤δ µ(ξ,ξ ′ )≤δκ
1 Et,ξ′ l Φ 2 βe − t
1 ≤ sup Et,f (t) l Φ 2 βe − t
kβ−tk≤δ
.
These inequalities give the bridge: n o from the semiparametric model Pf,t : f ∈ F (n) , t ∈ Θ and to the parametric n
o
model Pf (t),t : t ∈ Θ .
n
o
It remains to check the LAN condition in the model Pf (t),t : t ∈ Θ at (ξ, β) . Then Theorem 12.1 of Ibragimov, Hasminski (1983), [87], p.162 delivers the statement of the theorem. Introduce the local alternatives for β by 1
t = β + Φ− 2 u, then
(11.11) 1
fi (t) = ξi + ViT Φ− 2 u,
(11.12)
and consider the likelihood quotient Λ = ln
dPt,f (t) . dPξ,β
Under normal error distribution we have Λ=− +
n n 1X 1 1X 1 2 (g (ξ , β) − g (f , t)) − (ξi − fi )2 i i 2 2 2 i=1 σ1i 2 i=1 σ2i
n X 1 i=1
2 σ1i
ε1i (g (ξi , β) − g (fi , t)) +
n X 1 i=1
2 σ2i
ε2i (ξi − fi ) .
(11.13)
Using the Taylor expansion and (11.11), (11.12) we get
g (fi , t) − g (ξi , β) = giξ Vi + giβ where the derivatives are taken at (ξi , β) and
T
1
Φ− 2 u + ∆i ,
1 1 1 ξβ T ββ T V V + 2g V + g Φ− 2 u ∆i = uT Φ− 2 g ξξ i i i i i i 2
(11.14)
217
11.1. A MINIMAX BOUND OF HAJEK TYPE
ββ where g ββ ξ i , β with ξ i − ξi ≤ |fi − ξi | and β − β ≤ kt − βk and g ξξ i = g i ,
g ξi are analogously defined. We obtain
# " n T 1 1 1 T −1 X 1 ξ β ξ β T g i Vi + g i g i Vi + g i + 2 Vi Vi Φ− 2 u + η T u + ∆, Λ=− u Φ 2 2 2 σ2i i=1 σ1i (11.15) where ! n n X X 1 1 ξ β − 12 η=Φ ε g i Vi + g i + ε V . (11.16) 2 1i 2 2i i i=1 σ1i i=1 σ2i
and ∆=−
n n n X 1 1X 1 2 1 T −1 X 1 ξ β T 2 u Φ ∆ − ∆ + ε ∆. g V + g i i i i i 2 2 2 2i i 2 i=1 σ1i 2 i=1 σ1i i=1 σ2i
(11.17)
The key idea of the local least favorable alternative is the clever choice of Vi in (11.7) such that T
− 21
u Φ
"
n X 1 2 σ1i
i=1
giξ Vi
+
giβ
giξ Vi
+
#
T giβ
1 1 + 2 Vi ViT Φ− 2 u = uT u. σ2i
(11.18)
Let us show (11.18). Using (11.7), we obtain giξ Vi + giβ =
2 σ2i
and 1 V VT = 2 i i σ2i
1
= Φ− 2
n X i=1
2 σ1i
+
2 σ2i
2 σ2i giξ
2
2 giξ
2 σ2i giξ
2 σ2i giξ
Thus
2 σ1i
2
+
2
2 σ1i
2 + σ1i
giβ
β βT
2 gi gi
n X β βT − 1 −1 2 g i g i Φ 2 = Φ 2 2
2 giξ
i=1
+ σ1i
1
2 σ2i giξ
1
2
2 + σ1i
− giβ giβT Φ 2
1
= Φ− 2 ΦΦ− 2 = I. Hence (11.18) is valid. The random vector η in (11.16) is normally distributed, because of the normal distribution of the error terms. From (11.18) we get − 21
Cov (η) = Φ
n X 1 2 i=1 σ1i
giξ Vi
+
giβ
giξ Vi
+
T giβ
+
n X 1
V VT 2 i i σ i=1 2i
!
1
Φ− 2 = I.
1
218
CHAPTER 11. EFFICIENCY
Thus η is a standard normally distributed vector of dimension p. Summarizing, we have in (11.15) 1 Λ = − uT u + η T u + ∆, with η ∼ N (0, I) 2 Hence for the LAN-property it remains to show that the remainder term ∆, given (11.17) tends in probability to zero. The regularity condition Reg is chosen such that ∆ → 0 in probability for n → ∞. Using the relations λmax AT BA ≤
λmax AT A λmax (B) and especially
1
1
λmax Φ− 2 BΦ− 2 ≤ λmax Φ−1 λmax (B) ≤ λmin (Φ)−1 λmax (B) and λmax (B) ≤ tr (B), from (11.14) we obtain for the first term in (11.17) that n X 1 i=1
2 σ2i
2
(∆i ) ≤
n X 1 1 i=1
2 σ2i
2
T
− 21
u Φ
T g ξξ i Vi Vi
+
T 2g ξβ i Vi
+
g ββ i
n 1 X 1 ξβ T ββ 2 ξξ T g V + g ≤ kuk 2 λ V V + 2g max i i i i i i 2 λn i=1 σ2i 4
n 1 X 1 ≤ const kuk 2 2 λn i=1 σ2i 4
− 21
Φ
u
2
!
!
ξξ 2
ξβ 2 4 2 ββ 2 g i kVi k + 2 g i kVi k + tr g i
≤ kuk4 κλ−1 n
The second term in (11.17) are estimated by the same arguments. Consider the last term in (11.17), we have V ar
n X 1
2 i=1 σ2i
!
ε2i ∆i =
n X 1
2 i=1 σ2i
(∆i )2 → 0.
2 For non normal error distributions fulfilling smoothness conditions we can show the LAN property also. Then by using the Theorem 12.1 of Ibragimov, Hasminki (1983), [87], the Hajek bound can be derived for non normal error distributions. In order to show LAN, we have to apply the Taylor expansion to both to the densities in likelihood quotient and to the regression function. The random leading term will not be normally distributed, but is a sum of independent random variables; under regularity conditions the asymptotic normality can be shown by the central limit theorem. The least favorable sequence of nuisance parameters has to be determined in a similar way as above. In this paper we confine ourselves mainly to least squares methods; and we know from the regression theory that the least squares estimator attains the bound only under normal distribution.
219
11.2. COMPARISON OF THE ESTIMATORS
Interesting may be the comparison of the bound in Theorem 11.1 with that in nonlinear regression. There we have under regularity conditions √
lim lim n→∞ inf sup Eξ,β l
δ→0
kt−βk≤δ
nIn (β)
1 2
βe − β
≥
Z
ldN,
see for instance Zwanzig (1994), [186]. Recall (9.12), we have Φ ≺ nIn (β) and that means in nonlinear regression the bound is sharper, and it cannot be attained in the functional relation model. The unknown nuisance parameters have essential influence. The error-in-variables models are not adaptive, as already discussed in the linear case by Nussbaum, Zwanzig (1988), [130].
11.2
Comparison of the estimators
11.2.1
Efficiency of the l.s.e.
Here we show that if the l.s.e. in the nonlinear model is asymptotic normally distributed, then it attains the minimax bound. We state. Theorem 11.2 Under the assumptions of Theorem 9.2 and Bound the least P ∗ ∗ squares estimator with optimal weights wji = σv2 , where v such that wji = 1, ji attains the Hajek bound of Theorem 11.1 for all bounded loss functions l ∈ L. 2 Proof.
Theorem 9.2 yields the convergence of 1
− Dw∗2 βb − β → Np (0, I) .
Under Bound the constants in (9.22) are independent of (ξ, β) , thus the above convergence holds uniformly. Further, we derived in (9.44) that Dw−1∗ = v −1 Φw∗ = Φ. Then the statement follows from Theorem 8 in the Appendix of Ibragimov, Hasminski (1983), [87]. 2
11.2.2
Inefficiency of the alternative estimator
The following result is on the alternative estimator and may be a little bit disappointing. In general the alternative estimator does not attain the minimax bound. Let us explain it in more detail.
220
CHAPTER 11. EFFICIENCY Theorem 10.7 states the convergence of √
1 2
1
nVn (β) 2 Dn−1 (β) Vn (β) → Np (0, I) .
Then the bound in (11.10) is attained for √
1 2
1
1
nVn (β) 2 Dn−1 (β) Vn (β) = Φ 2 .
(11.19)
The alternative estimators are not M-estimators in the classical sense, because the estimating functions q (x1 , y1 , β) 6= q (yi − g (ξi , β) , xi − ξ) . They are solutions of the deconvolution equation and have a difficult structure. Thatswhy let us consider special cases only. The alternative estimator for the linear model has no practical importance but the estimator should be efficient also in this case. Thus consider g (ξ1 , β) = ξ1 β then
h (x1 , β) = x21 − σ 2 β 2
f (x1 , β) = x1 β, thus
q (x1 , y1 , β) = (y1 − x1 β)2 + x21 − σ 2 β 2 = y12 − 2y1 x1 β + x21 − σ 2 β 2 We have A (β) = Vn (β)
1 2
Dn−1
and Φ=
σ2
P
1 2
(β) Vn (β) = P
Eq ββ Cov (q β )
X 1 ξi2 2 (1 + β )
Simple calculations gives that under normal distribution σ12 = σ22 = 1 P
n ξi2 nA (β) = P (1 + 2β 2 ) ( ξi2 + n)
which is asymptotical equivalent lim
n→∞
We have
P
P
n ξi2 ξi2 − P (1 + 2β 2 ) ( ξi2 + n) (1 + 2β 2 ) P
ξi2 1 X 2 ≤ ξi . (1 + 2β 2 ) 1 + β2
!2
= 0.
11.3. EFFICIENCY IN THE REPLICATION MODEL
11.2.3
221
Outlook
Summarizing the results of the above sections, under restrictive assumptions we have a consistent and efficient estimator and it is the l.s.e.. The alternative estimator is reasonable under much less restrictive assumptions, but it is not efficient. The open problem is : how should be the efficient estimator adjusted in order to be consistent and to maintain its efficiency? The alternative estimators proposed in Chapter 10 are based on deconvolution integral equations, which correct the error in the variables only. It is an asymmetric procedure for a symmetric model, which makes no difference between both type of errors. Consider the random leading term (11.16) of the expansion of Likelihood quotient (11.13), we see that it depends from both errors ε1i , ε2i . Therefore we have to search for a symmetric adjusting procedure. Maybe, the way out n is to apply the methods of Chapter 10 to ”the local o least favorable” model Pf (t)t : β, t ∈ Θ, fi (t) = ξi + Vi (ξ, β) (t − β) , ξ ∈ F (n) , where Vi (ξ, β) = Vi defined in (11.7). This could give a chance to approach the true nuisance parameter from the ”right” direction.
11.3
Efficiency in the replication model
Let us consider the averaged model in (3.11) and (3.12) with normal distributed errors. There we have: y i = g (ξi , β) + ε1ik , (11.20) xi = ξi + ε2i , 1 2 εji ∼ N 0, σ0 , i = 1, ..., q r
(11.21) i.i.d.,
(11.22)
with β ∈ intΘ, ξi ∈ int [0, 1] .
(11.23)
and with bounded regression function and bounded derivatives. . In that model xi → ξi for r → ∞. Here it makes sense to apply the naive estimator. Remember the naive estimator was introduced in (5.24) as solution of β = arg min β∈Θ
q X i=1
w1i (y i − g (xi , β))2 .
(11.24)
We can show by usual techniques that β is asymptotically normally distributed. Assume Pos’ For all q > q0 the Fisher matrix Iq (β) given in (11.5) is positive definite λmin Iq (β) > 0.
222
CHAPTER 11. EFFICIENCY Let us quote here the result from Zwanzig (1989), [183].
Theorem 11.3 Suppose model (11.20),...,(11.23) with Diff, Bound, Pos’ and w1i = q −1 and n = qr such that q → 0. (11.25) r Then for the estimator defined in (11.24) holds 1 Pξ,β U 2 (ξ, β) β − β ∈ C − N (C) = 0, lim sup sup n→∞
(11.26)
β∈Θ ξ∈F (q)
where
U (ξ, β) = with D=
q2r Iq (β) D−1 Iq (β) σ02
q X
1+
i=1
2 giξ
(11.27)
T
giβ giβ ,
with Iq (β) , given in (11.5). 2 Note under Pos’ the inverse of D exists, because λmin
q X
1+
i=1
2 giξ
T giβ giβ
!
≥ qλmin Iq (β) + max giξ i=1,..,n
2
qλmin Iq (β)
≥ qλmin Iq (β) > 0. Corollary 11.4 Under the assumptions of Theorem 11.3 and g (ξ1 , β) = ξ1 h (β)
(11.28)
the naive estimator β attains the minimax bound for bounded loss functions. 2 From Theorem 11.3 it follows that for all bounded loss functions l
1
Eξ,β l U 2 (ξ, β) β − β
→
Z
ldN,
compare Theorem 8 in the Appendix of Ibragimov, Hasminski (1981), [87]. Because of Theorem 11.1 it remains to show U (ξ, β) = Φ. In the linear case (11.28) we have giξ = h (β) and Φ=
q X
2 i=1 σ1i
1 +
2 σ2i
β βT
giξ
2 gi gi
=
σ02
rq 1 + (h (β))2
Iq (β) ,
223
11.3. EFFICIENCY IN THE REPLICATION MODEL D=
q X
1+
i=1
2 giξ
T
giβ giβ = 1 + (h (β))2 qIq (β)
Thus U (ξ, β) = 2
σ02
rq 1 + (h (β))
2
Iq (β) Iq−1 (β) Iq (β) =
σ02
rq 1 + (h (β))2
Iq (β) = Φ.
Corollary 11.5 Under the assumptions of Theorem 11.3 and ∃i1 , i2
g ξ (ξi1 , β)
2
6= g ξ (ξi2 , β)
2
β
g (ξi1 , β) > 0, g β (ξi2 , β) > 0
and
(11.29)
the naive estimator β does not attain the minimax bound. 2 Proof.
From Theorem 11.3 follows that for all bounded loss functions l
1
EPξ,β l U 2 (ξ, β) β − β
→
Z
ldN,
compare Theorem 8 in the Appendix of Ibragimov, Hasminski (1981), [87]. For U (ξ, β) ≺ Φ
(11.30)
that implies that β does not attain the bound. We have the q × q diagonal matrix
W (σ) = diag
1
2 2 σ11 + σ12 giξ
2 , ...,
1
2 2 σ1q + σ1q giξ
2
and for the p × q matrix of first derivatives
Gβ = giβl
l=1,...,q i=1,...,q
Then we rewrite
.
T
Φ = Gβ W (σ) Gβ , and D
σ02 T = Gβ W −1 (σ) Gβ , r
U = Gβ Gβ Then
T
T
T
Iq (β) = qGβ Gβ ,
Gβ W −1 (σ) Gβ
(Φ − U ) = Gβ W (σ) Gβ − Gβ Gβ
T
T
−1
T
Gβ Gβ .
Gβ W −1 (σ) Gβ
T
−1
Gβ Gβ
T
224
CHAPTER 11. EFFICIENCY 1
1
= Gβ W 2 (σ) I − W − 2 (σ) Gβ 1
T
T
Gβ W −1 (σ) Gβ T
T
−1
−1
1
1
T
Gβ W − 2 (σ) W 2 (σ) Gβ ,
1
Gβ W − 2 (σ) is a projection matrix where P = W − 2 (σ) Gβ Gβ W −1 (σ) Gβ and positive semi definite. Let us now continue the proof indirectly. Assume Φ = U. Then
1
(11.31) 1
T
T
P W 2 (σ) Gβ = W 2 (σ) Gβ . This implies
1
R W 2 (σ) Gβ or respectively
R Gβ
T
T
1
⊂ R W − 2 (σ) Gβ
⊂ R W −1 (σ) Gβ
T
T
,
(11.32)
where R (X) = {Xb, b ∈ Rp } ⊂ Rn for n × p matrices X. By R⊥ (X) we denote the orthogonal space to R (X) . The relation (11.32) implies
R⊥ W −1 (σ) Gβ Take
T
⊂ R ⊥ Gβ
y ∈ R⊥ W −1 (σ) Gβ then
T
T
,
(11.33)
,
T
y T W −1 (σ) Gβ a = 0, for all a and y T W −1 (σ) Gβ a =
q X i=1
yi 1 + giξ
2
T
giβ a =
q X
T
yi giβ a +
i=1
q X i=1
yi giξ
2
T
giβ a = 0
Thus under the above assumption from y T Gβ a = − follows that
q X i=1
yi giξ
y∈ / R ⊥ Gβ
2 T
T
giβ a 6= 0 ,
and that contradicts (11.33). Thus (11.31) does not hold. 2 Otherwise under the same assumptions we have Corollary 11.4 and Theorem 11.2. Both together imply that the l.s.e. is efficient. Thus in the linear case the naive estimator and the l.s.e. are efficient, but in the nonlinear case we have the efficiency only for the least squares estimator. Hence in the nonlinear replication model the advise is: take the l.s.e..
Part II Estimation in the nonparametric model
225
Chapter 12 Orthogonal series estimators 12.1
Introduction
This part concerns to the nonparametric functional relation model. In difference to Part I no parametric form for the regression model is supposed. We require only that the regression function is member of a smooth function class. Now our model is a ”non parametric- semi parametric” one. Nonparametric – because of the nonparametric regression model which is involved and semi parametric – because of the increasing dimension of the nuisance parameter. As in Part I we consider the unknown design point as deterministic. But we will require that we know the asymptotic design. This knowledge is important for defining the orthogonality and for constructing the orthogonal base of the space, where the regression function lies. The main connection to the Part I is that we will use for the orthonormal base functions, which are the alternative one of Chapter 10. This Part includes the results for orthogonal polynomial estimator of Zwanzig (1996), [187]. Furthermore a more general approach for orthogonal series estimators are given, like it was presented on the Fourth World Congress of the Bernoulli Society (1996) as contributed talk. The main new example given in this talk are Fourier series estimators. The orthogonal series estimator bases on an orthonormal system of polynomials. The main idea is the construction of an unbiased estimator of the polynomial at the unknown design points. This problem is considered in the location sub model separately. The consistency of the orthogonal polynomial series estimator is shown. The lower bound of the convergence rate is derived by applying the method of worst cases. Because of technical reasons, it holds for a class of estimators only. The volume of the set of nuisance parameters determines essentially the rate. The proposed polynomial series estimator achieves the bound. 227
228
CHAPTER 12. ORTHOGONAL SERIES ESTIMATORS
12.2
Model assumptions
Suppose we have observations (y1 , x1 ) , ...., (yn , xn ) independently and not identically distributed, generated by the errors-in-variables model: yi = g (ξi ) + ε1i
(12.1)
xi = ξi + ε2i ,
(12.2)
where i = 1, ..., n. The design points or variables ξ1 , ..., ξn are unknown and fixed. Without loss of generality let −1 ≤ ξi ≤ 1. The errors ε1i are i.i.d. with expected value zero and finite variance σ12 , the absolute moments of ε1i are denoted by E |ε11 |k = µ1k . The errors ε2i are i.i.d. with expected value zero and for all absolute moments E |ε21 |k = µ2k < ∞.
(12.3)
In difference to the errors in the first model equation we suppose, µ2k are known. The knowledge of µ2k is used explicitly in the construction of the estimator. We use a generalized moment condition of Bernstein type B1 on error distribution in the regression equation and the same one on the error-in-variables distribution which is called B2: B1 ∀k ≥ 2
µ1k ≤ ∆k (k!)1+γ
∀k ≥ 2
µ2k ≤ ∆k (k!)1+γ
B2
The condition B2 is equivalent to the following Linnik condition,
1
∃a E exp a |ε2 | 1+2γ < ∞,
(12.4)
and if γ = 0, then B2 is equivalent to the condition on the characteristic function ϕε (t) of ε :
∃H ∀t ∈ [−H, H] ϕε (t) ≤ exp at2 ,
(12.5)
compare Rudskis,Bentkus (1980), [13]. Further the Bernstein condition is equivalent to the Statulevicius condition S1. The exact formulation of this equivalence is given in the Appendix by Lemma 16.2 and Lemma 16.3,
229
12.2. MODEL ASSUMPTIONS Model assumption
For the regression function g : [−1, 1] → R we assume the model assumption Mod, that g is k times differentiable and the k’th derivative is α H¨olderian with constant L and all regression functions and their derivatives are bounded by the same constant B. Denote the degree of smoothness by ν = k + α. Mod g ∈ Mν (B) = Mν
ν Mν = g ∈ C[−1,1]
Design assumption
(m)
g
≤ B, m = 0, .., k, : (k) max (k) g (x) − g (x + δ) ≤ Lδ α
(12.6) (12.7)
In the following we introduce the design assumption Des. Let us denote by Gn the empirical measure on [−1, 1] , induced by the unknown design points ξ1 , ..., ξn : Gn (A) =
n 1X IA (ξi ) , n i=1
(12.8)
where IA (ξi ) = 1 for ξi ∈ A and IA (ξi ) = 0 for ξi ∈ / A. We assume on the unknown design points ξ (n) = (ξ1 , ..., ξn ) , that the empirical measure Gn has an asymptotical design measure G in the following sense: Des ∃δ ∈ (0, 1)
Z Z sup f dGn − f dG ≤ Dn−δ
(12.9)
f ∈M
The set Dn of all design points, which fulfill Des for some δ ∈ (0, 1) , Dn =
(
) Z Z −δ (ξ1 , ..., ξn ) : sup f dGn − f dG ≤ Dn
(12.10)
f ∈M
is not empty. For instance the Ylvisaker design ξi0 fulfills
Z
sup
f ∈M
=G
f dGn −
−1
Z
i n
(12.11)
f dG ≤ 2Bn−1 .
The key assumption Key of this part is, that we require, that the asymptotical design is known: Key G is known
(12.12)
The knowledge of G is important for the construction of the estimator.
230
CHAPTER 12. ORTHOGONAL SERIES ESTIMATORS
Orthogonal system We have Mν ⊂ L2[−1,1] (G) and it exists a orthonormal system of polynomials {1, p1 , p2 , ...} which span L2[−1,1] (G) , compare for instance Szeg¨o (1959), [168], p26, where pm (x) =
m X
am,l xl
(12.13)
l=0
and
Z
pm (x) pj (x) dG =
(
1 for m = j . 0 for m 6= j
(12.14)
We require a second condition on G respect to this orthonormal system: Pol ∃ cG
∀m max m |am,l |2 ≤ exp (cG m ln m) .
(12.15)
l=0,...,m
In the case of an uniform asymptotical design G equals the uniform distribution over [−1, 1] . The Legendre polynomials form the orthonormal system and the condition (P ) is fulfilled, compare Sansone (1959), [140], p 174. For uniform asymptotical design we can apply the Fourier base: (
1 1 1 1 1 , √ cos (ξ) , √ sin (ξ) ......, √ cos (mξ) , √ sin (mξ) .... 2π π π π π
)
(12.16)
The aim is to estimate g ∈ Mν consistently respect to the norm of L2[−1,1] (G) . The main point is the construction of alternative estimators for the base functions.
12.3
Construction of orthogonal series estimators
In this section we describe the construction of the orthogonal series estimators in the functional relation model. Assume that we have an arbitrary normalized system of functions, which span the model space Mν and which are orthogonal with respect to the L2 − norm in L2[−1,1] (G) . This means, for all g ∈ Mν we have g (x) =
∞ X
βm pm (x) .
(12.17)
m=0
Note, the coefficients of pm depend of the asymptotic design G. The coefficients in (12.17) describe the unknown regression function g uniquely. It holds βm =
Z
g (x) pm (x) dG.
(12.18)
231
12.3. CONSTRUCTION OF ORTHOGONAL SERIES ESTIMATORS
The idea of the orthogonal series estimators consists of a convenient truncation of the series in (12.17) by q (n) and of an estimation of the main coefficients β0 , ..., βq(n) : q(n)
gb (x) =
X
m=0
βbm pm (x)
(12.19)
In the nonparametric regression theory one possibility of estimating the coefficients is n 1X βem = yi pm (ξi ) . (12.20) n i=1
In the errors-in-variables model we cannot choose this way, because the design points ξ1 , ..., ξn are unknown. In the following we describe the construction of βbk . The main idea is to use the adjusting procedure for estimating the base functions at the true unknown design points. That means functions hm (xi ) are wanted which solve the deconvolution equations , E hm (xi ) = pm (ξi ) for all i = 1, .., n. Rewrite it as integral Z
hm (ε2i + ξi ) p (ε2i ) dε2i = pm (ξi ) , for all m = 0, ...
For i.i.d. error terms as supposed here it is enough to require the above integral equation in the following form Z
hm (ε + ξ1 ) p (ε) dε = pm (ξ1 ) , for all m = 0, ...for all ξ1 ∈ [−1, 1] .
This is the same equation as in (10.41). Now the indices m = 0, ....respects to the free parameter β in the equation (10.41). This means, we can use the results of Chapter 10. We will discuss here only two cases the polynomial series estimators and the Fourier series estimators. Especially under Laplace error distribution we have the free choice for an orthonormal system, because of Lemma 10.9. For polynomial series estimator we apply the functions fm (x) , defined in (10.53), (10.54) with E fm (xi ) = (ξi )m for all i = 1, .., n.
(12.21)
Then we introduce functions hm (x) ,basing on (12.13) hm (x) =
m X l=1
aml fl (x)
(12.22)
232
CHAPTER 12. ORTHOGONAL SERIES ESTIMATORS
with E hm (xi ) =
m X
aml Efl (xi ) =
m X
aml ξil = pm (ξi ) for all i = 1, .., n.
(12.23)
l=1
l=1
For Fourier series estimators we use (10.58) with β0 = 0, β1 = √1π and β2 = im. If we have symmetric error distributions then the exponential moments required in ExpMom are real and equals the characteristic function ϕε (m) of ε, m (β2 ) :=
Z
exp (imε) p (ε) dε = =
Z
Z
(cos (mε) + i sin (mε)) p (ε) dε
cos (mε) dε = ϕε (m) .
The unbiased estimators, given by (10.58), of the Fourier base (12.16)are 1 Eh(1) m (x1 ) = √ cos (mξ) , π
1 Eh(2) m (x1 ) = √ sin (mξ) π
and h(1) m (x1 ) = √
1 cos (mξ) , πϕε (m)
h(2) m (x1 ) = √
1 sin (mξ) . πϕε (m)
R Now we define the estimator βbm of βm = g (x) pm (x) dG (x) by
βbm =
n 1X yi hm (xi ) . n i=1
The estimator βbm has the same expected value as βem in (12.20), because E βe
m
n n 1X 1X Eyi pm (ξi ) = g (ξi ) pm (ξi ) = n i=1 n i=1
(12.24)
(12.25)
and because of (12.21) E βbm =
n n 1X 1X Eyi Ehm (xi ) = g (ξi ) pm (ξi ) . n i=1 n i=1
(12.26)
Let us denote the expected value by (n) βm
We see that the bias
Z
=
Z
g (x) pm (x) dGn (x) .
g (x) pk (x) d (G − Gn ) (x)
will be estimated by assumption Des.
(12.27)
(12.28)
Chapter 13 Location Submodel Let us regard the following location model x = ξ + ε,
(13.1)
where ξ ∈ [−1, 1] is an unknown fixed parameter and ε is r.v. with expected value zero and known absolute moments µk , which fulfill the moment condition of Bernstein type B
1
µk ≤ ∆k (k!) β
∃∆ ∃β ∈ (0, 1] ∀k ≥ 3
For instance if we assume, that the ε are normal distributed and the variance σ goes to zero, σ 2 → 0, we have ∆ → 0 . We introduce the bound 2
∆ (m) =
(
∆m for ∆ > 1 = (∆ ∨ 1)m (∆ ∧ 1) ∆ for ∆ ≤ 1
(13.2)
which goes to zero if ∆ → 0. We define the estimator fm (x) of ξ m by f0 (x) = 1,
f1 (x) = x, m
fm (x) = x −
m−2 X l=0
k l
f2 (x) = x2 − σ 2
(13.3)
fl (x) µm−l .
(13.4)
!
In Theorem 1 the unbiasedness and under ∆ → 0 the consistency is shown. Theorem 13.1 In the model (13.1) with B for fm defined in (13.3),(13.4) it holds 1. Efm (x) = ξ m and 233
234
CHAPTER 13. LOCATION SUBMODEL
3 ∀k∀m∀D ≥ 3 ln k + 2
(13.5) !
1 E |fm (x) − ξ | ≤ (k!) ∆k (m) exp kD (m ln m + 1) . β 1 β
m k
(13.6)
2 Proof. By induction we show (13.6). f0 (x) , f1 (x) , f2 (x) fulfill (13.6) obvim ously. Using the binomial formula for xm i = (ξi + ε2i ) in (13.4) we obtain m−2 X
m
fm (x) − ξ =
m l
l=0
!
ξ
m
ε
m−l
− µm−l −
m−2 X
m l
l=0
!
fl (x) − ξ l µm−l . (13.7)
Under (13.6) for all l = 1, ..., m − 2, we get E (fm (x) − ξ m ) = m−1 X l=0
m l
!
m
ξ E ε
m−l
− µm−l −
m−2 X l=0
!
m l
E fl (x) − ξ l µm−l = 0.
(13.8)
It remains to prove 2. We will do it by induction also. Denote the bound in (13.6) by ! 1 1 k B (m) = (k!) β ∆ (m) exp kD (m ln m + 1) . (13.9) β k
For m = 0 we have E |f0 (x) − ξ 0 | = 0 ≤ B (0) . Because of B we have for m = 1
k
1
E f1 (x) − ξ 1 = E |ε|k ≤ (k!) β exp (k ln ∆) ≤ B (1)
Now we assume (13.6) holds for all l = 1, ..., m − 2. Remember (13.7) and applying |a + b|k ≤ 2k−1 |a|k + |b|k (13.10) and
k
E εm−l − µm−l ≤ 2k−1 E |ε|k(m−l) + µkm−l ≤ 2k µk(m−l)
we obtain E |fm (x) − ξ m |k ≤ k
2k−1 (m) max
l=0,..,m−1
m l
!k
kl
|ξ| 2k µk(m−l) +
max
l=0,..,m−2
m l
!k
(13.11)
µk(m−l) B (l) .
(13.12)
m k
From (B) and |ξ| ≤ 1 we get E |fm (x) − ξ | ≤ 2
k−1
k
(m) (m!)
k
max ∆
l=0,..,m−1
k(m−l)
M1 (l) +
max ∆
l=0,..,m−2
k(m−l)
k
∆ (l) M2 (l)
(13.13)
235 with M1 (l) = 1 (m − l)!l!
M2 (l) =
!k
2 (m − l)!l!
!k
1
((k (m − l))!) β
(13.14) !
1 ((m − l)!) (k!) exp kD (l ln l + 1) . β k β
1 β
(13.15)
We distinguish the case ∆ ≤ 1 and ∆ > 1 and obtain max ∆k(m−l) ≤ (∆ ∨ 1)km (∆ ∧ 1)k = ∆k (m)
(13.16)
max ∆k(m−l) ∆ (l) ≤ ∆k (m)
(13.17)
l=0,..,m−1
and l=0,..,m−2
such that E |fm (x) − ξ m |k ≤ 2
k−1
k
k
k
(m) (m!) ∆ (m)
max M1 (l) +
l=0,..,m−1
max M2 (l)
l=0,..,m−2
First we estimate M1 (l) by using the Stirling formula with c1 =
√
(13.18) 2πe
1 1 exp n ln n − n + ln n ≤ n! ≤ c1 exp n ln n − n + ln n . 2 2 We get k (F11 (l) + F12 (l) + F13 (l)) M1 (l) ≤ 2 c1 exp β k
1 β
!
(13.19)
(13.20)
with F11 (l) = (1 − β) (m − l) ln (m − l) − βl ln l ≤ (1 − β) m ln m
(13.21)
F12 (l) = (β − 1 + ln (k)) m + (1 − ln (k)) l ≤ (β + ln (k)) m
(13.22)
and
and F13 (l) = Because of
ln (k) + ln (m − l) ln (m − l) ln l ln m ln (k) − − ≤ + . 2k 2β 2β 2k 2k
ln x x
0 we have ln m ln (k) 1 F3 (l) ≤ + ≤ m 2km 2km 2
for all k ≥ 1and all m ≥ 1. Hence
(13.24)
1 k (1 − β) m ln m + m β + ln (k) + max M1 (l) ≤ 2 c1 exp l=0,..,m−1 β 2 k
1 β
!
.
(13.25)
236
CHAPTER 13. LOCATION SUBMODEL
Now we estimate the second term M2 (l)of (13.13). We apply the induction the Stirling formula on (13.15) also and obtain !
(13.26)
F21 (l) = (1 − β) (m − l) ln (m − l) + (D − β) l ln l
(13.27)
k M2 (l) ≤ (k!) exp (F21 (l) + F22 (l) + F23 (l)) β 1 β
with 1 1 1 F22 (l) = − β ln l + (1 − β) ln (l − m) ≤ (1 − β) ln (m) 2 2 2 F23 (l) = m (β − 1) + l + D + ln c1 ≤ mβ + D + ln c1 − 2 ≤ mβ + D.
(13.28) (13.29)
We get in (13.27) by the convexity of the function f (x) = (m − x) ln (m − x) + x ln x, x ∈ [0, m] max F21 (l) ≤ (1 − β) m ln m + (D − 1) (m − 2) ln (m − 2) .
l=0,..,m−2
(13.30)
For m = 2, 3 we have max F21 (l) ≤ (1 − β) m ln m.
(13.31)
l=0,..,m−2
In the other cases we apply ln (m − 2) / ln (m) ≤ 1 and obtain max F21 (l) ≤ (D − 1 − β) m ln mD − (D − 1) 2 ln (m − 2) .
l=0,..,m−2
(13.32)
Summarizing we get max [F21 (l) + F22 (l) + F23 (l)] ≤ (1 − β) m ln m + mβ −
l=0,..,m−2
β ln m + ∆0 (m) 2 (13.33)
with ∆0 (m) =
(
, for m = 2, 3 D + 12 ln (m) . (D − 1) m ln m + D + 21 ln m − (D − 1) 2 ln (m − 2) , else (13.34)
such that β k (1 − β) m ln m + mβ − ln m + ∆0 (m) max M2 (l) ≤ exp l=0,..,m−2 β 2 Now we compare (13.25) with (13.35) . 3 3 ln k + 2 we have
!!
.
(13.35) Because of the condition on D ≥
β 1 − ln m + ∆0 (m) ≥ m ln k + 2 2
(13.36)
237 and the bound of M1 (l) is less than the bound of M2 (l) and E |fm (x) − ξ m |k ≤
!!
k β (k!) 2 (m) (m!) ∆ (m) exp (1 − β) m ln m + mβ + − ln m + ∆0 (m) . β 2 (13.37) Applying the Stirling formula (13.19) once more we obtain 1 β
k
k
k
k
k E |fm (x) − ξ | ≤ (k!) ∆ (m) exp (Dm ln m + D + ∆1 ) β m k
1 β
k
!
(13.38)
with (
(1 − D) mln m + 12 ln m + β ln 2 + β ln m + ln c1 , for m = 2, 3 , else β ln 2 + β + 12 ln m + ln c − (D − 1) 2 ln (m − 2) (13.39) 3 and for D ≥ 3 ln k + 2 we have for all k, m that ∆1 ≤ 0 ,which implies the proposition (13.6). 2 ∆1 =
Corollary 13.2 It exists a constant c ≤ 68 , such that E |fm (x) − ξ m |k ≤ (k!)1+γ with 1+γ = Proof.
(13.40)
1 cm ln m. β
We have for all m, k ≥ 2
(13.41)
1 3 1 kD (m ln m + 1) ≤ k ln km ln m 3 + β β 2 ln k
1 1+ . m ln m
Thus (13.5) and (13.6) imply the existence of a constant c1 ≤ 17 such that !
1 E |fm (x) − ξ | ≤ (k!) exp c1 k ln km ln m . β m k
1 β
(13.42)
From the Stirling formula (13.19) we know exp (k ln k) ≤ k! exp (k) ≤ (k!)2 . Thus 1
1
1
2
E |fm (x) − ξ m |k ≤ (k!) β exp (k ln k) β c1 m ln m ≤ (k!) β (k!) β c1 m ln m and 2
1 1 1 1 ≤ 4c1 m ln m (2c1 m ln m + 1) ≤ m ln m 2c1 + β β m ln m β
238
CHAPTER 13. LOCATION SUBMODEL
Chapter 14 Consistency Now we state our first nonparametric consistency result. Here we use the estimation of the second moments of fm in Theorem 13.1 only. Theorem 14.1 If there exists a constant d sufficiently small, such that q (n) ln (q (n)) ≤ d ln (n)
(14.1)
then under the assumptions Key with G, Pol with cG , B, M it holds for the orthogonal series polynomial estimator q(n)
with βbk from (12.24), that
lim n→∞ lim sup
D→∞
X
gb (x) =
g∈Mν
m=0
βbm pm (x)
sup D−1 q (n)2ν Eg ξ (n) ∈G
n
Z
with Gn = Gn (g, q) = 2
(
(14.2)
|gb − g|2 dG = 0
) 2 q Z X 1 −2ν f pm d (Gn − G) ≤ Dq . (ξ1 , ..., ξn ) :
2
m=0
(14.3)
(14.4)
Proof. We split the mean integrated squared error (MISE) in variance and bias term Eg
Z
|gb − g|2 dG = Eg
Z
|gb − E gb|2 dG +
Z
|E gb − g|2 dG.
(14.5)
|gb − E gb|2 dG = 0.
(14.6)
In Lemma 14.2 we will show the consistency for the variance term lim n→∞ lim sup
D→∞
−1
sup D q (n)
g∈Mν ξ (n) ∈Gn
239
q(n)
Eg
Z
240
CHAPTER 14. CONSISTENCY
Note, under q ln q ≍ ln n we have a rate of q (n)q(n) = exp (q (n) ln q (n)) ≍ exp (d ln n) ≍ nd .
(14.7)
In Lemma 14.3 we will derive the rate for the bias term lim n→∞ lim sup
D→∞
g∈Mν
sup D−1 q (n)2ν ξ (n) ∈G
n
Z
|E gb − g|2 dG = 0,
(14.8)
which is less than the rate of the variance term, because q (n) ≤ ln n under (14.1). Both together gives statement. 2 Note, in this model the rate is determined by the bias, the variance term determines the growth of the q (n). In nonparametric regression with no errors in the variables the optimal rate is given in such way, that bias and variance have the same order. Lemma 14.2 If there exists a constant d sufficiently small, d
−1
2
≥ 2 ln B +
σ12
1 + cG + 48 β
!
+ 1,
and q (n) ln (q (n)) ≤ d ln (n) then under the assumptions Key with G, Pol with cG , B2, M , it holds lim n→∞ lim sup
D→∞
Proof.
g∈Mν
sup D−1 q (n)q(n) Eg ξ (n) ∈G
n
Z
|gb − E gb|2 dG = 0.
Recall (12.27), the Parseval identity gives Z
|gb − E gb|2 dG =
q X
m=0
(n) βbm − βm
2
.
(14.9)
The triangle inequality implies q X
m=0
(n) βbm − βm
2
≤2
q X
m=0
βbm − βem
with βem from (12.20). First we estimate V1 =
q X
m=0
βbm − βem
2
=
q X
m=0
2
+2
q X
m=0
2
.
(14.10)
!2
.
(14.11)
(n) βem − βm
n 1X yi (hm (xi ) − pm (ξi )) n i=1
241 We have that yi (hm (xi ) − pm (ξi ))are independently distributed r.v. with expected value zero, remember (12.23), we get q X
n 1X EV1 = E yi (hm (xi ) − pm (ξi )) n i=1 m=0
=
q n X 1 X
m=0
n2
i=1
!2
E (yi (hm (xi ) − pm (ξi )))2 .
(14.12)
Further yi is independent of (hm (xi ) − pm (ξi )) . Thus from (P ) and Corollary 1 with k = 2 we obtain
2
2
E (yi (hm (xi ) − pm (ξi ))) ≤ g (ξi ) +
≤ B 2 + σ12
2
≤ B +
σ12
exp
σ12
m X
E
l=0
am,l fl (xi ) −
max m (am,l )2 E fl (xi ) − ξil
l=0.,,,.m
!
!
ξil
!2
2
(14.13) (14.14)
1 m ln m ≤ exp (C1 m ln m) cG + c ln 2 β
(14.15)
with C1 ≤ ln (B 2 + σ12 ) + cG + c ln 2 β1 . Hence EV1 ≤
q n X 1 X
m=0
n2
i=1
exp (C1 m ln m) ≤ exp (ln q − ln n + C1 q ln q)
(14.16)
≤ exp (− ln n + 2C1 q ln q) . Now let us estimate the second term of (14.10) V2 =
q X
m=0
(n) βem − βm
2
=
q X
m=0
n 1X (yi − g (ξi )) pm (ξi ) n i=1
!2
.
(14.17)
We have EV2 =
q n X 1 X
m=0
n2
i=1
σ12 pm (ξi )2 ≤
q n X 1 X
m=0
n2
i=1
2
σ12 max m (am,l )2 ξil l=0.,,,.m
.
(14.18)
Using (P ) and |ξi | ≤ 1 it follows EV2 ≤ exp (ln q − ln n + C2 q ln q)
(14.19)
with C2 ≤ (ln (σ12 ) + cG ) ≤ C1 , such that E
Z
|gb − E gb|2 dG ≤ 4 exp (− ln n + 2C1 q ln q) .
(14.20)
242
CHAPTER 14. CONSISTENCY
Note, the constant C1 is independent of g and the ξi . We have sup
−1
sup D q (n)
q(n)
g∈Mν ξ (n) ∈Gn
Eg
Z
|gb − E gb|2 dG ≤ D−1 4 exp (− ln n + (2C1 + 1) q ln q)
q ln q ≤ D−1 4 exp − ln n 1 − (2C1 + 1) ln n
(14.21)
!!
≤ D−1 4 exp (− ln n (1 − (2C1 + 1) d)) (14.22)
For − ln n + (2C1 + 1) q ln q ≤ const
(14.23)
1 − (2C1 + 3) d = δ ∈ (0, 1) ,
(14.24)
we get the proposition . 2 For d in (14.1), such that
(14.22) implies sup
sup q (n)
q(n)
g∈Mν ξ (n) ∈Gn
Eg
Z
|gb − E gb|2 dG ≤ n−δ .
Now we estimate the bias term.
(14.25)
Lemma 14.3 Under the assumptions Key with G, Pol with cG , M it holds lim n→∞ lim sup
D→∞
g∈Mν
sup D−1 q (n)2ν ξ (n) ∈G
n
Z
2 Proof.
|g − E gb|2 dG = 0.
(14.26)
Recall (12.27), the Parseval identity gives Z
with
2
|g − E gb| dG = βm =
Z
q X
m=0
βm −
gpm dG
(n) 2 βm
(n) βm =
Z
∞ X
+
(βm )2 .
(14.27)
m=q+1
gpm dGn .
For ξ (n) ∈ Gn we estimate the first term in (14.27) by q X
m=0
βm −
(n) 2 βm
X Z
q(n)
=
m=0
gpm d (Gn − G)
2
1 ≤ Dq −2ν . 2
(14.28)
We apply the Weierstraß theorem on the second term in (14.27). The approximation theorem of Weierstraß implies, that for each function g ∈ C k+α ([−1, 1]) , it exists a polynomial p (x) of order h + k and constants c1 , such that sup |g (x) − p (x)| ≤ c1
x∈[−1,1]
k
1 h
̟k
1 ≤ ch−(k+α) , h
(14.29)
243
where ̟k h1 denotes the continuity module of g (x) . Further we have the representation for the (h + k) − polynomial with respect to the orthonormal system {p0 , p1 , ....} : p (x) =
h+k X
αl pl (x) .
(14.30)
l=0
Thus for
q (n) = h + k
(14.31)
Z
|p − g|2 dG ≤ ch−2(k+α) . (14.32)
we have ∞ X
m=q(n)+1
q(n) 2 βm ≤
X l=0
(αl − βl )2 +
∞ X
βl2 =
l=q(n)+1
For g ∈ Mν it exists a constant D , such that Dq (n)−2ν ≥ 2c (q (n) − k)−2ν and ∞ X
1 2 βm ≤ Dq (n)−2ν . 2 m=q(n)+1
(14.33)
From (14.33) and (14.28) it follows, that for sufficiently large D and sufficiently large n Z sup sup |g − E gb|2 dG ≤ Dq (n)−2ν . g∈Mν ξ (n) ∈Gn
2
In the following theorem we will give an estimation of the convergence rate of the probability. In difference to Theorem 13.2 we use the bounds of all moments in Theorem 13.1, and we assume the Bernstein condition for the first error terms in regression equation also. Further we require for the q (n) a little bit harder condition. Theorem 14.4 If there exists a constant d sufficiently small, such that q (n) (ln q (n))2 ≤ d ln (n)
(14.34)
then under the assumptions Key with G, Pol with cG , B1, B2, M it holds for the orthogonal series estimator q(n)
gb (x) =
X
m=0
βbm pm (x)
with βbk from (12.24), that there exists a constant c0 > 0, such that for sufficiently large D and for sufficiently large n sup
sup Pg
g∈Mν ξ (n) ∈Gn
Z
|gb − g|2 dG ≥ Dq (n)−2ν ≤ exp −c0 exp
with Gn defined in (14.4) .2
1 d
(14.35)
244
CHAPTER 14. CONSISTENCY
Proof.
From the Parseval identity we get Z
q(n) 2
|gb − g| dG =
X
m=0
βbm − βm
2
∞ X
+
2 βm .
(14.36)
m=q(n)+1
Because of (14.33) it remains to estimate
Pg
q(n)
X
m=0
βbm − βm
2
Remember (12.20), (12.27) ,we have
1 ≥ Dq (n)−2ν . 2
q(n)
X
m=0 q(n)
X
m=0
βb
m
− βe
m
2
q(n)
+
X
m=0
βe
m
−
βbm − βm
(n) 2 βm +
2
≤
q(n)
X
m=0
(14.37)
(n) βm − βm
2
= P1 + P2 + P3 . (14.38)
The last term P3 is not random. It holds (14.28) . We have to show for i = 1, 2 Pg
1 1 Pi ≥ Dq (n)−ν ≤ exp −c0 exp 6 2d
.
(14.39)
This will be done in the following lemmata. 2 Let us state the first one. It gives an estimation for the difference of the estimators q(n)
ge =
X
m=0
βem pm , defined in (12.20),
(14.40)
βbm pm , defined in (12.24),
(14.41)
for g in the usual regression model and q(n)
gb =
X
m=0
in the errors-in-variables model. We have Z
q(n)
(gb − ge)2 dG =
X
m=0
βbm − βem
2
.
(14.42)
Lemma 14.5 If there exists a constant d sufficiently small, such that (14.34) holds, then under the assumptions Key with G, Pol with cG , B1, B2, M there exists a constant c0 > 0 , such that for sufficiently large n and arbitrary h
sup
g∈Mν
2
sup Pg
ξ (n) ∈Gn
q(n)
X
m=0
βb
m
− βe
m
2
≥ q (n)
−h
≤ exp −c0 exp
1 d
(14.43)
245 Proof. The difference of βbm −βem is a sum of independently distributed random variables with expected value zero: βbm − βem =
n 1X yi (pm (ξi ) − hm (xi )) . n i=1
(14.44)
The kth cumulant of βbm − βem is χk
βb
m
− βe
m
n 1 X = k χk (yi (pm (ξi ) − hmk (xi ))) . n i=1
(14.45)
Now we consider the moments. Because of the independence of yi and xi we have E |yi (pm (ξi ) − hm (xi ))|k ≤ E |yi |k E |pm (ξi ) − hm (xi )|k .
(14.46)
From (B1 ) it follows k
k
E |yi | ≤ E (|g (ξi )| + |ε1i |) ≤
k X l=0
k l
1
!
|g (ξi )|k−l µ1l
(14.47)
1
≤ (k!) β (|g (ξi )| + ∆)k ≤ (k!) β (B + ∆)k .
(14.48)
From (12.22) we get m X k l E |pm (ξi ) − hm (xi )| = E aml ξi − fl (xi ) k
(14.49)
l=0
k
≤ max |maml |k max E ξil − fl (xi ) . l=0,...,m
l=0,...,m
Under Pol there exists a constant c1 , such that max |maml |k ≤ exp
l=0,...,m
1 k cG m ln m + (k − 1) ln m 2
(14.50)
(14.51)
≤ exp (c1 km ln m) Applying Corollary 1 for all k we obtain for some constant c > c1 E |pm (ξi ) − hm (xi )|k ≤ (k!)1+γ1 . with 1 + γ1 =
1 c (m ln m) . β
(14.52)
(14.53)
Put (14.48) and (14.52) in (14.46) we get E |yi (pm (ξi ) − hm (xi ))|k ≤ (k!)1+γ (H)k
(14.54)
246
CHAPTER 14. CONSISTENCY
with 1+γ =
2 c (m ln m) β
(14.55)
and H = B + ∆.
(14.56)
Because of the link between moments and cumulants, we have |χk (yi (pm (ξi ) − hm (xi )))| ≤ (k!)1+γ (2H)k
(14.57)
and n 1 X (k!)1+γ (2H)k ≤ χk βbm − βem ≤ k
n
i=1
1 nk−1
(k!)1+γ (2H)k ≤
1 nk−1
(k!)1+γ . (14.58)
with
3 c (m ln m) (14.59) β The result follows from the Lemma 16.10 in the Appendix. 2 The following lemma gives an uniform consistency result for the regression estimator ge in the usual model respects to its expected value. We have 1+γ =
Z
q(n)
X
(ge − E ge)2 dG =
m=0
(n) βem − βm
2
q(n)
with E ge =
X
(n) βm pm .
(14.60)
m=0
Lemma 14.6 If it exists a constant d sufficiently small, such that (14.34), then under the assumptions Key with G, Pol with cG , B2, M there exists a constant c0 > 0, such that for arbitrary h and for sufficiently large n sup
sup Pg
g∈Mν ξ (n) ∈Gn
2
q(n)
X
m=0
βe
m
−
(n) 2 βm
≥ q (n)
−h
≤ exp −c0 exp
1 d
.
(14.61)
Proof. The proof goes along the lines of the proof of Lemma 14.5. (n) The difference of βem − βm is a sum of independently distributed random variables with expected value zero: (n) βem − βm =
n n 1X 1X (yi − g (ξi )) pm (ξi ) = ε1i pm (ξi ) n i=1 n i=1
(n) The kth cumulant of βem − βm is
n 1 X (n) χk βem − βm = k χk (ε1i ) pkm (ξi ) n i=1
(14.62)
(14.63)
247 We have
(n) ≤ χk βem − βm
1
1
i nk−1 Under Pol we can estimate the polynomials by (14.51)
(k!) β (2∆)k max pkm (ξi ) .
(14.64)
max pkm (ξi ) ≤ exp ( c1 km ln m) i
and we obtain
(n) ≤ χk βem − βm
with
1+γ =
1
nk−1
(k!)1+γ
(14.65)
1 + c1 (m ln m) . β
The result follows from the Lemma 16.10 in the Appendix. 2 Corollary 14.7 Under q ln q ≤ d ln n,Pol with cG and Des with δ ≥ 2dcG the propositions in Theorem 14.1 and Theorem 14.4 hold uniformly for ξ (n) ∈ Dn with δ, where Dn is defined in (12.10). 2 Proof. It remains to show, that for all g ∈ Mν Dn ⊆ Gn . Under the assumptions of the corollary we have for ξ (n) ∈ Dn Z
gpm d (Gn − G)
2
≤
≤ exp (cG q ln q) max
l=0,..,q
≤ exp (cG q ln q) sup
f ∈M
Z
Z
g
Z
m X l=0
l
aml ξ d (Gn − G)
g ξ l d (Gn − G)
f d (Gn − G)
!2
!2
(14.66)
2
≤ exp (cG q ln q) n−δ
Hence X Z
q(n)
m=0
gpm d (Gn − G)
2
!!
cG q ln q (v + 1) ln q ≤ q exp − ln n δ − − ln n ln n (14.67) 1 ≤ Dq (n)−2ν . 2 −ν
and ξ (n) ∈ Gn . 2 In the following corollary we choose a special zero sequence for d and obtain a estimation for the probability in Theorem 14.4, which is sufficient for a result with probability 1.
248
CHAPTER 14. CONSISTENCY
Corollary 14.8 Under the assumptions of Theorem 14.4 with d−1 = ln ln n + ln (kco )
(14.68)
it exists a constant c, such that lim sup sup
sup q (n)
2ν
n→∞ g∈Mν ξ (n) ∈Gn
Proof.
Z
|gb − g|2 dG ≤ c
a.s.
(14.69)
For d defined in (14.68) we have
exp −c0 exp
1 d
≤ n−k .
(14.70)
The statement follows by the Lemma of Borel Cantelli for k ≥ 2. For all ε > 0 it exists a D ≥ (c + ε) , such that ∞ X
n=1
P
sup sup q (n) g∈M ξ (n) ∈Gn
2 Remark: For (14.68) is fulfilled. 2.
2ν
Z
2
!
|gb − g| dG − c > ε ≤
q (n) = (ln n)1−δ , δ ∈ (0, 1)
∞ X
n=1
n−2 < ∞.
(14.71)
(14.72)
Chapter 15 Rate of Convergence We derive a lower bound for the rate of convergence in nonparametric functional models. We consider the same model as in the above chapter. In difference to the condition Pol on the ON system of polynomials in (12.13) we require P2, which is similar but not comparable to Pol. This condition is fulfilled for Jacobi polynomials with parameter α, β, max (α, β) ≤ 2ν, compare Szeg¨o (1959), [168]. Recall, ν denotes the smoothness of the model (12.6).
15.1
Lower bound
P2 It exists a constant const , independent of m, such that
(l) (ξ) ≤ const for l = 0, .., k + 1 max m−2ν pm
ξ∈[−1,1]
(15.1)
For technical reasons we show the lower bound not for all estimators. Let us introduce the class C of estimators
C=
ge, E βem =
Z
ge =
∞ X
m=0
βem pm
gpm dGn ∀ m = 1, ...M,
For all M (n) the estimator
is included in C, because Eg,ξ(n)
ge =
M (n)
X
m=0
(15.2) βe
m
= 0 ∀ m > M, M = 0, ...
(15.3)
1X yi pm (ξi ) pm n
1X 1X yi pm (ξi ) = g (ξi ) pm (ξi ) . n n 249
(15.4)
250
CHAPTER 15. RATE OF CONVERGENCE 1
For M = q (n) it is the estimator ge defined in (14.40), for M (n) = n− 2ν+1 the estimator, which achieves the optimal rate in the nonparametric regression model with known design points ξi , i = 1, ...n. Because of (12.26) our estimator gb is included in C also.
Theorem 15.1 Under M,P2 and q ≤ n, n sufficiently large, there exists a positive constant d0 > 0 such that inf sup
e g ∈C
g∈Mν
sup Eg,ξ(n)
ξ (n) ∈G
n
Z
with Gn = G (g, q) defined in (14.4). 2 Proof.
(15.5)
Remember (14.5), we estimate the MISE by the bias term:
inf sup
e g ∈C
(ge − g)2 dG ≥ d0 q −2ν
g∈Mν
sup ξ (n) ∈G(g,q)
Eg,ξ(n)
Z
(ge − g)2 dG ≥ inf sup e g ∈C
The Parseval identity gives ≥ inf sup
sup
M X
e g ∈C g∈Mν ξ (n) ∈G(g,q) m=0
g∈Mν
E βem − βm
2
sup ξ (n) ∈G(g,q)
+
∞ X
Z
(E ge − g)2 dG
(15.6)
2 βm .
(15.7)
m=M +1
Now we distinguish two cases ge ∈ C1 for M < q and ge ∈ C2 for M ≥ q. First we regard M < q. That is the case, where normally the bias is large because of the bad approximation by the polynomial of low order M . We estimate (15.7) by ≥ inf sup
∞ X
sup
e g ∈C1 g∈Mν ξ (n) ∈G(g,q) m=M +1
2 βm ≥ inf sup
∞ X
e g ∈C1 g∈Mν m=M +1
2 βm .
(15.8)
Let us consider a worse g ∈ Mν . We take a positive constant d depending on B and set g = dq −2ν pM +1 .
(l)
(l)
Because of P2 we have g (l) ≤ dq −2ν pM +1 ≤ d (M + 1)−2ν pM +1 ≤ B and R R g ∈ Mν . Further we know βM +1 = gpM +1 dG = dq −2ν p2M +1 dG = dq −2ν and βm = 0 for all m 6= M + 1. Hence we can estimate (15.8) by −2ν 2 . ≥ inf βM +1 ≥ dq
e g ∈C1
Let us come back to (15.7) and regard the case ge ∈ C2 , M ≥ q. This is the case, where normally the variance term is the leading one for the convergence rate.
251
15.1. LOWER BOUND
Here the volume of the neighborhood of the nuisance parameters determines the rate. We estimate (15.7) by inf sup
M X
sup
e g ∈C2 g∈Mν ξ (n) ∈G(g,q) m=0
E βem − βm
2
≥ sup
sup
q X
g∈Mν ξ (n) ∈G(g,q) m=0
E βem − βm
2
. (15.9)
Now we choose a special function g ∈ Mν and a special design ξ (n) ∈ G (g, q) . Let g be a function which Ris orthogonal to all {1, p1 , ..., pq } , this means βm = 0 for all m = 0, ..., q. Then gdG = 0 and it exists a point ξ 0 , such that g (ξ 0 ) = 0. Further let be 4Dq −2ν g (0)2 = Pq (15.10) 2. m=0 pm (0)
We have p0 ≡ 1 and
Pq
m=0
pm (0)2 ≥ 1. It holds
4Dq −2ν −2ν g (0) = Pq ≤ B2. 2 ≤ 4Dq m=0 pm (0) 2
Thus (15.9) has the lower bound ≥
q X
m=0
E βe
m
2
=
q X
m=0
n 1X g (ξi ) pm (ξi ) n i=1
!2
.
Further we take the design in G (g, q) with the two different points ξ 0 and 0, both are repeated n2 times and obtain q X
m=0
E βe
m
2
=
q X 1
m=0
2
g ξ
0
pm ξ
0
1 + g (0) pm (0) 2
2
q X 1 pm (0)2 = Dq −2ν . = g (0)2 4 m=0
Take the constant d0 = min (d, D) and the proof is completed. 2 If we compare the results of Theorem 14.1 and Theorem 15.1, then we can sum up it to the following proposition. Proposition: Under M, B2, Key, P2 and for sufficiently small d ,with q ln q < d ln n for all g ∈ Mν and all ξ (n) ∈ G (g, q) the estimator gb , defined in (12.19) and (12.24) achieves the optimal rate of convergence. 2.
15.1.1
Discussion of the rate
The rate of convergence in the nonparametric functional error-in-variables model is a very slow one. If we choose q (n) = (ln n)1−δ like in (14.72) we obtain the optimal rate (ln n)−(1−δ)2ν , (15.11)
252
CHAPTER 15. RATE OF CONVERGENCE
where δ ∈ (0, 1) can be chosen small. This is much more slower than in the nonparametric regression model with no error in the variables, there we have the optimal rate 2ν n− 2ν+1 . (15.12) The rate in our model is determined by the volume of the nuisance parameter set G (g, q) essentially. In the nonparametric regression model we have no nuisance parameters. As we discuss in Chapter 11 we know that already in the parametric approach the errors-in-variables models are not adaptive. The best bound of the Hajek type inequality in the functional model is higher than that of the regression model. In the nonparametric case we see this difference in the different rates of convergence. In the structural case, ξi ∼ G i.i.d., Fan (1991), [53], derived lower bounds of convergence. They mainly depend from the type the of error distribution of ε21 . For the ”super smooth” error distribution with smoothness parameter β the rate is 2v (ln n)− β . (15.13) For ε21 ∼ N (0, 1) we have β = 2 and the rate (ln n)−ν ,
(15.14)
which is less than (15.11). In the case of ”ordinary smooth” distributions with respected parameter β the rate is 2ν
n− 2ν+2β+1 ,
(15.15)
much more better than (15.11), but less than (15.12). The conditions of ”super smooth” and ”ordinary smooth” are not comparable with the condition B2. That respects to a condition on the behavior of the characteristic function of ε21 near zero, compare (12.5). The conditions in the paper by Fan (1991), [53], describe the tail behavior of the characteristic function. We can see that the Fourier series estimators attain the optimal consistency rates given for the nonparametric structural model by Fan (1991) [54]. The proof works, but to write it down is still an ” open problem”.
Chapter 16 Appendix We say a r.v. x, with Ex = 0 and the k’th cumulant χk (x) satisfies the Statulevicius condition S, with the constants γ ≥ 0 and Cs and Hs iff S |χk (x)| ≤ (k!)1+γ Csk−2 Hs ,
k = 3, 4, ....
Further we say a r.v. x, with Ex = 0, satisfies the Linnik condition L, with the constants γ ≥ 0 and CL = CL (γ) iff L
1
E exp |x| 1+2γ ≤ CL < ∞. Furthermore we say a r.v. x, with Ex = 0 satisfies the Bernstein condition B, with the constants γ ≥ 0 and CB and HB iff B
1+γ CBk−2 HB , Exk ≤ (k!)
k = 3, 4, ....
Lemma 16.1 ”B =⇒ S ”: Let x r.v. with Ex = 0, V ar (x) = σ 2 > 0 satisfying the Bernstein condition B, with the constants γ ≥ 0 and CB and HB = σ 2 , that is 1+γ CBk−2 σ 2 , k = 3, 4, .... Exk ≤ (k!) Then x satisfies the Statulevicius condition S, with the same constant γ ≥ 0 and with Cs = 2 max (CB , σ) , Hs = σ 2 |χk (x)| ≤ (k!)1+γ Csk−2 σ 2 ,
k = 3, 4, ....,
2 Quoted from Saulis and Statulevicius (1991), [142], Lemma 3.1. The next result is further version of ”B =⇒ S ”. 253
254
CHAPTER 16. APPENDIX
Lemma 16.2 ”B =⇒ S ”:Let x r.v. with Ex = 0, V ar (x) = 1, satisfying the Bernstein condition B, with the constants γ ≥ 0 and CB ≥ 1 and HB ≥ 1, that is 1+γ CBk−2 HB , k = 3, 4, .... Exk ≤ (k!)
Then x satisfies the Statulevicius condition S, with the same constant γ ≥ 0 and with Cs = CB HB and Hs = ln 2 HB2 |χk (x)| ≤ (k!)1+γ Csk−2 Hs ,
k = 3, 4, ....
2 Proof. The proof goes along the line of that of Lemma 3.1 in Saulis and Statulevicius (1991), [142]. Let M (z) = E exp (z x) the moment generating function of the r. v. x and K (z) = ln M (z) the cumulant generating function. The relation between cumulants χk (x) and moments mk = Exk is given by χk (x) = "
with
∞ dk X 1 = k (−1)s+1 (M (z))s dz s=1 s
#
dk [K (z)]z=0 dz k
z=0
"
∞ 1 dk X (−1)s+1 (Mk (z))s = k dz s=1 s
k X 1
Mk (z) =
r=2
Under B for small δ < 1 and
r!
|Mr (z)| ≤
Pk
1 r=2 r!
r
|mr z | ≤ ≤
1
z2 2
+
Pk
Pk
r=2
r=2
δr ≤
z=0
mr z r .
|z| ≤ δ (k!)γ CBk−2 HBk follows
#
− 1
k
1
(r!) r
1 (k!) k
γr
−r k−2 k
CBr−2 HB δ r CB
HB−r
δ2 . 1−δ
Note (k!) k increases in k. Further we have for sufficiently small δ ! ∞ X 1 δ 2 s+1 s (−1) (Mk (z)) ≤ ln 1 − ≤ ln 2. 1−δ s=1 s
From the Cauchy formula follows "
∞ 1 dk X (−1)s+1 (Mk (z))s |χk (x)| = k dz s=1 s
2
#
z=0
≤ k! ln 2 (k!)γ CBk−2 HBk .
255 Lemma 16.3 ”S =⇒ B ”: Let x r.v. with Ex = 0, V ar (x) = 1 satisfying the Statulevicius condition S, with the constants γ ≥ 0 and Cs ≥ 1, Hs ≥ 1 |χk (x)| ≤ (k!)1+γ Hs Csk−2 ,
k = 3, 4, ....
Then x satisfies the Bernstein condition B, with constants γ ≥ 0 and CB = Cs Hs and HB = 4Hs2 , that is 1+γ Exk ≤ (k!) HB CBk−2 ,
2
k = 3, 4, ....
Proof. The proof follows the line of the proof above of Lemma 16.2. The relation between cumulants χk (x) and moments Exk Exk = k!
k X 1
s=1
s! i
X
P
1 ,..,is ,
is derived from
s Y γij (x)
"
∞ dk dk X 1 Ex = k [M (z)]z=0 = k (K (z))s dz dz s=1 s! k
with g (z) =
k X 1
r=2 r!
ij !
ij =k j=1
#
"
∞ dk X 1 = k (g (z))s dz s=1 s!
z=0
#
z=0
χr (x) z r ,
compare for instance also Bentkus and Rudskis (1980), [13], from formula (2,5) up. Under S for small δ < 1 and
|z| ≤ δ (k!)γ Csk−2 Hsk follows on the same way as above |g (z)| ≤
k X 1
r=2
r!
|γr (x) z r | ≤
− 1
k X
r=2
k
δr ≤
δ2 . 1−δ
Hence for sufficiently small δ ∞ X 1
δ2 |g (z)| ≤ exp 1−δ s=1 s! s
!
≤ 4.
From the Cauchy formula follows # "∞ k X 1 d s (g (z)) Exk = k
dz
2
s=1
s!
z=0
≤ 4k! (k!)γ Csk−2 Hsk .
256
CHAPTER 16. APPENDIX
Lemma 16.4 ”L =⇒ B ”: Let x r.v. with Ex = 0, V ar (x) = σ 2 > 0 satisfying the Linnik condition L, with constants γ ≥ 0 and CL , then x fulfills the Bernstein condition B, with the constants γ ≥ 0 and CB = 2CL e1+γ (1 + γ)3(1+γ) and HB = 1, that is 1+γ CBk−2 , k = 3, 4, .... Exk ≤ (k!)
2
Quoted from Saulis and Statulevicius (1991), [142], within the proof of Theorem 3.3. Lemma 16.5 ” S for x =⇒ S for x2 ”: Let x r.v. with Ex = 0, V ar (x) = 1 > 0 satisfying the Statulevicius condition S, with the constants γ ≥ 0 and Cs and Hs , then the x2 fulfills the Statulevicius condition S, with the constants γ2 = 2γ +1 and √ 1+γ C2s = 2πe Cs4 Hs6 28+3γ and
H2s = ln 2 (2πe)1+γ Cs4 Hs8 212+4γ ,
that is
2+2γ k−2 C2s H2s , χk (x2 ) ≤ (k!)
2 Proof.
k = 3, 4, ....
Because of Lemma 16.3, x satisfies the Bernstein condition 1+γ Exk ≤ (k!) (Cs Hs )k−2 4Hs2 .
Using (E |x|)2 ≤ Ex2 and assuming with out loss of generality E |x|2k > 1
k
∀i ∀k = 3, ... E |x2 − 1| ≤ 2k−1 E |x|2k + 1 ≤ 2k E |x|2k ≤ 2k ((2k)!)1+γ (Cs Hs )2k−2 (4Hs2 ) . We estimate with help of the Stirling formula, ln n exp n ln n − n + 2
!
≤ n! ≤
√
!
ln n , 2πe exp n ln n − n + 2
that (2k)! ≤
√
s
√ 2 1 2πe exp 2k (ln 2k) − 2k + ln 2k ≤ (k!)2 2πe22k 2 k
This means x2 − 1 fulfills the Bernstein condition
k
E x2 − 1 ≤ (k!)1+(2γ+1) (CB′ )
k−2
′
HB
257 with CB′ = 22+γ Cs2 Hs2 and HB′ =
√
2πe
1+γ
Cs2 Hs4 26+2γ .
Using once more the relation between Bernstein and Statulevicius condition, we obtain the result from Lemma16.2. 2 The next result concerns the inverse direction. Lemma 16.6 ” S for x2 =⇒ S for x ”: Let x r.v. with Ex = 0, V ar (x) = 1 > 0 and x2 satisfies the Statulevicius condition S, with the constants γ ′ ≥ 0 and C2s and H2s , that is 1+γ ′ k−2 C2s H2s , χk (x2 ) ≤ (k!)
k = 3, 4, ....
then x fulfills the Statulevicius condition S, with the constants γ = max q
γ ′ −1 ,0 2
Cs = 2H2S C2s H2s and 2 Hs = 4 ln 2 H2S ,
that is |χk (x)| ≤ (k!)1+γ Csk−2 Hs ,
k = 3, 4, ....
2 Proof. From Lemma 16.3 follows x2 satisfies the Bernstein condition B, with 2 constants γ ′ ≥ 0 and C2B = C2s H2s and H2B = 4H2s ,
We know
1+γ ′ k−2 H2B C2B , Ex2k ≤ (k!)
k = 3, 4, ....
q q k−2 1+γ ′ H2B C2B2 ≤ (k!)1+γ CBk−2 HB . Exk ≤ |Ex2k | ≤ (k!) 2
′
Thus x satisfies the Bernstein condition with γ = max γ 2−1 , 0 and HB = 2H2S , √ CB = C2s H2s . Now from Lemma 16.2 we obtain the result. 2 Lemma 16.7 ” S for x, S for y =⇒ S for xy ”: Let x, y independent r.v. with Ex = Ey = 0, V ar (x) = V ar (y) = 1 > 0 and x satisfies the Statulevicius condition S, with the constants γx′ ≥ 0 and Cs(x) and Hs(x) , and y satisfies the Statulevicius condition S, with the constants γy′ ≥ 0 and Cs(y) and Hs(y) ,
and
258
CHAPTER 16. APPENDIX
then xy fulfills the Statulevicius condition S, with the constants γ = 1+γx′ +γy′ and 3 Cs = 16Cs(x) Cs(y) Hs(x) Hs(y) and
Hs = 44 ln 2 Hs(y) Hs(x) that is
|χk (xy)| ≤ (k!)1+γ Csk−2 Hs ,
4
,
k = 3, 4, ....
2 Proof. From Lemma 16.3 follows x, y satisfy the Bernstein condition B, with (y) (x) (x) constants γx′ ≥ 0, γy′ ≥ 0 and CB = Cs(x) Hs(x) and HB = 4Hs(x)2 , CB = Cs(y) Hs(y) (y) and HB = 4Hs(y)2 .Because of the independence w have (x) (y) k−2 (y) (x) k k 2+γx′ +γy′ k E (xy) ≤ E (x) E (y) ≤ (k!) CB CB HB HB .
(y)
(x)
Thus xy satisfies the Bernstein condition with γ = 1+γx′ +γy′ and HB = HB HB , (x) (y) CB = CB CB . Now from Lemma 16.2 we obtain the result. 2 On the same way we can also give an estimation for the cumulants of the product of two independent random values x and y. Lemma 16.8 Let x r.v. with Ex = 0, V ar (x) = σ 2 > 0 satisfying the Statulevicius condition S, with the constants γ ≥ 0 and Cs and Hs = 2−(1+γ) H , then P (x > τ )
≤
2
τ exp − 4H
exp
− 14
τ Cs
1 1+γ
1
for τ ≤ (H 1+γ Cs−1 ) 1+2γ 1
for τ ≥ (H 1+γ Cs−1 ) 1+2γ
.
Quoted from Bentkus and Rudskis (1980), [13], Corollary 2.1. Compare also Saulis and Statulevicius (1991), [142], Lemma 2.4. Let us quote here one more result from the paper by Bentkus and Rudskis (1980), [13] namely their corollary 1.1. Lemma 16.9 Let x r.v. with Ex = 0, V ar (x) = σ 2 > 0 satisfying the Statulevicius condition S, with the constants γ = 0 and Cs and Hs = 2−1 H , then !
−x2 . P (x ≥ 0) ≤ exp 2 (HS + xCS ) In the nonparametric part we used the following application of Lemma 16.8.
259 Lemma 16.10 Let be Xm random variables (not independently distributed) with expected value zero and all cumulants exist and fulfill the condition for some constants c > 0, 1 + γ ≤ cm ln m (16.1) 1 |χk (Xm )| ≤ k−1 (k!)1+γ (16.2) n and q (ln q)2 ≤ d, d sufficiently small. (16.3) ln n Then it exists a positive constant c0 such that for sufficient large n and arbitrary h q(n) X 1 2 −h ≤ exp −c0 exp (Xm ) ≥ q (n) P . (16.4) d m=0 2
Proof. P
q X
We have 2
m=0
(Xm ) ≥ q
−h
!
≤
q X
m=0
P (Xm )2 ≥ q −h−1 ≤
q X
m=0
P ±Xm ≥
q
q −h−1 .
(16.5) We apply the second case in the statement of Lemma 16.8. Let us quote it: If 1+γ |Γk (ζ)| ≤ k!2 (∆)−k+2 H for all k ≥ 2, then for x1+2γ ≥ (H 1+γ ∆) it holds
1 1 P (±ζ ≥ x) ≤ exp − (x∆) 1+γ . 4
(16.6)
In our case we have CS =
1 , n
H=
1 1+γ 2 , n
x = q−
1 + γ = cq ln q,
h+1 2
(16.7)
and under (16.3) for some d0 , with d0 ln n > cq ln q and arbitrary h
H 1+γ CS−1
1 1+2γ
1
≤ 21+γ n− 3 ≤ exp cq ln q ln 2 −
h+1 1 ln n ≤ n−d0 ≤ q − 2 . (16.8) 3
Further for sufficiently large n it holds (x∆)
1 1+γ
≥ q
− h+1 2
n
1 d0 ln n
Hence q X
m=0
P ±Xm ≥ q
− h+1 2
≥ exp ln n
1−
h+1 d 2q 0
d0 ln n
≥ exp
1 2d0
1 1 ln q . ≤ exp − exp − 2d0 4 exp 1 2d0
(16.9)
(16.10)
260
CHAPTER 16. APPENDIX
Under (16.3) we have for 2d0 = (ln q)−1 d 1 ln q ≥ c0 > 0 − 4 exp 1 2d0 and we estimate (16.10) by
1 ≤ exp −c0 exp 2d0 then follows (16.4) . 2
ln q ≤ exp −c0 exp d
!!
≤ exp −c0 exp
1 d
,
Index A
Lipschitz condition L1, 63 Lipschitz condition L2, 105 Lipschitz condition QL, 191 Lipshitz condition on derivatives QDiffL, 198 Lyapunov conditionQLap, 197 moment assumption QM, 191 moment assumption QM0, 195 moment condition M0, 90 moment condition on derivatives QDiffL, 198 nonparametric model Mod, 229 normal error distribution Nor, 205 orthogonal polynomials Pol, 230 polynomial assumption P2, 249 positive definiteness Pos’, 221 positive definitness Pos, 163 QCov1, 197 QCov2, 198 regularity assumption Reg, 215 regularity condition G, 205 smoothness condition Diff, 162 smoothness condition QDiff, 197 smoothness condition Smoo, 208 Statulevicius condition S1, 71 Statulevicius condition S1’, 90 Statulevicius condition S2, 72 Statulevicius condition S2’, 90 technical assumption Tech, 165 universal boundness Bound, 167
assumption assumption interior point Int, 197 asymptotic design :Key, 229 Bernstein condition B1, 228 Bernstein condition B2, 228 Bernstein moment condition B, 233 condition of bounded variances V, 72 condition of bounded variances V’, 104 condition of vanishing variances Var, 90 condition of vanishing variances Var’, 91 consistency assumption Consist, 162 contrast assumption Con’, 186 contrast condition Con, 66 design assumption Des, 229 Diffzero, 163 ExFour, 202 Gamma error distribution Gam, 208 Interior, 162 key existence assumption Ex, 200 knowledge of exponential moments ExpMom, 204 knowledge of the moments Mom, 203 Laplace error distribution La, 205 Lipschitz assumption QL0, 195 Lipschitz condition H1, 72 Lipschitz condition H2, 72
C contrast, 43 contrast function, 43 covering number, 104 covering set, 71 261
262
INDEX
D deconvolution equation, 187, 231 asymmetrical, 187 degree of smoothness, 229
E entropy, 71, 104 covering number, 71 estimating function, 39 information of an estimating function, 40 estimator approximate corrected L1-norm, 189 approximate corrected least squares, 189 approximate corrected minimum contrast, 189 conditional maximum likelihood estimator, 51 corrected L1-norm, 188 corrected least squares, 188 corrected least squares estimator, 201 corrected minimum contrast, 188 least squares estimator, 55 least squares projection, 57 maximum likelihood estimator, 48 nonlinear least squares estimator, 57 orthogonal series estimator, 230 exponential family ξ−exponential family , 42
L least squares least squares projection, 57 likelihood function, 47 conditional likelihood function, 51 maximum relative likelihood function, 48 profile likelihood function, 48 linear model
simple linear functional relation, 41
M measurement error, 6 model transformed regression model, 201
S score function β−score function, 50 ξi −score function, 50 conditional score function, 52 sufficient statistic, 51 sum of squares projected sum of squares, 57
W weights, 55 optimal weights, 170
Bibliography [1] K. S. Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab., 12(4):1041–1067, 1984. [2] S. Amari and M. Kawanabe. Estimating functions in semiparametric statistical models. manuscript, Frontier research program, RIKEN, 1996. [3] S. Amari and M. Kawanabe. Information geometry of estimating functions in semiparametric statistical models. manuscript, Frontier research program, RIKEN, 1996. [4] S. Amari and M. Kumon. Estimation in the presence of infinitely many nuisance parameters - geometry of estimating functions. Ann. Statist., 16(3):1044–1068, 1988. [5] Y. Amemiya. Instrumental variable estimator for the nonlinear errors-invariables model. J.Econom., 28:273–289, 1985. [6] Y. Amemiya. Instrumental variable estimation of the nonlinear measurement error model. Contemp. Math., 112:147–156, 1990. [7] Y. Amemiya. The two-stage instrumental variable estimator for the nonlinear errors-in-variables model. J.Econom., 44:311–332, 1990. [8] Y. Amemiya and W. A. Fuller. Estimation for the nonlinear functional relationship. Ann. Statist., 16(1):147–160, 1988. [9] E. B. Andersen. Asymptotic properties of conditional maximum-likelihood estimators. J.R.Statist. Soc. B, 32:283–301, 1970. [10] T. W. Anderson. Estimating linear relationships. Ann. Statist., 12(1):1–45, 1984. [11] Ben Armstrong. Measurement error in the generalized linear model. Commun.Statist.-Simula.Computa., 14(3):529–544, 1985. [12] Debabrata Basu. On the elemination of nuisance parameters. J. Amer. Statist. Assoc., 72(358):355–366, 1977. 263
264
BIBLIOGRAPHY
[13] R. Bentkus and R. Rudskis. On exponential estimates of the distribution of random variables. Litov. Mat. Sbornik, 20(1):15–30, 1980. [14] J. Bhanja and J. K. Gosh. Efficient estimation with many nuisance parameters I. Sankhya, A, 54(1):1–39, 1992. [15] J. Bhanja and J. K. Gosh. Efficient estimation with many nuisance parameters II. Sankhya, A, 54(2):135–156, 1992. [16] J. Bhanja and J. K. Gosh. Efficient estimation with many nuisance parameters III. Sankhya, A, 54(3):1297–308, 1992. [17] R. N. Bhattacharya and R. Ranga Rao. Normal Approximation and Asymptotic Expansions. Wiley, New York, 1976. [18] P. J. Bickel and Y. Ritov. Efficient estimation in the errors in variables model. Ann. Statist., 15:513–540, 1987. [19] P.J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and adaptive estimation for semi-parametric models. Johns Hopkins, 1993. [20] L. Birge and P. Massart. Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields, 77:115–150, 1993. [21] L. Birge and P. Massart. Minimum contrast estimators on sieves. Technical Report 94.34, Universitede Paris-Sud, June 1994. [22] P. T. Boggs, H. R. Byrd, and R. B. Schnabel. A stable and efficient algorithm for nonlinear orthogonal distance regression. SIAM J. Sci. Stat. Comput., 8(6):1052–1078, 1987. [23] P. T. Boggs and J. E. Rogers. Orthogonal distance estimators. Contemp. Math., 112:183–194, 1990. [24] D. R. Brillinger. A Festschrift for Erich L. Lehmann, chapter The generalized linear model with ”Gaussian2 regressor variables, pages 97–113. Wadsworth International Group, Belmont, Calfornia, 1996. [25] H. I. Britt and R. H. Luecke. The estimation of parameters in nonlinear implicite models. Technometrics, 15(2):233–247, 1973. [26] J. S. Buzas and L. A. Stefanski. A note on corrected-score mean. Statist. Prob. Lett., 28:1–8, 1996. [27] R. J. Carroll and P. Hall. Optimal rates of convergence for deconvoluting a density. J. Amer. Statist. Assoc., 83(404):1184–1186, 1988.
BIBLIOGRAPHY
265
[28] R. J. Carroll, R. K. Knickerbocker, and C. Y. Wang. Dimension reduction in a semiparametric regression model with errors in covariates. Ann. Statist., 23(1):161–181, 1995. [29] R. J. Carroll, H. K¨ uchenhoff, A. Lombard, and L. A. Stefanski. Asymptotics for the SIMEX estimator in nonlinear measurement error models. J. Amer. Statist. Assoc., 91(433):242–250, 1996. [30] R. J. Carroll and Ker-Chau Li. Measurement error regression with unknown link: Dimension reduction and data visualization. J. Amer. Statist. Assoc., 87(420):1140–1150, 1992. [31] R. J. Carroll, D. Ruppert, and L. A. Stefanski. Measurement error in nonlinear models. Chapman and Hall, 1995. [32] R. J. Carroll and C. H. Spiegelman. Diagnostics for nonlinearity and heteroscedasticity in errors-in-variables regression. Technometrics, 34(2):186– 196, 1992. [33] R. J. Carroll, C. H. Spiegelman, G. K. K. Lan, K. T. Bailey, and R. D. Abbott. On errors-in-variables for binary regression models. Biometrika, 71(1):19–25, 1984. [34] R. J. Carroll and L. A. Stefanski. Approximate quasi-likelihood estimation in models with surrogate predictiors. J. Amer. Statist. Assoc., 85(411):562– 663, 1990. [35] R. J. Carroll and M. P. Wand. Semiparametric estimation in logistic measurement error models. J.R.Statist.Soc.B, 53(3):573–585, 1991. [36] Chi-Lun Cheng and J. W. van Ness. On estimating linear relationships when both variables are subject to errors. J.R.Statist.Soc. B, 56(1):167– 183, 1994. [37] A. Chesher. The effect of measurement error. Biometrika(3):451–462, 1991.
Biometrika,
[38] J. R. Cook and L. A. Stefanski. Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc., 85:652– 663, 1994. [39] D. R. Cox and N. Reid. Parameter orthogonality and approximate conditional inference. J.R.Statist. Soc. B, 49(1):1–39, 1987. [40] D. R. Cox and N. Reid. A note on the calculation of adjusted profile likelihood. J.R. Statist.Soc.B., 55(2):467–471, 1993.
266
BIBLIOGRAPHY
[41] L. Devroye. Consistent deconvolution in density estimation. Canad. J. Statist., 17(2):235–239, 1989. [42] G. R. Dolby. Generalized least squares and maximum likelihood estimation of non-linear functional relationships. J. R. Statist. Soc. B, 34:393–400, 1972. [43] G. R. Dolby. The connection between methods in implicit and explicit nonlinear models. Appl. Statist., 25(2):157–162, 1976. [44] G. R. Dolby. The ultrastructural relation: A synthesis of the functional and structural relations. Biometrika, 63(1):39–50, 1976. [45] G. R. Dolby and T.G. Freeman. Functional relationships having many independent variables and errors with multivariate normal distributions. J. Multiv. Anal., 5:466–479, 1975. [46] G. R. Dolby and S. Lipton. Maximum likelihood estimation of the general nonlinear functional relationship with replicated observations and correlated errors. Biometrika, 59(1):121–129, 1972. [47] T. A. Duever, K. F. O’Driscoll, and P. M. Reilly. The use of the errorin-variables model in terpolymerization. J. Polymer. Sc., 21:2003–2010, 1983. [48] M. F. Egerton and P. J. Laycock. Maximum likelihood estimation of the multivariate non-linear functional relationships. Statistics, 10(2):273–230, 1979. ¨ [49] H. Eichhorn. Uber die Reduktion von photographischen Sternpositionen und Eigenbewegungen. Astron. Nachr., 285:233–237, 1960. [50] H. Eichhorn. Least-squares adjustment with probalistic constraints. Mon. Not. R. astr. Soc., 182:355–360, 1978. [51] H. Eichhorn. The direct use of spherical coordinates in focal plane astrometry. Astron. Astrophys., 150:251–255, 1985. [52] H. Eichhorn. A general explicit solution of the central overlap problem. The Astrophysical Journal, 334:465–469, 1988. [53] Jianqing Fan. Asymptotic normality for deconvolution kernel density estimators. Sankhya, 53:97–110, 1991. [54] Jianqing Fan. On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist., 19:1257–1272, 1991.
BIBLIOGRAPHY
267
[55] Jianqing Fan. Adaptively local one-dimensional subproblems with application to a deconvolution problem. Ann. Statist., 21(2):600–610, 1993. [56] Jianqing Fan and Young K. Truong. Nonparametric regression with errors in variables. Ann. Statist., 21(4):1900–1925, 1993. [57] I. Fazekas, A. Kukush, and S. Zwanzig. On inconsistency of the least squares estimator in nonlinear functional error-in-variables models with dependent error terms. manuscript, 1997. [58] W. A. Fuller. Measurement errors models. Wiley, 1987. [59] W.R. Gaffey. A consistent estimator of a component of a convolution. Ann. Math. Statist., 30:198–205, 1959. [60] A. R. Gallant. Nonlinear statistical models. Wiley, New York, 1986. [61] R. Gatto. Saddlepoint approximations of marginal densities and confidence intervals in the logistic regression measurement error model. Biometrics, 52:1096–1102, 1996. [62] R. Gatto and E. Ronchetti. General saddlepoint approximations of marginal densities and tail probabilities. J. Amer. Statist. Assoc., 91:666–673, 1996. [63] S. van de Geer. Estimating a regression function. Ann. Statist., 18(2):907– 924, 1990. [64] S. van de Geer. The method of sieves and minimum contrast estimators. Math. Meth, Statist., 4(1):20–38, 1995. [65] E. Gine. Empirical processes and applications : an overview. Bernoulli, 2:1–28, 1996. [66] Z. Girliches and V. Ringstad. Errors-in-variables bias in nonlinear contexts. Econometrica, 38(2):368–370, 1970. [67] L. J. Gleser. A note on G. R. Dolby’s unreplicated ultrastructural model. Biometrika, 72(1):117–124, 1985. [68] L. J. Gleser. Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemp. Math., 112:99–114, 1990. [69] W. Gnedenko, B. Einfhrung in die Wahrscheinlichkeitstheorie. Akademie Verlag, Berlin, 1991. [70] V. P. Godambe. An optimum property of regular maximum likelihood estimation. Ann. Math. Statist., 31:1208–1211, 1960.
268
BIBLIOGRAPHY
[71] V. P. Godambe. Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63(2):277–284, 1976. [72] V. P. Godambe. On sufficiency and ancilliraty in the presence of a nuisance parameter. Biometrika, 67(1):155–162, 1980. [73] V. P. Godambe. Orthogonality of estimating functions and nuisance parameters. Biometrika, 78(1):143–51, 1991. [74] V. P. Godambe and M. E. Thompson. Estimating equations in the presence of a nuisance parameter. Ann. Statist., 2:568–571, 1974. [75] J. J. Hanfelt and Kuung-Yee Liang. Approximate likelihoods for generalized linear errors-in-variables models. J. R. Statist. Soc. B, 59(3):627–637, 1997. [76] J. A. Hausman, W. K. Newey, Hidehiko Ichimura, and J. L. Powell. Identification and estimation of polynomial errors-in-variables models. J. Econom., 50:273–295, 1991. [77] J. A. Hausman, W. K. Newey, and J. L. Powell. Nonlinear errors in variables estimation of some Engel curves. J. Econom., 65:205–233, 1995. [78] C. H. Hesse. Deconvolving a density from contaminated dependent observations. Ann. Inst. Statist. Math., 47(4):645–663, 1995. [79] C. H. Hesse. Deconvolving a density from partially contaminated observations. J. Multiv. Anal., 55:246–260, 1995. [80] L.T.M.E. Hillegers. The estimation in functional relationship models. Proefschrift, Technische Universiteit Eindhoven, 1986. [81] S. Hirte. Reduction methods for the astronometric analysis of Schmidt plates. Wiss. Z. Techn. Univers. Dresden, 38(2):15–16, 1989. [82] Cheng Hsiao. Consistent estimation for some nonlinear errors-in-variables models. J. Econom., 41:159–185, 1989. [83] P. J. Huber. Robust Statistics. Wiley, New York, 1981. [84] K.M.S. Humak. Statistische Methoden der Modellbildung. I. AkademieVerlag, Berlin, 1977. [85] K.M.S. Humak. Statistische Methoden II. Akademie-Verlag Berlin, 1983. [86] I. A. Ibragimov and R. Z. Hasminski. On the efficient estimation in the presence of an infinite-dimensional nuisance parameter. Proceedings of the USSR- Japan Symposium on Probability Theory and Mathematical Statistics, Lecture Notes in Statistics, 4????(12):12–16, 1983.
BIBLIOGRAPHY
269
[87] I. A. Ibragimov and R. Z. Hasminskii. Statistical Estimation: Asymptotic Theory. Springer-Verlag, 1981. [88] A. V. Ivanov and S. Zwanzig. An asymptotic expansion of the distribution of least squares estimators in the nonlinear regression model. Statistics, 14(1):7–27, 1983. [89] W. H. Jefferys. On the method of least squares. The Astronomical Journal, 85(2):177–181, 1980. [90] W. H. Jefferys. On the method of least squares II. The Astronomical Journal, 86(1):149–155, 1981. [91] R. I. Jennrich. Asymptotic properties of nonlinear least squares estimators. Ann. Math. Statist., 40(2):633–643, 1969. [92] J. D. Kalbfleisch and D. A. Sprott. Application of likelihood methods to models involving large number of parameters. J.R. Statist. Soc. B, 32:175– 208, 1970. [93] S. E. Keeler and Reilly P. M. The error-in-variables model applied to parameter estimation when the error covariance matrix is unknown. Can. J. Chem. Eng., 79:27–34, 1991. [94] S. E. Keeler and P. M. Reilly. The design of experiments when there are errors in all the variables. Can. J. Chem. Eng., 70:774–778, 1992. [95] M. G. Kendall. Regression, structure and functional relationship. I. Biometrika, 38:11–25, 1951. [96] M. G. Kendall. Regression, structure und functional relationship. II. Biometrika, 39:96–108, 1952. [97] M. G. Kendall and A. Stuart. The Advanced Theory of Statistics. Griffin, London, 1979. [98] J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist., 27:887–906, 1956. [99] A. N. Kolmogoroff and W. M. Tichomirow. Arbeiten zur Informationstheorie III. Mathematische Forschungsberichte, Verlag der Wissenschaften, Berlin, 1960. [100] J. Kuelb. Some exponential moments of sums of independent random variables. Trans. Amer. Math., 240:145–162, 1978.
270
BIBLIOGRAPHY
[101] A. Kukush and S. Zwanzig. On an alternative estimator in nonlinear functional relations. manuscript, July 1996. [102] A. Kukush and S. Zwanzig. On inconsistency of the least squares estimator in nonlinear functional error-in-variables models. Technical Report 12, Institute of Mathematical Stochastics, University of Hamburg, July 1996. [103] A. Kukush and S. Zwanzig. Consistency and inconsistency of the weighted least squares estimator in linear functional relations with dependent error terms. Technical Report 3, University of Hamburg, Institut of Mathematical Stochastics, March 1997. [104] A. Kukush and S. Zwanzig. On a corrected contrast estimator in the implicit nonlinear functional relation model. Technical Report 97-11, University of Hamburg, Institute of Mathematical Stochastics, October 1997. [105] M. Kumon and S. Amari. Estimation of a structural parameter in the presence of a large number of nuisance parameters. Biometrika, 71(3):445– 459, 1984. [106] D. Kundu. Asymptotic theory of the least squares estimator of a particular non-liear model. Statist. Prob. Lett., 18:13–17, 1993. [107] Henning L¨auter. Note on the strong consistency of the least squares estimator in nonlinear regression. Statistics, 20(2):199–210, 1989. [108] Lung-fei Lee and J. H. Sepanski. Estimation of linear and nonlinear errors-in-variables models using validation data. J. Amer. Statist. Assoc., 90(429):130–140, 1995. [109] Kung-Yee Liang. Estimating functions and approximate conditional likelihood. Biometrika, 74(4):695–702, 1987. [110] F. Liese and Vajda I. Consistency of M-estimates in general regression models. J. Multiv. Anal., 50:93–114, 1994. [111] F. Liese and Vajda I. Necessary and sufficient conditions for consistency of generalized M-estimates. Metrika, 42:291–324, 1995. [112] B. G. Lindsay. Conditional score functions: Some optimality results. Biometrika, 69(3):503–12, 1982. [113] H. N. Linssen and L. T. M. E. Hillegers. Approximative inference in multivariate nonlinear functional relationsships. Statistica Neerl., 43(3):141–156, 1989.
BIBLIOGRAPHY
271
[114] G. G. Lorentz. Approximation of functions. Holt, Rinehart and Winston, 1966. [115] M.C. Lui and R. L. Taylor. A consistent nonparametric density estimator for the deconvolution problem. Canad. J. Statist., 17:427–438, 1989. [116] K. Mai. Approximation von Verteilungsfunktionen unter einer Kumulantenbedingung. ph. thesis, Humboldt university Berlin, 1987. [117] Tak K. Mak. Solving non-linear estimation equations. J.R. Statist. Soc. B., 55(4):945–955, 1993. [118] E. Malinvaud. The consistency of nonlinear regression. Ann. Math. Statist., 41(3):956–969, 1970. [119] H. J. Mantel and V. P. Godambe. Estimating functions for conditional inference: Many nuisance parameter case. Ann. Inst. Statist. Math., 45(1):55– 67, 1993. [120] E. Masry. Asymptotic normality for deconvolution estimators of multivariate densities of stationary processes. J. Multiv. Anal., 44:47–68, 1993. [121] E. Masry. Multivariate regression estimation with errors-in-variables for stationary processes. J. Nonparam. Statist., 3:13–36, 1993. [122] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1994. [123] R. van der Meer, H. N. Linssen, and A. L. German. Improved methods of estimating monomer reactivity ratios in copolymerization by considering experimental errors in both variables. J. Polymer Science, Polymer Chemestry Ed., 16:2915–2930, 1978. [124] J. Mendelsohn and J. Rice. Deconvolution of microfluorometric histograms with B splines. J. Amer. Statist. Assoc., 77(380):748–753, 1982. [125] S.A. Murphy and A. W. van der Vaart. Likelihood inference in the errorsin-variables model. manuscript, October 1995. [126] Nico J.D. Nagelkerke. Maximum likelihood estimation of functional relationships. Lecture notes in Statistics, 69 , Springer, 1992. [127] T. Nakamura. Corrected score function for errors-in-variables models: Methodology and application to generalized linear models. Biometrika, 77(1):127–137, 1990. [128] J. Neyman and E. L. Scott. Consistent estimates based on partially consistent observations. Econometrica, 16(1):1–32, 1948.
272
BIBLIOGRAPHY
[129] M. Nussbaum. An asymptotic minimax risk bound for estimation of a linear functional relationship. J. Multiv. Anal., 14:300–314, 1984. [130] M. Nussbaum and S. Zwanzig. Standard and nonstandard asymptotic risk bounds in a semiparametric errors in variable model and their attainment by estimation. Technical Report P-Math-20/88, Institute of Mathematics, Academy of Science GDR, February 1988. [131] S. Nussbaum, M.and Zwanzig. A minimax result in a model with infinitely many nuisance parameters. Transact. Tenth Prague Conferenceon Information Theory.., pages 215–222, 1988. [132] Prakash Patil. A note on deconvolution density estimation. Statist. Probab. Lett., 29:79–84, 1996. [133] J. Pfanzagl. On the measurability and consistency of minimum contrast estimates. Metrika, 14:247–276, 1969. [134] J. Pfanzagl. Incidental versus random nuisance parameters. Ann. Statist., 21(4):1663–1691, 1993. [135] J. Pfanzagl. On the consistency of conditional maximum likelihood estimators. Ann. Inst. Statist. Math., 45(4):703–719, 1993. [136] D. Pollard. Convergence of Stochastic Processes. Springer Series in Statistics, 1984. [137] P. M. Reilly and H. Patino-Leal. A Baysian study of the error-in-variables model. Technometrics, 23(3):221–231, 1981. [138] M. Rudemo, D. Ruppert, and J. C. Streibig. Random-effect models in nonlinear regression with applications to bioassay. Biometrics, 45:349–362, 1989. [139] J. Sacks and D. Ylvisaker. Some model robust design in regression. Ann. Statist., 12(4):1324–1348, 1984. [140] G. Sansone. Orthogonal functions. Interscience Publishers. New York, 1959. [141] L. Saulis. Asymptotic expansions of distribution functions of arbitrary random variable with regular semiinvariants (in russian). Liet. Matem. Rink., 35(3):367–380, 1995. [142] L. Saulis and V. A. Statulevicius. Limit theorems for large deviations. Kluwer Academic Publishers. Dordrecht, 1991. [143] D. W. Schafer. Covariate measurement error in generalized linear models. Biometrika, 74(2):385–391, 1987.
BIBLIOGRAPHY
273
[144] D. W. Schafer. Measurement error model estimation using iteratively weighted least squares. Contemp. Math., 112:129–138, 1990. [145] H. Schneeweiß and H. J. Mittag. Lineare Modelle mit fehlerbehafteten Daten. Physica-Verlag Heidelberg, 1986. [146] D. J. Schnell. A likelihood ratio test for error covariance specification in nonlinear measurement error models. Contemp. Math., 112:157–165, 1990. [147] Karl L. Schulze and Robert S. Lipe. Relationship between substrate concentration, growth rate, and respiration rate of Escherichia coli in continuous culture. Archiv f. Mikrobiologie, 48:1–20, 1964. [148] H. Schwetlick and V. Tiller. Numerical mehods for estimating parameters in nonlinear models with errors in variables. Technometrics, 27(1):17–24, 1985. [149] G. A. F. Seber and C. J. Wild. Nonlinear Regression. Wiley, New York, 1989. [150] J. H. Sepanski and R. J. Carroll. Semiparametric quasilikelihood and variance function estimation in measurement error models. J. Econom., 58:223– 256, 1993. [151] L. A. Shepp. Distinguishing a sequence of random variables from a translate of itself. Ann. Math. Statist., 36:1107–1112, 1965. [152] W. H. Southwell. Fitting data to nonlinear functions with uncertainties in all measurement variables. Comp. J., 19(1):69–73, 1976. [153] P. Sprent. Some history of functional and structural relationships. Contemp. Math., 112:3–15, 1990. [154] K. Srivastava, A. and Shalabh. Consistent estimation for the non-normal ultrastructural model. Statist. Probab. Lett., 34:67–73, 1997. [155] L. A. Stefanski. The effects of measurement error on parameter estimation. Biometrika, 72(3):583–592, 1985. [156] L. A. Stefanski. Correcting data for measurement error in generalized linear models. Commun. Statist. Theory Meth., 18(5):1715–1733, 1989. [157] L. A. Stefanski. Unbiased estimation of a nonlinear function of a normal mean with application to measurement errors. Commun. Statist. Theory Meth., 18(12):4335–4358, 1989. [158] L. A. Stefanski. Rates of convergence of some estimators in a class of deconvolution problems. Statist. Probab.Lett., 9:229–235, 1990.
274
BIBLIOGRAPHY
[159] L. A. Stefanski and J. S. Buzas. Instrumental variable estimation in binary regression measurement error models. J. Amer. Statist. Assoc., 90(430):541–550, 1995. [160] L. A. Stefanski and R. J. Caroll. Structural logistic regression measurement error models. Contemp. Math., 112:115–127, 1990. [161] L. A. Stefanski and R. J. Carroll. Covariate measurement error in logistic regression. Ann. Statist., 13(4):1335–1351, 1985. [162] L. A. Stefanski and R. J. Carroll. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika, 74(4):703–716, 1987. [163] L. A. Stefanski and R. J. Carroll. Deconvoluting kernel density estimators. Statistics, 21:169–184, 1990. [164] L. A. Stefanski and R. J. Carroll. Score tests in generalized linear measurement error models. J.R.Statist.Soc.B, 52(2):345–359, 1990. [165] L. A. Stefanski and R. J. Carroll. Deconvolution based score tests in measurement error models. Ann. Statist., 19(1):249–259, 1991. [166] C. M. Stein. Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9(6):1135–1151, 1981. [167] C. Stone. Optimal global rates of convergence for nonparametric regression. Ann. Statist., 10:1040–1053, 1982. [168] G. Szeg¨o. Orthogonal polynomials. American Mathematical Society Colloquium Publications, 1959. [169] T. D. Tosteson and A. A. Tsiatis. The aymptotic relative efficiency of score tests in a generalized linear model with surrogate covariates. Biometrika, 75(3):507–514, 1988. [170] H. Triebel. H¨ohere Analysis. Deutscher Verlag der Wissenschaften, Berlin, 1972. [171] A. W. van der Vaart. Statistical estimation in large parameter spaces. Proefschrift, Rijkuniversiteit Leiden, 1987. [172] A. W. van der Vaart. Efficient MLE in semi-parametric mixture models. Ann. Statist., 24(2):862–878, 1996. [173] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, 1996.
BIBLIOGRAPHY
275
[174] C. Villegas. On the least squares estimation of non-linear relations. Ann. Math. Statist., 40:462–466, 1969. [175] A. Wald. Estimation of a parameter when the number of unkown parameters increases indefintely with the numbers of observations. Ann. Math. Statist., 19:220–227, 1948. [176] A. S. Whittemore and J. B. Keller. Approximations for regression with covariate measurement error. J. Amer. Statist. Assoc., 83(404):1057–1066, 1988. [177] P. Whittle. Bounds for the moments of linear and quadratic forms in independent variables. Theory Prob. and Appl., 5:302–305, 1960. [178] K. M. Wolter and W. A. Fuller. Estimation of nonlinear errors-in-variables models. Ann. Statist., 10(2):539–548, 1982. [179] K. M. Wolter and W. A. Fuller. Estimation of the quadratic errors-invariables model. Biometrika, 69(1):175–182, 1982. [180] Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. Ann. Statist., 9(3):501–513, 1981. [181] Cun-Hui Zhang. Fourier methods for estimating mixing densities and distributions. Ann. Statist., 18(2):806–831, 1990. [182] S. Zwanzig. The choice of approximative models in nonlinear regression. Statistics, 11(1):23–47, 1980. [183] S. Zwanzig. On an asymptotic minimax result in nonlinear errors-invariables models. Proceedings of the Fourth Prague Symposium on Asymptotic Statistics, pages 549–558, 1989. [184] S. Zwanzig. On consistency in nonlinear functional relations. Technical Report P-Math 10-90, Institute of Mathematics, Academy of Science GDR, 1990. [185] S. Zwanzig. Least squares estimation in nonlinear functional relations. Proceedings Probastat 91, Bratislava, pages 171–177, 1991. [186] S. Zwanzig. On adaptive estimation in nonlinear regression. Kybernetika, 30(3):359–367, 1994. [187] S. Zwanzig. On an orthogonal series estimator in nonparametric functional relations. Technical Report 96-8, University of Hamburg, May 1996.
276
BIBLIOGRAPHY
[188] S. Zwanzig. Application of Hipparcos dates: A new statistical method for star reduction. Technical Report 97-10, University of Hamburg, Institute of Mathematical Stochastics, October 1997. [189] S. Zwanzig. On L1- norm estimators in nonlinear regression and in nonlinear error-in-variables models. Lecture notes- Monograph Series, 31:101–118, 1997.