ASYMPTOTIC PROPERTIES AND ROBUSTNESS OF MINIMUM ...

7 downloads 0 Views 257KB Size Report
Abstract. This paper deals with the asymptotic properties of an unexplored estimation method, for location and scale parameters, based on the minimization of ...
c 2006 Society for Industrial and Applied Mathematics 

THEORY PROBAB. APPL. Vol. 50, No. 2, pp. 171–186

ASYMPTOTIC PROPERTIES AND ROBUSTNESS OF MINIMUM DISSIMILARITY ESTIMATORS OF LOCATION-SCALE PARAMETERS∗ F. BASSETTI† AND E. REGAZZINI† Abstract. This paper deals with the asymptotic properties of an unexplored estimation method, for location and scale parameters, based on the minimization of the Monge–Gini–Kantorovich– Wasserstein distance. This method is rigorously defined and justified according to the general principle which directs the theory of regression. The resulting estimators—called minimum dissimilarity estimators—exist, and are measurable, consistent, and robust. Their asymptotic distribution is the same as the probability distribution of the absolute minimum point of an interesting functional of a standard Brownian bridge. This fact can be employed to obtain both explicit exact expressions and numerical approximations for the above asymptotic distribution. Key words. argmax argument, asymptotic laws, influence function, minimum dissimilarity estimator, Monge–Gini–Kantorovich–Wasserstein metric, occupation time of a Brownian bridge, robustness, minimum dissimilarity estimators DOI. 10.1137/S0040585X97981664

1. Introduction. Consider the location-scale model (1)

x k = μ + σ εk

(k = 1, 2, . . . ),

where ε1 , ε2 , . . . are independent and identically distributed (i.i.d.) real-valued random variables with common nondegenerate probability distribution (p.d.) function F , and (μ, σ) = θ is an unknown parameter in Θ := R × [0, +∞). The issue of estimating θ represents one of the most elementary and classical problems in statistics, and there is a large body of literature about it. The reason why we go back to it is the wish to illustrate an estimation method based on the minimization of the Monge– Gini–Kantorovich–Wasserstein distance (G-distance for brevity) between the law of each x i and the empirical distribution of any sample x (n) = ( x1 , . . . , x n ). In fact, this distance has been substantially neglected by the innumerable studies which deal with minimum distance estimators, in spite of its usefulness in many fields of mathematics. See, for example, [21] for a comprehensive treatment, and [24] for applications in the theory of optimal transportation. As we will show, the method above, applied to the problem of estimating θ in (1), appears to be robust and, consequently, can be recommended, for example, against the occurrence of anomalous values in x (n) . Moreover, it admits a theoretically interesting interpretation but, with respect to other minimum distance procedures, it turns out to be disadvantageous from a computational standpoint. In any case, we think it is worth analyzing, at least in the perspective of asymptotic statistics. Section 2 contains the main definitions together with an interpretation to motivate the use of the method of estimation based on the minimization of the G-distance. Section 3 deals with existence, measurability, and consistency of estimators derived ∗ Received

by the editors December 9, 2004. This work was partially supported by MURST, Programma di Ricerca: Metodi bayesiani non parametrici e loro applicazioni and by IMATI of CNR, Pavia, Italy. http://www.siam.org/journals/tvp/50-2/98166.html † Dipartimento di Matematica, Universit` a degli Studi di Pavia, via Ferrata 1, 27100 Pavia, Italy ([email protected], [email protected]). 171

172

F. BASSETTI AND E. REGAZZINI

from such a process of minimization. A characterization of their limit distribution is obtained in section 4. Such a characterization is used, in section 5, to deduce the exact expression of the asymptotic distribution of location minimum G-distance estimators for some specific forms of F . Apropos of this, open problems about occupation times of a Brownian bridge are described. The same section includes the proof of a simple, interesting robustness property of minimum G-distance estimators of location parameters. Finally, in section 6 the characterization given in section 4 is used to derive confidence bounds for θ by Monte Carlo methods. 2. Main definitions and interpretive aspects. Let us formalize as follows the conditions mentioned at the beginning of the previous section: (H0 ) ( ε1 , x 1 ), . . . , ( εn , x n ), . . . are real random vectors obeying model (1) for some θ := (μ, σ) in Θ = R × [0, +∞), defined on the probability space (Ω, F, Pθ ); Pθ makes the εk ’s i.i.d. real-valued random variables with a common nondegenerate p.d. function F , and F has finite expectation. Thus, the x k ’s turn out to be i.i.d. random variables with a common p.d. function Fθ that, for every x in R, is defined by Fθ (x) := F (σ −1 (x − μ)) if σ > 0 and by Fθ (x) := H(x − μ) = I[μ,+∞) (x) if σ = 0. From now on, IA will indicate the indicator function of the set A. Define Fn to be the empirical p.d. function of x (n) , i.e., Fn (x) =

n 1  H(x − x k ) n

(x ∈ R).

k=1

Put

 G(Fn , Fθ ) :=

  Fn (x) − Fθ (x) dx,

R

that is, the G-distance between Fn and Fθ . x1 , x 2 , . . . ) in Θ such that θˆn (

Under these conditions, any θˆn =

G(Fn , Fθˆn ) = inf G(Fn , Fθ ), θ∈Θ

i.e., θˆn ∈ arg min G(Fn , Fθ ), is said to be a minimum dissimilarity estimator (MDE) for θ. It should be recalled that the term dissimilarity index was introduced by Gini to designate G; see [13]. More precisely, if F1 and F2 are p.d. functions with finite expectation, the dissimilarity index between them is defined to be     −1    F (y) − F −1 (y) dy,  F1 (x) − F2 (x) dx = G(F1 , F2 ) = 1 2 R

(0,1)

where F1−1 and F2−1 stand for the quantile functions of F1 and F2 , respectively, i.e.,    Fi−1 (y) := sup t ∈ R : Fi (t)  y y ∈ (0, 1), i = 1, 2 . To motivate the introduction of the above method of estimation, assume momentarily that εk and xk (k = 1, . . . , n) are observed values of an explanatory variable and of a response variable, respectively. Then, like in the classical theory of regression, consider estimates of θ as minimizers of the loss function n 1  |xk − μ − σεk |r , n k=1

MINIMUM DISSIMILARITY ESTIMATORS

173

with r being some fixed positive number. With r = 2, this criterion gives rise to the least squares estimates, whereas, if r = 1, it yields the Boscovich L1 -approximation. For the case r = 1, by which our MDEs are inspired, see the relatively recent paper [19] and the book [5]. See also the papers contained in [11]. Now, if H (n) is the empirical p.d. function of the sample ((ε1 , x1 ), . . . , (εn , xn )), the previous loss function can be written as  (2) |x − μ − εσ|r dH (n) (ε, x). R2

Under the usual conditions of independence and identity in distribution (with parameter θ = (μ, σ)) for the ( εk , x k )’s, H (n) converges weakly (almost surely) as n → +∞ to the bivariate p.d. supported by the straight line {(x, y) ∈ R2 : y = μ + σx}, with marginal p.d. functions F and Fθ . So, if Γ(F1 , F2 ) denotes the Fr´echet class of all the bivariate p.d. functions with fixed marginals F1 and F2 , we can say that the limit of H (n) is the maximal element of Γ(F, Fθ ), i.e.,    H(x, y) = min F (x), Fθ (y) (x, y) ∈ R2 , with respect to the so-called concordance (or positive dependence) ordering defined by Gini as follows: If H1 and H2 are in Γ(F1 , F2 ) and H1 (x, y)  H2 (x, y) for every (x, y) in R2 , with H1 (x0 , y0 ) > H2 (x0 , y0 ) for some (x0 , y0 ), then H1 is said to be more concordant than H2 . Thus, in the statistical theory of regression the loss function (2) is determined by an empirical distribution which approximates H for sufficiently large values of n. To preserve this feature even when—as in our primary problem—only the response variable is observable, we can substitute H (n) with    (n) (x, y) ∈ R2 , H∗ (x, y) := min F (x), Fn (y) at the beginning of the which, like H (n) , under the distributional conditions fixed

paper, converges weakly (almost surely) to H. Hence, if R |x|r dF (x) < +∞, one will seek estimates of θ which minimize    −1  (n) r Fn (y) − F −1 (y)r dy, |x − μ − εσ| dH∗ (ε, x) = (3) θ R2

R

i.e., MDEs when r = 1. A proof of equality (3) can be found in [21]. As for literature, we know of a few papers devoted to the estimates described above: paper [4], which deals with estimates defined as minimizers of (3) with r = 2, and two papers [3] and [2], which discuss consistency of minimizers of the so-called Kantorovich functionals (see [21]) which appear as a natural generalization of (3). In fact, section 3 is a new version of some of the results contained in [2] adapted to our present particular situation. 3. Existence, measurability, and consistency of MDEs. Before we study properties like consistency, robustness, etc., it is worth tackling the problem of the existence of MDEs. Clearly, it is desirable that MDEs, when they exist, be measurable functions in order to state, for example, meaningful forms of asymptotic consistency. According to the next proposition, MDEs exist, are measurable, and are strongly consistent. Proposition 3.1. Let (H0 ) be valid. Then, for each θ in Θ◦ (the interior set of Θ) and for every n, there is an MDE θˆn which is F/B(R)-measurable, and the

174

F. BASSETTI AND E. REGAZZINI

sequence (θˆn )n1 satisfies lim

n0 →+∞





sup θˆn − θ2  ε = 1

(ε > 0),

nn0

where  · 2 stands for the Euclidean norm in R2 . Proof. Apply the dominated convergence theorem to prove that Θ  θ → G(Fn , Fθ ) is continuous. Hence, for any c, the set {θ ∈ Θ : G(Fn , Fθ )  c} ⊂ R2 is closed. Moreover, it is bounded since     sin φ + F −1 (y) cos φ dy G(Fθ , Fθ ) = ρ (0,1) 



with ρ = θ − θ2 , μ − μ = ρ sin φ, σ  − σ = ρ cos φ, and     sin φ + F −1 (y) cos φ dy > 0; k := inf φ

(0,1)

thus, if G(Fθ , Fθ )  c, we can write θ − θ2  c/k, and from G(Fθ , Fθ )  c + sn (where sn := G(Fn , Fθ )), which holds when G(Fn , Fθ )  c, we get      θ ∈ Θ : G(Fn , Fθ )  c ⊂ θ ∈ Θ : θ − θ2  (c + sn ) k −1 . The fact that the sets {θ ∈ Θ : G(Fn , Fθ )  c} are compact entails arg min G(Fn , Fθ ) = ∅, θ

which is tantamount to saying that MDEs θˆn exist for every n. At this stage, the measurability of θ n is a straightforward consequence of Corollary 1 in [7]. As for the consistency of (θˆn )n1 , consider the triangle inequality again to obtain G(Fθˆn , Fθ )  2sn and, hence, θˆn − θ2  2sn /k. Then, apply the Glivenko–Cantelli theorem to get sup |Fn − Fθ | → 0 (almost surely) and, by the Kolmogorov strong law of large numbers, x dFn (x) → xdFθ (x) (almost surely). These facts yield sn → 0 (almost surely); see, for example, [21, Corollary 7.5.3]. 4. Characterization of the asymptotic distribution of MDEs. The main result of the present section characterizes the limiting law of the statistic √

hn := n(θˆn − θ), which is well-defined according to Proposition 3.1. Proposition 4.1. Let (H0 ) be in force together with the following assumptions: (i) F is absolutely continuous on R, with a bounded and continuous probability density f , and the topological support of the probability measure determined by F is an interval, whose interior set will be denoted by (a, b) with −∞  a < b  +∞;

 (ii) R F (x)(1 − F (x)) dx < +∞. Then, for every θ in Θ◦ , there is a suitable Brownian bridge B on (Ω, F, Pθ ) such that     σB(y)  −1   dy M(h) := − h − h F (y) 1 2  f (F −1 (y))  (0,1)     σB(F (x))   − h1 − h2 x f (x) dx =  f (x) (a,b)

is, for every h = (h1 , h2 ) in R2 , a real-valued random variable and {M(h) : h ∈ R2 } is a stochastic process on (Ω, F, Pθ ) with continuous and coercive paths on R2 , which

MINIMUM DISSIMILARITY ESTIMATORS

175

possess one and only one absolute minimum point h. It is just such a random minimum that represents the limit, in law, of ( hn )n1 . Proof. The proof relies on an argmin argument which mimics Theorem 3.2.2 in [23]. Accordingly, consider   √ h Mn (θ) := G(Fθ , Fn ), Mn (h) := nMn θ + √ n and note that, for any compact subset K of R2 , there is n such that Mn (h) is welldefined for every h in K and for every n  n. It remains to prove the following claims: (a) Each path of M has one and only one absolute minimum point. (b) For every compact K ⊂ R2 , (Mn )n1 converges in distribution to M in the space l∞ (K) of bounded functions from K into R. (c) The sequence ( hn )n1 is tight in (R2 ,  · 2 ). Proof of claim (a). Consider the probability space ((a, b), B(a, b), pF ), where pF denotes the probability measure generated by F . Such a space is nonatomic and the conjugate space of L1 ((a, b), B(a, b), pF ) is given by L∞ ((a, b), B(a, b), pF ). Then, in view of Theorem 24 in [8], it suffices to prove that 0 is the only element of A := {a : R → R : a(x) = αx + β, (α, β) ∈ R2 } which vanishes on an ε-set. Recall that B ⊂ (a, b) is termed an ε-set if B = ∂A and A is a measurable set satisfying

a dF = Ac a dF for every a in A, i.e., A

 x dF (x) 1 A and = (4) x dF (x). pF (A) = 2 pF (A) (a,b) At this stage, it is enough to prove that |∂A|  2. In point of fact, |∂A| is smaller

c) or A = (c, b) for some c in (a, b) and, therefore, either

than 2 if and only if A = (a, x dF (x)/ dF (x) > x dF (x) or A x dF (x)/ A dF (x) < (a,b) x dF (x), in A A (a,b) contradiction with (4). Proof of claim (b). For every compact K ⊂ R2 there is n such that Mn (h) is well defined for every h in K and for every n  n. Thanks to (i), one can write Δn (x, h) h Fθ+h/√n (x) = Fθ (x) + g(x) √ + √ n n with 

x−μ g(x) = −f σ



1 x−μ , σ σ2



for every x in R, and  (5)

lim

sup

n→+∞ h C 2

  Δn (x, h) dx = 0

R

for every C > 0. To prove (5), first observe that Fθ is differentiable with respect to θ and that the differential is given by   x−μ = g(x) (dμ, dσ). dF σ

176 Thus,

F. BASSETTI AND E. REGAZZINI



  √  (h1 , h2 )  dx n Fθ+h/√n (x) − Fθ (x) − g(x) √ n  R    (x−μ−h1 /√n)/(σ+h2 /√n) √  = f (t) dt  n

  Δn (x, h) dx =

R



(x−μ)/σ

R

   1 x − μ  x−μ h1 + h2 f  dx σ σ σ     Mn,t   √ h2  σf (t) − f (ξ) σ + √  n dξ dt,  n  +

mn,t

R

√ √ with√mn,t and√Mn,t representing t ∧ σ −1 {t(σ + h2 / n) + h1 / n} and t ∨ σ −1 {t(σ + h2 / n) + h1 / n}, respectively. Then, for any M > 0,       |h1 + h2 t| Δn (x, h) dx  + σ R |t|>M |t|M       1  h 1 + h2 t    h2  σf (t) − σ + √   √ f t + x ×  σ n   dx dt  n 0 and, for every ε > 0 there is M = M (ε) in (a, b) and n = n(ε) such that, in view of (ii), the following inequalities hold for every M in (a, b) greater than M (ε) and n  n:          h 1 + h2 t   |h1 + h2 t| 1  h2    √ √ sup f t + x σf (t) − σ +  σ n   dx dt  σ n 0

h 2 C |t|>M   C|1 + t| f (t) dt |t|>M

+  



1

sup

h 2 C

0

      h 1 + h2 t    h2     σ+ √ √ f t + x  σ n   dt dx  n |t|>M



C|1 + t| f (t) dt |t|>M



+C 0

1





+∞ √

M −C/σ n

√ −M +C/σ n

+ −∞



  1 + |ξ| f (ξ) dξ dx  ε.

Now, fix M and verify—through dominated convergence—that

1 as indicated above √ √ sup h 2 C 0 |σf (t)−(σ+h2 / n) f (t+x|h1 + th2 |/(σ n))| dx converges to 0 as n → +∞ for every t, and that    1  |h1 + th2 |  σ + h2  |h1 + h2 t|  dx σf (t) − √n f t + x (σ √n) 0 is bounded with respect to (h, t) on {h : h2  C} × [−M, M ]. Hence, the Lebesguedominated convergence theorem yields R sup h 2 C |Δn (x, h)| dx → 0 as n → +∞. √ Setting Gn (x) := n(Fn − Fθ )(x), one can rewrite Mn (·) as    Gn (x) − h g(x) − Δn (x, h) dx. Mn (h) = R

Further, in view of Theorem 2.1 in [10], the sequence (Gn )n1 converges in distri

ν bution, in L1 (R), to B(Fθ ), by virtue of (ii). Thus since f → i=1 ci R |f (x) −

177

MINIMUM DISSIMILARITY ESTIMATORS

1 R, for every ci in R and h(i) h(i) g(x)| dx is a continuous function

L (R) into (i) ν from 2 in R (i = 1, . . . , ν), we see that i=1 ci R |Gn (x) − h g(x)| dx converges in distriν bution to i=1 ci M(h(i) ). In other words, the finite-dimensional distributions of      2   Nn (h) := Gn (x) − h g(x) dx : h ∈ R R

converge weakly to the finite-dimensional distributions of M. Now, since   ν  ν        (i) (i)  Δn (h(i) , x) dx, M c (h ) − N (h )  c   i n n i   R i=1

i=1

with the right-hand side converging to 0 as n → +∞, the Slutsky theorem can be used to show that the finite-dimensional distributions of (Mn )n1 have the same weak limits as the finite-dimensional distributions of (Nn )n1 . So, for each h, (Mn (h))n1 is tight in R. Moreover,   (1)    (h − h(2) ) g(x) dx Mn (h(1) ) − Mn (h(2) )  R      Δn (h(1) , x) + Δn (h(2) , x) dx. + R

Then, since every compact set K ⊂ R2 has a finite measurable {T1 , . . . , Tk }—such that h(1) − h(2) 2  ε for every h(1) and h(2) in may be, we can write     (1) (2)   Mn (h ) − Mn (h )  ε sup sup g2 dx + 2 sup sup i

h(1) , h(2) ∈Ti



R

i

h∈Ti

partition—say, Ti , whichever i   Δn (x, h) dx

R

with supi suph∈Ti R |Δn (x, h)| dx → 0 as n → +∞ thanks to (5). This fact, through Theorem 1.5.6 in [23], tells us that (Mn )n1 is tight in l∞ (K) for every compact set K ⊂ R2 . Then, since weak convergence in l∞ (K) can be characterized as tightness plus convergence of marginals, we can conclude that (Mn )n1 converges in law to M in l∞ (K) for every compact subset K of R2 . Proof of claim (c). For any θ in Θ we can write G(Fθ , Fθ ) − G(Fn , Fθ )  Mn (θ )  G(Fn , Fθ ) + G(Fθ , Fθ ) and, therefore, G(Fθˆn , Fθ )  G(Fn , Fθ ) + Mn (θˆn )  G(Fn , Fθ ) + Mn (θ)  2G(Fn , Fθ ). Since G(Fθ , Fθ )  c implies that θ − θ2  c/k holds true for some suitable k, as shown in the proof of Proposition 3.1, then √  √  n(θˆn − θ)  2 n G(Fn , Fθ ) k −1 , 2 i.e.,     √ λk hn 2  λ  Pθ Pθ  n G(Fn , Fθ )  2

(λ > 0).

√ Thus, the tightness of ( hn )n1 is a consequence of the tightness of nG(Fn , Fθ ) which, in turn, follows from Theorem 1.1 in [10] about the convergence in law of G(Fn , Fθ ). Proposition 4.1 is proved.

178

F. BASSETTI AND E. REGAZZINI

The above characterization of the limiting distribution of MDEs will be employed in section 6 to determine—via Monte Carlo methods—confidence bounds for location, scale, and location-scale parameters of some distinguished models. In the next section, the same characterization is discussed in connection with the exact form of the limiting distribution of an MDE for a location parameter. 5. MDEs for location parameters: Limiting distribution and robustness. As to the problem of determining an adequate explicit representation for the probability distribution of h when assumptions (i) and (ii) of Proposition 4.1 are valid, we confine ourselves to considering the case of a single location parameter, or of a single scale parameter. With respect to these particular cases, the present treatment is rather incomplete but, as explained later on, we intend to return to this subject in a forthcoming paper. Assume that the scale parameter σ in (1) is known (e.g., σ = σ0 > 0). Then, h is a real-valued random variable satisfying      σ0 B(y)

 h = arg min (6) − h dy.  −1 (y)) f (F h∈R (0,1)

Hence, for each path B of B, h can be viewed as a median of the random variable y →

σ0 B(y) f (F −1 (y))

defined on the probability space ((0, 1), B(0, 1), λ), with λ being the uniform probability measure on (0, 1). Therefore, for every θ = (μ, σ0 ) we have   1 (7) Pθ { h > ξ} = Pθ I(−∞,γ(s)] (B(s)) ds < 2 (0,1) with ξf (F −1 (·)) . σ0 Thus, the problem of determining the limiting law of an MDE for a location parameter can be viewed as that of seeking the left limit at 12 of the probability distribution of the occupation time of B determined by (8). Analogously, for MDEs of scale parameters—when μ = μ0 is known— h must satisfy     σB(y)  −1

  dy − hF (9) (y) h = arg min   −1 f (F (y)) h∈R (8)

γ(·) :=

(0,1)

and, consequently, for each trajectory B of B, it coincides with the median of the random variable defined on ((0, 1), B(0, 1), λ∗ ) by y →

σB(y) , f (F −1 (y)) F −1 (y)

where λ∗ stands for the probability measure satisfying  1 λ∗ (dy) = F −1 (y) dy μ

1 −1 on (0, 1), with μ := 0 |F (y)| dy. Hence,      μ (10) h > ξ} = Pθ I(−∞,γ ∗ (s)] B(s) F −1 (s) ds < Pθ { 2 (0,1)

179

MINIMUM DISSIMILARITY ESTIMATORS

with γ ∗ (·) :=

(11)

ξf (F −1 (·)) F −1 (·) . σ

Now, by recalling that B and (−B) have the same probability distribution, the previous statements can be used to easily prove the following proposition. Proposition 5.1. Let assumption (H0 ) and assumptions (i) and (ii) of Proposition 4.1 be in force and let μ

n , σ

n denote MDEs of location and scale parameters, √ √ respectively. Then the limiting probability distributions of n( μn −μ) and n( σn −σ) are symmetric. Distributional results for Brownian occupation times have been discussed in recent papers and books; see [12], [22], [14], [6]. The latter reference contains useful formulae that can be used to determine (7) for the following models:   1 −(s−μ)/σ0 Fθ : θ = (μ, σ0 ), Fθ (x) = F (x − μ) = e I(μ,+∞) (s) ds (−∞,x] σ0  (12)

(x ∈ R), for every μ in R ;   1 Fθ : θ = (μ, σ0 ), Fθ (x) = F (x − μ) = I(μ,μ+σ0 ) (s) ds (−∞,x] σ0  (x ∈ R), for every μ in R .

(13)

Indeed, in these cases, the Laplace–Stieltjes transform of the probability distribution

of (0,1) I(−∞,γ(s)] (B(s)) ds can be found in the tables of [6]. For a correct use of these tables, it should be recalled that a Brownian bridge can be identified as a Brownian motion conditioned to take the value 0 at 1. Hence, setting     1  (x,t)  (x,t) (14) exp − z ds , I(−∞,γ(s)] Ws Fε (x, t; z, ξ) := E I(−ε,ε) W1 t (x,t)

where Ws stands for a Brownian motion indexed by s in [t, 1] and satisfying (x,t) Wt = x, the Laplace–Stieltjes transform of the p.d. of the occupation time   I(−∞,γ(s)] B(s) ds (0,1)

is given by (15)

L(z; ξ) := lim+ ε→0

1 (0,0) Pθ {W1

∈ (−ε, ε)}

Fε (0, 0; z, ξ).

Example 5.1. Consider the problem of determining (15) when Fθ is defined by (12). In this case, we have γ(s) = ξ(1 − s)/σ0 , and (14), through [6, Part II, section 2.1.5.8], leads to    1 ξ˜ ˜2 1 − e−zs e−z(1−s) −ξ˜2 /(2(1−s)) ˜ := ξ L(z; ξ) = √ eξ /2 e ds ξ σ0 zs3/2 (1 − s)3/2 2π 0  2 ˜ 1 − ξ /2t e ξ˜ ˜2 L(t, z) 3/2 dt, = √ eξ /2 t (1 − t)3/2 2π 0

180

F. BASSETTI AND E. REGAZZINI

where L(t, z) is the value at z of the Laplace transform of x → I(t,1) (x). At this point, it is easy to see that the corresponding inverse Laplace transform—i.e., the density of the absolutely continuous component of the law of the occupation time—is ˜2 ˜ ξ˜2 /2  x ξe e−ξ /2t √ x −→ dt (0 < x < 1). 2π 0 t3/2 (1 − t)3/2 Since the value of the integral on (0, 1) of this function is 1, it then represents the probability density function of the occupation time, and we can consider (7) to deter√ mine the limiting distribution of the centered and normalized sequence n(θˆn − θ) of MDEs, i.e.,  ˜2 ˜ ξ˜2 /2  1/2   x e−ξ /2t ξe Pθ { h  ξ} = 1 − √ dt dx (ξ > 0) 3/2 (1 − t)3/2 2π 0 0 t ˜ −ξ˜2 /2  +∞ ξe y ˜2 =1− √ e−yξ /2 dy 3/2 (1 + y) 2 2π 0     2 2 −ξ 2 /2σ02 3 1 ξ2 e−ξ /2σ0 ξe 3 ξ2 √ √ , ; U = 1 − =1− , U 2, ; 2 2σ02 2 2 2σ02 2 π 2σ0 2π where, as usual, U stands for the confluent hypergeometric function of the second kind (also known as the Tricomi function). See, for example, [18]. Combining this fact with Proposition 5.1, we can conclude with the following proposition. Let θˆn be the MDE of μ in the exponential model (12). Then the limiting p.d. √ function (as n → +∞) of n(θˆ − θ) is given by the absolutely continuous function   2 2 3 1 ξ2 e−ξ /2σ0 √ U ξ −→ I(−∞,0] (ξ) , ; 2 2 2σ02 2 π    2 2 3 1 ξ2 e−ξ /2σ0 √ , ; + I(0,+∞) (ξ) 1 − U . 2 2 2σ02 2 π Furthermore, such a p.d. has probability density function   √  1 23/2 x 1+x 2  ξ −→ |ξ| exp − 2 ξ dx, πσ02 2σ0 (1 − x) (1 − x)3 0 (1 + x) and its moment of order 2k is ⎧ 2k ⎪ ⎨ σ0 1 · 3 · · · (2k − 1) k+1 ⎪ ⎩1

(k = 1, 2, . . . ), (k = 0).

Example 5.2. In order to determine (15) for the uniform model (13), first observe that in such a case one has γ(s) = ξ˜ := ξ/σ0 . Thus, by starting from 1.1.5.8 in [6, Part II], the same procedure as in Example 5.1 can be applied to determine (7). One gets   2 ξ˜ 1 −3/2 ˜2 −z −2ξ˜2 )+ t (1 − e−zt ) e−z(1−t)−2ξ /(1−t) dt, L(z, ξ) = e (1 − e π z 0 i.e., the Laplace–Stieltjes transform of the p.d.    x  ˜2 2 e−2ξ /y −2ξ˜2 ˜ ) δ1 (A) + ξ dy dx (1 − e 3/2 π A 0 (y(1 − y))



A ∈ B(R)

MINIMUM DISSIMILARITY ESTIMATORS

181

˜ In other words, of the occupation time of a Brownian bridge determined by γ(s) ≡ ξ. the following statement holds true. Let θˆn be the MDE of μ in the uniform model (13). Then the absolutely continuous function   2 2 1 3 1 2ξ 2 ξ −→ I(−∞,0] (ξ) √ e−4ξ /σ0 U , ; 2 2 2 σ0 2 π    1 −4ξ2 /σ02 3 1 2ξ 2 + I(0,+∞) (ξ) 1 − √ e , ; 2 U 2 2 σ0 2 π √ ˆ is the limiting p.d. function (as n goes to +∞) of n (θn − θ), under Pθ . Moreover, the moment of order 2k of this p.d. is   ⎧ ⎨σ 2k 1 · 3 · · · (2k − 1) F 3 , k; k + 2; 1 (k = 1, 2, . . . ), 2 1 0 8k (k + 1) 2 2 ⎩ 1 (k = 0), where, as usual, 2 F1 denotes the Gauss hypergeometric function. ˜ − s)—see Example 5.1—is evaluIn [6], the transformation (14) with γ(s) = ξ(1 ated by means of the Cameron–Martin–Girsanov transformation of measure, whereas, ˜ if γ(s) = ξ—like in Example 5.2—it is determined through the Feynman–Kac formula. In fact, the direct application of such a formula to the evaluation of (14) for general forms of γ can be problematic due to the violation of some regularity conditions required by the usual formulation of the Feynman–Kac theorem. See, for example, [17, subsection 5.6.B]. So, in order to get a systematic treatment of the p.d. of MDEs of location parameters in (1), we study a suitable extension of the Feynman–Kac theorem which yields (14) for general forms of γ, and we develop suitable analytical methods to produce explicit forms of the solutions of the corresponding differential problem. The results of this research will appear in a forthcoming paper. The main objective of the rest of the present section is to study robustness of MDEs of location parameters. First, let us specify how we intend to measure robustness to the occurrence of anomalous observations. We represent the intrusion of “bad” observations through the p.d. function x → H(x − ξ) =: Hξ (x) which is degenerate at ξ. More precisely, after fixing a proportion ε in (0, 1) of anomalous observations, we suggest that Fn be replaced with x → (1 − ε) Fn (x) + εHξ (x). When the true value of the unknown location parameter is revealed, say μ, since the x k ’s are supposed to be i.i.d. with common p.d. function x → F (x − μ), such a distribution can be viewed as an approximation of Fn , valid for large values of n. Thus,    με ∈ arg min F (x − μ ) − (1 − ε) F (x − μ) − εH(x − ξ) dx μ ∈R

is a minimum dissimilarity estimator of a location parameter, in the presence of gross errors, when the empirical distribution is assumed to be a copy of the true distribution. Under these circumstances, the quantity    r = r(ξ, ε) := sup |με − μ| : με ∈ arg min G Fμ , (1 − ε) Fμ + εHξ (16) μ ∈R

can be interpreted as a measure of asymptotic robustness of an MDE of a location parameter. When F satisfies the assumptions of Proposition 4.1, these estimators are

182

F. BASSETTI AND E. REGAZZINI

resistant to the intrusion of “bad” observations according to the following proposition, in which one sets,  y ∈ (0, 1) , Zξ (y) := I(0,F (ξ−μ)) (y) Z + (y) + I[F (ξ−μ),1) (y) Z − (y)  y y−1 and Z − (y) := y ∈ (0, 1) , Z + (y) := f (F −1 (y)) f (F −1 (y))

and hξ for any element of arg minh (0,1) |h − Zξ (y)| dy for some fixed ξ in R. Proposition 5.2. Let the same assumptions as in Proposition 4.1 be in force with σ = 1. Moreover, assume there is δ > 0 such that y 1−y + sup sup −1 (y)) y∈(1−δ,1) f (F −1 (y)) y∈(0,δ) f (F −1 is finite. Then, for

any ξ = μ + F (1/2), one has that (a) arg minh (0,1) |h − Zξ (y)| dy is a singleton; (b) ξ → hξ is a bounded function; (c) με = μ + hξ ε + o(ε) (ε → 0+ ). It is worth noting that, in view of (c), ξ → hξ can be viewed as the influence function of the MDE of a location parameter. Proof of Proposition 5.2. Define Gε (μ ) := G(Fμ , (1 − ε) Fμ + εHξ ) and set Vε (h) := 1ε Gε (μ + hε). Then hε := (με − μ)/ε is an absolute minimum point of Vε . After noticing that (a) is plainly true if ξ = μ + F −1 (1/2), we continue by proving (c), i.e., με −μ = hξ ε+o(ε) as ε → 0+ , which is tantamount to showing that hε = hξ +o(1) as ε → 0+ . First, by the triangle inequality and the minimum property of με , one has  G(Fμ , Fμε )  2G (1 − ε) Fμ + εHξ , Fμ = 2 εG(Fμ , Hξ )

and, since G(Fμ , Fμε ) = |μ − με |, one obtains |μ − με |ε  2G(Fμ , Hξ ). rewrite Vε as    1 εh − Yε (y) dy, Vε (h) = ε (0,1) with Yε defined by

Next,

    y −1 − F (y) + I(αε ,αε +ε] (y) ξ − μ − F −1 (y) Yε (y) = I(0,αε ) (y) F 1−ε     −1 y − ε −1 − F (y) + I(αε +ε,1) (y) F 1−ε 

−1



with αε := (1 − ε) F (ξ − μ). Now, by combining the mean-value theorem and the dominated convergence theorem, it can be shown that Vε (h) converges to (0,1) |h − Zξ (y)| dy =: V0 (h), uniformly with respect to h on every compact interval, as ε → 0+ . Then, in view of (a), arg minh V0 (h) is a singleton and hε → hξ := arg minh V0 (h) as ε → 0+ . Finally, note that the minimum point hξ must coincide with the median of Zξ (·), when Zξ is thought of as a random variable on ((0, 1), B(0, 1), λ). Hence, since  y ∈ (0, 1) , Z − (y)  Zξ (y)  Z + (y) we can write (17)

median(Z − )  hξ  median(Z + )

MINIMUM DISSIMILARITY ESTIMATORS

183

and this last inequality implies that ξ → hξ is bounded, as asserted in (b). Proposition 5.2 is proved. The last proposition shows that, from the point of view of robustness, an MDE of a location parameter is better than the empirical mean, at least for large values of |ξ|. As a matter of fact, if R x dF (x) = 0, the influence function of the empirical mean, evaluated at Fμ , has the form of the unbounded function ξ → (ξ + μ). Regarding the comparison with more robust estimators, such as the median, one can consider the special case in which Z − and Z + are strictly monotone functions, with the proviso that Z − must be increasing if Z + is decreasing. In such a case, (Z − , Z + ) is said to be an admissible pair. In fact, when (Z − , Z + ) is an admissible pair of strictly monotone functions, it is clear that  −1   − + −1 1 median(Z ) = − median(Z ) = − 2f F 2 holds true. Moreover, for any ξ = μ + F −1 ( 12 ),    −1 1 − 2f F −1 I(−∞,F −1 ( 12 )+μ) (ξ) 2    −1 −1 1 + 2f F I(F −1 ( 12 )+μ,+∞) (ξ) 2 is the well-known expression of the influence function of the empirical median seen as an estimator of a location parameter. See, for instance, both [15, Example 3.1] and [16, Chapter III, Example 1]. Combining these last facts with (17) gives the following proposition which states the superiority—from the point of view of robustness—of MDEs with respect to the median of the empirical distribution. Proposition 5.3. If, in addition to the hypotheses assumed in Proposition 5.2, (Z − , Z + ) is an admissible pair of strictly monotone functions on (0, 1), then |hξ | 

1 . 2f (F −1 ( 12 ))

Moreover, there is some ξ for which the inequality is strict. As for the hypothesis that both Z − and Z + are monotone functions, it should be noted that it is tantamount to assuming that the functions (1 − F ) and F are log-concave or log-convex, and it is worth recalling that log-concavity is passed from densities to p.d. functions; see [20]. Since there is a large number of commonly used p.d.’s with log-concave densities (see [1]) the condition by which Z − and Z + must be monotone functions appears not to be seriously restrictive. 6. Monte Carlo approximations. Proposition 4.1 states that, under a few regularity conditions, MDEs θˆn of unknown location-scale parameters, suitably centered and rescaled, converge in distribution to the random (unique) absolute minimum point h of     σB(x)  −1   h −→  f (F −1 (x)) − h1 − h2 F (x) dx, (0,1) where B denotes a standard Brownian bridge. Now, we use this characterization to get—by Monte Carlo methods—numerical approximations of the p.d. function of h, i.e., the asymptotic distribution of θˆn .

184

F. BASSETTI AND E. REGAZZINI

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 −4

−3

(a)

−2

−1

−1/2 x Fθ(x)=(2π) ∫−∞

0

1

2

3

0 −4

4

−3

2

−2

−1

0

−1/2 x ∫−∞

exp(−(t−μ) /2) dt, θ=(μ,1)

(b) Fθ(x)=(2π)

1

2

2

3

4

2

exp(−t /(2 σ ))/σ dt, θ=(0,σ)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 −5

−4

−3

−2

−1

0

1

2

3

(c) 1−F (x)=exp(−x/σ) (x >0), θ=(0,σ)

4

5

θ

√ √ Fig. 1. Asymptotic p.d. functions of n( μn − μ) with σ = 1 in (a) and of n( σn − σ)/σ with μ = 0 in (b) and (c), when θˆn is either the MDE (solid curve) or the MLE (dashed curve).

In particular, we employ such an approximation to determine asymptotic confidence bounds for the unknown parameters. Some examples of a limiting distribution function, for a few particular forms of Fθ , are shown as solid curves in Figure 1, compared with the asymptotic distribution of the corresponding maximum likelihood estimator (MLE) (dashed curves).√Apropos of these figures, it is worth noting that (6) implies that the limiting p.d. of n( μn − μ) does not depend on μ, when μ

n is the MDE of the location parameter μ, and that (9) entails the independence of σ of the √ limiting p.d. of n( σn − σ)/σ when σ

n is the MDE of the scale parameter σ. See the notation of Proposition 5.1. The robustness of MDEs—proved in the previous section—is attained at some sacrifice of efficiency. As an example, we have computed both the asymptotic variance of MLEs and the asymptotic variance of MDEs for some specific models and have obtained the results displayed in Table 1. In fact, these results, combined with the statements at the end of the previous section, seem to confirm the hypothesis that the trade-off between efficiency and robustness of MDEs is favorable. In order to obtain confidence intervals or regions, with confidence level (1 − α), from the asymptotic distribution of some MDE, we must recall that the limiting laws

185

MINIMUM DISSIMILARITY ESTIMATORS Table 1 Asymptotic variance of MDE versus MLE. Location in the normal model with σ = 1

Scale in the normal model with σ = 1

Scale in the exponential model with σ = 1

1.08 1

0.64 0.5

1.07 1

MDE MLE

Table 2 Confidence bounds for MDE and MLE. Confidence level

Type of estimator

Location in the normal model with σ = 1

Scale in the normal model

Scale in the exponential model

0.995

MDE MLE MDE MLE MDE MLE

2.8863 2.8070 2.6606 2.5758 2.0303 1.9599

2.2464 1.9849 2.0788 1.8214 1.5686 1.3859

2.8944 2.8070 2.6835 2.5758 2.0452 1.9599

0.99 0.95

√ √ of n( μn − μ) and of n( σn − σ)/σ are independent of μ and σ, respectively. Therefore, (asymptotic) confidence intervals for the location parameter μ, with confidence level (1 − α), have the same form—for sufficiently large values of n—as the intervals √ √ based on MLEs, i.e., ( μn − c/ n, μ

n + c/ n) with c a positive constant determined by P(μ,σ0 ) { h > c} = α/2, where h is the random minimizer of (6) with σ = σ0 . Analogously, (asymptotic) confidence intervals for the scale parameter σ, with confidence level (1 − α), have—for sufficiently large values of n—the form   σ

n σ

n √ , √ 1 + c/ n 1 − c/ n √ h > c} = α/2, where h is the random with 0 < c < n determined by P(μ,1) { minimizer of (9) with σ = 1. Values of c, corresponding to a few typical confidence levels and to a few different models, are presented in Table 2. In connection with location-scale parameters, we have determined the ellipse of inertia, with confidence level (1 − α), for the (simulated) asymptotic distribution of the MDE of the parameter θ = (μ, σ) of the Gaussian model. Writing the ellipse of inertia as   Eα = (x, y) ∈ R2 : ax2 + by 2  cα , a = 1.0419, b = 0.6149 (see, for example, [9]) with cα satisfying   P(0,1) h ∈ Eα = 1 − α, the region   2   cα b bcσ /n σ − σ

n σ

n2 (μ , σ  ) : a(μ − μ

n )2 + b −  n b − cσ /n (b − cσ /n) includes the true value of the parameter θ with probability (1 − α) with respect to the law of h under Pθ . As an example, for (1 − α) equal to 0.95 (0.99, respectively) we have cα = 2.6231 (cα = 3.99, respectively).

186

F. BASSETTI AND E. REGAZZINI REFERENCES

[1] M. Bagnoli and T. Bergstrom, Log-Concave Probability and Its Applications, Technical Report, Department of Economics, University of California, Santa Barbara, CA, 2004. Available online at http://repositories.cdlib.org/ucsbecon/bergstrom/1989D. [2] F. Bassetti, A. Bodini, and E. Regazzini, Consistency of Minimum Kantorovich Distance Estimators, Technical Report 4-MI, Consiglio Nazionale delle Ricerche, Istituto di Matematica Applicata e Tecnologie Informatiche, Milano, Italy, 2004. Available online at http://www.mi.imati.cnr.it/iami/reports.html [3] N. Belili, A. Bensa¨ı, and H. Heinich, Estimation bas´ ee sur la fonctionnelle de Kantorovich et la distance de L´ evy, C. R. Acad. Sci. Paris, 328 (1999), pp. 423–426. [4] S. Bertino, Gli indici di dissomiglianza e la stima dei parametri. Studi di probabilit` a, statistica e ricerca operativa in onore di Giuseppe Pompilj, Edizioni Oderisi, Gubbio, 1971, pp. 187– 202. [5] D. Birkes and Y. Dodge, Alternative Methods of Regression, Wiley, New York, 1993. [6] A. N. Borodin and P. Salminen, Handbook of Brownian Motion—Facts and Formulae, Birkh¨ auser, Basel, 2002. [7] L. D. Brown and R. Purves, Measurable selections of extrema, Ann. Statist., 1 (1973), pp. 902–912. [8] E. W. Cheney and D. E. Wulbert, The existence and unicity of best approximations, Math. Scand., 24 (1969), pp. 113–140. ´r, Mathematical Methods of Statistics, Princeton University Press, Princeton, NJ, [9] H. Crame 1946. ´, and C. Matra ´n, Central limit theorems for the Wasserstein distance [10] E. del Barrio, E. Gine between the empirical and the true distributions, Ann. Probab., 27 (1999), pp. 1009–1071. [11] Y. Dodge, ed., L1 -statistical Procedures and Related Topics, Institute of Mathematical Statistics, Hayward, CA, 1997. [12] P. Embrechts, L. C. G. Rogers, and M. Yor, A proof of Dassios’ representation of the α-quantile of Brownian motion with drift, Ann. Appl. Probab., 5 (1995), pp. 757–767. [13] C. Gini, Di una misura della dissogmiglianza fra due gruppi di quantit` a e applicazioni allo studio delle relazioni statistiche, Atti R. Ist. Veneto Sci. Lett. e Arti, 73 (1914), pp. 185– 213. [14] G. Hooghiemstra, On explicit occupation time distributions for Brownian processes, Statist. Probab. Lett., 56 (2002), pp. 405–417. [15] P. J. Huber, Robust Statistics, Wiley, New York, 1981. [16] P. J. Huber, Robust Statistical Procedures, SIAM, Philadelphia, PA, 1996. [17] I. Karatzas and S. E. Shreve, Brownian Motion and Stochastic Calculus, Springer-Verlag, New York, 1991. [18] W. Magnus, F. Oberhettinger, and R. P. Soni, Formulas and Theorems for the Special Functions of Mathematical Physics, Springer-Verlag, New York, 1966. [19] S. Portnoy and R. Koenker, The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators, Statist. Sci., 12 (1997), pp. 279–300. ´kopa, On logarithmic concave measures and functions, Acta Sci. Math. (Szeged), 34 [20] A. Pre (1973), pp. 335–343. [21] S. T. Rachev, Probability Metrics and the Stability of Stochastic Models, Wiley, Chichester, 1991. ´cs, The distribution of the sojourn time for the Brownian excursion, Methodol. Com[22] L. Taka put. Appl. Probab., 1 (1999), pp. 7–28. [23] A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical Processes, Springer-Verlag, New York, 1996. [24] C. Villani, Topics in Optimal Transportation, AMS, Providence, RI, 2003.

Suggest Documents