a consistent estimator of the characteristic function (s) = EeisX1]. ... Minimum distance methods based on cumulative distribution functions were outlined by ...
EMPIRICAL TRANSFORM ESTIMATION Qiwei Yao and Byron J. T. Morgan Institute of Mathematics and Statistics University of Kent at Canterbury Canterbury, Kent CT2 7NF, UK
Abstract The paper starts with a brief review of methods of model- tting using empirical transforms. We then present a method for estimating the parameters in indexed stochastic models via a least-squares approach based on empirical transforms. Asymptotic approximations are derived for the mean squared error and for the distribution of the resulting estimator. The explicit expression for the mean squared error provides a natural way of selecting the transform variable. A common nding when multi-parameter models are tted using transforms is that optimal performance results from equating the elements of the transform vector. We provide a natural condition, and indicate why this result occurs under this condition. Numerical examples illustrate the performance of the new method.
Keywords: Empirical transform, growth model, indexed stochastic model, Laplace transform, least squares estimator, mean squared errors, reliability model.
1
1 Introduction In a variety of applications for which standard methods of model- tting can be dicult, Laplace transforms of aspects of the model may take simple forms. Examples can be found in, for example, Darling and Siegert (1953), Leslie (1969), Fienberg (1974), Kingman (1963), Basawa (1974), McDunnough and Wolfson (1980), Morgan (1982), Morgan (1992, p.203), Wise (1989), Trajstman and Tweedie (1982), Frome and Yakatan (1980). This has led to ways of tting stochastic models by matching empirical and theoretical transforms. Work in this area appears to have been initiated by Parzen (1962), and may be dichotomised according to whether we are dealing with independent, identically-distributed random variables, or random variables indexed in some way, e.g., by time. The primary focus in this paper will be on the latter case, but both applications have much in common.
1.1 The case of independent, identically distributed random variables Consider initially the i.i.d. case, when we have a set fX1; : : :; Xng of independent, identically distributed random variables from a distribution with parameters . We then de ne the empirical characteristic function as n X ^(s) = n1 eisXj ; j =1
a consistent estimator of the characteristic function (s) = E [eisX1 ]. Similar de nitions apply for empirical Laplace transforms, Mellin transforms, moment generating functions, etc. We can estimate by seeking in some way a best match between ^(s) and (s). Minimum distance methods based on cumulative distribution functions were outlined by Wolfowitz (1953, 1957), and Leslie (1970) and Leslie and McGilchrist (1972) employed such methods for Laplace transforms and moment generating functions. In terms of characteristic functions, we might estimate by the value which minimises
h() =
Z1
?1
j^(s) ? (s)j2w(s)ds;
(1:1)
where w(s) is a suitable weight function.
Example 1.1 Suppose we have a random sample X1; : : :; Xn from N (; 2), where 2 is 2
known, and we want to estimate .
h() =
Z1 ?1
n X jn?1 eisX ? eis?2 s2 =2j2w(s)ds j
j =1
and choosing w(s) = e?s results in 2
h() =
212 XXe? 14 (Xi ?Xj )2 + n i j
? n2 2 1+ 2
Setting dh d = 0, results in
!X
12
(
1 + 2
2 exp ? 14 (Xj ? 2) (1 + 2 j
)
( 2) 1 ( X ? ) j = 0; (Xj ? ) exp ? 2 4
X
(1 + 2 ) the estimating equation for an M -estimator. Because of the boundedness of the characteristic function, one might expect the estimators that result to be more robust but less ecient than maximum likelihood estimators. In this example, the asymptotic relative eciency of ^ is, 2 f(1 + 22 + 43 4)=(1 + 22 + 4)g 23 : j
In the early work by Leslie and McGilchrist, numerical integration was employed in tentative examples using small samples from Normal and Gamma distributions, but there was little theoretical development. Press (1972) demonstrated strong consistency and asymptotic normality, with a view especially to use for parameter estimation in connection with the stable laws. Using the same w(s) as in the above example, Paulson, Holcombe and Leitch (1975) carried out the integration for the stable laws, using a 20-point Hermitian quadrature to obtain h(), which was then minimised numerically. The eciency of this procedure was shown by Heathcote (1977) to be generally less than unity and he suggested that it would be better to replace e?s2 by a function with small weight at zero. He suggested also that rather than use a weight function distributed over the entire real line, there could arise examples for which matching of ^(s) and (s) for a small number of values of s would be preferable. This corresponds to (1.1) with a weight function w(s) that is zero except for a nite number of s-values. Quandt and Ramsey (1978) used such an approach for tting a mixture of two normal distributions. Based on moment generating functions, their technique sought f; 1; 2; 1; 2g to minimise m k (1X X sX
j =1
m i=1 e
j
i
?
exp(1 sj + 12 s2j =2) ? (1 ? ) exp(2 sj + 22s2j =2) 3
)2
for k 5; see also Everitt and Hand (1981, Chapter 2). For this application a weighted least squares approach was provided by Schmidt (1982), who also considered the optimal choice of the fsj g, with regard to maximum eciency. He found it best to select all the fsj g close together, and we shall comment on this nding later in x4. Simply equating ^(s) and (s) for a nite set fsk g can result in explicit estimators, as pointed out by Press (1972, 1975), and such a modi ed method-of-moments has been generalised and examined by Feuerverger and McDunnough (1981a, b; 1984).
Example 1.2 For the two-parameter Cauchy distribution with p.d.f. f (x) = f 2 + ( x ? )2g ; simply using a single value of s and equating real and imaginary parts of the characteristic functions result in the explicit estimators
8
9
n n 0, k = 1; 2; 3, where Gn(:) is given as in (1.7), and
G(; ; ; !) =
Z1 0
( ? e? t )e?!t dt = ! ? + ! ;
(see Leedow and Tweedie 1983). For simplicity, we consider the following two-parameter version of the above growth model
Yj = (1 ? e? tj ) + j ;
j = 1; : : :; n;
(1:9)
where and > 0 are unknown parameters, and > 0 plays the role of a nuisance parameter. Let = (; ) . Now,
G(; !) =
Z1 0
(1 ? e? t )e?t! dt = !(! + ) :
Given s = (s1; s2) , by equating G(; si ) = Gn (si ) for i = 1; 2, we have the estimators
s1)Gn(s2) ; ^(s) = s22 Gn (s2) ? s21 Gn(s1 ) : ^ (s) = s1 ss22(sG2 ?(ss1))?Gsn2(G s G (s ) ? s G (s ) (s ) 2 n 2
1 n 1
1 n 1
2 n 2
(1:10)
When s2 ! s1 , the above estimators converge to
^(s1 ) =
s1 G2n (s1) ; 2Gn (s1 ) + s1 G_ n (s1)
^(s1 ) = ?s1 ?
s1Gn (s1) ; Gn(s1) + s1 G_ n(s1)
(1:11)
which, in fact, are the solution of equations
G(; s1) = Gn(s1 ) and @s@ G(; s1) = G_ n(s1 ); 1 where G_ n (:) denotes the derivative of Gn (:). 7
2
In this paper we shall not consider further the integration/minimisation approach of x1.1, concentrating instead upon the approach of the last 3 examples, which frequently results in explicit estimates. An explicit solution is much more attractive than one obtained after iteration. However, in our case there is the added complication that s needs to be chosen, over q .
1.3 Choosing the transform variable This has been done in various ways: (i) by cross-validation (Laurence and Morgan, 1987); (ii) by reference to asymptotic criteria, such as minimising an asymptotic variance (Feigin et al., 1983) or generalised variance (Schmidt, 1982); (iii) by searching alone the `diagonal' in s-space, where s1 = s2 = : : : = sp (Tweedie et al., 1995), this resulting from an empirical observation (Morgan and Tweedie, 1981, Schmidt, 1982), that this was optimal in some sense; (iv) using an external criterion such as a likelihood (Morgan and Tweedie, 1981, Laurence and Morgan, 1987),or a sum of squares, possibly simpli ed from a reduction in dimensionality following the diagonal optimisation feature mentioned above. For instance, for the model of (1.2), we can form the sum of squared residuals, n 1X 2 ^ n fYn;j ? ((s); tn;j )g j =1
(1:12)
and then select s as s^` , which minimises (1.12). The least-squares (LS) transform estimate is then de ned as ^` = ^(^s` ): (1:13) What has been lacking in the work to date for the time-indexed case has been a full asymptotic development to provide standard inferential procedures for the resulting estimator ^(s) given by (1.6). This is done in x2.1 of this paper, where the importance of accounting for bias is clearly demonstrated. Once we have an explicit approximation to the mean square error (MSE) of ^(s), we can choose the transform variable s to minimize the MSE. This is a very convenient approach which has not been previously adopted. By contrast, Feigin et al (1983) focused on an asymptotic 8
variance, and obtained an s which minimizes this variance. However, the resulting s was a function of the unknown . They therefore required a prior estimate of . The mechanics of our approach are described in x2.2. x3 illustrates our procedures in operation for Examples 1.3 and 1.4. The application to the two-parameter growth model of Example 1.4 provides a demonstration of the `diagonal optimization' phenomenon, mentioned above in (iii). This has been observed empirically in a range of dierent applications, and is both intriguing and important, because it reduces a high dimensional optimization problem to a simple line search | see for example Tweedie et al (1995). In x4, we provide a natural condition under which diagonal optimization occurs. The applications of this paper are simple, to illustrate the approaches adopted. However the potential use of the new methodology is large. The methodology can be applied to estimation based on empirical transforms for various statistical models. On the other hand, quite apart from robustness considerations (cf. Campbell 1992, 1993), model- tting based on empirical transforms will be the only feasible approach in many cases involving complex and realistic models (cf. Ball 1995, and Tweedie et al 1995). This is the primary motivation for our work. There are many avenues for further research. For example, we need to evaluate the eect on the precision of estimators of using the data to select the transform variables (cf. Ball 1995); the family of models given by (1.2) can be extended to the case when the fj g are serially correlated; the gain in eciency by increasing the number of transform variables in time-indexed models needs to be quanti ed; the idea of x2.2, for estimating the transform variable directly via the MSE, can be readily applied also to the i.i.d. case.
2 Asymptotic Properties 2.1 The asymptotic distribution of ^(s) In order to discuss asymptotic behaviour of estimator ^(s) de ned as in (1.6), we need to impose some conditions under which the sample size n tends to 1. The key point in the assumption is to ensure that all the sampling time intervals tj +1 ? tj for j = 1; : : :; n ? 1, and therefore also cj +1 ? cj for j = 0; 1; : : :; n ? 1, converge to 0 as n ! 1, although tn may tend to 1 (see conditions (A2) and (A6) below). This is also a reasonable assumption in practice, since, for example, it may waste sampling eort by sampling too much in the right-hand tail of a lifetime distribution in reliability trials. 9
To indicate the dependence of sampling times on the sample size n, we write (Yj ; tj ; cj ; j ) as (Yn;j ; tn;j ; cn;j ; n;j ). With the more detailed notation, model (1.2) can expressed as
Yn;j = (0 ; tn;j ) + n;j n;j ;
j = 1; : : :; n;
(2:1)
where n;1 ; : : :; n;n are independent random variables with rst two moments 0 and 1. First, we introduce some regularity conditions. 2 is bounded away from 0, and max1j n 2 is bounded (A1) For all n 1, min1jn n;j n;j away from 1. (A2) As n ! 1, n Z c X j =1 c
n;j
(A3) As n ! 1, n X j =1
(; tn;j )
Zc
n;j
cn;j?1
2 g(t; !)dt ! 0; ! 2 : ?1
n;j
g(t; !)dt !
Z1 0
(; t)g(t; !)dt = G(; !):
(A4) For any ! 2 , sup2 jjG_ (; !)jj < 1, where G_ (; !) = @@ G(; !). (A5) is a compact subset in Rp, and for any 1; 2 2 and 1 6= 2, q X G_ (1; sk )fG(2; sk ) ? G(1 ; sk )g = 6 0: k=1
(A6) g(t; !) is a non-negative function. Further, for any !1; !2 2 , as n ! 1, R R max1in j cc ?1 g (t; !1)dt cc ?1 g (t; !2)dtj Pn j R c g(t; ! )dt R c g(t; ! )dtj ! 0: 2 1 j =1 c ?1 c ?1 n;i n;i n;j n;j
n;i n;i n;j n;j
Condition (A2) ensures the law of large numbers, which means that in the sums of the LHS of (A.1) in the appendix, no individual term plays a dominate role. (A3) is a necessary condition for ^(s) being consistent. (A5) implies that the parameter is identi able by functions G(:; sk ), k = 1; : : :; q. (A6) is needed to satisfy the Lindeberg-Feller condition of the central limit theorem (cf. Theorem 2 below). (A1) and (A4) are imposed for technical convenience. Condition (A6) is also not the weakest possible that could be assumed, but it is convenient. Since G is a smooth function, it follows from (1.6) that ^(s) is the solution of equation
n ()
q X G_ (; sk )fGn(sk ) ? G(; sk )g = 0:
k=1
10
(2:2)
Replacing Yn;j by its mean value (0 ; tn;j ) in n () for all 1 j n, we obtain
8n Zc q 0. Therefore, inf 2U" jjn()jj 0=2 > 0 for all suciently large n. Consequently, n 2 O(0; ") as n ! 1, because n(n ) = 0. P 0, where () is given as in (2.2), and (^(s)) = 0. By Lemma 3, sup2 jjn() ? n ()jj ?! n n Hence jj ()jj > 0=4g ! 1: P finf 2U n "
Consequently, P fjj^(s) ? 0 jj < "g ! 1 since
jj ()jj > 0=4g f^(s) 2 O(0; ")g: finf 2U n "
20
2
Proof of Theorem 2. Let ^ = ^(s). It follows from Taylor's expansion that n (^) ? n (n) = _n ()(^ ? n ), where is between ^ and n . Since ^ is the root of (2.2), ? _n ()f^(s) ? n g = n (n):
(A:2)
It follows from Lemma 3 and (A3) that
9 q 8n Zc = X