result is true for the chi-square statistic used for testing the validity of the ..... r smooth correct functional equality constraints and subject to well behaved non-.
British lournel o/ Mathentarical and Slatistical Psychology (1992) 45, 289-309
(01992 The British Psychologicnl S w u l y
Printed in Greet Britain
289
On statistical inference with parameter estimates on the boundary of the parameter space T. K. Dijkstrat Institute oj Econometrics, P.O. Box 800, 9700 AV Groningen, The Netherlands
Inadmissible estimates, like negative variance estimates, frequently occur when a theoretical covariance matrix is fitted to a sample covariance matrix. When this happens, it is common practice to re-estimate subject to nonnegativity constraints and to count the ensuing zeros as true zeros. We derive conditions under which this approach has a valid frequentist interpretation. 1. Introduction As is well known, it happens frequently that unconstrained estimation of, say, the
parameters of a covariance structure leads to inadmissible estimates, like negative variance estimates. Lee (1980), for example, claims that in practice about one-third of the data sets yield at least one negative estimate of the unique variances. Lawley & Maxwell (1971, p. 32), report a study done by Joreskog in which nine out of eleven data sets produced improper values. When this happens and one still believes in the model, one typically resorts to constrained optimization techniques in order to force the estimates to comply with theoretically desirable inequality, or non-negativity, constraints. For inferential purposes it is then common practice to count the resulting estimates on the boundary, the zeros in case of Heywood-cases, as known parameters. The stability of the remaining estimates across the sample space is assessed on the basis of an (asymptotic) distribution, which is relevant only under the model with the parameters fixed a priori on the boundary. From an orthodox frequentist point of view this procedure ignores the fact that on different subsets of the sample space different constraints will be selected. Some samples place the estimates well within the parameter space. Others position the estimates on boundaries different from those favoured by the actually available sample. From a strict frequentist perspective, this fact should be taken into account when frequency distributions of estimates are calculated. It is rather obvious however that, when one insists this is the proper approach, in general insuperable problems arise. Not only will it be impossible to specify the precise partitioning of the sample space for a realistic model, the frequency distributions involved will be intractable. Now one could feel, as we do, ?Requests for reprints.
290
?: K . Dijkstra
that data sets leading to constraints other than those favoured by the set at hand are less relevant in the assessment of the estimates’ stability. In fact, it does not seem unreasonable to focus on the conditional distribution, taking only those samples into account which prefer the same constraints to be binding. This is the approach adopted in this paper. To anticipate a main result, the simple, practical approach which counts the estimated ‘zeros’ as true zeros and which uses the marginal asymptotic distribution of the other estimates for inferential purposes is valid asymptotically when: (i), the estimated zeros are true zeros, (ii), there are no other true zeros, (iii), the estimation method is asymptotically efficient. In sufficiently large samples, it may be reasonable to assume condition (i) to be satisfied, under the maintained model, but the other conditions appear to be more problematic. A similar result is true for the chi-square statistic used for testing the validity of the assumed covariance structure. The paper is organized as follows. In Section 2 we analyse a very simple model which allows us to highlight the inferential issues without distracting mathematical intricacies. The simplicity of the model also allows us to call attention to problems which would perhaps remain unnoticed in more complicated models. The model we chose to deal with in fact is the simplest possible one-factor model with zero degrees of freedom. This model nevertheless imposes substantial restrictions on the set of covariance matrices (correlation matrices actually). We describe the maximum likelihood estimates and chi-square statistic. Statistical inference problems are discussed in detail. We conclude Section 2 with some numerical exercises. Section 3 continues with the one-factor model but now we impose equality constraints on the parameters, so that the degrees of freedom are now positive. Here we can give some exact distributional results. Section 4, finally, is the most general section. The set up chosen there is sufficiently rich to cover general covariance structures with equality and non-negativity constraints, it is certainly not restricted to factor analysis models. The approach adopted in this section leans heavily on the work done by Shapiro (1985, 1987) and properties of the normal distribution.
2. A one-factor model with zero degrees of freedom 2.1. The relative size of the set of admissible structures Suppose one postulates that the correlations between three observable variables can be explained by one unobservable factor. So the correlation matrix E of the manifest variables satisfies
for non-negative values of the unique variances a:, a: and a:. There are no degrees of freedom, we have six moments and six parameters; the model is ‘saturated’.
On statistical inference
29 1 13” 12’23
Figure 1
However, not every 3 x 3 correlation matrix can be expressed meaningfully in terms of the As and a2s. A little bit of algebra easily yields the following proposition: Proposition I . The equations (1) can be solved for real non-zero values of the As and non-negative values of the 6’s if and only if: (i) P 1 2 P 1 3 P 2 3 > 0 and (ii) ( P 1 2 P I 3 / P 2 3 ) . ( P 1 2 P 2 3 / P 1 3 ) and (P13P23/P12) are not larger than One. (We take p.d.).
If the conditions are satisfied we have A: = ( ~ 1 2 ~ 1 3 / ~ 2 3 )A:, = ( ~ 1 2 ~ 2 3 / ~ 1 3 A: ) , = 0: = 1 -A?, for i = 1,2,3. To indicate how small the set of correlation matrices satisfying these conditions is relative to the set of all possible correlation matrices consider Fig. 1. The figure shows for an arbitrary value of p l z from (0,l) the set of admissible values of ( P 1 3 r P 2 3 ) , the shaded area. The area bounded by the ellipse contains all possible values of ( P 1 3 r P 2 3 ) given p I 2 ; this area is determined by the requirement that I; is p.d. It is clear that if one takes an arbitrary correlation matrix there is a very good chance that the moment equations cannot be solved for admissible values. In fact it is possible to show (see Appendix) that the volume of the set of admissible correlation matrices relative to the volume of the set of all correlation matrices is a mere 2 / n 2 z 0 . 2 ! So even though there are no degrees of freedom, the one factor model does impose non-trivial restrictions.
(p13p23/p12) and
2.2. Maximum likelihood estimators and chi-square statistic Let R > O be a 3 x 3 correlation matrix based on a random sample of size n from a normal population. So
292
1-
Define 8 = [ I , , I,, 13,a:, a:, a:] and define the matrix valued function E(.)of 8 by
I:+o: E(e) =
I1I2
I1I3
I:+a;
1213
I : + a:
Suppose we fit E(8) to R by minimization of F(R,C(8)) =n[tr(RE-') +log IR-'E[ -31
(4)
with respect to 8. Typically, standard computer programs like LISREL allow 8 to vary in principle over R6, whereas the search for a minimizer ought to be restricted to, say, 0 with
We will return to this aspect in Section 2.4. It may happen that the moment equations R=E(8) can be solved for admissible values, in which case the fit is perfect and the associated value of F is zero. As the previous section made clear, it is perfectly possible that the moment equations cannot be solved for admissible values: it can easily happen that either all Is are complex or that precisely one of the a2s is negative. This follows easily from Proposition 1 of Section 2.1 and the fact that only one of the ratios (r12r13/r23)r(rl2rt3/rl3), (r,3r23/r12)can be larger than one. So when the conditions of Proposition 1 are not satisfied we cannot obtain a perfect fit for admissible values. Now it can be shown that in this case F(R,E(8)) does not have a stationary point within 0. Thereforefore we will have to search on the boundary of 0 for a minimizer. Observe that E(8) with two or more a2s equal to zero has a zero determinant. So it suficies to substitute a zero for each of the 02s, one at a time, and minimize F(R,E(8)) with respect to the remaining parameters. Suppose we set oi=O. Let us denote the (standardized) indicators by xI, x2 and x3. Then under the one-factor model we have that x 1 and x3 differ from 1 1 x 2 and I , x , respectively by uncorrelated measurement errors. In other words, the conditional correlation between x1 and x3 given x2 equals zero, or equivalently, p 1 J = p 1 2 p 2 3 . Now if we replace r13 by r12r23 and solve the ensuing moment equations we obtain:
a=
&, d:,
d:, d:] = [r,,, 1, r23, I -rt2,0, 1 -&I.
Moreover F(R,Z(@]=n[log(l -rf2)(1 -ri3)-log1R[]
(6)
On statistical inference And
[
W)=
r12
r13r23
1
rf3
1.
293
(8)
We can treat n:=O and a:=O similarly. With this approach, our estimate of a: equals zero if and only if r t 3 is less than both r f 2 and r i 3 (the required changes in this statement for d:=O or S:=O will be obvious). This approach can indeed be shown to lead to the correct results in case of ML estimation. In fact, a detailed analysis yields the following proposition (where events with zero probability are ignored). Proposition 2. 7he ML estimator equal respectively:
6 and
associated 'chi-square'-statistic F(R, Z(8))
and zero, if and only if
II(i):
[lrr12,r13,0,
1 - r : 2 , 1 -r:J
and nlog(1 - r i 3 , 1 ) - 1 ,
if and only if
if and only if
Il(iii): [rI3,rz3,1 , 1 - r : 3 , 1 -r:3,0] if and only if
and nlog(1 - r ? 2 , 3 ) - 1 ,
7: K . Dijkstra
294
The enumeration is complete. 2.3. On statist i d infcrence
An orthodox frequentist’s report typically contains a vector of estimates plus a description of the frequency distribution of these estimates as induced by repeated sampling from a fixed probability space. He or she will also report goodness-of-fit statistics and associated ‘observed significance levels’ or p values, calculated on the same or on a related probability space. It will be clear from Proposition 2 that the exact frequency distribution of 8 and F(R,Z(8))will not be easy to derive, even when we assume random sampling from a normal population. Asymptotically, when the number of observations, n, tends to infinity, the derivations are a lot easier. Suppose, for example, that Z is a one-factor structure with positive error variances. Then the condition of Proposition 2.1 receives asymptotically unit probability. So 8 and the vector of estimates defined in Proposition 2.1 have the same limiting distribution. Though a tedious job, this distribution is not diflicult to derive. Now suppose that Z is not a one-factor structure. Then all of the probability mass will tend to be assigned in general to precisely one of the conditions in Proposition 2.11, say Il(i). Therefore 8 and the estimates defined in II(i) have identical easily derived limiting distributions. And of course F(R,Z(8)) tends in probability to infinity. The situation is more complicated when Z is a boundary point of the set of onefactor structures; i.e. precisely one of the error variances, say a:, equals zero. So
Z=
8=[P12,
[
1.
1’12
P12P23
1
P;3
I,P23, l - P ? 2 9 0 9
1 -Pi31
(9)
(10)
and fIt3.2
=O.
( 1 1)
I t is clear that the first condition in Proposition 2.11(ii), namely [{r12r,3r23l}, will tend to get half of the probability mass. The remaining 50 per cent will be assigned to the condition in Proposition 2.1, asymptotically. So the limiting distribution of 8 is the average of two conditional distributions. A similar result is true for the chi-square statistic F(R,Z(@): if n is sufficiently large, F(R,Z(8)) will be zero for one half of the samples and equal to nlog( 1 - r ? 3 , 2 ) - ’ for the other half. Now observe that since ~ : 3 , 2= O
On statistical inference
295
with z - "0, I ) . Also, knowing that ( r 1 2 r 1 3 / r 2 is 3 ) larger than one is asymptotically equivalent to knowing the sign of r13.2. Consequently F(R,E(8)) given the condition in Proposition 2.II(ii) tends in distribution to z 2 given the sign of z. But this is the same as the unconditional distribution of z 2 ! So lim Pr{F(R,C(8))~~}=$Pr(~~(O)~c}+fPr{~~(1)~c}, (13)
n-u,
where ~'(0) equals zero with probability one. The approach followed in practice is a lot simpler: one simply reports the estimates as defined in Proposition 2 and describes their marginal distribution, thereby pretending that one uses the same formula for the estimates and chi-square statistic irrespective of where the sample comes from. In the remainder of this section we will discuss the logic of this approach, assuming that C is indeed a one-factor structure, or is well approximated by such a structure. The easiest situation to discuss is the one where condition 1 of Proposition 2 prevails. This is the case where the moment equations can be solved for admissible values of the variances. So F(R,E(B))=O and 8 equals
The practical approach requires a description of the asymptotic distribution of (14) which is easy enough to get. It is clear that if E is a one-factor model with positive error variances then this approach is asymptotically correct. When however one of the error variances is zero (or is close to zero) an orthodox frequentist might deem a mixture of two conditional distributions more relevant. An intermediate position is adopted by those who accept the principle of conditional inference which could be phrased as follows:
If injerence proceeds b y relating a given experimental outcome to a series of hypothetical repetitions, then the series of repetitions should be as relevant as possible to the data at hand. Some of the names associated with this principle are D. R. Cox, J. Kiefer and R. A. Fisher. See, for example, Hinkley (1980), Cox & Hinkley (1974) and Lehmann (1986). So this would call for an evaluation of the conditional distribution of (14), given that the condition of Proposition 2.1 is satisfied and a specific error variance is equal to zero. This is by no means trivial. A possible way to proceed is to estimate 8 subject to the zero constraint believed to be active, and to simulate the conditional distribution via the bootstrap, using { Z 1 / 2 R - 1 / 2 ~ , } : = as , the set of vectors from which to sample. This suggestion remains to be tried. Next, let 8 be a point on the boundary with, say, d:=O, so
296
7: K . Dijkstra
and
This happens if and only if the sample satisfies [ { r l 2 r l 3 r Z 3 < Oand } { r ; 3 < r ; 2 , r f 3 } ]or { ( r1 2 r 2 3 / r 1 3 ) > 1. The approach followed in practice ignores this, and one describes the limiting unconditional or marginal distribution of (15) and (16). In fact, one assumes that o; =0, so that in particular the limiting marginal distribution of F(R,C(@) is a ,yz( 1)-distribution. In sufficiently large samples the assumption o;=O may be a reasonable one to make, since limn+mPr { 6 ; = 0 1 o f >0) = 0 and limn- Pr { 6: = 0 10; =0} = 1. Consequently, an orthodox frequentist might be willing to settle for the average of the conditional distributions of (14) and (1 5), and again use (13). However, in this situation it appears to be particularly reasonable to work conditionally. Samples leading to estimates within the admissible area do not seem to be relevant to the data at hand. Now, the conditional distribution of (15) and (16) given 6 f = O is asymptotically very simple, if o:=O: As stated above, the condition 6;=0 is asymptotically equivalent to knowing the sign of ri3.2; since it can be shown that r13.2 and (r12,r23) are independent, see Dawid (1985) for example, we have the following proposition. Proposition 3. If ui = O then the limiting conditional distribution of C,/nCo-
el', F(R, E(@)I,
conditional upon 6;=0, is the same as the asymptotic marginal distribution of
In particular, n log( 1 - r : 3 . 2 ) - ' tends conditionally in distribution to a x2( I)-distribution and is asymptotically independent of J n ( 8 - 0).
What Proposition 3 amounts to is that the practical approach is justifiable from a frequentist point of view, if the sample is sufficiently large and O: is zero (or is close to zero). Clearly 6 ; = 0 and 6: = 0 can be handled in a very similar way. The remainder of this section will be devoted to a special case, whose existence can be recognized only because of the simplicity of the model used here. We mean the following. A straightforward analysis yielded that (6: =0} is the union of two disjoint } { r : 3 < r ; 2 , r ; 3 } ] . Under the one events { ( r 1 2 r 2 J r 1 3 )I >} and [ { r 1 2 r 1 3 r 2 3 < O and factor model, the former event receives half of the probability mass when n+oo and o;=0. The latter event's probability tends to zero. But imagine that this event occurs. Then a conditional asymptotic analysis is out of the question. A full-blown unconditional analysis is also problematic, to put it mildly. To get some insight into this problem we conducted a Monte Carlo experiment. The E used is
[' : g.
On statistical inference
z=
297
(17)
so O=[i, l,i,i,O,i].We put n= 10 and generated just enough samples of size 10 to } { ~ : ~ < r : ~ , r :We ~ } ]needed . get lo00 with the property that [ { r 1 2 r 1 3 r 2 3 < Oand about 9000 samples. The estimated probability of the event is thus 1 1 per cent. Since p L 2 = p 2 3 it , is clear that r I 2 and r23 have identical distributions, conditionally as well as unconditionally. The loo0 values of r I 2 had mean and standard deviation respectively
0.41 and 0.23.
(18)
We also generated loo0 values of r I 2 unconditionally. The mean and standard deviation were respectively 0.47 and 0.27.
(19)
A comparison of histograms strongly suggested that r l is conditionally stochastically smaller than unconditionally. For example, about 65 per cent of the r I 2 s from the samples satisfying the condition were less than pI2=.5, as opposed to 48 per cent of the other set of r I 2 s . We also compared the chi-square statistics nlog( 1 -r:3,2)- I . The conditional mean and standard deviation were respectively
1.64 and 1.78.
(20)
Unconditionally, the same statistics were respectively 1.47 and 2.04.
(21)
Critical values at 5 per cent were estimated at 5.2 for the conditional distribution and at 5.6 for the unconditional distribution. The latter has a lighter tail to the right but a heavier tail to the left than the conditional distribution. So the approach followed in practice would in this case be less than appropriate. More research may be needed, especially perhaps into the usefulness of the bootstrap. On the other hand, in more complex situations we will not be able to split events like { b : = O } into subevents, some with zero limiting probabilities. So in general all that appears feasible is just conditioning on the fact that particular non-negativity constraints are binding. We will deal with that in some generality in Section 4. 2.4. Some numerical exercises
Suppose the observed correlation matrix R equals
298
7: K . Dijkstra
Since the product of the correlations is negative we cannot solve the moment equations for admissible values, so a perfect fit is impossible. The LISREL-algorithm, in an attempt to minimize F(R,E(8)) on R6, thus without constraints, reported that the iterations did not converge. In addition, at the point where it stopped iterating, LISREL produced a chi-square statistic with zero degrees of freedom equal to minus 16.5, the estimated As were approximately [ - 121, +6943, -2793, other estimates were also way out of line, and it warned that A3 may not be identified (the initial estimates were selected by LISREL)! Based on some numerical work the author suspects that the graph of F(R,E(B)) has a valley of 'minimum values' on the set of 0s of the form ( A 1 , A 2 , A 1 , 1 -A;, 1 -A:, 1 -At) with A1A2=.5869 and A1 tending to zero. However, the analysis of the unconstrained minimization problem is not yet complete. It has been noted by Rindskopf (1983), who attributes the basic idea to Bentler (1976), that LISREL may be forced to minimize subject to non-negativity constraints, by means of a reparameterization. More specifically, one fixes the variances of the measurement errors at 1.0 and estimates their coefficients. So we write for the (standardized) indicators xI,x2 and x3:
where the 5s are zero mean, unit variance, and uncorrelated variables. E(.)is now a function of the six As, The residual variances are estimated by the squares of &, and I,. One problem with this appealingly simple suggestion is that if the minimum is located on the boundary, so that, say 1 5 = 0 , the Jacobian, (do,,/dA,), i s j , has less than maximal rank at 1.Optimization methods which require a full rank Jacobian at the optimum may be expected to get into trouble. Indeed, when we ran LISREL on this problem we had again non-convergence, absurd estimates at the last iteration, and an incorrect identification warning (the initial esitmates were also selected by LISREL). When we used a gradient-free method, the Nelder-Mead algorithm, the results obtained were correct, namely 8= [f, I , 1, i,O, $1. Other methods for constrained optimization have been suggested by Lee ( 1980) and Jennrich & Sampson (1968). However, the simplest approach for relatively small covariance matrices may still be a systematic search on the boundary. We have also fitted the reparameterized E(.),E(A), to R using the following functions: ( 1 ) GLS (2) IML (3) G D
:itr(l-R-1E)2 : tr (R - 'E)+ log IE 'R) - 3 : ic:(logyi)2, with the ith eigenvalue of R - ' E denoted by y ,
On statistical inference Table 1. Estimates of ~~
~
0:
299
and minimum values
Fitting function
6;
GLS IML GD ML* IML( -2) IGLS
,2885 .4 I 67 ,5591 .7500 .9014 1.0833
~
Minimum value ~~~~
'
._
~~
.3077 ,5878 6476 ,5878 .4778 .3077
*ML f(R.TI
(4) IML(m) : (5) IGLS
+ 1 [ I tr(R-'E)"-logIR-'XI m2
m m
1
with m = -2.
:+tr(l-Z-'R)2.
The functions were minimized by means of the Nelder-Mead algorithm. All of these fitting functions, including ML, can be written in the form Y2: f ( y i ) for suitable f ( . ) . They belong to the class of fitting functions suggested and analysed by Swain (1975); see also Dijkstra (1990), in particular for IML, IML(m) and IGLS. They yield asymptotically equivalent results, but their estimates may differ appreciably in 'small' samples. Since the diagonal elements of R as well as r 1 2 and r 2 3 are equal here, the resulting vector of estimates is of the same form for all of Swain's fitting functions, i.e. e=[K,,K,,I,,irf,ir.:,ir:]. Moreover one may expect d:=O. It turned out that the five functions listed above and ML produced identical estimates except for 8:;8 is of the form [i, l,+,ir:,O,ir;]. In other words, the estimated Xs differ only in the first and third diagonal element. Table 1 collects the estimates of 0: as well as the minimum values of the fitting functions. The functions are ordered w.r.t. to the third derative of their f ( . ) at y = 1; GLS has the largest third derivative, IGLS the smallest. As predicted by Swain (1973, the residual variance estimates increase with decreasing values of the third derivative. Finally, we close this section by noting the following facts: ( I ) Only one of the estimated Xs is a correlation matrix, namely in case of ML. However, if we transform the e s to correlation matrices we always have p , 3 , 2 = 0 . (2) It is not generally true that ML and IML or GLS and IGLS yield the same minimum value. (3) It is not generally true that the fitting functions locate the minimum on the same boundary (it is here due to r I 2= r Z 3 ) .
3. A one-factor model with positive degrees of freedom In this section we assume we have a random sample xl,x2, ..., x, from a normal population with zero mean and
300
7: K . Dijkstra E=P11’+021m.
So there are m indicators, m> 1, with identical loadings and identical variances. Let
s =1-c“x i x ; . n1
The following proposition is fairly straightforward to verify. Proposition 4. Unconstrained maximization of the likelihood function with respect to 8 = (,I2, 02)yields: -2
- i’Si - tr S
I=-
~
m(m - 1)
and
Note that d 2 is always positive but
fi2 may be negative. In fact we can show
Proposition 5 . The vector ( i 2 , S 2 ) ’ is distributed as
where x:(n) and &n(m - 1)) are independently distributed random variables with chi-square distribution with n and n(m- 1) degrees of freedom, respectively. The probability that fi2 is negative is
F(n, n(m- 1))
=0
62
tr S
15’ + fi2 =-m
otherwise.
Moreover, the chi-square statistic in case f 2 O , 4,,- I ~ ( O , 2 nand ] the remaining 4s restricted to (O,n].The Jacobian is the product of a function of the 4s and of T ~ - ’ .Since Y ’ ~ = T ’ , the joint density of ( T , $ , ,..., $,, I ) factors into a function of T and a function of (4’,..., 4n-,) In other words, T and (4’,...,4,,. I ) are independent. Since the conditioning event { A y S O } QED. involves the 4s only, the lemma is proved. Proof of the Proposition 9. The minimum of ( z - A p ) ’ V ’ ( z - A p ) on K is of the form (z-Av)’V ‘(z-Av)+v’,RI ,‘vl + i ’ , f i ; : i lThis . is a sum of three independent quadratic forms with x2(p-q)-, x2(r)-, and ~*(s,)-distributionsrespectively, since (z- Av), v, and i I are independent. It is clear that when we condition on (39), only thc first condition is relevant. So we must find
+
Pr { (z - Av)’V ‘(z - Av) v’,R,
‘lv I
+ i’,fi;lli5 c I f i L 1 ’ i I 5 0 ) . I
Now condition on the values of the first two quadratic forms as well. The lemma just proved implies that {fiLI’ilSO} is irrelevant. Integration with respect to the conditioning quadratic forms gives the desired result. QED. An immediate corollary is Proposition 10. The probability that the minimum of ( z - A p ) ’ V - ’ ( z - A p )
on K
is at least equal to c is given by
C wi.,Pr { xz(p-q + r + i) ~ Y
c } ,
i-0
where ( I ) ~ , is ~ the chance of getting i zeros when minimizing ( p 2 - i ) ’ f i ‘ ( p 2 - i ) on {pz20}.
So Proposition 10 gives in effect the limiting unconditional distribution of the chi-square statistic. Shapiro ( 1985) discusses the calculation of the weights ( I ) ~ , ~in
some detail. I t appears that for s 54 the weights can be calculated/estimated reasonably easily but larger values of s can create something of a problem. The usymptofic distribution of n’’z(f?- 0,) is a mixture of conditional distributions with the conditions specifying which of the non-negativity constraints are binding; the weights of the mixture will be difficult to determine in general. However, a particular condirional distribution is very easily obtained. This concerns the limiting conditional distribution given that all s non-negativity constraints are active. For then (38) and (39) imply that this is just the marginal asymptotic distribution of n”*(o^-U,) as obtained by minimization of the fitting function with the first r + s elements fixcd. So we have
On statisticul inference
307
Proposition 1 1 . n l i 2 ( 0- Oo), given that all s non-negativity constraints are binding, is asymptotically distributed as (0,p:')' with
So p; -. t ' ( 0 , R 3 , 1 2and ) . 11; and the chi-square statistic are independent. The situation is a lot less favourable when s l < s zeros are obtained: The conditional and marginal distributions do not agree; conditionally, we get a truncated normal distribution with all non-zero elements of fi biased, in general. See Judge & Yanccy (1986) for a collection of properties of truncated normal distributions. Finally, we will briefly indicate what happens when a non-optimal or inefficient fitting function is used; i.e. suppose in the postulate of section 4.1 we replace V - ' by W with W f V . If we substitute W for V in (36), the expression for R, then the same algebra as before can be used to get a new fi with a definition completely analogous to the old ji. However, chi-square statistics are in general no longer X'-distributed, not even conditionally. Also, there is no corresponding Proposition 1 1 for non-optimal estimators when all inequality constraints happen to be active. In fact it can be shown that now p: is non-normal, it has a non-zero mean and its conditional covariance matrix differs from its marginal counterpart, the latter is typically the larger. However the author knows of one special situation where the errors cancel: if there is only one non-negativity constraint and this constraint happens to be binding, then it is easy to show that symmetrical confidence intervals based on the marginal distribution of p: have the same confidence level conditionally! It is not known whether this extends to more general situations. 4.3. Some implicu t ions with respec t I o statistical inference
In applied covariance structure analysis it happens frequently that unconstrained optimization produces inadmissible estimates. If the model is still maintained, one often decides to fit the model as well as possible to the data subject to non-negativity constraints. I t is common practice to treat the ensuing zeros as 'true' zeros, thereby increasing the degrees of freedom. The stability of the non-zero estimates is assessed according to the estimated marginal asymptotic distribution. An orthodox frequentist however would insist, at least in principle, upon a partitioning of the sample space, each subset specified by a unique set of zero estimates. And in theory he would demand the frequency distribution of the estimates as induced by varying across the sample space, including those parts of the sample space which lead to zero estimates, if any, different from those actually obtained. If one adopts the principle of conditional inference however, then one could limit the variation to that part of the sample space producing the same zeros as the data at hand. This appears to be quite a natural thing to do: one tends to consider the possibility of boundary points only when the sample yields zeros under constrained optimization. The previous sections gave conditions which ensure that the conditional approach agrees with the approach followed in practice. Basically, the fitting function should be
308
7: K . Dijkstra
correctly specified so that asymptotically efficient estimates are produced. Moreover, the zeros found ought to be 'true' zeros indeed. For the chi-square statistic there may be more true but unidentified zeros, but not for the vector of estimates. In sufficiently large samples it may not be unreasonable to assume that we have identified at least a subset of binding constraints for the true parameters. After all, the asymptotic probability of obtaining boundary estimates is positive only when the true values are on or close to the boundary. Nevertheless, the conditions taken together are rather strong. In fact one may feel, as does the author, that statistical inference in a jrequentist mode is almost impossible to perform true to principle. To look at the same results in a more positive vein: there are situations where the approach typically followed in practice makes sense and is justifiable from a jrequentist perspective. For a strongly related discussion concerning the problems associated with model specification and evaluation on the same data set, we refer to Dijkstra (1988) and Dijkstra (in preparation).
References Barlow, R. E., Bartholomew. D. J., Bremner, J. M. & Brunk, N. D. (1972). Stafisfical Injerewce under Order Restrictions. New York: Wiley. Bentler. P. M. ( 1976). Multistructure statistical model applied to factor analysis. Multivariate Behaviour Research. 1 1 , 3 25. Cox, D. R. & Hinkley, D. V. (1974). Theoretical statistics. London: Chapman & Hall. Dawid, A. P. ( 1985). lnvariance and independence in multivariate distribution theory. Journal of Multivariafe Analysis, 17, 304315. Dijkstra, T. K. (Ed.) (1988). On Model Uncerfainfy and i t s Sfatistical Implications. Berlin: Springer-Verlag. Dijkstra, T. K. (1990). Some properties of estimated scale invariant covariance structures. Psychometrika, 55, 327-336. Dijkstra, T. K. (in preparation) O n statistical inference for data-instigated models. Hinkley, D. V. (1980). Fisher's development of conditional inference. In S. Fienberg & D. V. Hinkley (Eds.), R. A . Fisher, an Appreciation, pp. 101-108. Berlin: Springer-Verlag. Jennrich, R. I. & Sampson, P. F. (1968). Application of stepwise regression to non-linear estimation. Technometrics, 10, 63-72. Judge, G. G. & Yancey, T. A. (1986). Improved Methods of Inference in Econometrics. Amsterdam: North-Holland. Lawley, D. N. & Maxwell, M. A. (1971). Factor Analysis as a Sfufisfical Method. London: Butterworth. Lee, S.-Y. (1980). Estimation of covariance structure models with parameters subject to functional restraints. Psychome.trika, 45, 309-324. Lehmann, E. L. (1986). Tesfing Statisticul Hypotheses, 2nd ed. New York: Wiley. Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. New York: Wiley. Rindskopf, D. ( 1983). Parameterizing inequality constraints on unique variances in linear structural models. Psye*hometrika, 48, 73-83. Shapiro, A. (1985). Asymptotic distribution of test statistics in the analysis of moment structures under inequality constraints. Biometriko, 72, 133-144. Shapiro, A. ( 1987). Second order sensitivity analysis and asymptotic theory of parameterized nonlinear programs. Mafhematical Programming, 33, 28e299. Swain, A. J. (1975). A class of factor analysis estimation procedures with common asymptotic sampling properties. Psychometriku, 40, 3 15-335. Reccived 27 Julv 1990; revised version received 6 January I992
O n stutisticul inference
309
Appendix Write a = p l z , / l = p l , and y = p z 3 . The set of all correlation matrices is
The set of admissible correlations matrices is the intersection of C with
We will first determine the volume of C. Define
Then
I
= 2 1 n ( l -a2)’/’da 0
Now volume (AC) is just, see Figure 1
So volume (AC)/volume (C) equals 2/n2. Incidentally, if we define
then we have volume (AC)/volume (BC)z.31. Note the BC contains the correlation matrices which fit a one-factor model with real loadings but not necessarily admissible values of the variances. So even in case the moment equations can be solved for real As, the odds are in favour of a negative unique variance.