Bootstrapping Goodness-of-Fit Statistics for Sparse ... - CiteSeerX

10 downloads 0 Views 472KB Size Report
Internet: http: www.pabst-publishers.de mpr c©1997 Pabst ... was supported by the DFG Deutsche Forschungsgemeinschaft in the project Mixture Dis- tribution ...
Methods of Psychological Research Online 1997, Vol.2, No.2 Internet: http://www.pabst-publishers.de/mpr/

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data { Results of a Monte Carlo Study {

Matthias von Davier Abstract Analysis of categorical data in educational measurement or psychometrics often faces the problem of how to test model t for a questionnaire with many items. In this situation, most observed response patterns will be unique and many out of the set of possible response patterns will not be observed at all. The \classical" approach of evaluating goodness-of- t statistics like the Pearson X 2 or the likelihood-ratio G2 by means of a 2 distribution is not appropriate in these cases, as the data are sparse and the expected frequencies are very low (Fienberg, 1979; Koehler & Larntz, 1980). Performing the bootstrap (Efron, 1979) instead was suggested by many authors (Aitkin, Anderson, & Hinde, 1981; Collins, Fidler, Wugalter, & Long, 1993; Langeheine, Pannekoek, & van de Pol, 1996) as an approach to solving the problem. The bootstrap is supposed to produce a rough approximation of the goodness-of- t statistics unknown distribution. Langeheine et al. (1996) have shown that the bootstrap works ne for small contingency tables, but for sparse tables, di erent conclusions regarding the t of a model can arise if more than one statistic is tested in the bootstrap procedure. Results of a Monte Carlo study focusing on bootstrapping di erent goodnessof- t statistics are presented in this paper. The four statistics examined are the Pearson X 2 , the Cressie-Read CR(2=3) /Cressie & Read, 1984), the likelihood-ratio G2 and the Freeman-Tukey F T statistics (see Read & Cressie, 1988). The results presented here imply that the parametric bootstrap can be used for analyzing goodness-of- t, even if the data are very sparse, at least with some of the examined statistics. An explanation based on the  Postdoctoral Fellow 1997/1998, Educational Testing Service, Princeton, NJ. This research was supported by the DFG (Deutsche Forschungsgemeinschaft) in the project \Mixture Distribution models" supervised by Prof. Dr. Jurgen Rost

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

30

behavior of the power-divergence (Cressie & Read, 1988) statistics as well as practical recommendations for using the bootstrap are given in this paper.

1 Introduction Investigating complicated and interesting subject matter often requires large numbers of questions and response categories. In contrast to that, many textbook examples for categorical data analysis contain no more than three or four dichotomous variables. The number of items (denoted by k in this paper) as well as the number of response categories (denoted by m) determine the number of possible response patterns J = mk which is the size of the (multivariate) contingency table analyzed if models for categorical data are used.1 The following examples may illustrate the huge size of the contingency table arising from moderately long questionnaires:

ABI-UMW/P (Krohne, Rosch, & Kursten, 1989): A German coping

questionnaire consisting of eight scales with k = 18 dichotomous (m = 2) items each. When analyzing each scale separately, there are 218 = 262; 144 possible response vectors in the contingency table, NEO/FFI (McCrae & Costa, 1987): A personality inventory with 5 scales, each consisting of k = 12 items with m = 5 response categories. This yields J = 244; 140; 625 cells for each of the 5 scales, SM-scale (Snyder, 1974): The self-monitoring scale originally consists of k = 21 items with dichotomous responses, this is J = 221 possible response patterns, i.e., the number of cells is 2097152. It can be easily veri ed that even for the smallest scale with J = 262; 144 possible response patterns the sample size N will rarely outnumber J . Even when the sample size is very large, most of the possible responses will not be observed so that the observed frequencies of the contingency table will mostly be zero, i.e., the data are very sparse. This means also that expected frequencies will be extremely small in such a sparse contingency table. Applying classical 2 goodness-of- t tests in this situation is not appropriate, as certain conditions have to be met when the 2 reference distribution is used (see Bishop, Fienberg, & Holland, 1975). The rule of thumb of a minimum expected frequency of MINx (E (xi ))  5 will, of course, also not be met in the sparse data case. In order to use the 2 distribution, di erent rules of thumb for the minimum i

1 For models analyzing sum scores rather than response patterns, the contingency table of possible responses boils down to the number of distinct scores, so that sparse data will occur only in small samples. Nevertheless, when testing the model as a whole, i.e., including the assumption of loosing no information by using sum scores, the multidimensional contingency table of all possible responses has to be used.

MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

31

expected frequency have been proposed (Read & Cressie, 1988),some as small as MINx (E (xi ))  1. However, even this boundary is by far too high in many applications. This can be veri ed by a rough estimate: MINx (E (xi ))  mN , which is the ratio of the number of observations divided by the number of cells in the contingency table. The upper bound MIN = mN will be reached only in the trivial equal probability case. The problems of small expectations and sparse data were addressed in a number of simulation studies investigating the distribution of various goodnessof- t statistics (see e.g. Koehler & Larntz, 1980; Collins et al., 1993 ). Collins et al. (1993) found that even with a fairly limited maximum scale length of k = 8 items the distributional moments of G2 , X 2 and the Cressie-Read statistic CR( 32 ) di er strongly from the expected 2 distribution. Besides these empirical results, there are also a number of papers deriving asymptotical results when both the number of cells J and the sample size N increase (see Holst, 1972; Read & Cressie, 1988; Osius & Rojek, 1992). Under this increasing cell assumption, asymptotic normality can be derived for most of the well-known statistics like the Pearson X 2 and the likelihood-ratio statistic G2 . When analyzing real data, any proof of an asymptotic normal distribution is of limited use, as the number of observations N as well as the number of cells in the contingency table J = mk are xed. Therefore, the distribution of the goodness-of- t statistics will be unknown in most applications. Efron (1979, 1982) proposed the bootstrap, a very general computing-intensive method to produce a (more or less rough) approximation to the unknown distribution of a statistic T (see section 3). There are two versions of the bootstrap, the (naive) bootstrap and the parametric bootstrap, of which only the parametric bootstrap can be used for goodness-of- t testing (see Bollen & Stine, 1993; Langeheine et al., 1996; von Davier, 1997). Researchers in educational measurement proposed to use simulation techniques like the parametric bootstrap already in the early eighties (Aitkin et al., 1981) for testing the results of a latent class analysis. In that application, a 38item scale was analyzed using the parametric bootstrap test for likelihood-ratio statistic G2 . Collins et al. (1993) also observed that the 2 distribution is not appropriate when analyzing sparse data and recommended Monte Carlo sampling, which is equivalent to the parametric version of bootstrap. Langeheine et al. (1996) investigated the use of the parametric bootstrap by examining the results of bootstrapping goodness-of- t statistics for ve real datasets with increasing sparseness. Their conclusion was that the results of the bootstrap test can vary for di erent statistics when the data are sparse. It was proposed to reject a model if at least one statistic is rejected by the bootstrap procedure. The Monte Carlo study presented here lls the gap between the applications of the parametric bootstrap to goodness-of- t statistics in real data problems and simulation studies focused on the distribution of those statistics. The main aim of the present study is to examine the performance of the bootstrap test for i

i

k

k

MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

32

various goodness-of- t statistics under di erent conditions of sparseness, sample size, data structure and also for di erent psychometric models. The results apply to a large variety of models for categorical data, as any model assigning a probability P (X = x1 ; : : : ; xk j ) to each response pattern x1 ; : : : xk can be tested by means of the method presented.

2

2

Goodness-of- t statistics

Goodness-of- t statistics like the Pearson X 2 are measures of divergence between the observed data and the expectation under a certain model. Models for categorical data providing a certain probability p(xi ) for each possible response pattern xi = (xi1 ; : : : ; xik ) can be tested by means of goodness-of- t statistics. The pattern probabilities p(xi ) relate to the expected frequencies, E (xi ) = np(xi ), which can be regarded as the prediction of the model under consideration. Most commonly used goodness-of- t statistics compare these expected frequencies E (xi ) with the observed frequencies O(xi ) by means of a distance measure (O(xi ); E (xi )) aggregated over all possible response patterns xi . Parametric IRT models as well as the Latent Class Analysis (for a recent overview of models for categorical data and applications in social sciences see Rost & Langeheine, 1997) can be tested in this way, as they provide a probability p(xi ) for each response vector xi . The well-known Pearson X 2 as well as the likelihood-ratio G2 goodness-of t statistic are both members of the power-divergence family CR() (Cressie & Read, 1984), which is de ned as mk X

CR() = (1+ 1) O(xi ) i=1

"

!

#

O(xi )  , 1 ; E (xi )

where O(xi ) is the observed frequency of response pattern xi and E (xi ) is the corresponding expected frequency. Many other goodness-of- t statistics are members of that family. Table 1 gives some examples, including the Pearson and the likelihood-ratio statistic. Read & Cressie (1988) have shown that all members of the CR() family have the same limiting 2 distribution as the Pearson X 2 under certain regularity conditions (for a discussion of these conditions, see Bishop et al., 1975). They also outlined the e ect of simultaneously increasing the number of observations as well as the number of cells in the contingency table. A limiting normal distribution can be derived for the whole CR() family (Read & Cressie, 1988) similar to the results derived for the X 2 and the G2 statistic (see e.g. Holst, 1972; Osius & Rojek, 1992). MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

33

Statistic Computation  2 P m (X ,E ) X2 = = CR(1) i=1  E  2   P X 3 , 1 = CR( 2 ) CR = 1:8 m i=1 Xi E 3 k

i

i

i

k

G2 FT 2 G2m Xm2

= = = =

i i

P X 2 m i=1 Xi ln E p 2 P p 4 m i=1 ( Xi , Ei ) k

i i

k

P E 2 m i=1 Ei ln X k

i i

Pmk (Xi ,Ei )2 i=1 Xi

= CR(0) = CR(, 12 )

= CR(,1) = CR(,2)

Table 1: Some classical goodness-of- t statistics and their computation by means of the CR() power-divergence family. (Note : CR() with   ,1 can not be used for

tables with zero counts)

However, when using goodness-of- t statistics for analyzing sparse real data, there is no asymptotics, as both the number of observations (the sample size N ) and the number of cells in the table (mk ) are constant. In these cases, alternative methods have to be found in order to obtain a reference distribution for the goodness-of- t test.

3 Exploring the bootstrap for goodness-of- t statistics Generally speaking, the bootstrap can be considered as a means to determine the unknown distribution FT of a statistic T , when T is a function of the observed data (x1 ; : : : ; xN ) and a vector  of parameters, i.e., t = T (x1 ; : : : ; xN ; ). This method can be applied in many cases, as Efron and Tibshirani (1993 demonstrate in a recent textbook on the variety of currently available bootstrap methods. In the case considered here, assume that T is a goodness-of- t statistic and is computed by means of the raw-data (x1 ; : : : ; xN ) as well as the estimated model parameters ^. Then, the distribution of T is approximated by generating a sample of independent outcomes tj for j = 1; : : : ; b and constructing the empirical distribution F^t . But how can and how has the sample (tj )j=1;:::;b to be generated? Clearly, the outcomes t have to be drawn from the distribution of T in order to approximate FT by F^t . MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

34

This can be achieved by simulating additional data (for each tj a simulated dataset (x1j ; : : : ; xNj )) according to the model equation and the estimated parameter vector ^. Then, starting from ^ (Langeheine et al., 1996), parameters are estimated for the simulated dataset and an outcome of the statistic tj is computed. This procedure of simulating data and re-estimating parameters is repeated b times, or until an accuracy criterion for the approximation is met. If the original data can be tted by the model, the value t = T (x1 ; : : : ; xN ; ^) of the original data will not be signi cantly di erent from the bootstrap sample (tj )j=1;:::;b . If the original data are not appropriately tted by the model, however, then t (being a goodness-of- t measure of the original data) should be signi cantly di erent from the t values of the bootstrap samples. Performing the bootstrap for goodness-of- t statistics is highly computingintensive, as the computation of the t values not only includes the necessary simulation of data (Bollen & Stine, 1993) but also the repeated estimation of model parameters for the simulated data, as they play a key role in computation time and the outcome t (Langeheine et al., 1996; von Davier, 1997). The number b of bootstrap samples has to be chosen with respect to the aim of the bootstrap. If the bootstrap is used for estimating the distribution of T , b has to be very large (Efron & Tibshirani, 1993). If the bootstrap is used for testing, the number of bootstrap samples chosen can be quite small, as the nominal level of testing the value t by means of the empirical distribution of the tj is met quite accurate with sample sizes of b = 40 or even b = 20 (see Aitkin et al., 1981; Marriot, 1979).

4 Design of the study and method The main goal of the Monte Carlo study presented here is to investigate whether the bootstrap can be used for goodness-of- t statistics under various sparseness conditions. In order to do so, four frequently used goodness-of- t statistics were chosen for this study, namely the Pearson X 2, the Cressie-Read CR( 23 ), the likelihood-ratio G2 and the Freeman-Tukey FT statistic (compare table 2). The study was also aimed at testing the bootstrap procedure for di erent models (independent variable V1) for categorical data as well as di erent response formats (V2), so that both the latent class analysis (LCA, Lazarsfeld, 1950) and the mixed Rasch model (MRM, Rost, 1990) were used in their dichotomous and polytomous versions. The latent class analysis assumes that the observed data are a sample drawn from an unknown mixture of distributions (= latent classes) rather than from a single distribution. All dependencies between the observed variables are assumed to be due to this mixture, i.e., within each latent class, it is asMPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

35

sumed that multinomial independence of the response variables holds and the observations are identically independent distributed. The Rasch model (Rasch, 1960) is a probabilistic model for categorical data with a number of favorable mathematical properties (see Fischer & Molenaar, 1995). A major assumption in the Rasch model is that the raw score (= sum over response variables) is a sucient statistic. Therefore, the Rasch model (or developments based on this model) can be used to test whether aggregation of data by means of summation is appropriate. Rost (1990) developed the mixed Rasch model, an integration of the Rasch model and the latent class analysis. This model assumes that the Rasch model holds in each component distribution of a mixture of distributions. Both the Rasch model and the latent class analysis are special cases of the mixed Rasch model. Sparseness, being the main variable expected to in uence the distribution of X 2 , G2 , FT and CR( 32 ) was varied indirectly by the sample size (V3) and directly by the response format as mentioned above (V2) and the scale length (V4), i.e., the number of items k. Finally the number of classes (V5) in the estimated model was varied, as both the LCA and the MRM are mixture distribution models allowing for more than one population (= latent class) with a distinct set of parameters in each latent class. Datasets were generated according to a two-class model for each of the independent variable conditions. The number of classes (V5) was varied in order to estimate parameters for an under tting (1-class) model, a tting (2-class) model, and an over tting (3-class) model. Table 2 gives an overview of the independent variables used for the Monte Carlo study. All computations were carried out with WINMIRA (von Davier, 1994), which was slightly modi ed in order to allow for a more compact output of the bootstrap results. A maximum of 999 iterations was used for initial estimation, and the bootstrap estimation had a maximum of 100 iterations, as these estimations started from the nal parameter estimates of the initial sample. For each independent variable condition, 40 datasets were used and for each of these datasets, 40 bootstrap samples were generated. This yields 40 + 40  40 = 1640 parameter estimations in each of the 48 independent variable conditions.2

5 Results The following section, in order to be comparable to the ndings of Koehler and Larntz (1980) and Collins et al. (1993), summarizes the distributional moments of all four goodness-of- t statistics in the Monte Carlo study. 2 The seemingly low number of 40 datasets in each cell was chosen as the bootstrap procedure requires an additional parameter estimation for each bootstrap sample. The estimated computation time for this study was approx. 55 days on a Pentium 90Mhz machine.

MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

V5 V4 V3 V2 V1

36

Independent variable Condition N. of latent classes 1('under t') 2('true model') 3('over t') N. of items(dichotomous) k=6 k = 24 N. of items(polytomous) k=6 k = 12 Sample size N = 500 N = 4000 Response format dichotomous polytomous N. of categories 2 4 Model LCA MRM Table 2:

Summary of independent variables in the Monte Carlo study.

Section 5.2 summarizes the performance of the bootstrap test for goodnessof- t statistics by means of a simultaneous graphical presentation of the relative frequencies of model choices. Section 5.3 gives an overview of the error rates for the bootstrap test of the 2-class models. For reasons of clarity and economy of presentation, only the overall well-performing statistics are considered in this section.

5.1 Distributional moments Table 3 shows the means and standard deviations of all four statistics when the model holds, i.e., if a 2-class model is tted to the simulated data. The table summarizes the values for both the dichotomous LCA and the dichotomous MRM. Statistic X2 CR G2 FT

N 500 4000 500 4000 500 4000 500 4000

LCA

Mean

MRM

k=6 k=24 k=6 k=24 49.63 16943602 43.44 15786072 48.37 16775983 42.24 16909512 49.96 618724 43.69 504323 48.37 1242602 42.32 1049812 52.81 8218 47.58 6824 48.54 49575 42.73 39842 59.74 3883 59.53 3508 48.81 29484 43.47 25152

Standard Deviation LCA MRM k=6 k=24 k=6 k=24 8.37 1741571 10.33 2788693 10.01 538760 11.26 884229 8.41 33723 10.20 44003 10.06 20427 11.29 25886 9.70 96 11.74 93 10.25 225 11.52 204 14.70 8 18.02 21 10.47 51 11.84 75

Means and standard deviations of all four goodness-of- t statistics for the 2-Class solution of the dichotomous LCA and the dichotomous MRM. Table 3:

Clearly, all four statistics (except FT with a sample size of N = 500) work similarly when the number of items is k = 6. In this case, the expectation of all four statistics drawn from the 2 approach would be dfLCA = 50 for the LCA MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

37

and dfMRM = 42 p for the MRM. The standard deviations are also close to the expected value 2df , though there seems to be a small but systematic bias for the MRM conditions. In contrast to the k = 6 condition, there are large di erences between the four statistics when k = 24, i.e., when the data are very sparse: X 2 has the highest means followed by the CR = CR( 23 ) statistic, whereas G2 and FT have comparably small means. Table 3 shows also that the means of CR, G2 and FT are a ected by sample size much more that the mean of X 2 is. Standard deviations are also a ected by sample size: For X 2 and CR, standard deviations decrease with sample size and for FT and G2 they increase. The standard deviations of G2 and FT seem to be extremely small anyway (see section 6 for an explanation of this nding). Table 4 similarly shows the means and standard deviations of the 2-class models for the polytomous conditions. Note that the maximum number of items is k = 12 and the number of cells in the contingency table is 4k in these conditions. Statistic X2 CR G2 FT

N 500 4000 500 4000 500 4000 500 4000

Mean Standard Deviation LCA MRM LCA MRM k=6 k=12 k=6 k=12 k=6 k=12 k=6 k=12 4013.41 16176715 3973.93 15924332 176.16 1758859 182.46 1631434 4067.05 16739411 4039.20 16877342 88.65 630769 115.08 638203 2606.05 683894 2575.74 672092 70.25 31445 67.16 23715 3710.15 1405757 3679.98 1410092 63.53 20464 81.78 17968 1747.62 9079 1733.26 9053 31.05 79 32.33 63 3864.54 56541 3830.88 56546 58.43 219 69.56 184 2160.38 3942 2147.76 3941 35.40 3 33.69 2 5640.29 30735 5594.62 30737 88.25 24 101.33 21

Means and standard deviations of all four goodness-of- t statistics for the 2-class solution of the polytomous LCA and the polytomous MRM. Table 4:

The four statistics di er strongly in the conditions with k = 12. The means and standard deviations in the polytomous k = 12 item conditions are very close to the corresponding dichotomous conditions with k = 24 items. As the number of cells is 224 = 412 in both contingency tables, this similarity can be explained by the table size. The standard deviations of G2 and FT are extremely small, as already stated for the dichotomous case. In contrast to the dichotomous case, the polytomous \small scale" conditions with k = 6 items can be regarded as slightly sparse, as these conditions have 46 = 4; 096 cells, so that even if sample size is N = 4; 000 there will be some patterns with zero counts. This implies that the 2 distribution should not be used in these conditions and some di erences can be expected between the four statistics. As it can be seen from the table, the means di er considerably MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

38

depending on sample size for FT , G2 and CR, but less for X 2. Again, standard deviations increase with sample size for FT and G2 , they are of similar size for CR in both sample size conditions, and decrease for X 2. The classical 2 -approach would lead to wrong conclusions in most sparseness conditions of this Monte Carlo study, as the means and standard deviations presented in the tables above di er strongly from the expected values of the 2 distribution. These ndings correspond to results of other simulation studies on the behavior of goodness-of- t statistics when expectations are small (for an overview of those studies, see Read & Cressie, 1988).

5.2 Performance of the bootstrap test In this subsection, the relative frequencies of model choice from the bootstrap test for goodness-of- t statistics are presented. The relative frequencies of model choice represent the rate of correct decisions of the bootstrap test if the model can t the data. Accordingly, if the model can not t the data, this relative frequency represents the rate of wrong decisions. For this analysis, a graphical representation of the results was chosen for two reasons: First, exact expectations can be stated only for correctly speci ed models (von Davier, 1997), as the rates of correctly rejecting under tting models depend on the unknown error and the rate of choosing over tting models can not be predicted exactly. Second, an \ideal" graph can be de ned as a pattern matching template in order to compare the performance of the bootstrap test simultaneously for several goodness-of- t statistics: The rationale of expecting such a graph is based on the following considerations:

 If an (under tting) 1-class model is estimated to t a 2-class dataset, the

bootstrap test is expected to reject the model in almost all cases (this is analogous to expecting the unknown error to be close to zero).  The (appropriately tting) 2-class model should be rejected according to the chosen error level. The expected rates of model choice for the 2-class models are 1 , for the three levels chosen, as is the expected rate of false rejection in this case.  The over tting 3-class model can not be identi ed as such. This model ts too many parameters to the 2-class dataset, but nevertheless, a 2-class dataset can be tted in this way. Therefore, the same rate of model choice is also assumed for the 3-class model. (By applying \Occams Razor" the researcher would decide in favor of some other tting model with a smaller number of parameters and/or latent classes) MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

39

Relative frequency

Ideal profile 0.9 0.75 0.6 0.45 0.3 0.15 0.0

Figure 1:

= 0.025 = 0.100 = 0.250

1

2 Classes

3

Ideal relative frequencies of model choice for all four statistics.

For the decisions based on the bootstrap distribution of the statistics, error levels of 1 = 0:025, 2 = 0:1 and 3 = 0:25 were chosen in this analysis. In order to choose a model, a minimum number of 1 (= 0:025  40), 4 (= 2  40) or 10 (for 3 ) bootstrap samples (out of a total of b = 40) having a higher goodness-of- t value than the original data have to be observed. If less than this minimum number of bootstrap samples have a higher value of the statistic, the model is rejected at the corresponding level. Figure 2 shows the graphs for all four statistics in the conditions of the dichotomous LCA. Figure 3 on the next page shows the respective graphs for the dichotomous MRM conditions. In both gures, the graphs of the Pearson X 2 and the CR statistic are close to the ideal as shown in Figure 1. It seems that the bootstrap works well for both the statistics, even in the conditions where k = 24 items, when the data are very sparse. The graphs of FT and G2 also seem to t to the ideal in the small scale conditions with k = 6 items, i.e., when the data are not sparse. In contrast to that, FT and G2 di er strongly from the ideal in the sparse data conditions. Here, the not tting 1-class models are accepted too often when the bootstrap is carried out for FT and G2 . There is also some indication that the 2-class model is rejected too often; the rate of model choice is as low as 0:5 in some conditions. MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

40

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

Dichotomous LCA k=6 k=24 N=500

X

2

CR

2

G

FT

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=500

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

2

3

1

2

3

1

2

3

1

2

3

Figure 2: Relative frequencies of model choice for all four statistics in the dichotomous LCA conditions.

On the following pages, the respective graphs for the polytomous conditions are presented. Figure 4 shows the graphs for the polytomous LCA and the graphs for the polytomous MRM can be found in gure 5. Note that the data are already moderately sparse in the polytomous conditions with k = 6 items, as the number of possible response patterns is 46 = 4; 096 in this case. Therefore, e ects of sparseness can be expected even when k = 6, at least within the small sample conditions where N = 500. X 2 and CR also perform well in the polytomous conditions of this study. Minor deviations can be found for the CR statistic in the small sample (N = 500) conditions, as the graphs in these conditions indicate a slightly higher rate of rejection than expected. MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

41

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

Dichotomous MRM k=6 k=24 N=500

X

2

CR

2

G

FT

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=500

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

2

3

1

2

3

1

2

3

1

2

3

Figure 3: Relative frequencies of model choice for all four statistics in the dichotomous MRM conditions.

Clearly, the bootstrap for FT and G2 does not work in most of the polytomous conditions. It seems that the bootstrap test does not work when the data are sparse, even though a certain recovery can be observed when the sample size increases (compare the N = 4; 000 conditions with k = 6 items). For practical purposes, of course, it can not be known beforehand whether the sample size is big enough to use the bootstrap for G2 and FT . These results show that bootstrapping goodness-of- t statistics can lead to di erent conclusions depending on which statistic was used. It seems that bootstrapping the Pearson X 2 and the CR statistic performs well even when the data are very sparse. In contrast to that, the bootstrap should not be used for the FT and the G2 statistic, as under many di erent conditions of sparseness MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

42

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

Polytomous LCA k=6 k=24 N=500

X

2

CR

2

G

FT

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=500

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

2

3

1

2

3

1

2

3

1

2

3

Relative frequencies of model choice for all four statistics in the polytomous LCA conditions. Figure 4:

this procedure fails to work as expected and may lead to wrong decisions. Especially the under tting 1-class models have been accepted too frequently when bootstrapping FT or G2 . These ndings explain the observation of Langeheine et al. (1996) that the bootstrap can lead to di erent decisions for di erent goodness-of- t statistics when the data are sparse. Based on the results of this Monte Carlo study, the recommendation of Langeheine et al. (1996) can be modi ed in the following way: A model for sparse categorical data should be tested by means of the bootstrapped X 2 and CR statistics and accepted if neither X 2 nor CR rejects the model. The failure of FT and G2 may be due to lack of power for these statistics. MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

43

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

Polytomous MRM k=6 k=24 N=500

X

2

CR

2

G

FT

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=500

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

1

1

1

2

2

2

2

N=4000

3

3

3

3

0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0 0.9 0.75 0.6 0.45 0.3 0.15 0.0

1

2

3

1

2

3

1

2

3

1

2

3

Relative frequencies of model choice for all four statistics in the polytomous MRM conditions. Figure 5:

Recovery of performance with increasing sample size may be expected, as indicated by the results for FT and G2 in the polytomous conditions with sample size N = 4; 000 and k = 6 items. For practical purposes, neither should be used with the bootstrap test, as they are clearly inferior to X 2 and CR. In section 6, an explanation of the di erences in performance of the bootstrap test between the four goodness-of- t statistics is outlined.

MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

44

5.3 Error rates in the Monte Carlo study A di erent way of analyzing the performance of the bootstrap test is to examine the frequencies of model choice by means of con dence bands. These con dence bands can be de ned based on the binomial B (b; 1 , ) distribution for the correctly speci ed 2-class models (see von Davier, 1997). Table 5 shows the results of this analysis based on the B (40; 1 , ) binomial distribution with con dence bands of approx. 95%.3 -level: Expectation: Con dence lower: Band limits upper: meas. N model LCA 500 MRM X2 LCA 4000 MRM LCA 500 MRM CR LCA 4000 MRM

LCA 500 MRM LCA 4000 MRM LCA 500 MRM CR LCA 4000 MRM X2

0.025 0.975 0.925 1.000 1.000 0.975 0.975 0.900

1.000 0.975 0.975 0.925 1.000 0.925 0.975 0.950 0.975 0.950 0.975 0.975

Dichotomous Models 6 items 24 items 0.100 0.250 0.025 0.100 0.250 0.900 0.750 0.975 0.900 0.750 0.800 0.625 0.925 0.800 0.625 0.975 0.875 1.000 0.975 0.875 model choice rates 0.925 0.775 1.000 0.950 0.775 0.825 0.700 1.000 0.950 0.850 0.925 0.800 1.000 0.950 0.800 0.875 0.800 1.000 0.925 0.700 0.900 0.775 0.950 0.900 0.700 0.850 0.750 1.000 0.925 0.850 0.925 0.800 1.000 0.950 0.825 0.875 0.800 0.975 0.925 0.800 Polytomous Models 6 items 12 items 0.950 0.750 0.975 0.950 0.825 0.875 0.775 1.000 0.950 0.775 0.950 0.625 1.000 0.925 0.725 0.800 0.625 0.950 0.825 0.625 0.900 0.650 0.950 0.900 0.650 0.825 0.525 0.950 0.950 0.575 0.950 0.700 1.000 0.900 0.725 0.875 0.650 0.950 0.750 0.550

Relative frequencies of model choice for the X 2 and the CR statistics in the 2-class conditions. Signi cant deviations from the con dence band are printed boldface. Table 5:

Table 5 shows only one signi cant deviation of model choice rates for the X 2 statistic. There are four signi cant deviations for the CR statistic located in the polytomous conditions. All signi cant deviations are directed in the same way, namely that the actual rate of model choice is lower than the lower bound As the binomial is a discrete distribution, it was not possible to de ne exact 95% bands. For 1 = 0:025, the band was 98:2%, for 2 = 0:1 it was 96:9% and nally, for 3 = 0:25, it was 95:7%. 3

MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

45

of the con dence band, i.e., the models are rejected more often than expected by the error rate. Nevertheless, these deviations should not be overly stressed as some of them might be due to capitalization of chance. As it can be expected from the graphs in section 5.2, FT and G2 show unsystematic deviations both lower than the lower bound and higher than the upper bound of the con dence bands. The number of signi cant deviations for those two statistics not shown here is 15 for G2 and 12 for FT (see von Davier, 1997). The conclusions that can be drawn from Table 5 t to the interpretation of the graphical analysis, that bootstrapping X 2 as well as CR performs well even when the data are very sparse.

6 Discussion In this section, an explanation for the di erential behavior of the bootstrap test for the four statistics in the Monte Carlo study is outlined. Consider the sparse data case when the number of possible response patterns is by far bigger than the number of observations (denote this case with mk >> N ). In this situation, most patterns xi = (xi1 ; : : : ; xik ) have not been observed, i.e., O(xi ) = 0. Even when a certain pattern xi is observed, it will most probably be unique in the data, so that the corresponding observed frequency will be O(xi ) = 1. This means that in the power-divergence statistic mk X

CR() = (1+ 1) O(xi ) i=1

"

!

#

O(xi )  , 1 ; E (xi )

the majority of summation terms will vanish when mk >> N . Namely, all terms with O(xi ) = 0 will not contribute to the statistic. In order to study the behavior of the remaining sum of CR(), the Cressie-Read term

O(xi )

"

!

#

O(xi )  , 1 ; E (xi )

with O(xj ) = 1 can be plotted against E (xj ) in a certain interval. A plot for the X 2, the CR( 23 ), the G2 and the FT is shown in the following Figure 6. Clearly, the main di erences between the graphs can be observed in the region of expected frequencies smaller than 0:4, which is the most relevant region in the sparse data case. X 2 and CR are very steep in that region, whereas FT and G2 show a comparably at graph. For expected frequencies above 0.4, the steepness of the graphs of all four statistics does not vary much. This partially MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

46

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

2

X CR LR FT

Cressie Read term

10 9 8 7 6 5 4 3 2 1 0 -1

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

expected freq. Figure 6: Slope of the Cressie-Read term with observed frequency O(x) = 1 and expected frequencies varied from 0.05 to 2.0. The gure shows the slope of the X 2 , the CR( 32 ), the G2 and the F T statistic.

explains why all four statistics work similarly when expectations are not small. Steepness is related to the value of  the four statistics have in the powerdivergence family (compare table 2). With decreasing , the graph of CR()

attens in the region of low expected values. Consider two observations both with O(x1 ) = O(x2 ) = 1 but with E (x1 ) = 0:05 being small, and E (x2 ) = 0:20 being comparably large for the second observation. Then X 2 will penalize the observation with E (x1 ) = 0:05 much more than the one with E (x2 ) = 0:20. In comparison to that, FT will penalize both cases \almost" equally. This di erential penalization of small expected frequencies might be the reason for both, the small standard deviations of G2 and FT as presented in section 5.1 and for the failure of the bootstrap test for these statistics in the Monte Carlo study. Nevertheless, the graphs for G2 and FT are not totally at, so that an increasing sample size might enable appropriate model choice also with these statistics (as indicated by the polytomous conditions with sample size N = 4; 000 and k = 6 items). As the failure of bootstrapping FT and G2 is not only restricted to falsely accepting the under tting models, there is no point in assuming that FT and G2 would work ne when the latent classes would di er more with respect MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

47

to the class speci c parameters. The parameters, however, have been de ned to allow for a clear separation of both latent classes (see von Davier, 1997). Moreover, bootstrapping FT and G2 even fails to work as expected for the correctly speci ed 2-class models, which should be accepted as tting the data even when the classes are very similar (following the same reasoning why a 3class model can not be detected as over tting when tted to a 2-class dataset). Consequently, bootstrapping FT and G2 can not be recommended at the present stage of knowledge, but as long as there is no indication of systematic failure of X 2 and CR in this study, there is no need to apply any other statistic.

References [1] Aitkin, M., Anderson, D., Hinde, J. 1981 Statistical Modelling of Data on Teaching Styles Journal of the Royal Statistical Society, Series A, 144, 419{461. [2] Bishop, Y. M. M., Fienberg, S. E., Holland, P. W. 1975 Discrete Multivarate Analysis: Theory and Practise. MIT Press, Cambridge, MA. [3] Bollen, K. A.Stine, R. A. 1993 Bootstrapping Goodness-of-Fit Measures in Structural Equation Models In Bollen, K. Long, J., Testing Structural Equation Models, 111{135. Sage Focus Edition, Newbury Park, CA. [4] Collins, L. M., Fidler, P. L., Wugalter, S. E., Long, J. D. 1993 Goodnessof-Fit Testing for Latent Class Models Multivariate Behavioral Research, 28 (3), 375{389. [5] Cressie, N. Read, T. R. C. 1984 Multinomial Goodness-of- t Tests Journal of the Royal Statistical Society, Series B, 46(3), 440{464. [6] Efron, B. 1979 Bootstrap Methods: Another look at the Jackknife Ann. Statist., 7, 1{26. [7] Efron, B. 1982 The jackknife, the bootstrap and other resampling plans. CBMS-NSF regional conference series in applied mathematics. SIAM - Society for Industrial and Applied Mathematics, Philadelphia, PA. [8] Efron, B. Tibshirani, R. J. 1993 An Introduction to the Bootstrap. Chapman & Hall, New York. [9] Fienberg, S. E. 1979 The Use of Chi-squared Statistics for Categorical Data Problems Journal of the Royal Statistical Society, Series B, 41(1), 54{64. [10] Fischer, G. Molenaar, I.. 1995 Rasch models { Foundations, Recent Developments and Applications. Springer, New York. [11] Holst, L. 1972 Asymptotic normality and eciency for certain goodness of t tests Biometrika, 59, 137{145. MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Bootstrapping Goodness-of-Fit Statistics for Sparse Categorical Data

48

[12] Koehler, K. J. Larntz, K. 1980 An empirical investigation of goodness-of t statistics for sparse multinomials Journal of the American Statistical Association, 75, 336{344. [13] Krohne, H.-W., Rosch, W., Kursten, F. 1989 Die Erfassung von Angstbewaltigung in physisch bedrohlichen Situationen Zeitschrift fur Klinische Psychologie, 18 (3), 230{242. [14] Langeheine, R., Pannekoek, J., van de Pol, F. 1996 Bootstrapping Goodness-of-Fit Measures in Categorical Data Analysis Sociological Methods and Research, 24 (4), 492{516. [15] Lazarsfeld, P. F. 1950 The logical and mathematical foundations of latent structure analysis In Stouer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Star, S. A., Clausen, J. A., Studies in Social Psychology in World War II, Vol. IV: Measurement and Prediction. Princeton University Press, Princeton, NJ. [16] Marriot, F. 1979 Barnard's Monte Carlo tests: how many simulations? Appl. Statist., 28, 75{77. [17] McCrae, R. R. Costa, P. T. 1987 Validation of the ve-factor model of personality across instruments and observers. Journal of Personality and Social Psychology, 52, 81{90. [18] Osius, G. Rojek, D. 1992 Normal Goodness-of-Fit Tests for Multinomial Models with Large Degrees of Freedom Journal of the American Statistical Association, 87, 1145{1152. [19] Rasch, G. 1960 Probabilistisc models for some intelligence and attainment tests. The Danish Institute of Educational Research, Copenhagen. [20] Read, T. Cressie, N. 1988 Goodness-of-Fit Statistics for Discrete Mutivariate Data. Springer Series in Statistics. Springer, New York. [21] Rost, J. Langeheine, R. 1997 Applications of Latent Trait and Latent Class Models in the Social Sciences. Waxmann, Munster. [22] Rost, J. 1990 Rasch Models in Latent Classes: An Integration of two Approaches to Item Analysis Applied Psychological Measurement, 14, 271{282. [23] Snyder, M. 1974 Self Monitoring of expressive behavior Journal of Personality and Social Psychology, 30, 526{537. [24] von Davier, M. 1994 WINMIRA - A Windows-Program for Analyses with the Rasch Model, with the Latent Class Analysis and with the Mixed Rasch Model, IPN Software, Institute for Science Education, Kiel, Germany. [25] von Davier, M. 1997 Methoden zur Prufung probabilistischer Testmodelle (Methods for testing probabilistic test models, in German), 158 of IPN Schriftenreihe. IPN, Kiel. MPR{online 1997, Vol.2, No.2

c 1997 Pabst Science Publishers

Suggest Documents