Modelling Association between Two or More Categorical Variables that Allow for Multiple Category Choices Christopher R. Bilder1 and Thomas M. Loughin2 1
Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA;
[email protected], http://www.chrisbilder.com 2 Department of Statistics, Kansas State University, Manhattan, KS, USA Abstract: Multiple-response (or pick any/c) categorical variables summarize multivariate binary responses, such as responses to survey questions which ask "pick any" or "choose all that apply" from a set of item responses. The purpose of this paper is to introduce extensions to loglinear modelling in order to model the associations between different multiple-response categorical variables simultaneously across all their items. Because individual item responses to a multipleresponse categorical variable are likely to be correlated, the usual chi-square approximations to loglinear model goodness-of-fit statistics are not appropriate. A new bootstrap procedure is proposed to approximate the distribution of these statistics. Asymptotic chi-square distributional approximations are also developed. Simulations show the bootstrap procedure leads to tests that hold the correct size and perform as well as previously proposed non-model-based procedures for the important test of simultaneous pairwise marginal independence. The new models proposed here are the first to be able to model the association structure between MRCVs while providing tests that hold the correct size. Key words: Bootstrap; correlated binary data; generalized loglinear model; marginal model; multiple-response categorical variable; pick any/c 1
Introduction The loglinear modelling of single-response categorical variables has been extensively studied
and is explained well in categorical data analysis textbooks such as Agresti (2002) and Christensen (1997). Until recently, research had not focused as much on multiple-response
1
categorical variables (MRCVs). These types of variables arise from survey questions which ask respondents to “pick any” or “choose all that apply” from a set of items. MRCVs arise in surveys from a variety of settings including swine management (Agresti and Liu, 2001; Bilder and Loughin, 2004), contraceptive use (Foxman et al., 1997; Bilder and Loughin, 2002), and understanding patient symptoms (Bilder and Loughin, 2001). More generally, these variables arise from any multivariate binary response measurement. Responses to a MRCV can be all negative, one positive, any combination of positives, or all positive responses. With regards to developing statistical methods to examine MRCVs, Umesh (1995) and Loughin and Scherer (1998) examine how to work with one MRCV and provide a test for marginal independence between a single and a multiple-response categorical variable. Agresti and Liu (1999, 2001) and Bilder et al. (2000) further this research by presenting refinements and new approaches to the problem. Bilder and Loughin (2002) examine the problem of testing for conditional independence between a single and a multiple-response categorical variable while conditioning on a third single-response categorical variable. Thomas and Decady (2000) and Bilder and Loughin (2004) propose extensions of a Pearson chi-square test for independence to test marginal independence between two MRCVs. Most of this past research has involved developing extensions to the Pearson chi-square test for independence. Model-based testing approaches proposed in Agresti and Liu (1999, 2001) are found by Bilder et al. (2000) and Bilder and Loughin (2003) to have problems holding the correct size and obtaining parameter estimate convergence. The purpose of this paper is to develop new loglinear modelling approaches to describe associations between two or more MRCVs without the same problems that past modelling proposals have encountered. The new models are shown to be flexible enough to handle different types of association structures between MRCVs. Distributional approximations to goodness-of-fit statistics are developed using the bootstrap and modifications to procedures developed by Rao and Scott (1984). These testing
2
procedures are found to hold the correct size for a variety of situations. The Kansas farmer data first described in Richert et al. (1993) and later in papers such as Agresti and Liu (1999, 2001) and Bilder and Loughin (2004) give a situation where more than two MRCVs arise. For illustrative purposes, questions involving (a) the swine waste storage methods and (b) what swine waste is tested for, are of focus here. The farmers were told to pick all waste storage methods they use from among lagoon, pit, natural drainage, and holding tank. The farmers were also told to pick all items they test waste for from among nitrogen, phosphorus, and salt. Table 1 gives a tabular representation of the observed items picked for both MRCVs. For example, 27 farmers use lagoon as their waste storage method and also test for nitrogen. Since farmers can pick more than one item for each MRCV, they can appear multiple times in the table, which makes this problem different from the familiar problem involving single-response categorical variables (subjects would appear only once in a similar tabular format). Research questions of interest that arise with this data are: 1) Is waste storage independent of what the waste is tested for? 2) If they are dependent, what is the association structure? Modelling the association structure can help researchers better understand the current waste management practices of Kansas farmers and then promote particular types of changes (if needed). In the single-response categorical variable case, these research questions can be answered using simple loglinear models of the type described in Agresti (2002, Chapter 8). Because of the multiple-response structure here, these types of models can not be used directly. The proposed modelling procedure in this paper is the first to be able to model the association structure between the MRCVs and provide goodness-of-fit tests that hold the correct size. The order of this paper is as follows. Section 2 describes marginal loglinear models which are proposed to fit contingency tables with correlated counts between tables. This section specifically focuses on two MRCVs. Bootstrap procedures and chi-square distributional approximations are developed to approximate distributions of goodness-of-fit statistics. Section 3
3
describes extensions for three or more MRCVs. Section 4 applies these models to example data sets and describes how to choose a “best” model. Section 5 examines how well the proposed testing procedures hold the correct size. Section 6 summarizes and gives concluding recommendations. Models for association between one MRCV and single-response categorical variables are addressed in this section. 2
Loglinear models fit to counts in the item response table Consider the case of two MRCVs generically denoted as W, representing I categories or
“items”, and Y, representing J items. For the waste management data, W corresponds to the test waste with I = 3 items, and Y corresponds to the waste storage with J = 4 items. Survey respondents contribute a vector of binary responses for both MRCVs indicating the items to which a positive response is given. For a randomly selected subject, s, let Ws = (Ws1,…, WsI)′ denote the binary responses for W and Ys = (Ys1,…, YsJ)′ denote the responses for Y. Let the number of subjects be denoted by n. Data arising from MRCVs can be summarized pairwise in what is called an item response table. Table 2 shows the waste management data in this format. This data format is easier to work with than the marginal table format commonly shown in MRCV research, i.e., Table 1, because both positive and negative responses to MRCV items are shown and helps to avoid an invariance problem associated with test statistics formed directly on marginal tables like Table 1 (see Bilder and Loughin, 2001 and 2004, for more information and examples). Let mab(ij) denote the number of (Wi=a, Yj=b) responses where a = 0 or 1 and b = 0 or 1. For example, there are m11(11) = 27 farmers who test waste for nitrogen and use lagoon as their waste storage method. The expectation of mab(ij) is denoted by µab(ij). The corresponding marginal probability is P(Wi=a, Yj=b) = πab(ij). Note that πab(ij) = µab(ij)/µ••(ij) and µ••(ij) = µ00(ij) + µ01(ij) + µ10(ij) + µ11(ij) is equal to n if there are no missing observations. Although it is unnecessary to assume no missing observations, this assumption is made throughout the paper to simplify the exposition.
4
Modelling the marginal association between W and Y means examining patterns among the associations computed on the IJ 2×2 sub-tables within an item response table. These patterns are then attributed to features of W and Y. Loglinear models are used for this purpose because they lead naturally to odds ratio interpretations of these associations. To model the associations between Wi and Yj, separate loglinear models could be fit to each sub-table. However, it is more desirable to fit one model for all sub-tables to facilitate common inferences about the effects that the levels of two MRCVs have on the association between them. One model allows for the unified treatment of the data as coming from one sample instead of IJ different samples. This unified treatment is similar to Agresti and Liu (2001, p. 408) in how they use one marginal logit model to simultaneously model all items of a MRCV. Agresti and Liu (1999) introduce the hypothesis of simultaneous pairwise marginal independence (SPMI), which is the extension of independence in the single-response categorical variable setting to the MRCV setting (Bilder and Loughin, 2004). SPMI denotes the simultaneous pairwise independence of two groups of binary random variables. Specifically, SPMI exists if µab(ij) = µa•(ij)µ•b(ij)/n holds true for all a = 0,1, b = 0,1, i = 1,…,I, and j = 1,…,J. Note that µa•(i1) = … = µa•(iJ) and µ•b(1j) = … = µ•b(Ij) when there are no missing observations. The SPMI hypotheses can also be written in terms of odds ratios. SPMI exists if
(µ11(ij)µ00(ij) ) (µ10(ij)µ01(ij) )
= 1 for i=1,…,I and j=1,…,J. In the context of Table 2 then, SPMI
represents simultaneous independence in each of the IJ sub-tables. The marginal loglinear model for SPMI is Y log(µab(ij) ) = γ ij + ηaW(ij) + ηb(ij) , i=1,…I, j=1,..,J, a=0,1, b=0,1
(2.1)
where the γ ij terms control the sample size to be n in each sub-table, the ηaW(ij) terms control the Y terms control the column marginal counts in row marginal counts in each sub-table, and the ηb(ij)
each sub-table. Appropriate restrictions are made on the model parameters to ensure identifiability. Fitting this model to a data set represented like the one in Table 2 creates
5
predicted sub-table counts whose margins match those for the observed sub-tables. Furthermore, the predicted sub-table counts all have odds ratios of 1. When the SPMI model is not appropriate, additional parameters measuring the association between W and Y are introduced into the model. These models include: Y log(µab(ij) ) = γ ij + ηaW(ij) + ηb(ij) + λab ,
(2.2)
Y Y log(µab(ij) ) = γ ij + ηaW(ij) + ηb(ij) + λab + λab( j) ,
(2.3)
Y W log(µab(ij) ) = γ ij + ηaW(ij) + ηb(ij) + λab + λab(i) ,
(2.4)
Y W Y log(µab(ij) ) = γ ij + ηaW(ij) + ηb(ij) + λab + λab(i) + λab( j) , and
(2.5)
Y W Y WY log(µab(ij) ) = γ ij + ηaW(ij) + ηb(ij) + λab + λab(i) + λab( j) + λ ab(ij) .
(2.6)
Model (2.2) allows for homogenous association across all the sub-tables. Thus, each sub-table’s odds ratio will be equal, but not necessarily equal to one. This model is appropriate when associations between the Wi and Yj items are at approximately the same level. Model (2.3) adds effectively a Y main effect to the odds ratios. It allows the sub-tables’ odds ratios to vary across the Yj items, but within Yj they are constant. Thus, we call this the W-homogenous association model. Model (2.4) is similar to model (2.3) except the roles of Yj and Wi are reversed. These types of models are appropriate when only one of the MRCVs affects the association with all items of the other MRCV. Model (2.5) allows for main effects of both W and Y on the odds ratios, imposing that the differences between log odds ratios for any two levels of Y are constant across W and vice versa. Finally, model (2.6) is the saturated model. It puts no constraints on the odds ratios among the W and Y item combinations. Items from the same MRCV often have the same type of dependency structure with items of the other MRCV due to the fact that item responses within each MRCV are typically correlated. Models (2.1)-(2.5) take advantage of this dependence structure in order to provide a set of models less complicated than a saturated model. The examples in Section 4 illustrate where this dependence structure occurs with actual data. Furthermore, note that if an individual loglinear
6
model is fit to each sub-table, the modelling choices for a given sub-table would only be between a model under independence and a saturated model. Thus, the models proposed here provide a greater range of options for non-saturated models for the data. 2.1
Fitting the models There are a few different choices to fitting models in (2.1)-(2.6). First, the model can be
incorporated within the generalized loglinear model framework described in Lang and Agresti (1994). These models are fit via maximum likelihood to a cross-classification of the multinomial counts from all possible Ws and Ys. Bilder and Loughin (2004) refer to this cross-classification as a joint table. With two MRCVs, the number of multinomial counts is 2I+J and each of the corresponding multinomial probabilities needs to be estimated under a set of model constraints. When I and/or J are not small, this can result in a large number of parameters that need to be estimated. For example, there are 24+3 = 128 parameters that need to be estimated for the waste management data example. Joint tables can also be very sparse. For example, the waste management data has 101 of its 128 observed multinomial counts equal to 0. To further complicate matters, several different multinomial probability distributions can actually satisfy the model of interest. This is because the πab(ij) are linear combinations of the multinomial probabilities. As suggested by Agresti and Liu (2001) and shown in Sections 4 and 5 here, these factors can cause parameter estimate convergence problems. When convergence is reached, the goodness-of-fit tests produced from the converged parameter estimates can be extremely conservative, as is also illustrated in Sections 4 and 5. To avoid fitting a model to the sparse multinomial counts, the generalized loglinear model can be fit using a new marginal modelling approach. The model is fit directly to the data as displayed in Table 2. By doing this, the counts are temporarily treated as if they arose from a multinomial distribution, without regard to the fact that counts from different sub-tables are actually sums based on some of the same multinomial counts. The estimated expected
7
frequencies, µˆ ab(ij) , from the proposed models here are found through solving the estimating equations of X ′µˆ = X ′m, where µˆ and m are 4IJ×1 vectors of the corresponding µˆ ab(ij) and mab(ij) quantities and X is a matrix of 0’s and 1’s relating the expected to the observed counts for a model in (2.1)-(2.6). Because the usual likelihood equations are assumed for the multinomial working model, fitting can be performed with software such as PROC GENMOD in SAS or the glm function in R. Programs are available at chrisbilder.com/bilder_loughin to fit these models. The model fitting procedure here is similar to other procedures proposed for different problems. Rao and Scott (1984) fit a loglinear model to one contingency table that is the result of complex survey sampling, but initially treat the table counts as arising through random sampling. Similar to the present case, their model misstates the correlations among table counts. They call their parameter estimates “pseudo” maximum likelihood estimates since the true likelihood equations are not used. As will be shown, their statistical methodology can be adjusted to correspond to the problem of MRCV data arising from simple random sampling. Thus, our parameter estimates are still consistent because they are functions of πˆ ab(ij) =mab(ij)/n which is a consistent estimator for πab(ij). Haber (1985, p. 2852-3) fits a loglinear model to a group of contingency tables with counts correlated among the tables using the same type of methodology as done here. However, Haber does not consider all possible sub-tables since he was not working with MRCV data. 2.2
Goodness-of-fit statistics Because the individual counts from the item response table are not multinomial counts, the
usual loglinear model goodness-of-fit statistics (Pearson and likelihood ratio) do not have asymptotic chi-square distributions for the marginal modelling approach. Instead they are asymptotically distributed as linear combinations of independent χ12 random variables. Rao and Scott (1984) specifically discuss the asymptotic distributions for these types of goodness-of-fit statistics. In the particular case of testing for SPMI, the Pearson statistic is
8
∑ a,b,i, j ( mab(ij) − µˆ ab(ij) ) µˆ ab(ij) where µˆ ab(ij) results from fitting model (2.1). This statistic is the 2
same as the modified Pearson statistic derived in Bilder and Loughin (2004) because µˆ ab(ij) = ma•(ij)m•b(ij)/n and hence it has the same asymptotic distribution. First and second-order Rao-Scott (1984) adjustments to the Pearson test statistic and likelihood ratio test (LRT) statistic create new statistics whose asymptotic first and/or second moments are the same as a chi-square random variable. Bilder and Loughin (2004) find that the first-order adjusted Pearson statistic does not hold the correct size for the SPMI test when there is strong pairwise association among the items of the same MRCV. They also find the second-order adjusted Pearson statistic performs satisfactorily most of the time, but not quite as well as their bootstrap procedures. Because of these findings, we propose a new bootstrap procedure to estimate the sampling distribution of the goodness-of-fit statistics. The resampling involves generating new correlated binary vectors of data under a null hypothesis model using the data generation algorithm of Gange (1995). Suppose M0 is the null model and that it is nested within M1, the alternative model. Let X2M be a Pearson or likelihood ratio goodness-of-fit statistic ˆ (0) µˆ (0) comparing the two models. For example, X2M = ∑ a,b,i, j (µˆ (1) ab(ij) − µ ab(ij) ) ab(ij) for the Pearson 2
ˆ (1) statistic where µˆ (0) ab(ij) and µ ab(ij) denote predicted counts for the M0 and M1 models, respectively. The bootstrap estimate of the sampling distribution for X2M under M0 requires generating new vectors of data, say, Ws∗ and Ys∗ , that satisfy the conditions imposed by M0. Simply resampling the observed Ws and Ys together does not guarantee this. Instead, the Gange algorithm provides the precise means to produce a resample under M0. Gange (1995) specifies the configurations, as described in Section 3.1.3 of Bishop et al. (1975), of constraints on the marginal associations between categorical variables in a contingency table. Using these constraints, the iterative proportional fitting method can then be used to find a multinomial distribution which satisfies the constraints. For the problem under investigation here, the predicted counts from M0 provide the two-way configurations (predicted sub-table counts)
9
between items for the different MRCVs. Using these model predicted configurations with the observed two-way configurations between items within the same MRCV provides for a natural way for the Gange algorithm to estimate the multinomial probabilities under M0. The Ws∗ and Ys∗ for s = 1, …, n are then found through standard methods of generating observations from a
multinomial distribution (see Gange (1995) for a description) which satisfies M0. In summary, the following bootstrap procedure is used to estimate the distribution of X2M : 1. Find µˆ (0) and µˆ (1) through solving the estimating equations for the M0 and M1 models, respectively, and calculate X2M . 2. Find the observed 2×2 tables for each (Wi, Wi′) (i