Using Mixed-Model Item Response Theory to ...

2 downloads 927 Views 664KB Size Report
authors present an illustration of the use of MM-IRT with the Job Descriptive Index (JDI), focusing ... item response theory, latent class analysis, invariance testing.
Using Mixed-Model Item Response Theory to Analyze Organizational Survey Responses: An Illustration Using the Job Descriptive Index

Organizational Research Methods 14(1) 116-146 ª The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1094428110363309 http://orm.sagepub.com

Nathan T. Carter1, Dev K. Dalal1, Christopher J. Lake1, Bing C. Lin1,2, and Michael J. Zickar1

Abstract In this article, the authors illustrate the use of mixed-model item response theory (MM-IRT) and explain its usefulness for analyzing organizational surveys. The authors begin by giving an overview of MM-IRT, focusing on both technical aspects and previous organizational applications. Guidance is provided on how researchers can use MM-IRT to check scoring assumptions, identify the influence of systematic responding that is unrelated to item content (i.e., response sets), and evaluate individual and group difference variables as predictors of class membership. After summarizing the current body of research using MM-IRT to address problems relevant to organizational researchers, the authors present an illustration of the use of MM-IRT with the Job Descriptive Index (JDI), focusing on the use of the ‘‘?’’ response option. Three classes emerged, one most likely to respond in the positive direction, one most likely to respond in the negative direction, and another most likely to use the ‘‘?’’ response. Trust in management, job tenure, age, race, and sex were considered as correlates of class membership. Results are discussed in terms of the applicability of MM-IRT and future research endeavors. Keywords item response theory, latent class analysis, invariance testing Item response theory (IRT) models have played a large role in organizational researchers’ understanding regarding measures of a variety of domains, including job attitudes (e.g., Collins, Raju, & Edwards, 2000; Donovan, Drasgow, & Probst, 2000; Wang & Russell, 2005), personality

1 2

Bowling Green State University, OH, USA Department of Psychology, College of Arts and Sciences, Portland State University, OR, USA

Corresponding Author: Nathan T. Carter, Department of Psychology, Bowling Green State University, 214 Psychology Building, Bowling Green, OH 43402, USA Email: [email protected]

116 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

117

(e.g., Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001; Zickar, 2001; Zickar & Robie, 1999), general mental ability (e.g., Chan, Drasgow, & Sawin, 1999), performance ratings (e.g., Maurer, Raju, & Collins, 1998), vocational interests (e.g., Tay, Drasgow, Rounds, & Williams, 2009), and employee opinions (e.g., Ryan, Horvath, Ployhart, Schmitt, & Slade, 2000). Recent extensions of IRT models provide increased flexibility in the questions researchers and practitioners may ask about the response–trait relationship in organizational surveys. In this article, we explain how organizational survey data can be analyzed using mixed- (or mixture-) model IRT (MM-IRT). MM-IRT combines features of traditional IRT models (e.g., the partial credit model [PCM]) and latent class analysis (LCA; see Lazarsfeld & Henry, 1968) and identifies subgroups of respondents for whom the item response–latent variable relationships indicated by the item response functions (IRFs) are considerably different. Here, we explain the technical features of MM-IRT and discuss some of its potential uses, including checking scoring assumptions, identifying systematic responding, and evaluating potential correlates of class membership. We then illustrate the use of MM-IRT by applying the framework to answer some fundamental questions about the measurement properties of the Job Descriptive Index (JDI; Balzer et al., 1997; Smith, Kendall, & Hulin, 1969), a commercially available survey that measures facets of job satisfaction with five separate scales measuring persons’ satisfaction with their work, coworkers, supervision, pay, and opportunities for promotions (9 items). Our discussion and results are presented in concert with Table 1 that outlines the process from beginning to end that a researcher would use when conducting an MM-IRT analysis. We first show how these steps are accomplished and then how we address each of these steps in our analysis of the JDI. The JDI was chosen for this illustration for several reasons. First, it has consistently been found to be one of the most frequently used measures of job satisfaction (see Bowling, Hendricks, & Wagner, 2008; Connolly & Viswesvaran, 2000; Cooper-Hakim & Viswesvaran, 2005; Judge, Heller, & Mount, 2002). In fact, the JDI Office at Bowling Green State University continues to acquire around 150 data-sharing agreements per year (J. Z. Gillespie, personal communication, October 28, 2009). In addition, the JDI’s ‘‘Yes,’’ ‘‘No,’’ ‘‘?’’ response format has been examined with conventional IRT analyses (e.g., Hanisch, 1992), and measures using similar response scales (i.e., ‘‘Yes,’’ ‘‘No,’’ ‘‘?’’) have been investigated in past MM-IRT research (e.g., Hernandez, Drasgow, & Gonzalez-Roma, 2004). In sum, the JDI provided a well-known measure with properties that have been of interest in the past and in the current research and application, allowing for an accessible and substantively interesting measure for the illustrative MM-IRT analyses.

An Introduction to Mixed-Model IRT IRT is essentially a collection of formulated models that attempt to describe the relationship between observed item responses and latent variables (e.g., attitude, personality, and interests). The IRFs of IRT are logistic regressions that start with observed responses and use conditional probabilities of responding to an item in a particular way (e.g., strongly agree) to find an appropriate transformation of the sum score to represent the underlying or latent variable. Additionally, the models parameterize different properties of items (depending on the model) that are on a scale common to the estimates of persons’ standing on the latent variable. Although any number of item properties could be included, IRT models are generally concerned with the location and discrimination of items. The item’s location reflects its difficulty (in an ability context) or extremity (in attitudes or personality). In situations where items have more than two options, the item’s location is often quantified by threshold parameters. Thresholds represent the point on the latent variable continuum at which the probability of responding to one option becomes greater than choosing another; thus, there will be one less threshold parameter than there are options. Generally, the average of these thresholds can be considered an estimate of the item’s location. An item’s discrimination is a reflection of 117 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

118

Organizational Research Methods 14(1)

its sensitivity to variability in the latent variable. Discrimination parameters are typically quantified as the regression line’s slope at the point of the item’s location on the trait continuum. IRT models are advantageous for several reasons. Most pertinent to this article is that they place persons and items on a common scale, providing measurement researchers an appropriate framework for determining whether group differences in observed sum scores are likely to be due to differences on the latent variable or for other, extraneous reasons, such as membership to a protected class, response sets, or individual differences other than what the researcher is attempting to measure.1 Conventional IRT models assume that item responses are drawn from one homogenous subpopulation. This implies that one set of IRFs can be used to describe the relationship between item responses and the latent trait. However, it may be plausible that there are subgroups of respondents with different response–trait relationships; in other words, more than one set of item parameters may be needed to model item responding. In fact, it has been noted that such heterogeneity can be expected in situations where the studied population is complex (von Davier, Rost, & Carstensen, 2007), which organizational researchers are likely to encounter. Typically, researchers examine the viability of the homogenous subpopulation assumption by conducting differential item functioning (DIF) analyses based on important manifest group variables. For organizational researchers, these groups are typically based on legally, ethically, or practically important manifest variables such as respondent sex and race. However, it is possible that differences in the response–trait relationship are better described by latent groupings that may or may not correspond to the previously mentioned observed variables. These unobserved groups may exist for a variety of reasons, including differential endorsement strategies and comparison processes that may result from different sociocultural experiences (Rost, 1997). MM-IRT identifies unobservable groups of respondents with different response–trait relationships, in effect an exploratory method of DIF detection wherein subgroups are identified a posteriori (Mislevy & Huang, 2007). In the following sections, we provide a general definition of MM-IRT and discuss the estimation of item and person parameters under the Rasch family of models. We focus here on the use of Rasch-based IRT models because these are the models available in the WINMIRA program (von Davier, 2001), a user-friendly software program that can be used by researchers to conduct such analyses without the more intensive programming involved when estimating other mixed IRT models.

MM-IRT Model Definition As noted above, MM-IRT is a hybrid of IRT and LCA, which uncovers unobservable groups whose item parameters, and therein IRFs, differ substantially. The most general form of the LCA model can be written: PðuÞ ¼

G X

pg PðujgÞ;

ð1Þ

g¼1

where P(u) denotes the probability of a vector of observed responses, u ¼ x1, x2, . . . , xi. The term P(u|g) denotes the probability of the response vector within an unobservable group, g, of the size, p. The p parameter, also called the mixing proportions parameter, is used to represent the proportion of respondents belonging to the gth group and carries the assumption: G X

pg ¼ 1;

g¼1

that is, the summation across the proportion parameter estimates must be equal to 1. 118 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

ð2Þ

Carter et al.

119

The conditional probability of observed responses within a group, P(u|g), can be replaced by any number of conventional dichotomous or polytomous IRT models, assuming that the basic form of the model holds across groups with a different set of values for the IRT model’s parameters. Here, we focus on the use of the PCM (Masters, 1982), which is stated: expðhyj  sih Þ ; for h ¼ 0; 1; 2 . . . m; M P expðsyj  sis Þ

PðUij ¼ hÞ ¼

ð3Þ

s¼0

or that the probability of person j responding h to an item, i, is determined by the distance between person j’s standing on the latent trait, y, and the sum of the h option thresholds for the s ¼ h þ 1 possible observed response options: sih ¼

h X

tig ; with si0 ¼ 0:

ð4Þ

s¼1

The option location parameter, tis, is the location of the threshold, s, on the scale of y for the ith item. This IRT model assumes that there is one latent population of respondents who use the same response process (i.e., who use scales similarly). Note that the s term here is simply a counting variable for purposes of summation that corresponds to the levels of h. The PCM is a polytomous Rasch model and therefore assumes equal discriminations across items, whereas other polytomous IRT models do not (e.g., the graded response model). Although Rasch models often have worse fit than models that allow item discriminations to vary (see Maijde Meij, 2008), the WINMIRA program uses only Rasch-based measurement models. Estimation is much easier using the simple Rasch models because of the exclusion of multiplicative parameter terms (Rost, 1997). Although we hope that commercially available user-friendly software will someday incorporate more complex models, we proceeded cautiously with the Rasch-based PCM, paying careful attention to item-level fit to test whether the model was tenable for our data. Other researchers using attitudinal (Eid & Rauber, 2000) and personality data (Hernandez et al., 2004; Zickar, Gibby, & Robie, 2004) have found acceptable fit using similar models. Although we focus on the Rasch models available in WINMIRA, it should be noted that our discussion of MM-IRT generalizes easily to IRT models that allow for discrimination to vary. However, doing so requires more extensive programming experience using a more complex program such as LatentGOLD 4.5 (Vermunt & Magidson, 2005). The mixed PCM (MPCM) is obtained by substituting the IRF (Equation 3) in place of the P(u|g) term2 in Equation 1, or PðUij ¼ hÞ ¼

G X g¼1

pg

h X expðhyjg  sihg Þ ; with s ¼ tisg : ihg M P s¼1 expðsyjg  sisg Þ

ð5Þ

s¼0

According to this model, each person and item have as many sets of relevant parameters as there are groups; so in a 3-class solution, each person will have three estimates of y, and each item will have three times the number of item parameters in the single-group version of the PCM. The item and group-proportion parameters are estimated simultaneously using an extended expectation-maximization (EM) algorithm, using conditional maximum likelihood in the maximization step. Because the sum score can be considered a sufficient estimate of y in Rasch models, the trait estimate is not involved in the EM procedure. The y estimates are obtained using the item and group-proportion parameters established in the EM procedure and then are estimated by setting the observed total score equal to the right-hand side of the model in Equation 5 and solving for y 119 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

120

Organizational Research Methods 14(1)

iteratively. This is an important point, because this means y for person j in one group will not differ greatly from their estimate in another group; this is because estimation of y is based on the observed sum score, which is the same for a given person, j, across groups (Rost, 1997). The researcher using MM-IRT need only to specify the number of subpopulations believed to underlie what would typically be considered data coming from one homogenous subpopulation. First, a 1-class model is fit, then a 2-class model, a 3-class model, and so on. Once an increase in the number of classes no longer shows an increase in model-data fit, the models are compared and the best fitting is retained for further analysis. Identifying the latent class structure and therein unmixing groups with different response–trait relationships nested within the 1-class model, the researcher can ask both practically and theoretically interesting questions. In the following section, we show some past uses of MM-IRT in organizational research before presenting our illustration using the JDI.

Applications of MM-IRT in Organizational Research In the first application of MM-IRT to organizational research, Eid and Rauber (2000) found two distinct response styles in a measure of organizational leadership satisfaction—one class that made use of the whole response scale and one that preferred extreme responses. These authors found that length of service and level within the organization explained group membership, suggesting usage of the entire response scale may be too complex or too time-consuming, and that certain employees may be more prone to use global judgments such as ‘‘good’’ and ‘‘bad.’’ Additionally, a larger percentage of females belonged to the second class compared to males. Zickar et al. (2004) examined faking in two personality inventories. In their first analysis, Zickar and colleagues applied MM-IRT to an organizational personality inventory (Personal Preferences Inventory; Personnel Decisions International, 1997) with subscales mapping onto the Big 5, and found three classes of respondents for each of the five dimensions of the personality inventory, with the exception of the Neuroticism subscale extracting four classes. In general, results suggested that there were likely three extents of faking: no faking, slight faking, and extreme faking. In their second analysis, the authors applied MM-IRT to a personality inventory (Assessment of Background and Life events; White, Nord, Mael, & Young, 1993) in a sample of military personnel. Prior to survey administration, the authors placed respondents in honest, ad lib faking, and trained faking conditions. Results suggested two response classes for each of these conditions: A faking and an honest class. Interestingly, 7.2% to 22.9% of respondents in the honest condition were placed in the faking class, 27.2% to 41.6% of participants in the coached faking condition were placed in the honest class, reflecting the inherent complexity and high variability in personality test faking. Finally, in a study of measures of extraversion and neuroticism in the Amsterdam Biographical Questionnaire (ABQ; Wilde, 1970), Maij-de Meij, Kelderman, and van der Flier (2008) found that a 3-class solution best fit the data for each scale. The classes were differentiated by the probability of using each of the ‘‘Yes,’’ ‘‘No,’’ and ‘‘?’’ responses. Participants also completed a measure of social desirability as part of the ABQ, and scores on social desirability and ethnic background had significant effects on class membership. Results from this study suggest that personality measure scores from ethnically diverse and high-stakes contexts must be interpreted with caution, as there were strong effects of ethnicity and social desirability on response class membership. Although the analyses discussed above provide valuable insight into the use of MM-IRT in organizational research, a straightforward illustrative application of MM-IRT involving attitudinal surveys is not available in the current literature. Hernandez et al. (2004) provided a thorough investigation of personality questionnaires, and Eid & Rauber (2000) investigated the use of a satisfaction questionnaire. However, their discussion considered the case of a 2-class solution in regard to predicting class membership, which has recently been noted as possibly too restrictive (Maij-de Meij 120 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

121

et al., 2008), and use of solutions with more than two classes requires the use of different analytic techniques and considerations. Additionally, some unique issues related to the use of Rasch models have not seen coverage in the literature on MM-IRT. In this illustration, we intend to address these as yet unattended issues, providing a fairly comprehensive illustration of the use of the method in organizational research.

Background to the Illustration: The JDI and Similar Scales Before presenting our own illustration of MM-IRT analysis using the JDI, we begin by briefly reviewing the background of the measure and its typical assumptions for scoring to inform the interpretation of results. In addition, we review MM-IRT research analyzing measures using a ‘‘Yes,’’ ‘‘No,’’ and ‘‘?’’ response format. We then proceed with our analyses, discussing our use of the technique to (a) identify and assess the fit of the appropriate model; (b) qualify the latent classes, (c) check scoring assumptions, (d) examine the possibility of systematic response styles, and (e) examine relevant individual difference and group membership variables as a reason for the latent class structure.

Step 1: Background Review of the JDI and Similar Scales Originally, Smith et al. (1969) scored the items in the five JDI scales as follows: ‘‘Yes’’ ¼ 3, ‘‘?’’ ¼ 2, and ‘‘No’’ ¼ 1. Taking data from 236 persons responding to the Work scale, they split the sample into ‘‘satisfied’’ and ‘‘dissatisfied’’ groups and found that dissatisfied persons tended to have significantly more ‘‘?’’ responses than those in the satisfied group. This led them to recommend the currently used asymmetric scoring scheme of 3 (Yes), 1 (?), and 0 (No) because the data suggested the ‘‘?’’ response was more likely to be associated with dissatisfied than satisfied persons (Balzer et al., 1997). Hanisch (1992) evaluated the viability of the Smith et al. (1969) scoring procedure using Bock’s (1972) nominal IRT model. In observing option response functions (ORFs), it was clear that the intervals between options were not equidistant, suggesting that the ‘‘Yes’’ option was well above ‘‘?’’ and ‘‘No’’ on the trait continuum and that those moderately low in y were more likely to endorse the ‘‘?’’ response, thus verifying the original scoring by Smith et al. One of the limitations of the nominal IRT model used by Hanisch is that it assumes that all respondents use the same type of response process when answering items. It may be possible, however, that even though a majority of individuals interpret the ‘‘?’’ response option as a neutral response, there are others who interpret this ambiguous option in other ways. MM-IRT will allow us to further probe how different groups of respondents interpret and use the ‘‘?’’ option. Expectations concerning class structure. Due to our review of measures using response anchors similar to the JDI, we were particularly concerned with the use of the ‘‘?’’ option. Although the research concerning the ‘‘?’’ response of the JDI have been mostly supportive of the common conceptualization of the ‘‘?’’ response as being between ‘‘Yes’’ and ‘‘No’’ responses (e.g., Hanisch, 1992; Smith et al., 1969), other researchers have shown less confidence about this assumption in the analyses of other scales (e.g., Bock & Jones, 1968; Dubois & Burns, 1975; Goldberg, 1971; Hernandez et al., 2004; Kaplan, 1972; Worthy, 1969). To date, the research available using MM-IRT to investigate the use of ‘‘?’’ have agreed with the latter group of researchers (see Hernandez et al., 2004; Maijde Meij et al., 2008; Smit, Kelderman, & van der Flier, 2003), finding that the vast majority avoid the ‘‘?’’ response. Thus, we hypothesize (Hypothesis 1) that a class will be identified that has a higher probability of using the ‘‘?’’and that the remaining classes will avoid the ‘‘?’’ response (Hernandez et al., 2004; Maij-de Meij et al., 2008; Smit et al., 2003). 121 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

122

Organizational Research Methods 14(1)

Eid and Zickar (2007) noted that latent classes can also uncover groups of respondents that are ‘‘yea-sayers’’ and ‘‘nay-sayers’’, which use one end of the scale regardless of item content (Hernandez et al., 2004; Maij-de Meij et al., 2008; Reise & Gomel, 1995; Smit et al., 2003). Thus, we expected that the persons who avoid the ‘‘?’’ response would constitute extreme responders, manifested in either one class using only extremes, or two classes, each preferring one of the two extremes (i.e., ‘‘Yes’’ or ‘‘No’’) over other options. Because there was some inconsistency in the past research concerning the division of extreme respondents, we do not formulate formal hypotheses concerning the expected number of classes. Expectations concerning systematic responding. Concerning consistency of classification across scales, we found only one study by Hernandez et al. (2004) that directly addressed this issue. These researchers found only moderate consistency in class membership across the 16 PF, as indicated by the average size of phi correlations between class membership in every scale, M ¼ .39, SD ¼ .04, most of which was due to the largest class. That is, even though for one of the scales, a respondent might be in the class that uses the ‘‘?’’ frequently, did not mean that the same person was placed in the same class for other scales. We postulated that (Hypothesis 2) respondents will be classified into a particular class across the 5 JDI scales with moderate consistency. Support for this hypothesis would suggest that although latent class membership is to some extent determined by a specific response style, it is by no means the only possible consideration. Checking scoring assumptions. Although there is some available research using MM-IRT to evaluate scoring assumptions of scales using the ‘‘?’’ option, we believe this literature is not so established that these findings can be generalized across scales. In this investigation, we will be using the MPCM to examine the viability of the assumption that ‘‘?’’ falls between the ‘‘Yes’’ and ‘‘No’’ response options and therein whether summing across options is an appropriate scoring algorithm. The Rasch models are especially useful for examining the viability of this assumption due to the additivity property of Rasch models (Rost, 1991). This property implies that if the total score does not hold as a meaningful additive trait representation one potential consequence is the disordering of item threshold estimates (Andrich, 2005; Rost, 1991). Take the example of measuring perceived length and relating it to physical length of rods: If additivity holds, a 10-foot rod will be judged as longer than a 5-foot rod; if additivity does not hold, there will be a considerable number of persons judging the 5-foot rod as longer than the 10-foot rod. Correlates of class membership. In addition to the above hypotheses, we were also interested in identifying variables that could explain respondents’ classification into the subpopulation that does not avoid the ‘‘?’’ response. Hernandez et al. (2004) noted that ‘‘Other factors have been suggested (Cruickshank, 1984; Dubois & Burns, 1975; Worthy, 1969) . . . ’’ for respondents’ choice of ‘‘?’’ other than representing a point on the trait continuum between ‘‘Yes’’ and ‘‘No,’’ noting it is possible that respondents ‘‘ . . . (a) have a specific response style, (b) do not understand the statement, (c) do not feel competent enough or sufficiently informed to take a position, or (d) do not want to reveal their personal feelings about the question asked’’ (p. 688). Although Hypothesis 2 above addresses the question of response style influence are the reason for classification, the remaining points should also be addressed. We identified variables to explore in the available data set to address points (c) and (d): job tenure (JT) and trust in management (TIM), respectively. Unfortunately, as will be discussed later, we did not have data available to evaluate point (b), that comprehension drives the decision to use or not use the ‘‘?’’ response. JT was used as an approximation of the idea of Hernandez and colleagues (2004) that those using the ‘‘?’’ response ‘‘ . . . do not feel competent enough or sufficiently informed to take a position’’ (p. 688). Although knowledge concerning the job is variable among new employees, knowledge regarding its domains can be expected to increase during an employee’s tenure (Ostroff & Kozlowski, 1992). JT can be expected to affect the extent to which an employee feels informed regarding the five dimensions on the JDI, which map onto the four aspects of organizational 122 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

123

Using past research and organizational initiatives/concerns identify variables that may be useful for explaining class membership Conduct analyses to determine if these variables explain class membership





5/6. Evaluate the Influence of Response Sets

7. Assess Correlates of Class Membership







 

5/6. Check scoring assumptions

PCM¼partial credit model.









4. Name the Classes

3. Assess Absolute Fit 

2. Assess Relative Fit/Model Choice

1. Background/ Review

Conduct a review of the measure being studied and measures using similar response scales Formulate hypotheses and expectations concerning class structure and correlates of class membership Determine the appropriate number of classes; fit first the 1-class, then 2-class model, and so on, comparing their fit May need to defer to absolute fit when relative differences are small Determine whether the model fits the data well without reference to other models, and whether there is sufficient item-level fit Name the latent classes (LCs) in a way that is behaviorally meaningful according to response behavior Determine that thresholds are ordered appropriately If not, determine if freeing discrimination parameters alleviates disordering/ Determine if categories are appropriately ordered Ensure that observed scores and trait estimates are commensurate via correlation coefficients Determine whether persons are consistently classified into similar latent groups across scales (only applicable to multiple-scale measures)

Description



Step

 Hernandez, Drasgow, & Gonzalez-Roma (2004) for 2-class case.  This article for more than two classes  Hernandez et al., 2004 for 2-class case  This article for more than two classes  Kutner, Nachtsheim, Neter, & Li (2005) for logistic regression  Maij-de Meij, Kelderman, & van der Flier (2008) for covariate integration

 Phi correlations for 2-class case  Contingency statistics, for example:  Chi-square and related effect sizes  Cohen’s Kappa  Logistic Regression (bi- or multinomial)  Covariate Integration Descriptive Statistics across classes

 Rost & von Davier (1994)  WINRMIRA User Manual (von Davier, 2001)  Eid & Zickar (2007)  Rost (1991)  Andrich (2005)  Multilog user manual (Thissen, Chen, & Bock, 2003)  Thissen & Steinberg (1986) for PCM

Item-level Q index for item fit Bootstrapped Pearson’s w2 Model Parameter Values Category probability histograms

 Bozdogan (1987)  WINRMIRA User Manual (von Davier, 2001)

Key Citations (where relevant)

 Item Threshold Parameter Plots  Option Response Function Plots  Use alternate Bilog, Multilog, or Parscale parameterizations to free discrimination parameters across items

   

 Information Theory Statistics such as CAIC  Bootstrapped Pearson’s w2

 Typical search engines such as PsycInfo

Useful Tools/Statistics

Table 1. Common Steps for Conducting Mixed-Model Item Response Theory (MM-IRT) Analyses and Explaining Class Membership

124

Organizational Research Methods 14(1)

characteristics that employees much learn (Work—job-related tasks, Supervision & Coworkers— group processes and work roles, Pay and Promotion—organizational attributes). TIM was used to approximate the idea of Hernandez et al. (2004) that the ‘‘?’’ response may indicate of a lack of willingness to divulge feelings. In work settings, it has long been thought that trust moderates the level of honest and accurate upward communication that takes place (see Jablin, 1979; Mellinger, 1956). Evidence of the link between trust and open communication comes from recent empirical studies. First, Detert and Burris (2007) found that employee voice, the act of openly giving feedback to superiors, was related to a feeling of safety and trust. Levin, Whitener, and Cross (2006) found that subordinates’ perceived trust in supervisors related to levels of subordinate–supervisor communication. Finally, Abrams, Cross, Lesser, and Levin (2003), in a series of employee interviews, found that trust facilitated the sharing of knowledge with others. In sum, giving honest, open feedback about perceived shortcomings of an organization is likely to be seen as risky (Rousseau, Sitkin, Burt, & Camerer, 1998). Based on this link, we expect those with low TIM to use the ‘‘?’’ response more than those with high TIM as it allows responding in a non-risky way to sensitive scale items. We also decided to examine sex and race as explanatory variables that have been suggested as important considerations in MM-IRT studies (Hernandez et al., 2004). Past research has found that sex is not a significant predictor of class membership when examining scales with the ‘‘Yes’’– ‘‘No’’–‘‘?’’ response scheme (e.g., Hernandez et al., 2004), whereas ethnic background has been found to have a significant effect on class membership in one study (see Maij-de Meij et al. 2008). Here, we consider minority status (i.e., Caucasian vs. non-Caucasian), a typical concern among researchers investigating bias in organizational contexts in the United States. No research we are aware of has examined the influence of age on class membership. These variables were investigated in an exploratory fashion; we do not postulate formal hypotheses.

Illustrative Analyses Data Set The data were obtained from the JDI offices at Bowling Green State University, which included responses from 1,669 respondents to the five facet scales: Pay (9 items), the Work (18 items), Opportunities for Promotion (9 items), Supervision (18 items), and Coworkers (18 items). Sample sizes for each scale after deletion of respondents showing no response variance are included in Table 2. The sample consisted of mostly full-time workers (87.8%) and had a roughly equal number of supervisors (42.7%) and nonsupervisors. The mean age was 44, and 45.2% were female. Race demographics were 77% Caucasian/White, 17% Black/African American, 2% Hispanic/Latino, 1% Asian, and 0.5% Native American; the remaining used the ‘Other’ category. From the spring to summer of 1996, data were collected approximately uniformly from the north, midwest, west, and south of United States as part of norming efforts for the JDI (Balzer et al., 1997). JT was measured by asking respondents, ‘‘How many years have you worked on your current job?’’ Responses ranged from 0 to 50, M ¼ 9.84, SD ¼ 9.2 and were positively skewed. We transformed JT by the method of Hart and Hart (2002) to approach normality. TIM was measured with the JDI TIM scale, a 24-item item scale with a similar format to the JDI, using the same response scale, with M ¼ 35.34, SD ¼ 23.11, and a ¼ .96 in this sample. For more information on these data and the TIM measure, see Balzer et al. (1997).

Results Step 2: Assessing relative fit. The process of identifying the number of latent classes involves a sequential search process. First, the single-group PCM is estimated by setting g to 1 or 100% of 124 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

125

Table 2. Mixed-Model Item Response Theory (MM-IRT) Model Fit Statistics for Job Descriptive Index (JDI) Scales by the Number of Latent Classes in the Model Number of classes in model (CAIC fit statistic) JDI Scale

N

1

2

3

4

5

Work Coworker Supervisor Promotions Pay

1,563 1,529 1,562 1,586 1,564

34,549.18 41,415.55 42,472.62 20,406.90 22,081.51

33,927.93 39,326.18 41,020.70 19,532.70 20,133.49

33,529.71 39,104.40 40,745.23 19,026.70 19,556.70

33,505.90 39,102.27 40,770.56 18,868.63 19,687.08

33,621.13 39,112.48 18,968.41

CAIC ¼ Consistent Aikake’s Information Criteria. Italicized CAIC statistics denote the most appropriate latent class solution.

respondents; this model represents the case where it is assumed that all items and persons can be represented by one set of parameters and reduces the model in Equation 5 to Equation 3. The number of classes is then specified as one greater, until a decrease in fit is observed. The most appropriate model is chosen with reference to one of several relative fit statistics. Here, we use the Consistent Aikake’s Information Criteria (CAIC; Bozdogan, 1987). CAIC is considered useful because it penalizes less parsimonious models based on the number of person and item parameters estimated CAIC ¼ 2  lnðLÞ þ 2p½lnðN Þ þ 1;

ð6Þ

where 2  ln(L) is 2 times the log-likelihood function (common to conventional IRT model fit assessment) taken from the maximization step of the EM procedure. This statistic is increased by correcting for the number of parameters estimated, p, and ln(N) þ 1, the log-linear sample size with an additive constant of 1. The related statistic, Aikiake’s Information Criteria (Akaike, 1973) does not correct for either of these, and the Bayesian Information Criteria (Schwarz, 1978) only corrects for p. CAIC is used because it promotes parsimony more than the alternatives. According to CAIC, all JDI subscales fit a 3-class model well relative to other models estimated (see Table 2). Initially, the 4- and 5-class models for the Promotions scale would not converge within 9,999 iterations to meet the accuracy criterion of .0005 (the default of the WINMIRA program). We solved this problem by attempting to use several different starting values to avoid local maxima. Nonconvergence is not uncommon and is often due to low sample size, as this can lead to nearzero within-class response frequencies; such a finding should spur the researcher to consider the bootstrapped fit statistics in addition to model choice statistics (M. von Davier, personal communication, October 23, 2009). For the Work, Coworker, and Promotions scales, the 4-class model showed slightly better fit than the 3-class model. As noted earlier, in MM-IRT analyses model choice can become difficult due to small incremental increases in fit, as is the case here (see Table 2). It is important to understand that the probability of choosing an overparameterized model increases sharply as the difference between information criteria and sample size become smaller and that model choice should be largely motivated by parsimony (Bozdogan, 1987). In fact, Smit et al. (2003) constrained their number of groups to 2, underscoring the importance of this issue. However, it has been suggested more recently that this approach is likely too restrictive (Maij-de Meij et al., 2008). Following the strategy of Hernandez et al. (2004), we analyzed both solutions further to determine which was more appropriate by examining the absolute (as opposed to relative) fit of the competing models via item-fit, bootstrapped-fit statistics and checking for unreasonable parameter estimates. As noted previously, we attempted to proceed cautiously in our use of the Rasch-based PCM. Therefore, we were concerned with the extent to which there would be misfit due to the fact that 125 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

126

Organizational Research Methods 14(1)

Table 3. Mixed-Model Item Response Theory (MM-IRT) Item Fit and Latent Class Size Estimate, p, by Class Type (i.e., ‘‘Y,’’ ‘‘N,’’ and ‘‘?’’) Latent class size estimate, p JDI Scale Work Coworker Supervisor Promotions Pay

Misfit item rate

AC

DC

MLQC

5/54 3/54 8/54 2/27 0/27

.52 .40 .48 .32 .34

.27 .27 .29 .57 .51

.21 .33 .22 .11 .15

AC ¼ acquiescent class; DC ¼ demurring class; JDI ¼ Job Descriptive Index; Misfit item rate ¼ number of misfit items/total number of item fit tests; MLQC ¼ most likely to use the question mark class. ‘‘Y’’ corresponds to the class most likely to respond. ‘‘Yes;’’ ‘‘N’’ corresponds to the class most likely to respond. ‘‘No;’’ ‘‘?’’ corresponds to the class most likely to respond ‘‘?’’

the PCM does not take discrimination into account. Therefore, we thought it important to investigate the relative fit of the PCM and the generalized PCM (GPCM). We estimated the equivalent parameterizations of the 1-class PCM and the GPCM (which allows for discrimination to vary across items) by Thissen and Steinberg (1986) in Multilog 7.03 (Thissen, Chen, & Bock, 2003) for the Work scale. We found that the GPCM did show somewhat better fit, CAIC ¼ 12,034, than the PCM, CAIC ¼ 12,458. However, the two models appeared to fit similarly at the absolute level. Additionally, the locations of items under the two models were correlated .93. These results suggested that although the GPCM showed somewhat better fit, the PCM fit reasonably well. Below, we proceed with our analysis using the PCM. Step 3: Assessing absolute fit. Item-level fit provides both important information on the interpretability of item parameters and is an indicator of absolute fit; up to now the focus has been on relative fit (i.e., model choice statistics). The significance test of the z-transformed Q statistic (see Rost & von Davier, 1994) was used to test for item misfit. For scales showing competitive solutions (i.e., lack of clarity in relative fit), we calculated and compared item misfit rates for the 3- and 4-class solutions. Item misfit rates for the 3-class solution of all scales are provided in Table 3. The 3-class solution for the Work scale showed 5 misfitting items, which was above 5% chance levels (i.e., .05[ig], or .05[18  3] ¼ 2.7), whereas the 4-class solution showed 2 misfitting items, below the expected value (i.e., .05[18  4] ¼ 3.6). For the Coworker scale, the 3-class solution, showed 3 items with significant misfit, approximately the number expected by chance (i.e., 2.7); for the 4-class solution, 2 items were misfitting, which was just below chance levels (i.e., 3.6). For the Promotions scale, the 3-class solution showed a smaller proportion of misfitting items (i.e., 2/27 ¼ .037) than the 4-class solution (i.e., 3/36 ¼ .056). The empirical p values of bootstrapped Pearson w2 fit statistics showed better fit for the 3-class solution than the 4-class solution in the Coworker scale (p ¼ .03 vs. p < .001), both solutions showed acceptable fit for the Promotions scale (p ¼ .08 vs. p ¼ .10), and both showed misfit for the Work scale (both p < .001), suggesting absolute fit was either similar or better for the 3-class model in each of these scales. More importantly, inspection of item threshold parameter estimates for the 4-class solution of the Work, Coworker, and Promotions scales showed 6 of the 72, 4 of the 72, and 5 of the 36 items had threshold values greater than +4, respectively, whereas for the 3-class solutions only 2 of 27 items for the Promotions scale were found to exceed this value; these are unreasonable values for item threshold parameters. These findings are indicative that the 4-class solution cannot be ‘‘trusted’’ and the more parsimonious model is more appropriate (M. von Davier, personal communication, October 26, 2009). 126 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

127

Category 0

Category 1

Category 2

Category Probabilities in Class 1 with size 0.48417 0.9 0.8 0.7 Probability

0.6 0.5 0.4 0.3 0.2 0.1 0.0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Item

Figure 1. Within-class category probability chart for the acquiescent class (AC) in the Supervisor scale

Step 4: Naming the latent classes (LCs). After the most appropriate LC structure has been identified, one can qualify or find a way to refer to LCs that is behaviorally meaningful. This can be accomplished by identifying important differences in response within and between classes. Qualifying the LCs can help both researchers and readers avoid confusion in interpreting the results of MM-IRT studies (Eid & Zickar, 2007). An overall inspection and comparison of within-class item-category probability histograms showed a clear trend in category use. One class, the largest in all scales, with the exception of Pay and Promotions, was more likely than any other class to respond in the positive (i.e., satisfied) direction which we named the Acquiescent Class (AC; see Figure 1). Another class emerged that was more likely than other class to respond negatively and was the largest class in the Pay and Promotions scales; we named this LC the Demurring Class (DC; see Figure 2). Finally, for all scales, there was one class, the smallest for all but the Coworkers scale, which was more likely than other to use the ‘‘?’’ response, which we named the Most Likely to use the Question mark Class (MLQC; see Figure 3). Those belonging to the AC and DC avoided the ‘‘?’’ response, and only the MLQC used the ‘‘?’’ with any regularity, as indicated by its nonzero mode for frequency of ‘‘?’’ usage (see Table 4). These results offer support for Hypothesis 1 that the majority of respondents would avoid using this option (AC and DC). The size of the AC and DC also confirmed our expectations that most respondents would prefer extremes. Table 3 shows the size of each class labeled by the names given above. Steps 5/6: Identification of systematic responding. One possible reason for latent class membership is the manifestation of particular systematic response styles. This question can be addressed by comparing the consistency of class assignment across several scales measuring different attitudes or other latent variables. The current authors could find only one instance of this type of analysis in Hernandez et al. (2004) where it was found that classification consistency was low to moderate across scales of the 16 Personality Factors questionnaire. Conducting 10 w2 analyses based on 3  3 contingency tables for each of the possible scale by scale combinations (e.g., the concordance between the three latent classes for the Pay and Work scales), we found significant w2 statistics

127 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

128

Organizational Research Methods 14(1)

Category 0

Category 1

Category 2

Category Probabilities in Class 2 with size 0.29118 0.9 0.8 0.7 Probability

0.6 0.5 0.4 0.3 0.2 0.1 0.0 1

2

3

4

5

6

7

8

10 9 Item

11

12

13

14

15

16

17

18

Figure 2. Within-class category probability chart for the demurring class (DC) in the Supervisor scale

Category 0

Category 1

Category 2

Category Probabilities in Class 3 with size 0.22466 0.9 0.8 0.7

Probability

0.6 0.5 0.4 0.3 0.2 0.1 0.0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Item

Figure 3. Within-class category probability chart for the most likely to use the question mark class (MLQC) in the Supervisor scale

(p< .001) and small to medium effect sizes3 for all the 10 tests (see Table 5). These findings support for Hypothesis 2 that although class membership is not purely a function of response style, it is still a considerable factor, in agreement with Hernandez et al. (2004; see above). 128 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

129

Table 4. Central Tendency Estimates for the Number of Times ‘‘?’’ Was Used by Members of Each Class

Work scale (18 items)

Coworker scale (18 items)

Supervision scale (18 items)

Promotion scale (9 items)

Pay scale (9 items)

Latent Class

M

SD

AC DC MLQC AC DC MLQC AC DC MLQC AC DC MLQC AC DC MLQC

0.41 0.95 4.43 0.63 0.92 6.10 0.70 1.15 6.09 0.34 0.49 2.99 0.26 0.46 2.71

0.70 1.13 2.11 0.88 1.26 3.11 0.98 1.35 3.10 0.72 0.85 2.15 0.55 0.74 1.18

Mode 0 0 3 0 0 5 0 0 5 0 0 3 0 0 2

AC ¼ acquiescent class; DC ¼ demurring class; MLQC ¼ most likely to use the question mark class.

Steps 5/6: Checking scoring assumptions. Inspecting item parameters revealed the threshold locations of the PCM were disordered for the majority of classes, suggesting a potential violation of the property of additivity (discussed above). Across all scales of the JDI, the AC and DC showed disordering with large distances between thresholds (e.g., see Figures 4 and 5). For the MLQC, the thresholds were ordered as would be expected (e.g., see Figures 6–8), suggesting the sum score is an appropriate representation of these classes’ satisfaction levels. For the Pay and Work scales, thresholds were nearly identical (see Figures 9 and 10). When thresholds are disordered, as was the case for the AC and DC, it is possible that ordered integer scoring (e.g., 0, 1, 2 or 0, 1, 3) may not be appropriate. However, it is also possible that the model has been properly estimated in such cases (Borsboom, 2005; Rost, 1991). Thus, we looked closer at these classes to determine whether the typical ordered sum-scoring of the JDI is viable for representing the latent trait. First, we consider whether the observed score distributions are consistent with the type discussed by Rost (1991) in which a measurement model may be estimated properly. Additionally, we consider the influence of excluding a discrimination parameter in the measurement model (see Borsboom, 2005, chap. 4). The observed score distributions for the AC and DC were either highly skewed (e.g., Figure 11) or U-shaped (e.g., see Figure 12), whereas the MLQC showed a quasi-normal distribution with low kurtosis (e.g., see Figure 13). This is consistent with Rost’s (1991) guidance that disordered thresholds could be properly estimated under such distributional conditions. Although this threshold disordering may seem serious at first glance, it should be noted that this means the intersections of categories, and not the categories themselves, are disordered. Thresholds represent the point on the trait continuum at which endorsing one option becomes greater than another (e.g., the level of the trait where endorsing ‘‘Yes’’ becomes more likely than endorsing ‘‘?’’) or the intersection of the category curves. Thus, it is possible for thresholds to be disordered whereas category curves are not. Consulting response curve plots for the 1-class PCM for each of the scales (which also showed disordered thresholds), we noted that the disordering of thresholds was a result of the low probability of using the ‘‘?’’ option and not disordered categories (e.g., see Figure 14). Additionally, y estimates were highly positively correlated with the sum score across

129 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

130

Organizational Research Methods 14(1)

Table 5. w2 Statistics and Cohen’s w Below and Cohen’s Kappa Above the Diagonal by Scale Pairs w2 (Cohen’s w based on Crame´r’s j) JDI Scale

Work

Work Coworker Supervisor Promotions Pay

Coworker

Supervisor

.01 162.14 176.56 51.31 113.95

(.34) (.35) (.18) (.28)

Promotions .03 .02 .07

.23 .03

222.84 (.38) 90 (.24) 73.37 (.22)

Pay

80.38 (.22) 92.17 (.25)

.01 .07 .01 .06

110.9 (.27)

All tests performed on 3 (latent class membership, scale 1)  3 (class membership, scale 2) contingency tables with df ¼ 4. Negative Cohen’s Kappa suggests no agreement and can effectively be set to 0.

Threshold 1

Threshold 2 Item Parameters in Class 1 with size 0.48417

3 2

Threshold

1 0 −1 −2 −3 1

2

3

4

5

6

7

8

9

10 Item

11

12

13

14

15

16

17

18

Figure 4. Within-class item threshold parameter plot for the acquiescent class (AC) in the Supervisor scale

classes for the Work (r ¼ .97), Coworker (r ¼ .96), Supervisor (r ¼ .97), Pay (r ¼ .99), and Promotions (r ¼ .98) scales. As noted above, it is also important to consider the possibility that the threshold disordering is due to the Rasch-based model being too restrictive by not taking discrimination into account (M. von Davier, personal communication, October 23, 2009). We determined that this was not the case. As noted above, we estimated the equivalent parameterizations of the 1-class PCM and GPCM shown by Thissen and Steinberg (1986) for the Work scale. Although the GPCM fit the data somewhat better than the PCM, the inclusion of varying discriminations did not alleviate disordering for the large majority of items. These results suggest the disordering in the AC and DC is due to these groups’ low probability of choosing the ‘‘?’’ option, as can be seen by examining the intersections of the ORFs in Figure 14. Were the probability of using ‘‘?’’ higher in this plot, the thresholds would be ordered as expected. Furthermore, observed score distributions were consistent with those discussed by Rost (1991) and were not due to the exclusion of a discrimination parameter. Thus, the threshold disordering did not

130 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

131

Threshold 1

Threshold 2 Item Parameters in Class 2 with size 0.29118

3 2

Threshold

1 0 −1 −2 −3 1

2

3

4

5

6

7

8

9

10 Item

11

12

13

14

15

16

17

18

Figure 5. Within-class item threshold parameter plot for the demurring class (DC) in the Supervisor scale

Threshold 1

Threshold 2 Item Parameters in Class 3 with size 0.22466

3 2

Threshold

1 0 −1 −2 −3 1

2

3

4

5

6

7

8

9

10 Item

11

12

13

14

15

16

17

18

Figure 6. Within-class item threshold parameter plot for the most likely to use the question mark class (MLQC) in the Supervisor scale

appear because of problems with the data or the model, which appeared to have been properly specified, gave trait estimates consistent with sum scores and showed appropriate category ordering in spite of the disordered parameters. 131 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

132

Organizational Research Methods 14(1)

Threshold 1

Threshold 2 Item Parameters in Class 2 with size 0.32449

3 2

Threshold

1 0 −1 −2 −3 −4 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Item

Figure 7. Within-class item threshold parameter plot for the most likely to use the question mark class (MLQC) in the Coworker scale

Threshold 1

Threshold 2 Item Parameters in Class 3 with size 0.11211

4 3 2

Threshold

1 0 −1 −2 −3 −4 1

2

3

4

5

6

7

8

9

Item

Figure 8. Within-class item threshold parameter plot for the most likely to use the question mark class (MLQC) in the Promotions scale

132 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

133

Threshold 1

Threshold 2 Item Parameters in Class 3 with size 0.15026

3

Threshold

2 1 0 −1 −2 −3 1

2

3

4

5

6

7

8

9

Item

Figure 9. Within-class item threshold parameter plot for the most likely to use the question mark class (MLQC) in the Pay scale

Threshold 1

Threshold 2 Item Parameters in Class 3 with size 0.20947

4 3 2

Threshold

1 0 −1 −2 −3 −4 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Item

Figure 10. Within-class item threshold parameter plot for the most likely to use the question mark class (MLQC) in the Work scale

133 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

134

Organizational Research Methods 14(1)

Frequency

WLE

MLE Person Parameters in Class 1 with size 0.40114

4 40 2 Frequency

Parameter

30 0 20 −2 10 −4 0 0

1

2

3

4

5 6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Raw score

Figure 11. Observed score distribution for the Coworker scale in the acquiescent class (AC)

Frequency

WLE

MLE Person Parameters in Class 3 with size 0.27437

4 40 2

20

Frequency

Parameter

30 0

−2 10 −4 0 0

1 2

3

4

5 6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Raw score

Figure 12. Observed score frequency distribution for the Coworker scale in the demurring class (DC)

Step 7: Correlates of class membership. Two basic ways of investigating this question were found by the authors in the available MM-IRT literature: (a) use of regression techniques and (b) integration of covariates into measurement models. In our illustration, we focus on the former because more organizations and researchers are likely to possess the statistical and/or programming expertise to accomplish the regression-based approach. In this method, the latent class structure is used as a multinomial dependent variable, and therefore multinomial logistic regression (MLR) is an appropriate mode of analysis in determining predictors of class membership. This is an important extension of 134 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Carter et al.

135

Frequency

WLE

MLE Person Parameters in Class 2 with size 0.32449

4 40 2

0 20

Frequency

Parameter

30

−2 10 −4 0 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Raw score

Figure 13. Observed score frequency distribution for the Coworker scale in the most likely to use the question mark class (MLQC) Item Characteristic Curve: 3 1.0

1

Probability

0.8

0.6

0.4

0.2 3 2 0 −3

−2

−1

0 Ability

1

2

3

Figure 14. Example Option Response Curve for an item in the Work scale under the 1-class partial credit model (PCM) Note: 1 ¼ No, 2 ¼ ?, 3 ¼ Yes.

from previous research that has (appropriately for their purposes) used only binary logistic models for 2-class solutions (e.g., Hernandez et al., 2004) or complex covariate-measurement model integration when more than two classes are retained (e.g., Maij-de Meij, 2008). 135 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

136

Organizational Research Methods 14(1)

In all, we used five variables as possible predictors of class membership: (a) JT; the transformed number of years on the job; (b) the TIM, self-report measure; (c) age; (d) sex; and (e) race. We performed multinomial logistic regression to investigate these variables as correlates of latent class membership (i.e., AC, DC, and MLQC), one for each variable representing latent class membership on the JDI scales (i.e., the outcome variables). The correlations between these variables are shown in Table 6. The AC was used as the reference group in these analyses as it had a consistently large relative size compared to the DC and MLQC. First, likelihood ratio statistics were calculated to select variables explaining class membership (see Table 7). Variables that were nonsignificant were not included in the final MLR analyses. TIM was the only variable selected for all scales, and race was the only variable that did not contribute to any of the five models. JT was only included in the MLR analyses for Promotions and Pay scales; Sex was included for Work, Promotions, and Pay; and Age for Work and Promotions. Next, the final MLR analyses can be conducted using only the variables selected in the first step. Logit regression coefficients, B, can be tested by the Wald statistic, which is distributed approximately w2 with one degree of freedom. The exponentiation of the log-odds, exp(B), can be interpreted as the probability of belonging to one class over the reference class (AC here). However, for dichotomous predictor variables, two special considerations are necessary. First, B and therein, exp(B) will likely be much larger than those for continuous predictors. This is due to the fact that a one-unit change in, for example, TIM is much less than that for a one-unit change in Sex (i.e., moving from male to female). Second, the standard errors used to compute the Wald statistic are often unduly inflated and thus can lead to high Type II Errors (Menard, 2002). Thus, for dichotomous variables, the likelihood ratios above should be used for variable selection and exp(B) should be considered over the Wald statistic. The final MLR models showed that the selected variables were successful in explaining low to moderate amounts of variance in class membership (Nagelkerke R2 from .04 to .20; see model information in Table 8). For all JDI scales, TIM was found to be a significant predictor of belonging to the MLQC and DC for all scales such that having a higher TIM score decreased the chances of belonging to either group relative to the AC by between 97% and 99%, as indicated by the exp(B) coefficient.4 The only exception was for the Promotions scale, where higher TIM increased the chance of belonging to the MLQC, though by a negligible factor (i.e., 3%). JT was a significant predictor of belonging to the MLQC and DC for the Promotions and Pay scales; more JT increases the likelihood of belonging to these classes for the Promotions scale and decreases this likelihood for the Pay scale. Higher age decreased the likelihood of belonging to the DC and MLQC in the Work scale, whereas those with higher age showed increased chances of belonging to the DC and MLQC. Females had larger chances than males of being in the MLQC for the Work and Pay scales and in the DC for the Pay scale. Females showed lower chances than males of being in the DC for the Work scale and the MLQC group for the Promotions scale. Results of the logistic regression analyses for the MLQC classes are briefly summarized in Table 9.

Discussion In this article, we provided a comprehensive overview of MM-IRT and past research using the method, in addition to related techniques that can be used to understand latent class structures in item response data. We have included here, a table (Table 1) outlining the major steps commonly undertaken in conducting MM-IRT analyses. Given that MM-IRT is an underutilized tool in the organizational research, this study used MM-IRT to investigate some interesting questions about a popular work-related satisfaction measure, the JDI. The data at hand afforded us a way to show some 136 Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

Downloaded from orm.sagepub.com at UNIV OF GEORGIA LIBRARIES on March 11, 2015

137

2

.185** .001 –.115** .017 –.048 –.065**

.297** .020 –.234** –.067** –.067** –.032

–.016 –.008 –.009

–.051* .019 –.008

6a

6b

.148** .012 .083** .053* –.196** –.034

7a

7b

–.159** –.108** –.466** .312** –.538** –.347**

6c

7c

8a

8b

–.011 .240** –.084** –.144** .149** .071** –.176** .064** –.084** .161** –.050* .028 .101** –.091** –.572** –.054* –.172** –.023 .285** –.165** –.160** .375** –.466** –.299**

–.026 .054* –.022

–.001 .089** –.547** –.074** –.489** –.244**

5

8c

9a

9b

–.091** .020 .025 .045 .127* –.043

9c

10a

10b

.142** –.048 –.116** .188** –.041 –.078** .189** .009 –.071** .138** –.013 –.030 .181** –.064** –.016 .162** –.086** –.124** .210** –.045 –.606** –.103** .142** –.060* –.073** .129** –.068** –.065** .215** –.143** –.064** –.229** –.511

.113** .136** .191** –.064** –.112** .123** –.002 –.096** .145** –.057* –.076** –.066** –.117** –.040 .112** –.040 –.016 .092** –.030 –.010 .116** –.069** –.632** –.004 –.041 –.113** –.050* .233** –.098** –.070** .198** –.110** –.049* .211** –.321** –.379**

.014 .036 .001

.014 .013 .022

.025 .050* .010

.152**

4

–.144** .086** –.107** –.035 –.049* .165** –.271** .131** .083** –.008 –.100** .289** –.152** –.042 .060*

.097** .198** .012 –.054* –.184** –.056* –.067** .009 –.033

.199** .046 –.174** –.058* –.056* –.081**

.185** .082**

3

–.006 .007 –.017

.110** .432** .030 .117** .028 .171** –.057*

1

AC ¼ acquiescent class; DC ¼ demurring class; JT ¼ job tenure; MLQC ¼ most likely to use the question mark class; TIM ¼ trust in management. Dummy coding: Sex (1 ¼ Female, 0 ¼ Male); Race (1 ¼ Not Caucasian, 0 ¼ Caucasian). * p < .05, and ** p < .01.

1. JT 2. TIM 3. Age 4. Race 5. Sex Dummy variables Work 6a. AC 6b. DC 6c. MLQC Coworker 7a. AC 7b. DC 7c. MLQC Supervisor 8a. AC 8b. DC 8c. MLQC Pay 9a. AC 9b. DC 9c. MLQC Promotions 10a. AC 10b. DC 10c. MLQC

Predictor Variables

Table 6. Predictor Variables and Dummy Variables by Scale

138

Organizational Research Methods 14(1)

Table 7. Likelihood Ratio Statistics for Model Predictors by Scale Likelihood ratio statistics for variable selection JDI Scale

Work

Coworker

Supervisor

Promotions

Pay

TIM JT Age Sex Race

68.09a 0.58 13.96a 15.01a 3.85

51.91a 2.59 6.29 4.56 0.75

143.63a 1.53 7.32 5.75 1.28

193.01a 15.03a 39.21a 14.83a 3.64

88.49a 20.31a 2.96 31.12a 7.22

Dummy coding: Sex (1 ¼ Female, 0 ¼ Male); Race (1 ¼ Not Caucasian, 0 ¼ Caucasian). JDI ¼ Job Descriptive Index; JT ¼ job tenure; TIM ¼ trust in management. a Indicates that the predictor significantly improved the model and was selected to remain in the model. *p< or essentially equivelent to .01 (i.e. .05 corrected for the number of JDI scales analyzed, or .05/5).

Table 8. Multinomial Logistic Regression Results Using the AC as the Reference Category for All Scales Scale (Class)

Variable

Work (DC)

TIM* Age* Sex TIM* Age* Sex TIM* TIM* TIM* TIM* TIM* JT* Age* Sex TIM* JT Age* Sex* TIM* JT* Sex* TIM* JT* Sex*

Work (MLQC)

Coworker (DC) Coworker (MLQC) Supervisor (DC) Supervisor (MLQC) Promotions (DC)

Promotions (MLQC)

Pay (DC)

Pay (MLQC)

B

Wald

Significance

exp(B)

0.02 0.02 0.30 0.01 0.02 0.31 0.02 0.01 0.03 0.02 0.02 0.66 0.03