parameters were then related to item ratings of social desirability, item subtlety, fre- quency of misunderstanding, and word usage frequency indexes.
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT ZICKAR AND URY
DEVELOPING AN INTERPRETATION OF ITEM PARAMETERS FOR PERSONALITY ITEMS: CONTENT CORRELATES OF PARAMETER ESTIMATES MICHAEL J. ZICKAR AND KAREN L. URY Bowling Green State University
The research in this article attempted to relate content features of personality items to item parameter estimates from Muraki’s partial credit model. Goldberg’s Adjective Checklist was administered to 329 participants to calibrate item parameters. These parameters were then related to item ratings of social desirability, item subtlety, frequency of misunderstanding, and word usage frequency indexes. Subtle items tended to have lower discrimination parameter estimates. There was also an interaction between subtlety and social desirability such that items low in subtlety and high in social desirability had lower location parameters. Implications for future psychometric research in the personality domain are discussed.
More than 25 years ago, Ralph Heine (1971) lamented, “I would like to echo a wish of the author [Donald Fiske] that a much larger area of collaboration be established between measurement specialists and personality theorists” (p. vi). The neglect of personality in psychometric theory has been evident in the development of item response theory (IRT). Thissen, Steinberg, Pyszczynski, and Greenberg (1983) argued that “analysis of personality of personality and attitude scales using IRT has lagged behind applications in ability measurement” (p. 212). Some research has been conducted applying IRT models to personality data since Thissen et al.’s complaint (e.g., Fraley, Waller, & Brennan, 2000; Reise & Waller, 1990, 1993; Zickar & Drasgow, 1996; Zickar & Robie, 1999); however, IRT remains a test theory largely wedded to the cognitive ability domain. For example, labels that are used to describe item parameters, such as “item difficulty,” have relevance in the conCorrespondence concerning this article should be addressed to Michael Zickar, Department of Psychology, Bowling Green State University, Bowling Green, OH 43403; e-mail: mzickar@ bgnet.bgsu.edu. Educational and Psychological Measurement, Vol. 62 No. 1, February 2002 19-31 © 2002 Sage Publications
19
20
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
text of ability and educational measurement, but they are misleading in a personality measurement context. This article aims to develop an interpretation of item parameters for personality items by understanding the relation between item format and content with IRT parameter estimates. In IRT, an item response function links the probability of affirming an item to specific values of a latent trait, commonly denoted θ. An option response function (ORF) is similarly constructed for models with more than two response options (e.g., response scales most often used for personality items); the ORF links the probability of affirming a particular option given θ. The shapes of the ORFs are determined by the choice of a particular IRT model and the estimated parameters that are specified by that model. Research on the Interpretation of Item Parameters As stated before, IRT item parameters have intuitive meaning in the ability domain; these intuitive meanings (i.e., discrimination and difficulty) can be used to understand the relation between item content and the shapes of item response functions. In the cognitive ability domain, Embretson and her colleagues have attempted to move beyond this intuitive understanding of item parameters by researching empirical relations between item content and item parameter estimates. As an example of this research, Embretson and Wetzel (1987) attempted to model characteristics of items that covaried with item parameter estimates in a paragraph comprehension test. Embretson and Wetzel entered dummy-coded features of items (e.g., number of propositions in a paragraph, average reading level of words in paragraph) in a regression equation, treating the item parameter estimates as the dependent variables. Embretson and Wetzel found that successful prediction of item difficulty was obtained when models that incorporated wide representation of both text and decision processing were used. Embretson has applied similar methodology to other types of cognitive ability tests, such as abstract reasoning (Embretson, 1998) and mathematical reasoning (Embretson, 1995). Similar research needs to be conducted in the personality domain to help develop a theoretical and empirical understanding of item parameters in personality research. Roskam (1985) stated that “with some exceptions, there appears to be no theory which offers a psychological interpretation of the parameters of item response models” (p. 8). Roskam conjectured that for personality items, the discrimination parameter should be related to the concreteness of the item. According to Roskam, concretely worded items should have high discrimination parameters, whereas items that were written in abstract terms should have low discrimination parameters. Zumbo, Pope, Watson, and Hubley (1997) tested Roskam’s hypothesis using the two- and three-parameter logistic IRT models fit to the Extraversion and Neurotocism scales of the Eyesenck Personality Questionnaire (Eysenck & Eysenck,
ZICKAR AND URY
21
1975). For both scales, they failed to find statistically significant negative correlations between the concreteness of an item and the a (discrimination) parameter estimate for that item; hence, Roskam’s conjecture was not supported. In another study that aimed to understand how individuals respond to and comprehend personality items, Graziano, Jensen-Cambell, Steele, and Hair (1998) found that many first-year college students did not understand the meaning of some of the words in Goldberg’s (1992) Adjective Checklist (ACL). That study, however, did not relate item comprehension rates to itemlevel statistics, such as item-total correlations or item discrimination parameter estimates. Although there has been little research done linking IRT or classical test theory analyses to the content of personality items, there has been considerable research using factor analytic methodologies. Research has shown that the factor analysis of personality data often identifies factors that are artifacts of item characteristics. For example, factor analyses of personality items often have extracted separate factors that represent positively worded items and negatively worded items (e.g., Levin & Montag, 1989; Schmitt & Stults, 1985; Spector, Van Katwyk, Brannick, & Chen, 1997). Although this factor analytic research has helped understand the statistical functioning of personality items, there is still more research to be done. In the present research, we attempt to relate parameter estimates of personality items to features and characteristics of items on Goldberg’s (1992) ACL. We used Muraki’s partial credit model (PCM) (Muraki, 1990), which can be used to model polytomous data with ordered options. PCM The PCM can be used to fit items composed of m ordered response options. A latent trait, commonly denoted θ, is estimated for each individual; θ is generally assumed to be distributed standard, normal [i.e., N(0, 1)]. Two item parameters are estimated that determine the shapes of the m ORFs for each item. The discrimination parameter, a, is related to the slope of an item’s ORFs. Item discrimination parameters are constrained to be equal for each of the ORFs within an item; a parameters can vary, however, between items. Larger a parameters denote more discriminating items; hence, it is generally desirable to have items with large a parameters. The PCM assumes that the distances between adjacent responses on a Likert-type scale are identical across all items within a scale. These distances, denoted category parameters, are estimated for a set of items. For each item, a location parameter, b, is estimated. The location parameter is related inversely to item endorsement rates of an item. Large, positive locations (e.g., b > +1.0) indicate that only individuals with a high value on the
22
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
latent trait θ will tend to select the most positive options. Items with large, negative locations (e.g., b < –1.0) will have the highest option chosen by almost all individuals. For example, suppose there are two items that both have a 7-point response format that ranges from strongly disagree to strongly agree. An item that asks respondents if they are “sometimes sad” should be have a higher mean score than another item that asks respondents if they “sometimes consider suicide” because the prevalence of sad thoughts should be higher than suicidal thoughts. In this example, the item related to sad thoughts should have a lower location parameter than the suicidal thoughts item.
Item Features That Influence Item Parameter Estimates Miscomprehension In the present research, we collected various perceptual ratings of items, as well as a standard word frequency count, and related those indexes to PCM item parameter estimates. We hypothesized that adjectives that were most frequently misunderstood by respondents would have lower a parameters than items that were more often understood correctly. Individuals who misunderstand the connotations of an individual word would not be able to respond with the appropriate frame of reference for that item. Their responses may be based on false knowledge and, hence, appear to be random. We assessed the miscomprehension of words in two ways. A standardized measure of the frequency of word use in literature and the press was used (Zeno, Ivens, Millard, & Duvvuri, 1995). Items that are infrequently used in common English usage should be frequently misunderstood by respondents. In addition, we asked respondents to identify words that they had misunderstood after completing the personality instrument. We hypothesized that discrimination a parameter estimates would correlate positively with the word usage frequency measure and correlate negatively with the frequency of misunderstanding by respondents. Subtlety and Ambiguity Previous research has also suggested that items that are subtly related to the underlying construct measured by the scale may be less valid indicators of the underlying construct than items that are more directly or obviously related (e.g., Burkhart, Gynther, & Fromuth, 1980; Duff, 1965; Wiener, 1948). Although there is now some consensus among personality test developers that subtle items should be avoided, some test developers may prefer
ZICKAR AND URY
23
subtle items to obvious items because they should be more resistant to motivated faking and socially desirable responding. In related research, the relationship between item ambiguity and item discrimination has also been investigated. Harris and Baxter (1965) defined an ambiguous item as one that produces problems in generating a response due to statement wording, vagueness, or the number of possible meanings for the item. Much of this research on item ambiguity has used the Minnesota Multiphasic Personality Inventory (MMPI). For example, Baxter and Morris (1968) investigated the relationship between item ambiguity and item discrimination for the MMPI. They described two types of ambiguity (see Broen, 1960; Goldberg, 1963): interpretive ambiguity (IA) and response ambiguity (RA). IA refers to interindividual variability in the interpretation of a specific item. IA may not always be a negative aspect of an item and may in fact be a desirable property. RA refers to response consistency over time. The researchers asserted that items high in IA and low in RA would appear to be most useful for personality measurement. Accordingly, those researchers hypothesized that items high in IA and low in RA would have greater discrimination power (defined as the number of MMPI scales on which the item is keyed). In the present research, we measured subtlety (IA) by assessing participants’ ability to assign items to the correct underlying construct. Items that were more subtle should be more often incorrectly classified than obvious items. Accordingly, we hypothesized that items judged to be more subtle (ambiguous) based on this procedure would tend to have lower discrimination parameter estimates based on the previous research. We did not hypothesize a relation between the degree of item subtlety and location parameter estimates. Social Desirability There has been a long history of research demonstrating that items that present attributes that respondents view as desirable are endorsed at higher rates than items with attributes that respondents view as negative (e.g., Edwards, 1953; Rogers, 1971; Trimble, 1997). Because the location parameter is related inversely to the percentage of individuals endorsing the item, we hypothesized a negative relation between location parameter estimates and social desirability ratings. Alternatively, Rorer (1965) hypothesized that socially desirable responding only affected ambiguous items. He hypothesized that “the more explicit the content of an item [i.e., less subtle], the greater the extent to which a set [e.g., social desirability] could operate” (p. 134). We also predicted an interaction of social desirability with item subtlety in that social desirability
24
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
should have more of an impact with obvious items because respondents will be more likely to judge correctly the option most desirable.
Method Participants A total of 121 undergraduate students in a large introductory psychology class at a medium-sized midwestern university completed Goldberg’s (1992) ACL as a prefatory exercise for a lecture on personality traits and the Big Five personality theory. Two thirds (n = 82) of the sample were female. Sixty-one percent (n = 71) of the sample were in their first year of college; the mean age was 19.2. Thirty percent (n = 36) of the sample reported that they had never completed a personality inventory before; 44% (n = 54) reported that they had completed “a few” personality inventories previously. To improve the accuracy of item parameter estimates, responses to the ACL from 208 participants who had completed the ACL for an experiment on group processes were combined with the previous data set. These participants were recruited from an introductory psychology class at the same university during the previous semester. These two combined data sets resulted in an item parameter estimation sample of n = 329. The second data set was not included in any of the other analyses reported in this article. Measures The five scales from the Goldberg’s ACL (1992) were administered to participants. Four items (energetic, moody, timid, and warm) were excluded from the original scale because of a clerical error, leaving 96 items. The five scales in the ACL are Agreeableness, Conscientiousness, Emotional Stability, Surgency, and Intellect, which correspond roughly to the Big Five typology (McCrae & Costa, 1987). Criteria Measures Word frequency was assessed with the standard frequency index (SFI), which tabulates the frequency of words in a sample of textbooks and works of fiction and nonfiction (Zeno et al., 1995). The index ranged from 3.5 to 88.3, with higher values representing greater frequency. Eight words were not listed in the frequency guide and, hence, were given the lowest possible frequency rating, 3.5. After completing the ACL, participants were asked to circle adjective items that they did not fully understand. The percentage of respondents who
ZICKAR AND URY
25
indicated that they misunderstood a word (MUD) was used as an index of the comprehension of each word. Next, an independent sample of 28 participants from an undergraduate class on testing and measurement was presented definitions of the five Goldberg (1992) constructs and asked to classify each item according to the construct that they decided the item measured. Our measure of item subtlety was the percentage of individuals who misclassified each item. Finally, social desirability ratings for each item were gathered from another independent sample of 78 students from a different introductory psychology course. A procedure developed by Edwards (1957) was used to gather these ratings. In Edwards’s procedure, participants were asked to judge each trait in terms of whether that trait would be considered desirable or undesirable in others. Each word was rated on a 9-point scale, which ranged from extremely undesirable (1) to extremely desirable (9). The average social desirability rating across all 78 participants was calculated for each item and used in further analyses. Parameter Estimation PARSCALE 3.0 (Muraki & Bock, 1996) was used to estimate PCM item parameters for each of the five scales (i.e., there were five separate parameter estimation runs). The PCM did not fit three items (“distrustful,” “unexcitable,” and “unsophisticated”); standard errors for the location parameter estimates for these items could not be estimated. Subsequent analyses were conducted with and without those three items. Because results did not differ across the two analyses, results are reported with those three items included.
Results The five personality scales appeared to function well as internal consistency estimates were similar to previous results (e.g., Goldberg, 1992). Table 1 presents the descriptive statistics for the five personality scales. The Surgency scale had the highest mean discrimination parameter (M = 0.80), whereas Emotional Stability had the lowest mean discrimination (M = 0.61). However, a one-way ANOVA suggested that there was not statistically significant variation in the mean discrimination parameter estimates between scales, F(4, 92) = 1.52, p > .05, η2 = .06. A similar one-way ANOVA suggested that there was statistically significant variation in the b location, F (4, 92) = 14.93, p < .01, η2 = .39. Post hoc tests using Bonferonni corrections indicated that the Emotionality Stability (M = 0.15) scale had a significantly higher mean b location than all other scales except Surgency (M = –0.54). The mean Surgency b location was also significantly higher than the Consci-
26
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
Table 1 Descriptive Statistics of Study Variables Scale Aggreableness (19 items) Conscientiousness (20 items) Emotional Stability (19 items) Surgency (18 items) Intellect (20 items)
M
SD
α
a
7.18 6.38 4.73 5.78 6.55
0.79 0.97 0.95 1.14 0.89
0.87 0.89 0.90 0.90 0.87
.68 .64 .61 .80 .65
b –1.98 –1.48 0.15 –0.54 –1.33
Note. Scales ranged from 1 to 9. a refers to the mean discrimination parameter estimate in the partial credit model (PCM); b is the mean location parameter estimate in the PCM.
entious (M = –1.48) and Agreeableness (M = –1.98) scales. There were no other statistically significant differences for the b location estimate. Table 2 presents the mean item content indexes for all five ACL scales. The mean SFI was 41.67 (SD = 15.09). The adjectives with the highest SFI (i.e., most frequently used) were kind (M = 66.20) and cold (M = 64.30), whereas eight words (e.g., imperceptive) were given the lowest SFI rating of 3.50 because they were not listed in the index. A one-way ANOVA suggested that there were no statistically significant differences between scales in the mean SFI, F(4, 91) = 1.79, p > .05, η2 = .07. The mean MUD index (the frequency of words misunderstood by respondents) was 4.47 (SD = 10.09), indicating that on average, between 4 and 5 people (out of 121) indicated that they did not understand each word. Fortyfour (46%) of the words were understood by all participants (e.g., helpful and organized). Table 3 presents the 11 items that respondents most often did not understand. A one-way ANOVA indicated that there were no statistically significant differences in number of words misunderstood between scales, F(4, 91) = 1.07, p > .05, η2 = .05. The mean subtlety rating for all items was 0.39 (SD = 0.25), indicating that on average, participants incorrectly identified the appropriate scale for each item about 40% of the time. The words most frequently classified correctly were active, anxious, extraverted, imaginative, insecure, and sympathetic, which were all correctly classified by 96% of the participants (only 1 person incorrectly classified them). Cold and demanding were the most frequently misclassified, with only 11% of the participants correctly classifying each of those adjectives. A one-way ANOVA indicated that there was no statistically significant variation in mean subtlety between scales, F(4, 91) = 0.78, p > .05, η2 = .03. The mean social desirability rating was 4.67 (SD = 2.04). The most socially desirable adjective was trustful (M = 8.58), whereas the least desirable was distrustful (M = 1.21). A one-way ANOVA indicated that there were
ZICKAR AND URY
27
Table 2 Mean Indexes by Scale
Aggreableness Conscientiousness Standard frequency index Subtlety Misunderstood frequency Social desirability
Emotional Stability Surgency Intellect
45.51 0.40
46.48 0.33
37.52 0.36
41.18 0.39
36.63 0.46
0.42 4.71
4.35 4.76
5.32 3.86
6.50 5.00
5.80 4.79
Table 3 Most Frequently Misunderstood Adjectives Adjective Imperturbable Introspective Introverted Extraverted Haphazard Imperceptive Uninquisitive Inhibited Negligent Systematic Vigorous
Frequency Misunderstood (%) 58 30 28 26 23 20 15 12 12 12 12
no statistically significant differences in social desirability between scales, F(4, 91) = 0.90, p > .05, η2 = .03. Table 4 presents the correlations between variables in the study. The correlation between the discrimination parameter estimate and the frequency of words not understood by respondents (MUD) was not statistically significant (r = .10, p > .05). In addition, the word usage index (SFI) was unrelated to the a parameter (r = .13, p > .05). Both of these results were inconsistent with the first hypothesis. As predicted, item subtlety was correlated negatively with discrimination parameter estimates (r = –.32, p < .01). There was not a statistically significant relation between item subtlety and the location estimate (r = .13, p > .05). Finally, correlations of social desirability ratings with parameter estimates were not statistically significant for both the discrimination parameter
28
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
Table 4 Correlation Matrix of Item-Level Variables
a a b Standard frequency index Subtlety Misunderstood Social desirability
b
–.16 .13 .00 –.32** .13 .10 .18 .14 –.04
Standard Frequency Index
–.26* –.29** .38**
Subtlety Misunderstood
.15 –.48**
–.00
*p < .05. **p < .01.
estimate (r = .14, p > .05) and the location estimates (r = –.04, p > .05). Our hypothesis regarding social desirability and discrimination was not supported. To test whether social desirability interacted with subtlety as predicted, a series of moderated regressions were conducted. In the first step, both social desirability and subtlety were entered. In the second step, the multiplicative product term of those two variables was entered. If the R2 at the second stage was significantly larger than the R2 from the first stage, then the interaction between the two variables was deemed significant (Aiken & West, 1991). As predicted, there was a statistically significant interaction between social desirability ratings and subtlety in predicting the location parameter estimates. The main effect terms of subtlety and social desirability resulted in an R2 of 0.02. The addition of the interaction term increased the R2 from 0.02 to 0.13: test of increment F(1, 92) = 11.61, p < .01. For items near the mean of item subtlety, there was an essentially flat relationship between social desirability and b location parameter estimates. For items that were below the mean in item subtlety (i.e., transparent items), there was a negative relation between social desirability and the b location parameter estimate, as illustrated in Figure 1. For items that were above the mean in subtlety, there was a positive relation between social desirability and the b location parameter estimate. Finally, the same interaction between social desirability and subtlety was not statistically significant in predicting the discrimination parameter estimates: test of increment in R2, F(1, 92) = 0.64, p > .05, ∆R2 = .01.
Discussion The results of this study supported one of our two hypotheses regarding correlates of the discrimination parameter. As predicted, the discrimination parameter was related to the item subtlety ratings of personality items. Contrary to expectations, the level of word usage frequency and miscomprehension rates were unrelated to discrimination. It could be that
ZICKAR AND URY
29
0
–0.2
–0.4
–0.6 Low Sublety
b
Average Sublety
–0.8
High Sublety –1
–1.2
–1.4 Low
Average
High
Social Desirability Figure 1. Interaction between social desirability and subtlety in predicting the location parameter estimate.
there is a nonlinear relation between comprehension and discrimination in that once words are above a certain minimal level of standard comprehension, respondents should have few problems. Exploratory nonlinear regressions, however, did not support this speculation. Clearly, there are other variables that affect item discrimination that were not assessed in this study. However, these results suggest that items that are clearly linked to the underlying trait dimension tend to have higher discrimination parameters than subtle items. None of the variables that we assessed in this study were correlated directly with the location parameter estimates. However, as predicted, social desirability and subtlety interacted to influence the location parameters. When subtlety was low, social desirability was related negatively to the location parameter estimates. Thus, when respondents could correctly discern the underlying construct being measured by the item, the probability of choosing a more positive option was positively related to social desirability. That there was no similar interaction for the discrimination parameter suggests that social desirability does not destroy the relationship between the underlying personality trait, θ, and item endorsement. If this were the case, transparent items high in social desirability would have had ORFs that were relatively flat. An additional comment about the data presented in this study needs to be made. The high frequency of self-admitted misunderstood adjectives was consistent with Graziano et al.’s (1998) results using the ACL. Eleven of the
30
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
96 adjectives used in this study were misunderstood by at least 10% of this sample. Given that participants rarely skipped items, it is clear that many respondents answered items even though they did not clearly understand the items. Even though this level of “garbage” data did not appear to hinder item parameter estimates, developers of psychological tests are urged to use simple, precise language in item construction. In conclusion, this study helped develop psychological interpretations of item parameter estimates for personality items by establishing correlates of such parameter estimates. Future research should be conducted on personality scales other than adjective checklists to see if the same pattern of results hold. Also, other IRT models should be applied to personality scales to determine whether analogous parameters have the same correlates for different models.
References Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks, CA: Sage. Baxter, J. C., & Morris, K. L. (1968). Item ambiguity and item discrimination in the MMPI. Journal of Consulting and Clinical Psychology, 32, 309-313. Broen, W. E. (1960). Ambiguity and discriminating power in personality inventories. Journal of Consulting Psychology, 24, 174-179. Burkhart, B. R., Gynther, M. D., & Fromuth, M. E. (1980). The relative predictive validity of subtle versus obvious items on the MMPI depression scale. Journal of Clinical Psychology, 36, 748-751. Duff, F. L. (1965). Item subtlety in personality inventories. Journal of Consulting Psychology, 29, 565-570. Edwards, A. L. (1953). The relationship between the judged desirability of a trait and the probability that the trait will be endorsed. Journal of Applied Psychology, 37, 90-93. Edwards, A. L. (1957). The social desirability variable in personality assessment and research. New York: Dryden. Embretson, S. E. (1995). A measurement model for linking individual learning to processes and knowledge: Application to mathematical reasoning. Journal of Educational Measurement, 32, 277-294. Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380-396. Embretson, S. E., & Wetzel, C. D. (1987). Component latent trait models for paragraph comprehension tests. Applied Psychological Measurement, 11, 175-193. Eysenck, H. J., & Eysenck, S.B.G. (1975). Eysenck Personality Questionnaire. San Diego, CA: EdITS/Educational and Industrial Testing Service. Fraley R. C., Waller, N. G., & Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78, 350-365. Goldberg, L. R. (1963). A model of item ambiguity in personality assessment. Educational and Psychological Measurement, 23, 467-492. Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure. Psychological Assessment, 4, 26-42. Graziano, W. G., Jensen-Campbell, L. A., Steele, R. G., & Hair, E. C. (1998). Unknown words in self-reported personality: Lethargic and provincial in Texas. Personality and Social Psychology Bulletin, 24, 893-905.
ZICKAR AND URY
31
Harris, J. G., & Baxter, J. C. (1965). Ambiguity in the MMPI. Journal of Consulting and Clinical Psychology, 29, 112-118. Heine, R. W. (1971). Preface. In D. W. Fiske (Ed.), Measuring the concepts of personality (pp. vvii). Chicago: Aldine. Levin, J., & Montag, I. (1989). The bipolarity of the Comrey Personality Scales: A confirmatory factor analysis. Personality and Individual Differences, 10, 1115-1120. McCrae, R. R., & Costa, P. T. (1987). Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology, 52, 81-90. Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psychological Measurement, 14, 59-71. Muraki, E., & Bock, R. D. (1996). Parscale 3.0 [Computer software]. Chicago: Scientific Software International. Reise, S. P., & Waller, N. G. (1990). Fitting the two parameter model to personality data. Applied Psychological Measurement, 14, 45-58. Reise, S. P., & Waller, N. G. (1993). Traitedness and the assessment of response pattern scalability. Journal of Personality and Social Psychology, 54, 143-151. Rogers, T. B. (1971). The process of responding to personality items: Some issues, a theory, some research. Multivariate Behavioral Research Monographs (No. 2), 6. Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin, 63, 129-156. Roskam, E. E. (1985). Current issues in item response theory. In E. E. Roskam (Ed.), Measurement and personality assessment (pp. 3-19). Amsterdam, the Netherlands: Elsevier Science. Schmitt, N., & Stults, D. M. (1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9, 367-373. Spector, P. E., Van Katwyk, P. T., Brannick, M. T., & Chen, P. Y. (1997). When two factors don’t reflect two constructs: How item characteristics can produce artifactual factors. Journal of Management, 23, 659-677. Thissen, D., Steinberg, L., Pyszczynski, T., & Greenberg, J. (1983). An item response theory for personality and attitude scales: Item analysis using restricted factor analysis. Applied Psychological Measurement, 7, 211-226. Trimble, D. E. (1997). The Religious Orientation Scale: Review and meta-analysis of social desirability effects. Educational and Psychological Measurement, 57, 970-986. Wiener, D. N. (1948). Subtle and obvious keys for the Minnesota Multiphasic Personality Inventory. Journal of Applied Psychology, 12, 164-170. Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995). The educator’s word frequency guide. Brewster, NY: Touchstone Applied Science. Zickar, M. J., & Drasgow, F. (1996). Detecting faking on a personality instrument using appropriateness measurement. Applied Psychological Measurement, 20, 71-87. Zickar, M. J., & Robie, C. (1999). Modeling faking good on personality items: An item-level analysis. Journal of Applied Psychology, 84, 551-563. Zumbo, B. D., Pope, G. A., Watson, J. E., & Hubley, A. M. (1997). An empirical test of Roskam’s conjecture about the interpretation of an ICC parameter in personality inventories. Educational and Psychological Measurement, 57, 963-969.