Original Article
Pitfalls and Challenges in Constructing Short Forms of Cognitive Ability Measures Stefan Schipolowski,1 Ulrich Schroeders,2 and Oliver Wilhelm3 1
Institute for Educational Quality Improvement, Humboldt-Universität zu Berlin, Germany, Department of Educational Science, University of Bamberg, Germany, 3Department of Psychology and Education, Ulm University, Germany
2
Abstract. Especially in survey research and large-scale assessment there is a growing interest in short scales for the cost-efficient measurement of psychological constructs. However, only relatively few standardized short forms are available for the measurement of cognitive abilities. In this article we point out pitfalls and challenges typically encountered in the construction of cognitive short forms. First we discuss item selection strategies, the analysis of binary response data, the problem of floor and ceiling effects, and issues related to measurement precision and validity. We subsequently illustrate these challenges and how to deal with them based on an empirical example, the development of short forms for the measurement of crystallized intelligence. Scale shortening had only small effects on associations with covariates. Even for an ultra-short six-item scale, a unidimensional measurement model showed excellent fit and yielded acceptable reliability. However, measurement precision on the individual level was very low and the short forms were more likely to produce skewed score distributions in ability-restricted subpopulations. We conclude that short scales may serve as proxies for cognitive abilities in typical research settings, but their use for decisions on the individual level should be discouraged in most cases. Keywords: short forms, cognitive abilities, item selection, binary response data, measurement precision
In psychological assessment there is currently a trend toward developing and using short scales both for academic research purposes and practical issues (e.g., Kemper, Brhler, & Zenger, 2013). This trend is primarily driven by efficiency considerations and an increased interest in psychological constructs in survey research. An example is the German Socio-Economic Panel (SOEP; e.g., Wagner, Frick, & Schupp, 2007), a large-scale longitudinal study based on representative samples of the general population in Germany that mainly provides socioeconomic indicators, but also includes a limited number of psychological scales. In large-scale assessments, administration time is very costly, underlining the need for time-efficient and nevertheless valid measurement instruments. Rising cost pressure is also a pressing concern in the health system, personnel management, or educational monitoring. In all of these contexts, cognitive abilities are highly relevant as explanatory variables (Grabner & Stern, 2010) and predictors of important outcomes such as health (Gottfredson & Deary, 2004), job success (Ones, Viswesvaran, & Dilchert, 2004; Strenze, 2007), and educational achievement (Deary, Strand, Smith, & Fernandes, 2007). Despite the high predictive power of cognitive variables, the development of short scales has mainly focused on personality assessment and clinical psychology so far. By comparison, well-validated short scales for the assessment of cognitive abilities are rare (Kruyen, Emons, & Sijtsma, 2013). One possible explanation is that Journal of Individual Differences 2014; Vol. 35(4):190–200 DOI: 10.1027/1614-0001/a000134
the development of cognitive short scales poses specific challenges that are associated with measuring maximal performance instead of typical behavior (Cronbach, 1949). In the first part of this article we point out some of the pitfalls and challenges in developing short scales for cognitive ability assessment. Specifically, we discuss item selection strategies, methods for the analysis of binary item responses, measurement precision, floor and ceiling effects, and ways to demonstrate validity. In the second part, we illustrate these challenges with an example – the development of short scales for the assessment of crystallized intelligence. Throughout this article, we assume that short form development is conducted by selecting items from an existing scale.
Item Selection Strategies Informed item selection strategies rely on at least one of two sources of information: test or item statistics, and experts’ judgments of item content (Kruyen et al., 2013). Relevant statistical information on the item level includes item difficulties, item discriminations, and bivariate relations with criteria. On the test level, important statistics are the distribution of the scale score in the target population, estimates of measurement precision (reliability), analyses reflecting the internal structure of the test 2014 Hogrefe Publishing
S. Schipolowski et al.: Construction of Cognitive Short Forms
(dimensionality), and correlations of the measured construct with other constructs in the light of theoretical expectations (validity). By contrast, the correspondence of item content with the construct that is to be measured has to be judged by experts, producing more or less subjective ratings of the items’ construct representation, that is, the degree to which each item reflects the construct of interest (Embretson, 1998). Most authors have argued that in short scale development, one should consider both statistical information and content ratings (Smith, Combs, & Pearson, 2012; Smith, McCarthy, & Anderson, 2000; Widaman, Little, Preacher, & Sawalani, 2011). Specifically, short form developers should strive to preserve the content breadth of the target construct. However, for ability assessment an approach that relies solely on statistical information may be appropriate if the cognitive ability in question is narrow (e.g., figural reasoning) and item selection is based on a large pool of items generated with a prespecified algorithm (e.g., Freund, Hofer, & Holling, 2008). Content considerations are still relevant in the sense that the rationale for item generation has to be grounded in substantive theory. Scale development will usually focus on more complex, multidimensional constructs such as memory or crystallized intelligence. Consensual theories of intelligence structure describe cognitive abilities at different levels of generality or breadth. For instance, the factor crystallized intelligence (gc) is situated at the second stratum in Carroll’s (1993) three stratum model with several more narrow factors representing more specific crystallized abilities on the first stratum underneath gc. Hierarchical structures of cognitive abilities can be preserved by constructing a short scale that allows reporting scores for both the narrow abilities and the superordinate broad ability factor. This entails that each lower-order factor has to be considered separately in the item selection process and in the evaluation of crucial scale features such as construct representation and measurement precision (Smith et al., 2000). Afterwards, a multidimensional measurement model including all subscales and preserving the internal structure of the original measure should be established using adequate analytical techniques (Brunner, Nagy, & Wilhelm, 2012). Alternatively, given a close correlation between the narrow ability factors it is also possible to develop a short scale based on a unidimensional measurement model that represents only the broad ability factor. The advantage of this approach is that it results in shorter measures. However, items representing different narrow factors should not simply be pooled for item selection. Instead, the contents of the narrow factors should be represented proportionately to their contribution to the broad ability factor in the original scale to minimize the risk of a shift in the meaning of the construct (Smith et al., 2000).
Analyses of Binary Response Data In most cases ability test items are scored dichotomously, resulting in binary item responses that reflect individual 2014 Hogrefe Publishing
191
differences on a continuous latent trait. Since binary response data do not follow a multivariate normal distribution, the commonly used maximum likelihood (ML) estimator can lead to biased estimates and misleading results (Stucky, Gottfredson, & Panter, 2012). Specific estimators have been developed for factor analysis with binary variables, most importantly weighted least squares (WLS) estimators (Flora & Curran, 2004). Beauducel and Herzberg (2006) showed that the weighted least squares means and variance adjusted estimator (WLSMV; Muthn, du Toit, & Spisic, 1997) provided – under realistic conditions, that is, with moderate sample sizes (N = 200) and moderate factor loadings (about .50) – more accurate estimates of model fit and item loadings for variables with two or three categories than a classic ML estimator. Alternatively, item response theory (IRT; e.g., Reckase, 2009) can be used to analyze binary response data. An advantage of IRT is the possibility to handle complex test designs that result in an incomplete variance-covariance matrix (Frey, Hartig, & Rupp, 2009). Also, measurement error is estimated for each person parameter as a function of item information. A disadvantage of IRT in comparison to confirmatory factor analysis (CFA) can be seen in the lack of good omnibus tests of model fit.
Measurement Precision Measurement precision can be defined on the group level and on the individual level. The former is referred to as reliability and conceptualized as the ratio of trait variance to observed score variance; as such, it is a group characteristic (Thompson & Vacha-Haase, 2000). This concept of reliability is related to, but not equivalent with measurement precision on the individual level (Mellenbergh, 1996) which is in turn crucial for the validity of inferences about an individual on the basis of his or her scale score (e.g., the conclusion that an applicant is not suited for a specific job). Short-scale developers typically place strong emphasis on Cronbach’s (1951) coefficient a. However, a is inadequate for the characterization of measurement precision and its informative value is often vastly overestimated (Cortina, 1993; Kruyen, Emons, & Sijtsma, 2012; Schmitt, 1996; Sijtsma, 2009a). First, it underestimates reliability in most cases as it is based on the assumption that all items measure the underlying trait with equal precision (essential s-equivalent measurement model). Second, despite its common designation as ‘‘internal consistency’’ a is not an index of homogeneity or unidimensionality (Schmitt, 1996); item sets reflecting one or multiple latent dimensions can yield the same a value (Sijtsma, 2009a). Therefore, the question of dimensionality should be addressed using factor analytic techniques or IRT methods. Third, high a values are not synonymous with high measurement precision on the individual level. Even though many short scales exhibit satisfactory reliability in terms of a, simulation studies showed that decision quality in applied settings is alarmingly low when using short scales, even if they consist of Journal of Individual Differences 2014; Vol. 35(4):190–200
192
S. Schipolowski et al.: Construction of Cognitive Short Forms
high-quality items (Emons, Sijtsma, & Meijer, 2007; Kruyen et al., 2012). To characterize reliability on the group level, researchers should consider alternative coefficients such as x (McDonald, 1999) which more realistically assumes a s-congeneric measurement model, that is, different items measure the same latent trait with different precision. It can be computed directly from the standardized factor loadings of the measurement model. Given a psychometrically homogeneous set of items, x is the square of the correlation between the test score X and the common factor F (McDonald, 1999, p. 89). For categorical data, a and x can differ substantially depending on the type of correlation matrix used in their estimation (Gadermann, Guhn, & Zumbo, 2012). Polychoric and tetrachoric correlation coefficients are based on the assumption of unobserved continuous variables underlying the observed item responses. If the reliability of binary items is estimated with tetrachorics (or a factor model based on tetrachorics), it refers to these hypothetical continuous variables, ignoring the item difficulties of the manifest dichotomous items. If the reliability of the manifest test score is of interest, for instance, for the calculation of confidence intervals for individual scores, a Pearson product-moment correlation matrix should be used instead to take item difficulties into account. Alternatively, the scale reliability q (Dimitrov, 2003; Raykov, Dimitrov, & Asparouhov, 2010) was specifically developed for binary response data and is also suitable for the estimation of confidence intervals. The most common way to estimate measurement precision on the individual level, that is, for a given score is to calculate a confidence interval (CI) for the true score on the basis of the standard error of measurement. Based on simulated data, Sijtsma (2009b) showed that as items are dropped from an essentially s-equivalent scale, CIs become narrower, but the scale length (i.e., the difference between the maximum and the minimum possible score) is shrinking even more quickly than the CI. As a consequence, ‘‘CIs for the shortened test may encompass a larger proportion of the scale length than those for longer tests, thus leaving less room for test scores to differ significantly’’ (Kruyen et al., 2013, p. 227). The recommendation is therefore to consider CIs in relation to the scale length when characterizing measurement precision on the individual level (Kruyen et al., 2013; Sijtsma, 2009b).
Floor and Ceiling Effects As short scales are comprised of relatively few items, they are especially prone to floor and ceiling effects and skewed score distributions. That is, shortening a scale entails the risk that a considerable percentage of the relevant population receives the highest or lowest score on the scale, making it impossible to differentiate between those individuals receiving the same extreme score. Skewed or truncated
Journal of Individual Differences 2014; Vol. 35(4):190–200
score distributions can lead to substantially biased estimates of statistical parameters such as means, variances, and covariances (Wang, Zang, McArdle, & Salthouse, 2008; Yanagihara & Yuan, 2005). Although special models are available to handle censored data, they are associated with additional assumptions that may be violated in a given data set (Wang et al., 2008). For three reasons, cognitive short scales require special attention with regard to floor and ceiling effects: First, in contrast to rating-scale items that are often used for personality assessment, cognitive items are typically scored dichotomously (correct vs. incorrect). Compared to items with three or more levels, ceteris paribus dichotomous items allow fewer discriminations. Second, cognitive abilities have a large variance in the general population, and subpopulations can differ substantially. For instance, Wilhelm, Schroeders, and Schipolowski (in press) found a mean difference of two standard deviations on a test of crystallized intelligence between representative samples of ninth-graders enrolled in the German academic-track Gymnasium versus students in the vocational track Hauptschule. Third, if items with a closed response format are used, the score that constitutes the floor is dependent on the guessing probability of the items. Hence, the score distribution should be explicitly examined during scale construction. Furthermore, scale developers should consider the range restriction of the short scale in relevant subpopulations. For instance, even if the score of an ultra-short scale is approximately normally distributed in the general population, it may be substantially skewed in subpopulations with higher education, low socioeconomic status, or higher age. As researchers are frequently interested in investigating specific subpopulations, the use of an ultra-short ability scale can severely complicate or even preclude secondary analyses. If the use of an ultra-short scale is indispensable, researcher may consider computerized adaptive testing (CAT) to ensure that item difficulties optimally match the ability of the individual test-taker. However, besides technical equipment, CAT also requires a large item pool to select items of appropriate difficulty on-the-fly and more complex statistical models to estimate person parameters (see Van der Linden & Glas, 2013, for a detailed discussion).
Validity Issues Although short scales can be derived from existing, longer scales, validity evidence obtained for the original scale cannot be generalized to a shorter version of the same scale (Smith et al., 2000, 2012). Researchers often interpret a close correlation between the scores of the short scale and the original scale as validity evidence. This practice is flawed as the scores do not represent independent random variables. Likewise, the correlation between the short scale and a score based on the remaining items of the original scale is not informative about the quality of the short scale because a weak correlation may simply indicate that the
2014 Hogrefe Publishing
S. Schipolowski et al.: Construction of Cognitive Short Forms
short-scale developer was successful in removing items of poor psychometric quality. To demonstrate that the short scale measures the intended construct, a latent variable approach should be used to estimate the associations of the short form with independent measures of the same construct (e.g., a parallel form of the original scale). Additionally, associations with other constructs and criteria should be consistent with expectations derived from substantive theory and previous empirical studies of the respective constructs. In all cases, the set of criteria should be heterogeneous, that is, it should include criteria supposedly strongly related to the short scale and criteria for which one expects a weak or even negative association with the short scale (Widaman et al., 2011).
Empirical Demonstration of Scale Shortening Subsequently, we illustrate the above mentioned issues in the construction of short scales for cognitive measures with an example, the assessment of crystallized intelligence in a student population. To demonstrate the effects of test shortening and to give advice on how to check for detrimental effects, we compare short scales of different lengths to one another and to the original scale.
Measurement Intention Crystallized intelligence (gc) is a prominent cognitive ability factor in consensual theories of intelligence structure (e.g., Carroll, 1993; Horn & Noll, 1997). Compared to fluid intelligence (gf) that represents individual differences in decontextualized reasoning, gc reflects the influences of learning and acculturation. In accordance with this definition, Horn and Noll (1997, p. 69) described gc as ‘‘acculturated knowledge’’ and suggested to measure it with tasks ‘‘indicating breadth and depth of the knowledge of the dominant culture.’’ Consequently, a comprehensive gc assessment should administer declarative knowledge items capturing a broad variety of knowledge domains (Ackerman, 2000). In practice, however, gc is mainly measured with language indicators such as vocabulary tests. Although there is considerable overlap between verbal ability and declarative knowledge, it seems worthwhile to separate both constructs (Schipolowski, Wilhelm, & Schroeders, in press).
Sample, Design, and Measures The following analyses are based on a sample of N = 2,068 students (47.8% female) in upper secondary education attending grade 11 (n = 1,124) and 12 (n = 944). All
2014 Hogrefe Publishing
193
students were enrolled in academic-track schools (58.6% Gymnasium, 33.8% Berufliches Gymnasium, 7.6% Integrierte Gesamtschule). Mean age was 17.8 years (SD = 0.97; n = 1,952 students). A multiple-matrix booklet design (Frey et al., 2009) was used to measure crystallized intelligence and – for validation purposes in the present context – fluid intelligence and vocabulary. The booklet design was implemented to keep the individual work load within acceptable limits, that is, each student was administered only a subset of the test items. The complete test session took 90 min, including task instructions and a short questionnaire inquiring the students’ age, gender, and selected grades of the most recent school certificate. Fluid and crystallized intelligence were assessed with the Berlin Test of Fluid and Crystallized Intelligence for Grades 11 and Above (BEFKI 11+; Schipolowski, Wilhelm, & Schroeders, 2014). The original gc scale consisted of 64 multiple-choice items assessing declarative knowledge in 16 content domains (i.e., with four items per domain). The contents cover the natural sciences (physics, chemistry, biology, geography, medicine, technology), the humanities (literature, art, music, religion, philosophy), and social sciences (politics, law, economy, finance, history). Verbal, numerical, and figural fluid intelligence was assessed with three 16-item reasoning scales. The study also contained an independent parallel form of the gc test with the same number of items; a subsample completed both test forms, the order of administration was balanced. Vocabulary was measured with 42 items of the WST (Wortschatztest [vocabulary test]; Schmidt & Metzler, 1992).
Data Preparation and Descriptive Statistics Ideally, item selection, and evaluation of measurement instruments should rely on independent samples. Therefore, the best way to evaluate the psychometric properties of a newly developed short form is to conduct an additional study with an independent sample of the target population (Smith et al., 2000, 2012). Since this was not the case for the present data, we divided the sample described above into two independent subsamples. More precisely, each case in the data set was randomly assigned to one of two subsamples. Subsample 1 was subsequently used for item selection whereas all information on the properties of the scales given in Tables 2–4 and Figure 1 were based on subsample 2. Responses on the ability measures were recoded as either ‘‘correctly answered’’ or ‘‘not correctly answered.’’ To avoid a skewed score distribution, 15 vocabulary items answered correctly by more than 95% of the sample were excluded from further analyses. For grades, a German rating system was used ranging from 0 points (worst grade) to 15 points (best grade). To account for reference frame effects, grades were z-standardized within classes.
Journal of Individual Differences 2014; Vol. 35(4):190–200
194
S. Schipolowski et al.: Construction of Cognitive Short Forms
Table 1. Descriptive statistics for the ability scales and school grades Subsample 1 (nS1 = 1,036) Scale/covariate
a
n
M
gc gc (parallel form) Vocabularyb gf verbal gf numerical gf figural
766 411 341 849 849 850
38.88 38.57 16.64 9.24 8.67 7.80
School gradesc German Mathematics History
973 979 908
8.91 8.48 9.05
Subsample 2 (nS2 = 1,032) Miss
n
Ma
SDa
Miss
8.40 8.32 3.13 2.48 2.76 2.60
.26 .60 .67 .18 .18 .18
763 410 339 845 846 846
39.14 38.94 16.58 9.19 8.67 7.87
8.24 8.28 3.22 2.59 2.90 2.71
.26 .60 .67 .18 .18 .18
2.39 3.11 2.62
.06 .06 .12
968 969 888
8.88 8.34 9.24
2.49 3.11 2.55
.06 .06 .14
SD
a
Notes. Miss = relative frequency of missing values. aFor the ability scales, M and SD are based on sum scores. b27 items were used after removing extremely easy items (p > .95). cGrades were z-standardized within classes; M and SD given here are parameters obtained prior to standardization.
Table 2. Item statistics for all items of the three short scales and selected items of the original scale Item
Content domain
Facet
Bio1 Che1 Geo1 Med1 Med2 Phy1 Phy2 Tec1 Art1 Lit1 Mus1 Phi1 Rel1 Rel2 Fin1 His1 Law1 Pol1 Pol2 Eco1
Biology Chemistry Geography Medicine Medicine Physics Physics Technology Art Literature Music Philosophy Religion Religion Finance History Law Politics Politics Economy
N N N N N N N N H H H H H H S S S S S S
Original scale
16-item scale
9-item scale
6-item scale
p
k
k
k
k
.76 .60 .77 .91 .67 .79 .44 .43 .43 .56 .44 .68 .74 .51 .64 .37 .56 .78 .66 .84
.38 .30 .58 .39 .47 .36 .42 .34 .47 .33 .21 .43 .39 .43 .51 .45 .44 .40 .42 .52
.36 .25 .65 .33 – – .41 .35 .49 .28 .21 .43 – .45 .46 .45 .44 – .38 .54
– – – .26 – – .39 .35 .46 – – .45 .40 – – .46 .45 .40 – –
– – – – .54 .24 – – .40 – – – .40 – – .46 .51 – – –
Notes. Based on subsample 2 (nS2 = 763). Empty cells indicate that the item was not part of the respective scale. Parameters for the original scale are given only for those items that were selected for one or more of the short scales. N = natural sciences, H = humanities, S = social sciences; p = relative frequency of correct responses; k = standardized item loading in the unidimensional measurement model (see Table 3, ‘‘1F’’ models).
Descriptive statistics for the ability scales and grades based on subsamples 1 and 2, respectively, are given in Table 1. For the ability tests, missing values were design-related and therefore Missing Completely At Random (MCAR; Rubin, 1976). The WLSMV estimator yields consistent estimates under the MCAR assumption (Asparouhov & Muthn, 2010).
Journal of Individual Differences 2014; Vol. 35(4):190–200
Short-Scale Development Based on the 64 knowledge items of the original scale in subsample 1, we developed three short scales with 16, 9, and 6 items. For all short scales, we intended to establish a unidimensional measurement model, that is, the scales should be valid indicators of the second-stratum factor gc
2014 Hogrefe Publishing
S. Schipolowski et al.: Construction of Cognitive Short Forms
195
Table 3. Fit statistics for competing measurement models Scale
Model
v2
df
p
CFI
RMSEA
WRMR
1F 3F 1F 3F 1F 3F 1F 3F
2,079.2 2,065.2 133.5 128.0 31.5 29.8 12.4 10.1
1,952 1,949 104 101 27 24 9 6
.022 .033 .027 .036 .252 .190 .191 .119
.904 .913 .940 .945 .976 .968 .974 .969
.009 .009 .019 .019 .015 .018 .022 .030
1.06 1.05 0.97 0.94 0.75 0.74 0.73 0.66
Original scale 16-item scale 9-item scale 6-item scale
Notes. Based on subsample 2 (nS2 = 763). 1F = unidimensional measurement model, 3F = three-dimensional measurement model; CFI = Comparative Fit Index; RMSEA = Root Mean Square Error of Approximation; WRMR = Weighted Root Mean Square Residual.
Table 4. Correlations of the gc scales with criteria Original scale Criterion
q (SE)
gc (parallel form) Vocabulary gf verbal gf numerical gf figural
.97 .93 .53 .58 .45
(.03) (.08) (.05) (.05) (.06)
School grades German Mathematics History
.20 (.04) .15 (.04) .22 (.05)
r (SE) .79 .65 .36 .41 .33
(.02) (.04) (.03) (.04) (.05)
.19 (.03) .14 (.04) .22 (.04)
16-item scale q (SE) .98 .93 .56 .66 .45
(.05) (.09) (.06) (.05) (.06)
.17 (.04) .12 (.05) .18 (.05)
9-item scale
r (SE)
q (SE)
(.03) (.05) (.04) (.04) (.04)
1.00a .90 (.09) .48 (.06) .63 (.06) .42 (.07)
.14 (.03) .11 (.04) .16 (.04)
.13 (.05) .13 (.06) .19 (.05)
.73 .56 .34 .41 .29
6-item scale
r (SE) .65 .49 .26 .34 .23
(.03) (.05) (.04) (.04) (.05)
.08 (.04) .09 (.04) .13 (.04)
q (SE) 1.00a .82 (.10) .40 (.07) .57 (.07) .40 (.06) .12 (.06) .04b (.06) .17 (.06)
r (SE) .59 .45 .21 .30 .19
(.03) (.06) (.04) (.04) (.04)
.08 (.04) .04b (.04) .11 (.04)
Notes. Based on subsample 2 (nS2 = 1,032); the number of cases varies depending on the criterion (see Table 1). q = correlation between latent variables (gc, gf, vocabulary) or between a latent variable and a manifest variable (grades); r = correlation between two manifest variables (gc, gf, vocabulary: sum scores); SE = standard error. aCorrelation > 1 when freely estimated. bnot significant (p > .05).
without providing subscores on more narrow factors. All scales were developed independently from each other1 under consideration of the following criteria: item content, item difficulty, item discrimination, and normality of the resulting score distribution. First, to retain the content breadth of the original measure we included items from as many knowledge domains as possible in each short form and ensured that the proportion of items representing each of the superordinate facets – natural sciences, the humanities, and social sciences – was approximately maintained in the short scales. These aims were implemented as follows: For the 16-item scale, one item from each of the 16 domains was selected. For the 9-item scale, three items representing three different content domains were selected from each of the superordinate facets. Likewise, the 6-item scale tapped six different content domains with two items per facet.
1
Second, items were chosen to cover a broad range of item difficulties to optimize the scales for measurement at different points of the latent ability dimension and to avoid or minimize floor and ceiling effects. In addition, items had to be solved above the level of guessing probability (p > .25). Third, when possible we retained items with high discriminatory power as indicated by their loadings in a unidimensional measurement model of the original scale. Finally, once a complete set of items was selected according to these inclusion criteria, we inspected the distribution of the scale score and revised the scale, if necessary, to minimize skewness. Item difficulties and discriminations (here: loadings in a unidimensional measurement model) for all items of the short scales are given in Table 2. While scale development was based on subsample 1, all item parameters given in
This entails that while all item sets are subsets of the original item set, smaller item sets are not necessarily subsets of longer short forms. The reason for this is that selection of a given item is not only dependent on the respective item itself, but also on the properties of the item set as a whole (e.g., the resulting score distribution).
2014 Hogrefe Publishing
Journal of Individual Differences 2014; Vol. 35(4):190–200
196
S. Schipolowski et al.: Construction of Cognitive Short Forms
Figure 1. Skewness of the scale scores depending on gf. Based on subsample 2. The correlation between the original scale and the gf composite was r = .48 in the unrestricted sample. Table 2 (including parameters given for the original scale for comparison purposes) were calculated using the independent validation subsample 2.
All measurement models for the short scales showed satisfactory fit; only the CFI values of the 16-item model were slightly below the recommended cut-off values. The multidimensional models provided marginally, but not significantly better fit than the unidimensional representations (model comparison for the 16-item scale: Dv2(nS2 = 763, 3) = 7.5, p = .06; for the 9-item scale: Dv2(nS2 = 763, 3) = 1.9, p = .59; for the 6-item scale: Dv2(nS2 = 763, 3) = 2.7, p = .44). In line with the marginal differences in fit, the correlations between the latent factors in the three-dimensional models were very high (e.g., for the 16-item model: q(N, H) = .88, q(N, S) = .94, q(S, H) = .77; for abbreviations see Table 2). A comparison to the models of the original scale revealed that item selection lead to better fit values in terms of CFI and WRMR due to the exclusion of problematic items. Factor correlations in the multidimensional model were equally strong across all scales. In sum, the results supported the assumption of unidimensionality for the short scales. For the present example, developing unidimensional short scales was less problematic because the difference in fit between the 1 versus 3-factor model was very small for the original scale (although significant; Dv2(nS2 = 763, 3) = 36.1, p = .00). In other cases – especially in personality assessment – many items of the original scale may violate the unidimensionality assumption. Researchers should then consider model fit as an explicit criterion for item selection to ensure that items constituting a scale are psychometrically homogeneous.
Factorial Structure of the Scales To reiterate, our aim was to develop short scales to assess a single construct (gc). The assumption of unidimensionality of the measures has to be tested empirically with a latent variable modeling approach for each scale. A more comprehensive test of the internal structure also includes a comparison of the intended model with other, competing measurement models. In our example, we compared the single factor model to a multidimensional model based on the rationale for developing the original 64-item scale. Specifically, in the more complex model three correlated factors were specified representing each of the superordinate facets. Models were estimated with Mplus 7.1 (Muthn & Muthn, 1998–2012) using the WLSMV estimator. Fit statistics for all models are given in Table 3; statistics for the original scale were included for comparative purposes. Aside from the chi-square statistic, degrees of freedom, and corresponding probability value, we also provide the Comparative Fit Index (CFI), the root mean square error of approximation (RMSEA), and the weighted root mean square residual (WRMR). According to Yu (2002), the following values indicate good fit for models based on binary indicators and sample sizes of N 250: CFI .96, RMSEA .05, and WRMR .95. The RMSEA values should be interpreted with caution since the RMSEA has been shown to overestimate model fit when factor loadings are low (i.e., between .30 and .50; Heene, Hilbert, Draxler, Ziegler, & Bhner, 2011). Journal of Individual Differences 2014; Vol. 35(4):190–200
Measurement Precision on Group Level and Individual Level The effects of scale shortening on reliability were investigated with McDonald’s x estimated from the respective measurement model (i.e., based on tetrachorics) and by means of q as an indicator of scale score reliability. Reliability for the 16-item model was satisfactory, x = .76. In the shorter versions, however, reliability dropped noticeably to x = .63 (9-item model) and x = .57 (6-item model), respectively. The reliabilities of the manifest scale scores were as follows: q = .62 (16-item scale), q = .50 (9-item scale), and q = .43 (6-item scale). In all cases, reliabilities of the short scales were substantially lower than those obtained for the original 64-item scale (x = .89, q = .82 in the same subsample). Note that test shortening does not necessarily lead to decreases in reliability under realistic conditions (i.e., assuming that items are not strictly parallel; Raykov et al., 2010); however, it is typically observed in practice (Kruyen et al., 2013). To characterize measurement precision on the individual level, we calculated the 95% CI of an individual score Xi based on the standard error of measurement SE for the respective short scale in subsample 2, where SE = SX · sqrt(1 q) and the 95% CI is calculated as Xi € 1.96 · SE. SX is the standard deviation of the respective score distribution, q is the score reliability. For the 16-item scale, the 95% CI was 6.94 points wide (Xi € 3.47) 2014 Hogrefe Publishing
S. Schipolowski et al.: Construction of Cognitive Short Forms
which equals 43% of the total scale length. 95% CIs were smaller for the 9-item scale (Xi € 2.53) and the 6-item scale (Xi € 2.10). However, these scales were also shorter: The 95% CI covered 56% of the complete 9-item scale length and even 70% of the 6-item scale length. By comparison, the 95% CI was relatively large for the original 64-item scale (Xi € 6.85), but covered only 21% of its length. This means that the original scale offers far more precision for discriminating between individuals than the short scales considered here.
Score Distribution Reducing the number of items on a scale leads to an increased risk of truncating the score distribution. The risk that the respective scale score is not normally distributed is even higher in ability restricted subpopulations such as gifted children. This effect is illustrated in Figure 1 for the skewness of the sum scores which is sensitive to floor and ceiling effects. In the graph, the unrestricted sample is compared with subpopulations that have increasingly higher fluid intelligence scores. Specifically, we calculated Z scores (M = 100, SD = 10) based on a gf composite (sum score of all 48 gf items) and defined subpopulations that achieved a minimum gf score, resulting in 13 different subpopulations. The least restricted subpopulation was comprised of all students with at least average fluid intelligence (Z score of 100 or above) and the most restricted subpopulation was constituted by those students who achieved a Z score of 112 or above. Skewness of the score distribution was then calculated for each of the short scales in each subpopulation. In the unrestricted sample, skewness of the scale score was minimal for all short scales. This is not surprising as minimal skewness in the unrestricted population was an explicit criterion for item selection in subsample 1. With increasing ability restriction in terms of gf the score distributions became more skewed. However, this effect was more pronounced for the shorter scales. Skewness of the 16-item scale was consistently lower than for the 6-item scale. A thorough investigation of score distributions should also consider other distribution parameters (i.e., mean, standard deviation, kurtosis) and frequencies in the target population and relevant subpopulations (Schipolowski et al., 2013). The normality assumption should be consequently tested, especially for ultra-short scales (i.e., < 10 items).
Correlations With Other Constructs and Criteria The psychometric properties investigated so far are necessary, but not sufficient conditions for the validity of a
2
197
measure. Characterizing validity involves the examination of the measure’s relationships to other constructs and criteria. The results should be evaluated according to expectations derived from substantive theory and previous empirical findings for the respective construct. For gc, we expect a very close correlation between declarative knowledge and vocabulary (Schipolowski et al., in press), a substantial positive relationship with gf (Horn & Noll, 1997), and significant positive relations to school achievement indicated by grades (Gustafsson & Balke, 1993). As we intended to preserve the psychometric qualities of the original scale, we also expected similar results for the short scales and the original scale. Correlations of the original scale and the short scales with a parallel version of the original gc scale, vocabulary, reasoning, and grades are provided in Table 4. Correlations are based on manifest variables (r) and latent variables, respectively (q).2 Relationships between latent variables were established in item-level CFAs. To relate a school grade to a latent construct, the grade was added as a manifest variable. The correlations obtained for the original scale were consistent with expectations and were accurately reproduced with the 16-item short scale. Based on latent variables, all short scales showed a correlation close to or equal to 1 with the independent parallel form of the original scale. However, correlation estimates for the 9-item and 6-item scales were slightly lower for the majority of the criteria, even when controlling for the increase in unsystematic measurement error with a latent variable approach. For the latter, reducing the number of items also lead to a small increase in the standard errors of the estimates. For instance, the correlation between gc and the mathematics grade dropped from .15 (original scale) to .04 (6-item scale) while the respective standard error increased from .04 to .06. Consequently, using the 6-item scale would lead to the conclusion that there is no statistically significant relationship between the variables whereas a small positive correlation would be found on the basis of the original scale. In sum, scale shortening had only small effects on correlations with covariates, but an extreme reduction of the number of items compromised statistical precision.
Discussion and Final Remarks The purpose of this article was to elaborate on the challenges associated with the construction of short scales for the assessment of cognitive abilities and to illustrate these challenges with an example. Our findings support the conclusion that short scales can be valid, efficient indicators of cognitive abilities in typical research settings (i.e., for analyses on the group or population level). In the following we
The use of the Greek letter q in this context indicates a correlation between latent variables and is not to be confused with the scale reliability q.
2014 Hogrefe Publishing
Journal of Individual Differences 2014; Vol. 35(4):190–200
198
S. Schipolowski et al.: Construction of Cognitive Short Forms
want to point out some interesting points from a psychological assessment perspective. First, measurement precision on the individual level was very low for all three short scales. Even with 16 items, the 95% confidence interval was very large, covering almost half of the scale’s length. Therefore, the short scales provided insufficient precision to differentiate between individuals. This result supports Kruyen et al.’s (2012) conclusion that the use of short scales for decisions on the individual level is rarely appropriate and should be discouraged in most cases. Second, the use of short scales seems to be less critical for analyses on the group level, for instance, if one is interested in covariances (Sijtsma, 2009b). In our example, the reliability estimate derived from a unidimensional measurement model was substantially affected by scale shortening, but even with only six items reliability was still about .60. At the same time, reliability of the manifest scale score was as low as .43. Extreme shortening also affected the relationships between latent variables: The correlations of the original scale with criteria could be accurately reproduced with the 16-item short scale, but correlation estimates for the 6-item scale were somewhat lower and less reliable. This illustrates that statistical precision is not only dependent on the person sample size, but it is also a function of the item sample. Third, when considering the use of a short scale one should take all available information into account instead of relying on a single psychometric property such as a reliability coefficient or a close correlation between the short scale and the original scale (Sijtsma, 2009b; Smith et al., 2000). In our illustration, reliability estimates derived from the measurement models were not informative with respect to measurement precision on the individual level and the fit statistics of the measurement models did not reflect the decrease in reliability observed for shorter scales. We also showed that skewness of the score distribution in restricted subpopulations was especially problematic for shorter scales. In some respects, our empirical illustration deviated from the recommendations given in the Introduction. Most importantly, to evaluate its psychometric properties a new scale should be administered in its final (abbreviated) form in an independent validation sample (Smith et al., 2012). Furthermore, several recent developments are worth mentioning but could not be pursued in the present article due to space limitations. One such development are sophisticated algorithms to facilitate short scale construction. For instance, Leite, Huang, and Marcoulides (2008) developed a highly flexible ant colony optimization (ACO) algorithm that can be tailored ‘‘to account for any number of prespecified qualities simultaneously, such as content balancing, reliability, test information, and relationships with several variables’’ (Leite et al., 2008, p. 428). Moreover, complex research designs such as the ‘‘two-method design’’ (Graham, Taylor, Olchowski, & Cumsille, 2006) are available to combine the advantages of short scales with the advantages of longer measures. This approach is especially promising to assess a broader range of constructs in survey research. Journal of Individual Differences 2014; Vol. 35(4):190–200
Acknowledgments During the preparation of this manuscript, Stefan Schipolowski was a fellow of the International Max Planck Research School ‘‘The Life Course: Evolutionary and Ontogenetic Dynamics (LIFE).’’
References Ackerman, P. L. (2000). Domain-specific knowledge as the ‘‘dark matter’’ of adult intelligence: gf/gc, personality and interest correlates. Journal of Gerontology: Psychological Sciences, 55B, 69–84. doi: 10.1093/geronb/55.2.P69 Asparouhov, T., & Muthn, B. (2010). Weighted least squares estimation with missing data. (Technical Report). Retrieved from http://www.statmodel.com/download/GstrucMissing Revision.pdf Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling: A Multidisciplinary Journal, 13, 186– 203. doi: 10.1207/s15328007sem1302_2 Brunner, M., Nagy, G., & Wilhelm, O. (2012). A tutorial on hierarchically structured constructs. Journal of Personality, 80, 796–846. doi: 10.1111/j.1467-6494.2011.00749.x Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104. doi: 10.1037/0021-9010.78.1.98 Cronbach, L. J. (1949). Essentials of psychological testing. New York, NY: Harper. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. doi: 10.1007/ BF02310555 Deary, I. J., Strand, S., Smith, P., & Fernandes, C. (2007). Intelligence and educational achievement. Intelligence, 35, 13–21. doi: 10.1016/j.intell.2006.02.001 Dimitrov, D. M. (2003). Marginal true-score measures and reliability for binary items as a function of their IRT parameters. Applied Psychological Measurement, 27, 440– 458. doi: 10.1177/0146621603258786 Embretson, S. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396. doi: 10.1037/1082989X.3.3.380 Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2007). On the consistency of individual classification using short scales. Psychological Methods, 12, 105–120. doi: 10.1037/1082989x.12.1.105 Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9, 466– 491. doi: 10.1037/1082-989X.9.4.466 Freund, P. A., Hofer, S., & Holling, H. (2008). Explaining and controlling for the psychometric properties of computergenerated figural matrix items. Applied Psychological Measurement, 32, 195–210. doi: 10.1177/014662160730 6972 Frey, A., Hartig, J., & Rupp, A. A. (2009). An NCME instructional module on booklet designs in large-scale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28, 39–53. doi: 10.1111/j.1745-3992.2009.00154.x Gadermann, A. M., Guhn, M., & Zumbo, B. D. (2012). Estimating ordinal reliability for Likert-type and ordinal 2014 Hogrefe Publishing
S. Schipolowski et al.: Construction of Cognitive Short Forms
item response data: A conceptual, empirical, and practical guide. Practical Assessment, Research, & Evaluation, 17, 1–13. Gottfredson, L. S., & Deary, I. J. (2004). Intelligence predicts health and longevity, but why? Current Directions in Psychological Science, 13, 1–4. doi: 10.1111/j.0963-7214. 2004.01301001.x Grabner, R. H., & Stern, E. (2010). Measuring cognitive ability. In Rat fr Sozial-und Wirtschaftsdaten. (Eds.), Building on progress: Expanding the research infrastructure for the social, economic, and behavioral sciences (pp. 753–768). Opladen, Germany: Budrich UniPress. Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data designs in psychological research. Psychological Methods, 11, 323–343. doi: 10.1037/1082-989X.11.4.323 Gustafsson, J. E., & Balke, G. (1993). General and specific abilities as predictors of school achievement. Multivariate Behavioral Research, 28, 407–434. doi: 10.1207/ s15327906mbr2804_2 Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bhner, M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: A cautionary note on the usefulness of cutoff values of fit indices. Psychological Methods, 16, 319–336. doi: 10.1037/a0024917 Horn, J. L., & Noll, J. (1997). Human cognitive capabilities: GfGc theory. In D. P. Flanagan, J. L. Genshaft, & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests and issues (pp. 53–91). New York, NY: Guilford. Kemper, C. J., Brhler, E., & Zenger, M. (2013). Psychologische und sozialwissenschaftliche Kurzskalen [Short scales for psychology and the social sciences]. Berlin: Medizinisch Wissenschaftliche Verlagsgesellschaft. Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2012). Test length and decision quality in personnel selection: When is short too short? International Journal of Testing, 12, 321–344. doi: 10.1080/15305058.2011.643517 Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the shortcomings of shortened tests: A literature review. International Journal of Testing, 13, 223–248. doi: 10.1080/ 15305058.2012.703734 Leite, W. L., Huang, I.-C., & Marcoulides, G. A. (2008). Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43, 411–434. doi: 10.1080/00273170802 285743 McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum. Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293–299. doi: 10.1037/1082-989-X.1.3.293 Muthn, B., du Toit, S. H. C., & Spisic, D. (1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes (Technical Report). Retrieved from http://www.gseis.ucla.edu/faculty/muthen/ articles/Article_075.pdf Muthn, L. K., & Muthn, B. O. (1998–2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthn & Muthn. Ones, D. S., Viswesvaran, C., & Dilchert, S. (2004). Cognitive ability in selection decisions. In O. Wilhelm & R. W. Engle (Eds.), Handbook of understanding and measuring intelligence (pp. 373–392). London, UK: Sage. Raykov, T., Dimitrov, D. M., & Asparouhov, T. (2010). Evaluation of scale reliability with binary measures using latent variable modeling. Structural Equation Modeling: A Multidisciplinary Journal, 17, 265–279. doi: 10.1080/ 10705511003659417 2014 Hogrefe Publishing
199
Reckase, M. (2009). Multidimensional item response theory. New York, NY: Springer. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. doi: 10.1093/biomet/63.3.581 Schipolowski, S., Wilhelm, O., & Schroeders, U. (in press). On the nature of crystallized intelligence: The relationship between verbal ability and factual knowledge. Intelligence. Schipolowski, S., Wilhelm, O., & Schroeders, U. (2014). Berliner Test zur Erfassung fluider und kristalliner Intelligenz ab der 11. Jahrgangsstufe (BEFKI 11+) [Berlin test of fluid and crystallized intelligence for grades 11 and above]. Manuscript in preparation. Schipolowski, S., Wilhelm, O., Schroeders, U., Kovaleva, A., Kemper, C. J., & Rammstedt, B. (2013). BEFKI GC-K: A short scale for the measurement of crystallized intelligence. Methoden, Daten, Analysen, 7, 153–181. doi: 10.12758/ mda.2013.010 Schmidt, K.-H., & Metzler, P. (1992). WST – Wortschatztest. Gçttingen, Germany: Hogrefe. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353. doi: 10.1037/10403590.8.4.350 Sijtsma, K. (2009a). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107– 120. doi: 10.1007/S11336-008-9101-0 Sijtsma, K. (2009b). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 9, 167– 194. doi: 10.1080/15305050903106883 Smith, G. T., Combs, J. L., & Pearson, C. M. (2012). Brief instruments and short forms. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbook of research methods in psychology (Vol. 1, pp. 395–409). Washington, DC: American Psychological Association. doi: 10.1037/13619-021 Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of short-form development. Psychological Assessment, 12, 102–111. doi: 10.1037/1040-3590.12.1.102 Strenze, T. (2007). Intelligence and socioeconomic success: A meta-analytic review of longitudinal research. Intelligence, 35, 401–426. doi: 10.1016/j.intell.2006.09.004 Stucky, B. D., Gottfredson, N. C., & Panter, A. T. (2012). Itemlevel factor analysis. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbook of research methods in psychology (Vol. 1, pp. 683–697). Washington, DC: American Psychological Association. doi: 10.1037/13619-036 Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174–195. doi: 10.1177/ 0013164400602002 Van der Linden, W. J. & Glas, C. A. W. (Eds.). (2013). Elements of adaptive testing. New York, NY: Springer. Wagner, G. G., Frick, J. R., & Schupp, J. (2007). The German Socio-Economic Panel Study (SOEP) – scope, evolution and enhancements. Schmollers Jahrbuch, 127, 139–169. Wang, L., Zang, Z., McArdle, J. J., & Salthouse, T. A. (2008). Investigating ceiling effects in longitudinal data analysis. Multivariate Behavioral Research, 43, 476–496. doi: 10.1080/00273170802285941 Widaman, K. F., Little, T. D., Preacher, K. J., & Sawalani, G. M. (2011). On creating and using short forms of scales in secondary research. In K. H. Trzesniewski, M. B. Donnellan, & R. E. Lucas (Eds.), Secondary data analysis: An introduction for psychologists (pp. 39–61). Washington, DC: American Psychological Association. doi: 10.1037/12350003 Wilhelm, O., Schroeders, U., & Schipolowski, S. (in press). Berliner Test zur Erfassung fluider und kristalliner Intelligenz fr die 8. bis 10. Jahrgangsstufe (BEFKI 8–10) [Berlin Journal of Individual Differences 2014; Vol. 35(4):190–200
200
S. Schipolowski et al.: Construction of Cognitive Short Forms
Test of Fluid and Crystallized Intelligence for Grades 8–10]. Gçttingen: Hogrefe. Yanagihara, H., & Yuan, K.-H. (2005). Four improved statistics for contrasting means by correcting skewness and kurtosis. British Journal of Mathematical and Statistical Psychology, 58, 209–237. doi: 10.1348/000711005X64060 Yu, C.-Y. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes (Dissertation). University of California, Los Angeles, CA. Retrieved from http://statmodel2.com/download/ Yudissertation.pdf
Stefan Schipolowski Humboldt-Universitt zu Berlin Institut zur Qualittsentwicklung im Bildungswesen (IQB) 10099 Berlin Germany E-mail
[email protected]
Date of acceptance: April 29, 2014 Published online: November 21, 2014
Journal of Individual Differences 2014; Vol. 35(4):190–200
2014 Hogrefe Publishing