The Latent Trait Modeling of Passage-Based Reading

0 downloads 0 Views 1MB Size Report
Dec 5, 2016 - comprehension test administered to 830 Korean high school ... energy in processing a long passage to just answer a single item, and it also offers test ..... Note. c indicates that topics are Composition, Roller skate, Means of ...
25 English Language Assessment, Vol. 11, December 2016

The Latent Trait Modeling of Passage-Based Reading Comprehension Test: Testlet-Based MIRT Approach1 Jung-Hee Byun ⋅Yong-Won Lee (Seoul National University) Byun, Jung-Hee, & Lee, Yong-Won. (2016). The latent trait modeling of passage-based reading comprehension test: testlet-based MIRT approach. English Language Assessment, 11, 25-46. Numerous IRT studies have regarded reading comprehension tests as a typical example of testlet-based tests that are organized in a unit referred to as a testlet, a group of items sharing a common stimulus. Testlet-based tests, however, can cause item responses from within the same testlet to be more correlated than across testlets. And the residual correlations create the testlet effect, which is specifically associated with a testlet itself (e.g., prior knowledge about reading passages). This study aims to 1) demonstrate the existence of such effect resulting into the multidimensionality of a reading comprehension test administered to 830 Korean high school students and 2) specify the model which can properly address the testlet effect. For this purpose, testlet-based multidimensional IRT models – bifactor and Testlet Response Theory models – are compared in reference to a unidimensional model. Results confirm the multidimensionality of the test and find the bifactor model to be the preferred one in terms of the discriminating power of the general trait to be measured in the test and the capacity to accommodate the testlet effect that is variable within and between passages.

I. INTRODUTION 1

The current study is a part of the larger research (J.-H. Byun & Y.-W. Lee, 2016) entitled as “Investigating topic familiarity as source of testlet effect in reading tests: Bifactor analysis”. In phase 1 of the study, the focus of investigation was on the relationship between the testlet effect and topic familiarity which is a potential source of the testlet effect by using the bifactor model, a multidimensional IRT model. In the current study, the major goals of the investigation are to compare several IRT models and specify the best-fit one that will properly address the testlet effect, and to suggest an advanced psychometric approach to dealing with testlet effects in passage-based reading assessment.

26 Jung-Hee Byun ⋅Yong-Won Lee

I. INTRODUTION In the literature on Item Response Theory (IRT), a plethora of studies have dealt with reading comprehension tests as a typical example of testlet-based tests. A testlet, defined as a group of items sharing a common passage (Wainer & Kiely, 1987), has been widely recognized as a unit of the standard testing format to assess reading comprehension because of its several advantages; it offers test takers a practical way to save time and energy in processing a long passage to just answer a single item, and it also offers test developers a flexible way to assess the degree of test takers’ comprehension of the passage from various aspects and to save the cost spent for the test design of one question per passage (Min & He, 2014; Wainer et al, 2007). Despite its appealing uses in educational settings, testlet-based tests create the psychometric issue of violating two fundamental IRT assumptions of local independence and unidimensionality which stipulate that the responses of two items should be independent of each other for test takers of the same ability and that a single ability should be measured (DeMars, 2006; Jang & Roussos, 2007; Y.-W. Lee, 1998, 2004; Thissen et al, 1989; Rijimen, 2010; Zhang, 2010 as cited in Min & He, 2014; Sireci, Thissen, &Wainer, 1991). When the local independence assumption is violated among some items in a test, it will lead to local item dependence (LID) indicating significant levels of residual correlations are likely to remain among the items with the testlets even after the impact of the measurement construct is partialed out (or controlled for) from the item scores (Yen, 1993 as cited in Min & He, 2014). A violation of local independence due to the within-testlet effect not only raises the possibility of measuring other than the ability that we think is measured in the test, but also causes significant consequences, such as, biased item parameter estimates and overestimated reliability. As a feasible solution to address the testlet effect, the IRT framework proposes multidimensional IRT (MIRT) models as an alternative test structure that is assumed to contain more than a single trait. Thus, the IRT framework considers multiple dimensions including the dimensions (or factors) that are uniquely associated with testlets themselves in addition to a primary dimension measured by the test as a whole. This makes it necessary to tease out the testlet-specific dimension independent of the general dimension from the total test, estimate the magnitude of the effect, and model such testlet effects appropriately in estimating item and person parameters. In the current study, the following effort will be particularly addressed: identifying local dependence among items sharing a common passage or in a so-called ‘testlet’ in an EFL (English as a foreign language) reading comprehension test and specifying the MIRT model that can properly address the testlet effect. For this purpose, two testlet-based MIRT models- the bifactor model (S. Choi, 2012b; DeMars, 2006; Gibbons et al., 2007; Gibbons & Hedeker, 1992; C. Park, 2010; Y. So, 2010) and its nested and more parsimonious

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 27

Testlet-Response Theory (MTRT) model (Li, Bolt, & Fu, 2006; Rijimen, 2010) - are applied. By doing so, this study will ultimately investigate an advanced psychometric measure for us to take conceptually-relevant testlet-specific factors into account and keep non-relevant testlet factors in control in assessing L2 reading of which construct is innately complex and multidimensional,. The plausible application of multidimensional IRT models into testlet-based reading tests here thus will contribute to improving the measurement precision of L2 reading comprehension and test authenticity (McNamara, 1996) in measuring language performance. In brief, the main focus of studies demonstrating the multidimensionality of reading comprehension tests is on fitting several competing factor structures to the data to model test takers’ responses in testlet items and examining the consequences in estimating item and ability parameters when poorly-fit models are applied. Under the MIRT framework, the present study will delve into how the testlet effect, taking on the unique feature of a reading comprehension test, interacts with the intended ability of reading comprehension. Thus, research questions to be addressed in this study are as follows: RQ1. To what extent does the passage-based reading comprehension test used in this study exhibit multidimensionality? RQ2. To what extent do the MIRT models – the bifactor and Testlet Response Theory (MTRT) Models - fit the passage-based reading comprehension test in reference to the unidimensional 2PL model?

II. LITERATURE REVIEW 1. Local Dependence in Testlet-Based Tests The test format of multiple items with a single passage has long been recognized as a standard practice in reading assessment for educational efficiency. It is not only costeffective for test developers and test takers but also can provide a viable assessment context to measure different aspects of the examinee’s comprehension of the passage (Y.W. Lee, 2004; Min & He, 2014). In the IRT framework, however, such strength of testletbased tests pose a serious challenge for the two fundamental assumptions of unidimensional IRT - local independence (Embretson & Reise, 2000; Stout, Nandakumar & Habing, 1996) and unidimensionality,- which causes an inaccurate estimation of the examinee’s abilities and test reliability (Yen, 1993; Wainer & Thissen, 1996), and as a consequence, may affect interpretation of test result. According to previous studies (Henning, 1989; Weiss & Yoes, 1991), the definition of local independence is no

28 Jung-Hee Byun ⋅Yong-Won Lee

correlation between the two items for individuals at the same ability level. It means two items should have a zero correlation after the impact of the measurement construct is partialed out from the item scores (Yen, 1993, as cited in Y.-W. Lee, 2004). With a significant correlation remaining between the residuals of the two items, these items are considered to be locally dependent and include a secondary dimension which is not accounted for by the IRT theta (θ) representing the measuring construct of a test. The IRT literature (Breland, Muraki & Lee, 2001; Yen, 1993) attributes one major source of local dependence to the sharing of a common passage. Min and He (2014) argue that local dependence is almost innate in passage-based reading assessment and thus test takers’ performance in two items sharing the same passage can be highly correlated to each other. In a similar vein, DeMars (2006) claims that, when several items are based on a common passage, test takers with diverse background knowledge of the passage, differential skills specific to the passage, or various other motivational factors concerning the passage might have differential understanding of the passage and hence differential item response behavior.

2. IRT Models for Testlets: Bifactor and Testlet-Response Theory (TRT) Models To overcome the local dependence in standard unidimensional IRT (UIRT) framework, the literature presents multidimensional IRT (MIRT) approach: applying the bifactor model (Gibbons & Hedeker, 1992) and the Testlet Response Theory (TRT) model (Bradlow, Wainer & Wang, 1999). The multidimensional bifactor and TRT models are known as the useful devices in educational contexts (e.g., testlet-based tests) when it is theoretically justifiable to model test takers’ responses with a large number of group factors due to its unique constraints on the factor structure. They permit modeling of responses to testlet items where the items are conditionally dependent because the items share the common texts. 1) General Model: Bifactor Model The bifactor model is an MIRT model under which each testlet has its own specific dimension represented as θS1~θS3 along with the general dimension illustrated as θG in Figure 1. The bifactor model in Figure 1 illustrates a test dataset with three testlets where there is one general factor and three specific factors. Each item loads on the general factor and only one of the specific factors. The general factor or θG in Equation 1 (DeMars, 2013) represents the primary ability underlying the entire test, while the specific factors represent the more narrowly-defined independent characteristics such as item group effects or

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 29

FIGURE 1. Bifactor Model (Modified from DeMars, 2013)

g(πj) = αjGθG + αjSθS + βj ······························· (Equation 1) βj : the intercept parameter for item j αjG : the slopes or loadings of item j on the general latent variable (G) αjS : the slopes or loadings of item j on the specific latent variables (S) subdomain effects (S. Choi, 2012a). For binary data, the bifactor model can be formulated as follows . What Equation 1 describes about the relationship between the two dimensions assumed in the bifactor model is that the αjs are independent of the αjGs, and thus this equation gives an account of local dependence within testlet that will be of potential use in identifying items that are most influenced by the testlet effect (θS). Y. So’s (2010) study of investigating the dimensionality of nested items within a passage in the reading paper of Certificate in Advanced English (CAE) found the bifactor model best fit to represent interitem relationships on the CAE reading paper. This finding suggests two sources of item inter-correlations – reading proficiency and the specific reading passage which a group of items share – be addressed properly to model the item responses on the test. And not taking into account conditional dependence among nested items within a reading passage can lead to bias in estimating item discrimination parameter. The strength of the bifactor model in testlet-based reading tests was also highlighted by C. Park (2010) because of its substantial variability of loadings on the second dimension. These studies apparently demonstrate great potential to apply to language tests with testlet items.

30 Jung-Hee Byun ⋅Yong-Won Lee

2) Constrained Model: Multidimensional Testlet-Response Theory (MTRT) Model According to Li et al. (2006) and C. Park (2010), to illustrate how examinees’ responses to testlet items could be modeled, Bradlow et al. (1999) and follow-up studies established a two-parameter normal ogive (2PNO) testlet model in which a random effect parameter is added to model the local dependence among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wang, Bradlow, & Wainer, 2002). A two-parameter normal ogive (2PNO) testlet model can be written as P(yij=1) = Φ[ai(θj-bi+ ϒd(i)j )] ························· (Equation 2) P(yij=1): the probability that examinee j answers item i correctly Φ: the cumulative distribution function (cdf) of a standard normal distribution θj : the ability of examinee j bi : the difficulty of item i ai : the discrimination parameter of item i ϒd(i)j : a random effect that represents the interaction of person j with testlet d(i) (i.e., testlet d that contains item i). However, the testlet model can be defined as a special version of the bifactor by constraining the loadings on the specific dimension to be proportional to the loadings on the general dimension within each testlet (Li et al., 2006; Rijimen, 2009, as cited in Rijimen, 2010, G. Lee et al, 2009). Bradlow et al. (1999) and Wainer et al. (2007) formulate the testlet model in a Bayesian framework as follows. g(πj) = αjG (θG + Csθs) + βj , ··················· (Equation 3) The 2PNO testlet model (Bradlow et al, 1999) in Equation 3 is equivalent to Equation 1 of the bifactor model, if αjGCs = αjs is set in Equation 3. The testlet-specific proportionality constants Cs , s=1,…S, stem from the fact that, as opposed to the bifactor model, the scales of the specific dimensions do not have to be fixed for reasons of model identification (G. Lee et al., 2009). G. Lee et al. (2009) added that the TRT model presented in Equation 3 corresponds to an extension of the “original” 2 parameter-normal ogive (2PNO) testlet model by Bradlow et al. (1999) in Equation 2 in that the latter model incorporates the additional restriction that all Css are equal across testlets (the variances of the specific dimensions are assumed to be the same). And the TRT model nested within the bifactor model means that an item should be expected to have the same discrimination parameter ai for ϒd and θ. In a reading comprehension test, it would imply that items that are good indicators of reading comprehension are also more greatly affected by the secondary

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 31

dimension associated with the passage. For model specification, thus, it is crucial to set all the secondary dimension slopes equal to each item’s general dimension slope, and the variances of secondary dimensions across testlets are to be freed, whereas the means of the general dimension and secondary dimensions across testlets are set to 0, and the variance of the general dimension set to 1.

III. METHOD 1. Data Participants are 830 high school students in first and second grade at two schools located in two cities of the South Kyungsang province. The testing instrument is a multiple-choice reading comprehension test with 8 passages and 32 items, with 4 items related to each passage (See Appendix). It is originated from the test developed by Y.-W. Lee (1998) that includes 10 passages and 40 items, with 4 items in each passage. Among the 10 passages, 2 passages with the highest Q3 (Yen, 1984), an index of local dependence indicating average inter-item correlations of residuals formed by subtracting the thetapredicted scores from the raw scores, are eliminated under the assumption that the issue of local dependence should be treated with greater seriousness, even when the two passages which could pose a substantial threat to test reliability are absent. Q3 indices were automatically computed by a computer program, IRTNEW (Chen, 1996) after item parameter estimation by the MULTILOG 6.2 (Thissen, 1991).

2. Procedure Three sets of the same reading comprehension test are available for counterbalancing, a technique that controls the ordering effect of text arrangement and are randomly distributed to participants in almost the same proportion (Set A: 277, Set B: 277, Set C: 276). Thus, Forms A, B and C are only different in terms of the text arrangement of the 8 passages as shown in Table 1. Test takers take the reading comprehension test. The test lasts for 50 minutes, including time for giving instructions. For Research Question 1, local dependence and unidimensionality are checked according to both Classical Test Theory and Item Response Theory approaches, including the inter-item correlations within passages, factor analysis, and local dependence (LD) χ2 statistic (Chen & Thissen, 1997) obtained from unidimensional IRT modeling via IRTPRO 3 (Cai & du Toit, 2011; SSI, 2015). Min and He (2014) propose a guideline for interpreting LD-χ2 values; if the values exceed 4, it signals clear local dependence between items and the values exceeding 10 indicate extreme local dependence.

32 Jung-Hee Byun ⋅Yong-Won Lee

TABLE 1

Counterbalanced Design of the Reading Comprehension Test in Three Sets3 (Modified from J.-H. Byun & Y.-W. Lee, 2016) Number of Passage topic Form A Form B Form C items 1. Composition 4 1. Composition 8. Elevator 4. Dream 2. Roller skate 4 2. Roller skate 7. Polygraph 3. Means of trade 3. Means of trade 4 3.Means of trade 6. Insects 2. Roller skate 4. Dream 4 4. Dream 5. Yogurt 1. Composition 5. Yogurt 4 5. Yogurt 4. Dream 8. Elevator 6. Insects 4 6. Insects 3. Means of trade 7. Polygraph 7. Polygraph 4 7. Polygraph 2. Roller skate 6. Insects 8. Elevator 4 8. Elevator 1. Composition 5. Yogurt

Research Question 2 intends to find the preferred model from comparing the model fits of the two testlet-based models and evaluating the accuracy of parameter estimations from three models - the unidimensional 2PL, multi-dimensional TRT(MTRT), and bifactor models. Rijimen (2010) illustrated the nested relationship of the two MIRT models as extended forms of the two-parameter logistic (2PL) or two-parameter normal ogive (2PNO) models. The present study nicely fits into Rijimen’s (2010) 2PL framework of model comparison because of the medium-sized data, the computational efficiency of the approach, and lower concern about the guessing. The model selection tool to compare the three nested models is -2 log likelihood, known as a kind of deviance statistic indicating model mistfit. In general, the model with the lowest values on these information criteria is the preferred one. For all models, the item parameters are estimated by the Bock-Aitkin Marginal Maximum Likelihood Estimation (Bock & Aitkin, 1981) algorithm complying with Min and He’s (2014) study. And the ability estimates are obtained by the expected a posteriori (EAP) method (Bock & Mislevy, 1982, as cited in Min & He, 2014) requiring less computation and a smaller standard error of ability estimates (Wang & Vispoel, 1998, as cited in Min & He, 2014).

IV. RESULTS 1. Dimensionality Test 1) Classical Test Theory Approach

3

The mean word length is 185 words.

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 33

Investigating the presence of local dependence due to the passage effect includes comparison of inter-item correlations and reliabilities estimated by two scoring units - item and passage. Table 2 shows discrepancy in reliability between item-based scoring and passage-based scoring that the item-based reliability coefficient (.864) is larger than the passage-based one (.833). It implies that the gap, .031 is the inflated portion of reliability coefficient due to the passage effect, as supported by Wainer and Thissen (1996). More importantly, the 32 Χ 32 inter-item correlational matrices in Table 3 illustrate that the mean of 48 within-passage correlations (.183) is greater than the grand mean of the 496 correlation coefficients of all item pairs (.166), whereas the mean of 448 betweenpassage correlations (.148) is lower than the grand mean. To sum up, the residual withinpassage item correlations (0.035) prove the presence of local dependence in the classical test theory approach. Table 4 summarizes inter-item correlations within each of 8 passages used in the test. All the passages except Passage 8 mark the above average inter-item correlations. TABLE 2 Scoring Unit Item Passage

Reliability by Two Scoring Units, Item and Passage Scoring No. of No. of Cronbach-α Scheme Items Cases Dichotomous 32 830 .864 Polytomous 8 830 .833

Mean

SD

12.15

6.5

TABLE 3

Average Inter-item Correlations Within and Between Passagesa Grand No. of Type Within-passage Between-passage Mean Cases Pearson .183 .148 .166 830 Correlation (-0.032)b (-0.142)b Note. a indicates that the correlations above were computed via the Fisher-Z and Fisher-Z inverse transformations. b in brackets here indicates the expected correlations without local dependence derived from Yen’s (1984) formula.

TABLE 4 Average Inter-item Correlations Within Passage (p=.001) Passagec P1

P2

P3

P4

P5

P6

P7

P8

Mean

Grand Mean

Pearson 0.17 0.16 0.21 0.20 0.18 0.21 0.20 0.15 0.18 0.15 Corr. Note. c indicates that topics are Composition, Roller skate, Means of trade, Dream, Yogurt, Insects, Polygraph, and Elevator in the order of passage number.

34 Jung-Hee Byun ⋅Yong-Won Lee

2) Factor Analysis Approach In the following step, exploratory factor analysis on the basis of item scoring was conducted to investigate the number of factors that account for the total score variance of the reading comprehension test and the topic familiarity questionnaire. Principal component analysis (PCA) without rotation was used as the extraction method to identify the factors that exceed Eigenvalue 1.0. The result in Table 5 shows that the reading test has a total of 8 factors with Eigenvalues over 1.0. Component 1 that explains 19.27% of the total variance explained does not even meet the condition of the essential unidimensionality (Reckase, 1979) that requires at least 20% above the total variance to be a dominant dimension. TABLE 5

Component 1 2 3 4 5 6 7 8

Result of Exploratory Factor Analysis for 32 RC Test Items Extraction Sums of Squared Initial Eigenvalues Loadings Variance Cumulative % of Cumulative Total Total (%) (%) Variance (%) 6.17 19.27 19.27 6.17 19.27 19.27 1.35 4.23 23.50 1.35 4.23 23.50 1.33 4.17 27.67 1.33 4.17 27.67 1.25 3.89 31.56 1.25 3.89 31.56 1.16 3.61 35.17 1.16 3.61 35.17 1.08 3.39 38.56 1.08 3.38 38.56 1.04 3.25 41.81 1.04 3.25 41.81 1.02 3.17 44.98 1.02 3.17 44.98

3) Item Response Theory Approach It is necessary to choose the unidimensional model that best fits the dataset in order to examine local dependence and this will later serve as the baseline for model comparison in Research Question 2. Although the 3PL unidimensional model had the lowest -2 log likelihood value (30832.70) compared to the 1PL (31561.48) and 2PL (31171.51), the marginal reliability of the 2PL model marked the highest 0.85 as compared to 0.83 for the 1 PL and 0.71 for the 3PL model. The RMSEA index of the 2 PL and 3 PL models that equal 0.03 is slightly lower than 0.04 of the 1 PL model. In addition, there are only four items for which guessing parameters (g-parameters) exceeded 0.25 (that is set to be the cutoff for 4-option items) when estimated by the 3 PL model and did not exceed 0.3. It means no extra effort is necessary in order to take the computational burden for guessing parameter. Furthermore, previous research (Baker, 1987; Swaminathan & Gifford, 1985, as cited in Li et al., 2010) also supports this, arguing that choosing the 3PL model for the dichotomous items creates a lot of computational difficulty related to the c parameters

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 35

(pseudo-guessing parameter) that may affect the estimation of other parameters. All considered, the 2PL model is chosen in this study as the base model used to address the following research questions. The estimation of LD χ2 statistics by the 2PL unidimensional model found local dependence in 25 item pairs from a total of 48 item pairs within the passage. Again, among these 25-item pairs where local dependence is detected, 7 item pairs exhibit clear local dependence across all passages except Passage 3. Table 6 illustrates the result. Generally, the degree of local dependence in the reading comprehension test does not seem quite strong among items within a testlet. This may be somehow expected because of the research design which purposefully eliminates two passages with stronger local dependency from the original test developed by Y.-W. Lee (1998). Instead, attention needs to be paid in the result shown in Table 6 which explicitly supports the variability of testlet effect upon within and between passages. It suggests an important condition to find the best model that can accommodate the variable feature of the testlet effect upon the given data. In comparison with Q3 statistics which is a local item dependence (LID) index based on theta-predicted scores, Table 7 and 8 present a similar tendency of Q3 indices to raw-score based LID indices in Table 3 and 4. That is to say, the mean of within-passage correlations of residuals (.064) is greater than the grand mean (.010), and the gap in residual correlations (.055) in between- and within-passages clearly indicates the within- passage local item dependence. TABLE 6

Standardized LD-χ2 Statistics of Item Pairs Within Passage (J.-H. Byun & Y.-W. Lee, 2016) Passage P 1d P2 P3 P4 P5 P6 number Item pair 1↔2 5↔6 NAe 13↔14 19-20 22-24 with LD LD-χ2 5.2 4.2 NAe 5.4 3.5 3.8 Note. d indicates passage number. e indicates no local dependence found.

P7

P8

26-27

28-31

5.2

3.1

TABLE 7 Average Inter-item Correlations of Residuals Within and Between Passages (Q3) Grand Type Within-passage Between-passage Mean .009 .064 .010 Q3 (-0.032)f (-0.142)f Note. The correlations above were computed via the Fisher-Z and Fisher-Z inverse transformations. f in brackets here indicates the expected correlations without local item dependence derived from Yen’s (1984; 1993) formula = -1/(n-1) when n here means the number of items.

36 Jung-Hee Byun ⋅Yong-Won Lee

TABLE 8

Average Inter-item Correlations of Residuals Within Passageg (Q3) Passage number Grand Mean Q3 .064 .088 .074 .049 .073 .071 .062 .031 .064 .010 Note. g indicates that the correlations above were computed via the Fisher-Z and Fisher-Z inverse transformations. P1

P2

P3

P4

P5

P6

P7

P8

Mean

2. Model Comparison: the Uni, Multidimensional TRT, and Bifactor Models 1) Overall Model Fit As the Uni-2PL model is nested in the MTRT 2PL model, and the MTRT 2PL model is nested in the bifactor 2PL model, the significant difference in -2log-likelihood can be tested with a series of χ2 difference tests (du Toit, 2003, as cited in Min & He, 2014) to decide which one is the best-fitting model. The χ2-difference tests of -2log likelihood values showed statistical significance in all three pairs (Uni↔MTRT, MTRT↔Bi, Uni↔Bi), as can be seen in Table 9. And the model with the lowest -2 log likelihood value is considered to have the best fit. So, the bifactor model was chosen as the preferred model, followed by the MTRT model and the unidimensional 2PL model. TABLE 9 Comparison of the Unidimensional and Multidimensional Models (J.-H. Byun & Y.-W. Lee, 2016) -2log χ2-difference tests h Models (2PL) NP Df Likelihood of three models Uni 64 464 31171.51 χ2(8) =67.36** MTRT

72

456

Bi 96 432 Note. h indicates number of parameters.

31104.15 31062.87 *p < .05 **p < .005

χ2(24) =41.28* χ2(32)=108.64**

2) Estimation of item parameters, factor loadings and standard errors Discrimination estimates are related to slope parameters (a-parameter); the higher a slope is, the higher the discrimination is (Min & He, 2014). The mean of discrimination estimates in Table 10 indicates that the bifactor model can best discriminate the intended ability of reading comprehension. Also, in Table 11, the a-trait parameters between the Bifactor and the Uni models show relatively larger difference than the corresponding

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 37

TABLE 10 (N=32) Uni MTRT Bi

Summary of a-Trait Parameters from Three Models Min Max 0.23 1.76 0.22 2.20 0.21 4.22

Mean 0.98 1.03 1.20

TABLE 11 Correlation of a-Trait Parameters from Three Models (N=32) Uni_a1 MTRT_a1 Uni_a1 1 MTRT_a1 .99** 1 Bi_a1 .76** .79** Note. **p = .01

Bi_a1 1

correlations between the MTRT and Uni models and between the same MIRT models, suggesting that choosing the poorly fit model affects the estimation of discrimination parameters. Table 12 gives us a summary of two discrimination parameters – general and testlet (or specific) dimensions - of all items that are estimated by three models. It can be said that the bifactor model is the preferred one among three models, not only because of its discriminating power of the general dimension based on its highest estimates of a1 parameters across three models, but also because of its substantial capacity to accommodate variability of the testlet effect within and across passages. More significantly, given that the slope parameters can be directly transformed into factor loadings as illustrated in Figure 2 on the general dimension (.72 ~ .12) and on the testlet dimension (.75 ~ -.13) , most items in the test share the general dimension, which is reading comprehension, with the secondary trait, which is presumably related to other factors embedded in passages. Figure 3 indicates that most of standard errors (SEs) of the slope (discrimination) parameters from three models are on the reference SE line 0~0.2, despite some outliers in the bifactor (Item 13, 19, 26) and MTRT models (Item 26). By degree of correspondence, the Uni_MTRT pair shows the highest, followed by MTRT_Bi and the Uni_Bi pair. Figure 3 also shows the largest disagreement in the estimation of slope parameters between the Bi- and Uni-models. Another observation here about the Uni-model’s constant estimation of the SEs when compared with the Bi- and MTRT models confirms the Uni-model’s inability to take the testlet effect into account within passages. It seems hard to infer that the stably low-estimated SEs from the Uni-model is an indicator of accurate estimation when compared with the other two multidimensional models. The overall model fit, the slope parameter and its SEs, let alone the variability of local dependence that was detectable in both Classical Test Theory and Item Response Theory from such indices as

38 Jung-Hee Byun ⋅Yong-Won Lee

TABLE 12 Discrimination Parameters of General (a1) and Specific (a2) Dimensions from Three Models (Extended Version of J.-H. Byun & Y.-W. Lee, 2016) Uni MTRTi Uni MTRTi Bi Bi Item Item a1 a1/a2 a1 a2 a1 a1/a2 a1 a2 1 1.24 1.39 1.34 0.79 17 1.58 1.64 1.57 0.15 2 1.09 1.18 1.23 0.94 18 1.11 1.11 1.13 -0.05 19 1.48 1.53 3.96 4.12 3 0.68 0.69 0.68 0.46 4 0.46 0.42 0.45 0.07 20 0.30 0.30 0.29 0.31 Mean 0.87 0.92 0.93 0.57 Mean 1.12 1.15 1.74 1.13 5 1.07 1.18 1.19 0.90 21 1.16 1.19 1.20 0.28 6 1.14 1.27 1.27 0.91 22 1.72 1.86 1.77 0.02 7 0.84 0.85 0.85 0.51 23 0.41 0.41 0.69 2.15 8 0.23 0.23 0.21 0.26 24 1.08 1.11 1.11 -0.20 Mean 0.82 0.88 0.88 0.65 Mean 1.09 1.14 1.19 0.56 9 0.85 0.83 0.85 0.08 25 0.93 0.91 0.93 0.24 10 1.50 1.65 1.59 0.65 26 1.76 2.20 4.22 3.69 11 0.83 0.85 0.90 0.86 27 0.9 0.91 0.90 0.39 12 0.91 0.92 0.93 0.55 28 0.54 0.55 0.53 0.24 Mean 1.02 1.06 1.07 0.54 Mean 1.03 1.14 1.65 1.14 13 0.91 0.94 1.75 2.74 29 1.11 1.15 1.13 0.07 14 1.13 1.20 1.14 0.43 30 0.71 0.73 0.73 0.19 15 1.19 1.23 1.18 0.29 31 0.99 1.00 1.02 -0.23 16

0.85

0.81

0.91

-0.36

32

0.61

0.61

0.73

0.95

Mean 1.02 1.05 1.25 0.78 Mean 0.86 0.87 0.90 0.25 Note. MTRT model was constrained to set the two a-parameters for the general and testlet dimensions equal. So, an item should be expected to have the same discrimination parameter for the general and testlet dimensions.

Q3 and LD-χ2, the bifactor model can be the preferred one for the data under study.

V. DISCUSSION AND CONCLUSION By investigating the dimensional structure of a reading comprehension test for Korean EFL high school students, this study attempted to demonstrate the multidimensionality of the reading test with presence of local dependence among items within passage that brings about the testlet effect. And the current study further attempted to specify the latent trait model that can best addresses such effect so as to improve the accuracy in

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 39

FIGURE 2 Factor Loadings of 32 Items on General and Specific Dimensions in Bifactor Model

Uni(x)↔Bi(y)

Uni (x)↔MTRT(y)

MTRT(x)↔Bi(y)

FIGURE 3 Scatter Plots of the SEs of a1 (general) Parameters from Three Models (N=32)

estimating parameters and the reliability to measure the latent trait that the test intends as discussed in previous studies (S. Choi, 2012b; C. Park, 2010; Rijimen, 2009, 2010; So, 2010). It turns out that the testlet effect is substantially viable in the reading comprehension test used here, and the bifactor model is confirmed to be the preferred one among three candidate models. Overall, the sure thing drawn from the current study is that the

40 Jung-Hee Byun ⋅Yong-Won Lee

estimation of discrimination parameters is still affected by the testlet effect among items within a passage even under the condition that passages with the strongest local dependency in the original test (Y.-W. Lee, 1998) are eliminated. The analysis for Research Question 1 exhibits clear local dependencies among items within testlets (passages) of the reading comprehension test, based on several indicators from both Classical Test Theory and Item Response Theory: the greater inter-item correlations of residuals within passage than the grand mean of inter-item correlations, as indicated by well-known local dependence indices such as Q3 and LD χ2 statistics. Moreover, the variability of local dependencies within and across passages became apparent although extreme local dependency did not appear. The substantial variability of the testlet effects within passages supports the application of the bifactor model to the dataset in Research Question 2, based on the deviance statistics based on model fits. Supported by the accuracy of estimating the item slope parameters, the choice of the bifactor model as the best-fit one can improve discriminating power over the general dimension. This finding is in agreement with previous studies that investigate the consequences of using the poorly fit model to the data in estimating item parameters (Min & He, 2014; Y. So, 2010). More importantly, this result allows item slope parameters on the secondary dimension (a2 parameter) to be freely estimated so as to look into the relationship between the two dimensions and investigate a range of testlet-specific factors (e.g., item characteristics, passage topics, and characteristics of the test-taker group). To illustrate in Table 12, individual items show the constantly higher slope parameters of the general dimension than those of the testlet dimension, which means generally the test measures what it purports to measure- reading comprehension. Although such consistency of the two slopes continues in the early half of the test, it is broken in items of the late half (Items 21, 22, 23, 24, 25, 29, 30, 31, and 32) dealing with topics about insects and elevators. It may indicate that as the test continued, the two slopes became more independent of one another with testlet influence getting stronger. Among the bifactor model estimates in Table 12, an interesting look should be given to the reverse relationship in some items (Item 8, 13, 19, 20, 23, 32): discrimination estimates of the testlet dimension value are higher than those for the general dimension. The strong influence of the passage effect upon these items suggests that the test score variance might have been accounted for by the testlet factor more than the general factor that is test takers’ reading comprehension. What seems to be suggested here in the item analysis is that the utility of modeling passage-based reading comprehension test results in the multidimensional IRT bifactor framework lies in its capacity to allow for the variability of testlet slope parameters (a2s) that represent the testlet effect in different passages, as opposed to the unidimensional IRT framework which puts greater efforts than necessary into minimizing local dependence that

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 41 is the source of testlet effects. In consequence, estimating reading comprehension scores with the bifactor measurement model will help not only to promote test practicality due to its testlet structure with a group of items sharing the same passage, but also to validate passage-based reading comprehension tests which conceptually involve multiple factors derived from test takers, passage and items. Now, it is reasonable to identify several causes of the testlet effect, whether it occurs within or across passages. A lot of previous studies (DeMars, 2006; Li et al., 2006, Min & He, 2014; Rijimen, 2010) have assumed readers‟ background knowledge about passage topics as the highly attributable source and there is an empirical study to examine whether the degree of readers‟ topic familiarity or content knowledge makes association with the traits that account for the testlet dimension of a reading test in order to identify topic familiarity as the source of testlet effect (see more details in J.-H. Byun & Y.-W. Lee, 2016). As another potential source of the testlet effect, the item type deserves to be addressed. Five of the six items deal with making prediction or inference (see test items 3 and 4 in Appendix). As a factor across passage, S. Ji and H. Kim‟s (2014) report of the significant effect of item type upon reading comprehension of EFL Korean high school students is noteworthy. Based on their study, items sharing a common item type can be considered as another source of the local dependence among items across testlets. In fact, item type has long been studied as a source of local item dependence (Y.-W. Lee, 1998). As Bachman (1990) notes, when the similar item types are used for multiple passages, they can create an additional local dependence factor due to their test method effect that can potentially be overlapping with the influence of a shared text. To conclude, no other potential sources of testlet effects embedded among withinpassage items should be left unturned in future research: such as text organization of reading passage (Y.-W. Lee, 1998; Kobayashi, 2002). Hopefully, the current study opens an avenue to future research examining the relationship between reading comprehension and testlet effect and identifying the causes of the larger influence of the secondary testlet dimension.

REFERENCES Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Baker, F. B. (1987). Methodology review: Item parameter estimation under the one-, two-, and three-parameter logistic models. Applied Psychological Measurement,

42 Jung-Hee Byun ⋅Yong-Won Lee 12, 111–141. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: an application of the EM algorithm. Psychometrika, 46(4), 443-459. Bock, R.D., & Mislevy, R.J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431-444. Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects models for testlets. Psychometrika, 64, 153-168. Breland, H., Muraki, E., & Lee, Y.-W. (2001, April). Comparability of TOEFL CBT writing prompts for different response modes. Paper presented at the annual meeting of National Council on Measurement in Education (NCME) in Seattle, WA. Byun, J.-H., & Lee, Y.-W. (2016). Investigating topic familiarity as source of testlet effect in reading tests: Bifactor analysis. Foreign Languages Education, 23(3), 79-109. Cai, L., Thissen, D. & du Toit, S. (2011). IRTPRO user’s guide. Lincolnwood, IL: Scientific Software International, Inc. Chen, W.-H., & Thissen, D. (1997). Local dependence indices for item paris using item response theory. Journal of Educational and Behavioral Statistics, 22, 265-289. Choi, S. (2012a). Examining the Internal Structure of a TOEIC Practice Test: Bifactor and Two-Tier Model Approaches. Journal of Language Sciences,19 (2), 141-165. Choi, S. (2012b). The Validity of Subscores for a Reading Comprehension Test: A Bifactor Approach. English 21, 25(1), 269-291. DeMars, C.E. (2006). Application of the Bi-Factor Multidimensional Item Response Theory Model to Testlet-Based Tests. Journal of Educational Measurement, 43 (2), 145-168. DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354-378. du Toit, M. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Lincolnwood, IL: Scientific Software International, Inc. Embreston, S.E., & Reise, S.P. (2000). Item response theory for psychologists. NJ:Lawrence Erlbaum. Gibbons, R.D. , Bock, R.D., Hedeker, D., Weiss, D.J., Segawa, E., Bhaumik, D.K., Kupfer, D.J., Frank, E., Grochocinski, V. J., & Stover, A. (2007). Full-Information Item Bifactor Analysis of Graded Response Data. Applied Psychological Measurement, 31(1), 4-19. Gibbons, R.D., & Hedeker, D. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423-436. Henning, G. (1989). Meaning and implications of the principle of local independence. Language Testing, 6(1), 95-108.

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 43 Jang, E.E. & Ruossos, L. (2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44, 1-21. Ji, S. & Kim, H. (2014). A comparison of the effects of test-item type and text familiarity on results of an English reading test. Foreign Languages Education, 21(1), 215239. Kobayashi, M. (2002). Method effects on reading comprehension test performance: text organization and response format. Language Testing, 19(2), 193-220. Lee, G., Park, I.-Y., & Jeon, M.-J. (2009). Testlet Response Model for IRT True Score Equating. Journal of Educational Evaluation, 871-887. Lee, Y.-W. (1998). Examining the suitability of an IRT-based testlet approach to the construction and analysis of passage-based items in an EFL reading comprehension test in the Korean High School Context. Unpublished doctoral dissertation. University Park, PA: The Pennsylvania State University. Lee, Y.-W. (2004). Examining passage-related local item dependence (LID) and measurement construct using Q3 statistics in an EFL reading comprehension test. Language Testing, 21(1), 74-100. Li, Y., Bolt, D.M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3-21. Li, Y., Li, S., & Wang, L. (2010). Application of a general polytomous testlet model to the reading section of a large-scale English language assessment. (ETS Research Report No.RR-10-21). Princeton, NJ: ETS. McNamara, T.F. (1996). Measuring second language performance. Harlow, UK: Addison Wesley Longman. Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453477. Park, C. (2010). A comparative study of IRT models for locally dependent reading test items by ESL learners. Journal of Educational Evaluation, 23(2), 529-546. Reckase, M.D. (1979). Unifactor latent trait models applied to multifactor tests: Results and Implications. Journal of Educational Statistics, 4, 207-230. Rijimen, F. (2009). Three multidimensional models for testlet-based tests: Formal relations and an empirical comparison (ETS Research Report No. RR-09-37). Princeton, NJ: ETS. Rijimen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement , 47, 361-372. Sireci, S., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237-247.

44 Jung-Hee Byun ⋅Yong-Won Lee So, Y. (2010). Dimensionality of responses to a reading comprehension assessment and its implications to scoring test takers on their reading proficiency. Unpublished doctoral dissertation. Los Angeles: University of California. SSI. (2015). IRTPRO: User's Guide. Skokie, IL: Scientific Software International, Inc. Stout,W., Nandakumar, R., & Habing, B. (1996). Analysis of latent dimensionality of dichotomously and polytomously scored test data. Behaviormetrika, 23, 37-65. Swaminathan, H., & Gifford, J. A. (1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50, 349–364. Thissen, D., Steinberg, L., & Mooney, J.A. (1989). Trace lines for testlets: A use of multiple categorical response models. Journal of Educational Measurement, 26(3), 247-260. Thissen, D. (1991). MULTILOG user’s guide. Chicago, IL: Scientific Software. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201. Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15(1), 22-29. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In Computerized adaptive testing: Theory and practice (pp. 245-269). Springer: Netherlands. Wainer, H., Bradlow, E.T., & Wang, X. (2007). Testlet Response Theory and Its Applications. Cambridge, UK: Cambridge University Press. Wang, T. & Vispoel, W.P. (1998). Properties of ability estimation methods in computerized adaptive testing. Journal of Educational Measurement, 35(2), 109135. Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. ETS Research Report Series, 2002(1), 1-37. Weiss, D. J., & Yoes, M. E. (1991). Item response theory. In R.K. Hambleton & J. N. Zaal (Eds.), Advances in educational and psychological testing: Theory and applications (pp. 69-95). Boston: Kluwer Academic. Yen. W. M. (1984). Effects of local item dependence on the fit and equating performance of the three parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187-213. Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119-140.

The Latent Trait Modeling of Passage-Based Reading Comprehension Test 45 APPENDIX Note: Due to space limitations, only the first reading passage with four items is presented here. 고교영어독해력시험 (A 형) (문제 1~4) 다음의 독해지문을 읽고 각 질문에 가장 알맞은 답을 고르시오. When you write a composition, you will use many of the same skills you learned for writing paragraphs. You will choose a subject, gather details, organize your information, write a first draft, and revise. Choosing a subject is the first step. Look in your journal. Do some reading. Page through a book of photographs. Try brainstorming. Choose a subject that is interesting to you and that you know something about. Work with ideas that are too broad to be covered well in a single paragraph. However, don‟t choose a subject that is too general. For example, an explanation of how to eat with chopsticks is just right for a single paragraph. A description of a Japanese dinner, however, is much too broad to be covered well in only a few sentences. At the same time, the topics “Japan” or “Japanese customs” are too general even for a composition. 1. What is the main topic for this passage? (1) The first step in writing a composition (2) The similarities between paragraphs and compositions (3) The importance of reading as a search strategy (4) The uniqueness of Japanese culture and customs 2. In the above passage, which of the following was not mentioned as a way to choose a subject in the passage? (1) Reading

(2) Looking in a journal

(3) Brainstorming

(4) Taking photographs

3. According to the passage, "how to eat with chopsticks" may not be a right topic for a composition because the subject _____________________________________________. (1) is not interesting enough

(2) is not familiar enough

(3) is not broad enough

(4) is not technical enough

4. What kind of content would naturally follow this passage next in order? (1) How to gather materials

(2) How to organize information

(3) How to write up the final draft

(4) How to write paragraphs

46 Jung-Hee Byun ⋅Yong-Won Lee Applicable levels: Secondary, Tertiary Key words : testlet-based reading comprehension test, Testlet-based MIRT models, Bifactor model

Byun, Jung-Hee (1st author) Seoul National University E-mail: [email protected] Lee, Yong-Won (Corresponding author) Seoul National University E-mail: [email protected]

Received: October 1, 2016 Reviewed: November 4, 2016 Confirmed: December 5, 2016

Suggest Documents