Scaling Methodology and Procedures for the Mathematics ... - CiteSeerX

7

Scaling Methodology and Procedures for the Mathematics and Science Scales Raymond J. Adams Margaret L. Wu Greg Macaskill Australian Council for Educational Research

The principal method by which student achievement is reported in TIMSS is through scale scores derived using Item Response Theory (IRT) scaling. With this approach, the performance of a sample of students in a subject area can be summarized on a common scale or series of scales even when different students have been administered different items. The common scale makes it possible to report on relationships between students’ characteristics (based on their responses to the background questionnaires) and their overall performance in mathematics and science.

Because of the need to achieve broad coverage of both mathematics and science within a limited amount of student testing time, each student was administered relatively few items within each content area of each subject. In order to achieve reliable indices of student proficiency in this situation, it was necessary to make use of multiple imputation or “plausible values” methodology. Further information on plausible value methods may be found in Mislevy (1991), and in Mislevy, Johnson, and Muraki (1992). The proficiency scale scores or plausible values assigned to each student are actually random draws from the estimated ability distribution of students with similar item response patterns and background characteristics. The plausible values are intermediate values that may be used in statistical analyses to provide good estimates of parameters of student populations. Although intended for use in place of student scores in analyses, plausible values are designed primarily to estimate population parameters, and are not optimal estimates of individual student proficiency. This chapter provides details of the IRT model used in TIMSS to scale the achievement data. For those interested in the technical background of the scaling, the chapter describes the model itself and the method of estimating the parameters of the model. 7.1

THE TIMSS SCALING MODEL

The scaling model used in TIMSS was the multidimensional random coefficients logit model as described by Adams, Wilson, and Wang (1997), with the addition of a multivariate linear model imposed on the population distribution. The scaling was done with the ConQuest software (Wu, Adams, and Wilson, 1997) that was developed in part to meet the needs of the TIMSS study. The multidimensional random coefficients model is a generalization of the more basic unidimensional model.

111

CHAPTER 7

7.1.1

The Unidimensional Random Coefficients Model

Assume that I items are indexed i=1,...,I with each item admitting Ki + 1 response alternatives k = 0,1,...,Ki. Use the vector valued random variable, X i = ( X i1, X i2, ¼, X iK ) , i

ì1 where X ij = í î0

if response to item i is in category j otherwise

(1)

to indicate the Ki + 1 possible responses to item i. A response in category zero is denoted by a vector of zeroes. This effectively makes the zero category a reference category and is necessary for model identification. The choice of this as the reference category is arbitrary and does not affect the generality of the model. We can also collect the Xi together into the single vector X¢ = (X1¢, X2¢,...,Xi¢), which we call the response vector (or pattern). Particular instances of each of these random variables are indicated by their lower-case equivalents: x, xi and xik. T

The items are described through a vector x = ( x 1, x 2, ¼, x p ) of p parameters. Linear combinations of these are used in the response probability model to describe the empirical characteristics of the response categories of each item. These linear combinations are defined by design vectors ajk, (j = 1,…,I; k = 1,…Ki) each of length p, which can be collected to form a design matrix A¢ = ( a 11 ,a 12 ,¼ ,a 1K ,a 21, ¼ ,a 2K ,¼ ,a iK ) . Adopting a very general approach to the definition of items, in conjunction with the imposition of a linear model on the item parameters, allows us to write a general model that includes the wide class of existing Rasch models, for example, the item bundles models of Wilson and Adams (1995). 1

2

i

An additional feature of the model is the introduction of a scoring function, which allows the specification of the score or "performance level" that is assigned to each possible response to each item. To do this we introduce the notion of a response score bij, which gives the performance level of an observed response in category j of item i. The T bij can be collected in a vector as b = ( b 11 ,b 12 ,¼ ,b 1K ,b 21, b 22 ,¼ , b 2K ,¼ ,b iK ) . (By definition, the score for a response in the zero category is zero, but other responses may also be scored zero.) 1

2

i

In the majority of Rasch model formulations there has been a one-to-one match between the category to which a response belongs and the score that is allocated to the response. In the simple logistic model, for example, it has been standard practice to use the labels 0 and 1 to indicate both the categories of performance and the scores. A similar practice has been followed with the rating scale and partial credit models, where each different possible response is seen as indicating a different level of performance, so that the category indicators 0, 1, 2, etc. that are used serve as both scores and labels. The use of b as a scoring function allows a more flexible relationship between the qualitative aspects of a response and the level of performance that it reflects. Examples of where this is applicable are given in Kelderman and Rijkes (1994) and Wilson (1992). A primary reason for implementing this feature in the model was to facilitate the analysis of the two-digit coding scheme that was used in the TIMSS short-answer and ex112

CHAPTER 7

tended-response items. In the final analyses, however, only the first digit of the coding was used in the scaling, so this facility in the model and scaling software was not used in TIMSS. Letting q be the latent variable, the item response probability model is written as: T

exp ( b ij q + a ij x ) Pr ( X ij = 1 ;A ,b ,x q ) = --------------------------------------------------K i

å exp ( bik q +

T a ij

(2)

x)

k=1

and a response vector probability model as T

f ( x ;x q ) = Y ( q ,x ) exp [ x ( b q + Ax ) ]

(3)

with –1

ì ü T Y ( q ,x ) = í å exp [ z ( bq + Ax ) ] ý îz Î W þ

(4)

where W is the set of all possible response vectors. 7.1.2

The Multidimensional Random Coefficients Multinomial Logit Model

The multidimensional form of the model is a straightforward extension of the model that assumes that a set of D traits underlie the individuals’ responses. The D latent traits define a D-dimensional latent space, and the individuals’ positions in the D-dimensional latent space are represented by the vector q = ( q 1 ,q 2 ,¼ ,q D ) . The scoring function of response category k in item i now corresponds to a D by 1 column vector rather than a scalar as in the unidimensional model. A response in category k in dimension d of item i is scored bikd. The scores across D dimensions can be collected into a colT umn vector b ik = ( b ik1 ,b ik2 ,¼ ,b ikD ) , again be collected into the scoring sub-matrix for T T T T T item i, B i = ( b i1 ,b i2 ,¼ ,b i ) , and then be collected into a scoring matrix B = ( B 1 ,B 2 ,¼B I ) for the whole test. If the item parameter vector, x, and the design matrix, A, are defined as they were in the unidimensional model, the probability of a response in category k of item i is modeled as ki

T

exp ( b ij q + a ij x ) Pr ( X ij = 1 ;A ,B ,x q ) = ----------------------------------------------------K i

å exp ( bik q + a

T

(5)

ik x )

k=1

And for a response vector we have:

f ( x ;x q ) = Y ( q ,x ) exp [ x¢ ( Bq + Ax ) ]

(6)

113

CHAPTER 7

with –1

ì ü T Y ( q ,x ) = í å exp [ z ( Bq + Ax ) ] ý îz Î W þ

(7)

The difference between the unidimensional model and the multidimensional model is that the ability parameter is a scalar, q, in the former, and a D by 1 column vector, q, in the latter. Likewise, the scoring function of response k to item i is a scalar, bik, in the former, whereas it is a D by 1 column vector, bik, in the latter. 7.2

THE POPULATION MODEL

The item response model is a conditional model in the sense that it describes the process of generating item responses conditional on the latent variable, q. The complete definition of the TIMSS model, therefore, requires the specification of a density, f q ( q ;a ) , for the latent variable, q. We use a to symbolize a set of parameters that characterize the distribution of q. The most common practice when specifying unidimensional marginal item response models is to assume that the students have been sampled from a normal population with mean m and variance s2. That is:

2

1 (q – m) f q ( q ;a ) º f q ( q ;m, s 2 ) = ---------------- exp – -----------------2 2 2s 2ps

(8)

q = m+E

(9)

or equivalently

where E ~ N(0,s2). Adams, Wilson, and Wu (1997) discuss how a natural extension of (8) is to replace the T mean, m, with the regression model, Y n b , where Yn is a vector of u fixed and known values for student n, and b is the corresponding vector of regression coefficients. For example, Yn could be constituted of student variables such as gender, socio-economic status, or major. Then the population model for student n becomes T

qn = Yn b + En

(10)

where we assume that the En are independently and identically normally distributed with mean zero and variance s2 so that (10) is equivalent to 1 – --2

T 1 T T f q ( q n ;Yn , b, s ) = ( 2ps ) exp – ---------2 ( q n – Y n b ) ( q n – Y n b ) 2s 2

2

T

(11)

a normal distribution with mean Y n b and variance s2. If (11) is used as the population model then the parameters to be estimated are b, s2, and x. 114

CHAPTER 7

The TIMSS scaling model takes the generalization one step further by applying it to the vector valued q rather than the scalar valued q, resulting in the multivariate population model

f q ( q n ;Wn , g , S ) = ( 2p )

D – ---2

S

1 – --2

1 T –1 exp – --- ( q n – gW n ) S ( q n – gW n ) 2

(12)

where g is a u ´ D matrix of regression coefficients, S is a D ´ D variance-covariance matrix and Wn is a u ´ 1 vector of fixed variables. If (12) is used as the population model then the parameters to be estimated are g, S, and x. In TIMSS we refer to the Wn variables as conditioning variables. 7.3

ESTIMATION

The ConQuest software uses maximum likelihood methods to provide estimates of g, S, and x. Combining the conditional item response model (6) and the population model (12) we obtain the unconditional or marginal response model

f x ( x ;x, g , S ) =

ò f x ( x ;x q ) f q ( q ;g , S ) dq

(13)

q

and it follows that the likelihood is N

L =

Õ

f x ( x n ;x, g , S )

(14)

n=1

where N is the total number of sampled students. Differentiating with respect to each of the parameters and defining the marginal posterior as

f x ( x n ;x q n ) f q ( q n ;W n, g , S ) h q ( q n ;W n x, g , S x n ) = ------------------------------------------------------------------f x ( x n ;W n x, g , S )

(15)

provides the following system of likelihood equations: N

A¢ å x n – ò E z ( z q n )h q ( q n ;Y n, x, g , S x n ) dq n = 0 n=1

(16)

qn

N æ N Tö æ Tö gˆ = ç å q n W n÷ ç å W n W n÷ èn = 1 ø èn = 1 ø

–1

(17)

and

1 Sˆ = ---N

N

å ò ( qn – g Wn ) ( qn – g Wn )

T

h q ( q n ;Y n, x, g , S x n ) dq n

(18)

n = 1 qn

115

CHAPTER 7

where

E z ( z q n ) = Y ( q n, x )

å z exp [ z¢ ( bqn + Ax ) ]

(19)

zÎW

and

qn =

ò qn hq ( qn ;Yn, x, g , S xn ) dqn

(20)

qn

The system of equations defined by (16), (17), and (18) is solved using an EM algorithm (Dempster, Laird, and Rubin, 1977) following the approach of Bock and Aitken (1981). 7.3.1

Quadrature and Monte Carlo Approximations

The integrals in equations (16), (17) and (18) are approximated numerically using either quadrature or Monte Carlo methods. In each case we define, Qp, p=1,...,P a set of P D-dimensional vectors (which we call nodes), and for each node we define a corresponding weight Wp (g,S). The marginal item response probability (13) is then approximated using P

f x ( x ;x, g , S ) =

å

f x ( x ;x Q p )W p ( g , S )

(21)

p=1

and the marginal posterior (15) is approximated using

f x ( x n ;x Q q )W q ( g , S ) h Q ( Q q ;W n, x, g , S x n ) = ------------------------------------------------------------P

å

(22)

f x ( x n ;x Q p )W p ( g , S )

p=1

for q=1,...,P. The EM algorithm then proceeds as follows: Step 1. Prepare a set of nodes and weights depending upon g(t) and S(t) the estimates of g and S at iteration t. Step 2. Calculate the discrete approximation of the marginal posterior density of qn given xn at iteration t using (t) (t)

h Q ( Q p ;W n, x g , S

(t)

(t)

(t)

(t)

f x ( x n ;x Q p )W p ( g , S ) x n ) = --------------------------------------------------------------------------P

å

f x ( x n ;x

(t)

(t)

(t)

Q p )W p ( g , S )

p=1

where x(t), g(t), S(t) and are estimates of x(t), g(t), and S(t) at iteration t.

116

(23)

CHAPTER 7

Step 3. Use a Newton-Raphson method to solve the following to produce estimates of xˆ ( t + 1 ) . N

P

A¢ å x n –

å Ez ( z Qr )hQ ( Qr ;Wn, x

n=1

(t)

(t)

,g ,S

(t)

xn ) = 0

(24)

r=1

Step 4. Estimate g(t+1) and S(t+1) using

gˆ

(t + 1)

æ N n Tö æ N Tö = ç å Q W n÷ ç å W n W n÷ èn = 1 ø èn = 1 ø

–1

(25)

and (t + 1) 1 Sˆ = ---N

N

P

å å ( Qr – g

(t + 1)

Wn ) ( Qr – g

(t + 1)

(t)

T

(t)

W n ) h Q ( Q r ;Y n, x , g , S

(t)

x n ) (26)

n = 1r = 1

where P

n

Q =

å Qr hQ ( Qr ;Wn, x

(t)

(t)

,g ,S

(t)

xn )

(27)

r=1

Step 5. Return to Step 1. The difference between the quadrature and Monte Carlo methods lies in the way the nodes and weights are prepared. For the quadrature case we begin by choosing a fixed set of Q points, (Qd1, Qd2,…,QdQ) for each latent dimension and then define a set of QD nodes that are indexed r = 1,…,QD, and are given by the Cartesian coordinates Q r = ( Q 1 j , Q 2 j , ¼, Q d j ) with j1 = 1,…Q; j2 = 1,…,Q; …;jd = 1,…,Q . 1

2

d

The weights are then chosen to approximate the continuous latent population density (12). That is,

W p = K ( 2p )

d – --2

S

1 – --2

1 T –1 exp – ---( Q p – g W n ) S ( Q p – g W n ) 2

(28)

where K is a scaling factor to ensure that the sum of the weights is 1. In the Monte Carlo case the nodes are drawn at random from the standard multivariate normal distribution and at each iteration the nodes are rotated using standard methods so that they become random draws from a multivariate normal distribution with mean gWn and variance S. In the Monte Carlo case the weight for all nodes is 1/P.

117

CHAPTER 7

For further information on the quadrature approach to estimating the model see Adams, Wilson, and Wang (1997), and for further information on the Monte Carlo method see Volodin and Adams (1997). In the TIMSS scaling the Bock-Aitken quadrature approach was used for unidimensional models and the Volodin Monte Carlo methods was used when scaling in high dimensions. 7.3.2

Latent Estimation and Prediction

The marginal item response (13) does not include parameters for the latent values qn and hence the estimation algorithm does not result in estimates of the latent values. For TIMSS, expected a posteriori estimates (EAP) of each student’s latent achievement was produced. The EAP prediction of the latent achievement for case n is P

EAP

qn

=

å Qr hQ ( Qr ;Wn, xˆ , gˆ , Sˆ xn )

(29)

r=1

Variance estimates for these predictions were estimated using P

EAP

var ( q n

) =

å ( Qr – qn

EAP

EAP

) ( Qr – qn

T

) h Q ( Q r ;W n xˆ , gˆ , Sˆ x n )

(30)

r=1

7.3.3

Drawing Plausible Values

Plausible values are random draws from the marginal posterior of the latent distribution, (15), for each student. For details on the use of plausible values the reader is referred to Mislevy (1991) and Mislevy et al. (1992). Unlike previously described methods for drawing plausible values (Beaton, 1987; Mislevy et al., 1992) ConQuest does not assume normality of the marginal posterior distributions. Recall from (15) that the marginal posterior is given by

f x ( x n ;x q n ) f q ( q n ;W n, g , S ) h q ( q n ;W n, x, g , S x n ) = ------------------------------------------------------------------ò f x ( x ;x q ) f q ( q, g , S ) dq

(31)

q

M

The ConQuest procedure begins drawing M vector valued random deviates, { j nm } m = 1 from the multivariate normal distribution fq(qn, Wng,S) for each case n. These vectors are used to approximate the integral in the denominator of (31) using the Monte Carlo integration

1 ò f x ( x ;x q ) f q ( q, g , S ) dq » ---M q

118

M

å

m=1

f x ( x ;x j mn ) º Á

(32)

CHAPTER 7

At the same time the values

p mn = f x ( x n ;x j mn ) f q ( j mn ;W n, g , S )

(33)

p mnö M are calculated, so that we obtain the set of pairs æ j nm, ------which can be used as an è Á øm=1 approximation to the posterior density (31). The probability that jnj could be drawn from this density is given by

p mn q nj = -----------------M

å

(34)

p mn

m=1

At this point, L uniformly distributed random numbers, { h i } i = 1 , are generated and for each random draw the vector j ni that satisfies the condition L

0

i0 – 1

i0

s=1

s=1

å qsn < hi £ å qsn

(35)

is selected as a plausible vector. 7.4

SCALING STEPS

The item response model described above was fit to the data in two steps. In the first step the items were calibrated using a subsample of students drawn from the samples of the participating countries. These samples were called the international calibration samples. In a second step the model was fit separately for each country with the item parameters fixed at values estimated in the first step. There were three principal reasons for using an international calibration sample for estimating international item parameters. First, it seemed unnecessary to estimate parameters using the complete data set; second, drawing equal-sized subsamples from each country for inclusion in the international calibration sample ensured that each country was given equal weight in the estimation of the international parameters; and third, the drawing of appropriately weighted samples meant that weighting would not be necessary in the international scaling runs. 7.4.1

Drawing the International Calibration Sample

At the time when the international scaling of the data commenced the TIMSS database of item response data contained information from 25 Population 1 countries and 39 Population 2 countries. Those countries are listed in Table 7.1. For each target population, samples of 600 tested students were selected from the database for each participating country. This generally lead to roughly equal samples from each target grade. For Israel, where only the upper grade was tested, the sample size was reduced to 300 tested students. The sampled students were selected using a probability-proportional-to-size systematic selection method. The overall sampling 119

CHAPTER 7

weights were used as measures of size for this purpose. This resulted in equal selection probabilities, within national samples, for the students in the calibration samples. The Population 1 and 2 international calibration samples contained 14,700 and 23,100 students, respectively. Table 7.1 Countries Included in the International Item Calibration

Population 11

Population 2 2

Australia Austria Canada Cyprus

Australia Austria Belgium (Flemish) Belgium (French)

Czech Republic England Greece Hong Kong Hungary Iceland Iran Ireland Israel* Japan Korea Latvia Mexico Netherlands New Zealand Norway Portugal Scotland Singapore Slovenia United States

Bulgaria Canada Colombia Cyprus Czech Republic Denmark England France Germany Greece Hong Kong Hungary Iceland Iran Ireland Israel* Japan Korea Latvia Lithuania Mexico

Netherlands New Zealand Norway Portugal Romania Russian Federation Scotland Singapore Slovak Republic Slovenia Spain Sweden Switzerland United States

*A sample of 600 students was drawn from each country, excepting for Israel where only 300 students where drawn

because Israel sampled students from only the higher of the two grade levels. Note: Mexico's data was used to estimate the international item parameters, although Mexico subsequently withdrew its results from the international reports. Although results for Kuwait, the Philippines, and South Africa were reported in the international reports, their data were not used to estimate the international parameters. 1

Third and fourth grades in most countries.

2

Seventh and eighth grades in most countries.

7.4.2

International Scaling Results

Tables 7.2, 7.3, 7.4, and 7.5 show the basic statistics that resulted from international scaling for mathematics and science at Populations 1 and 2. The number of respondents shown for each item is the number of cases that were considered valid for calibration purposes. There are two reasons why this value is not equal to the total number of students in the calibration samples. First, the test rotation design was such that only items in cluster A were administered to all students (see Adams and Gonzalez, 1996), 120

CHAPTER 7

and second, items to which students did not respond because there were deemed to be “not reached” were treated as missing data in the calibration phase of the analysis. The percent correct figures that are reported were computed by summing the total of the scores achieved by all students who provided valid responses and dividing that by the number of students multiplied by the maximum score that could be achieved for that item; for most but not all items, the maximum possible scores was one. The difficulty estimate and asymptotic errors are in the logit metric, which is the natural metric for the ConQuest scaling software. The mean square fit statistic is an index of the fit of the data to the assumed scaling model; the statistic used here was derived by Wu (1997). Under the null hypothesis that the data and model are consistent, the expected value of these statistics is one. Values that are less than one are usually associated with items that have greater than average discrimination, while values that are greater than one may result from lower than average discrimination, guessing, or some other deviation from the model. 7.4.3

Fit of the Scaling Model

Tables 7.1 and 7.2 show the international item statistics and parameter estimates for Population 1 mathematics and science, respectively. Table 7.3 and 7.4 show the corresponding information for Population 2. The mean square fit statistics reported in Tables 7.2, 7.3, 7.4, and 7.5 show that the vast majority of items fit the Rasch model very well. Items with mean squares greater than one in Population 1 mathematics were B06, H05, I08, K07, M07, U01, and V04a. The reasons for the misfit of these items vary. For item B06, misfit is caused by the fact that the item does not discriminate as well as the other items. This may be seen in Figure 7.1 showing the modeled and empirical item characteristic curves for this item. For item H05, the modeled and empirical item characteristics curves are shown in Figure 7.2. There appear to be two reasons for this misfit of this item: first, it is slightly less discriminating than was assumed by the model; but second, interestingly, some students in the middle of the latent ability distribution did not perform as well as was expected, and these students would receive considerable weight in the estimation of the weighted mean square. Item K07 (Figure 7.3) was amongst the most difficult items, and it was multiple choice, so it is not surprising that some students are likely to have attempted to guess the correct response. A closer review of the item shows that one of the distracters proved to be attractive to some higher-achieving students – in fact the point-biserial for the distracter is positive for quite a few countries. This item survived the review process because of a policy decision to retain as many items as possible. Items M07, U01, and V04a all misfit because they had a slightly lower than modeled discrimination. The items that had mean square statistics less than one were all found to be more discriminating than was modeled. Misfit of this form is not usually deemed to be of concern. Interestingly, however, the majority of the most discriminating items are shortanswer or extended-response type. This may well be due to the fact that it is unlikely that students would have guessed the answers to these questions.

121

CHAPTER 7 Table 7.2 Population 1 Mathematics: Item Statistics and Parameter Estimates for the International Calibration Sample

122

Item Label

Number of Respondents in International Calibration Sample

Percentage of Correct Responses

Difficulty Estimate in Logit Metric

Asymptotic Standard Error in Logit Metric

Mean Square Fit Statistic

ASMMA01 ASMMA02 ASMMA03 ASMMA04 ASMMA05 ASMMB05 ASMMB06 ASMMB07 ASMMB08 ASMMB09 ASMMC01 ASMMC02 ASMMC03 ASMMC04 ASMMD05 ASMMD06 ASMMD07 ASMMD08 ASMMD09 ASMME01 ASMME02 ASMME03 ASMME04 ASMMF05 ASMMF06 ASMMF07 ASMMF08 ASMMF09 ASMMG01 ASMMG02 ASMMG03 ASMMG04 ASMMH05 ASMMH06 ASMMH07 ASMMH08 ASMMH09 ASMMI01 ASMMI02 ASMMI03 ASMMI04 ASMMI05 ASMMI06 ASMMI07 ASMMI08 ASMMI09 ASMMJ01

14637 14627 14618 14603 14571 7283 7268 7259 7247 7240 5460 5447 5441 5437 5463 5451 5438 5428 5411 5439 5420 5425 5413 5471 5459 5451 5445 5437 5181 5170 5152 5365 5434 5422 5407 5394 5383 1864 1859 1859 1859 1859 1858 1858 1854 1853 1814

79.78 51.47 58.61 78.33 79.29 60.66 48.98 40.52 82.03 56.66 80.44 64.79 48.87 75.19 74.81 51.73 83.41 55.56 47.92 63.28 71.25 59.54 31.98 66.02 59.11 59.70 60.39 68.81 36.33 69.83 87.36 47.83 45.99 48.51 66.08 64.29 65.11 49.79 31.47 53.68 81.33 46.26 69.48 54.36 49.30 61.09 84.45

-1.345 0.218 -0.131 -1.242 -1.307 -0.233 0.343 0.761 -1.511 -0.029 -1.401 -0.448 0.350 -1.040 -1.007 0.213 -1.609 0.031 0.405 -0.373 -0.807 -0.177 1.216 -0.515 -0.161 -0.189 -0.223 -0.662 0.987 -0.707 -1.990 0.414 0.464 0.345 -0.518 -0.422 -0.464 0.306 1.239 0.118 -1.461 0.480 -0.697 0.086 0.334 -0.247 -1.768

0.022 0.018 0.018 0.022 0.022 0.026 0.026 0.026 0.033 0.026 0.037 0.031 0.030 0.034 0.034 0.030 0.039 0.030 0.030 0.031 0.033 0.031 0.032 0.031 0.030 0.030 0.030 0.032 0.032 0.033 0.044 0.030 0.030 0.030 0.031 0.031 0.031 0.051 0.055 0.051 0.064 0.051 0.055 0.051 0.051 0.052 0.069

1.03 1.08 1.01 0.95 1.06 0.99 1.13 1.06 0.94 0.99 0.89 0.96 0.99 0.90 0.93 1.09 0.94 0.99 1.01 0.98 0.90 0.96 1.10 0.97 1.09 0.99 1.04 1.03 1.07 1.06 0.93 1.01 1.15 1.06 1.01 0.99 1.00 1.01 1.07 1.03 0.90 0.98 0.97 0.89 1.17 1.00 0.94

CHAPTER 7

Table 7.2 Population 1 Mathematics: Item Statistics and Parameter Estimates for the International Calibration Sample (Continued 1) Item Label


ASMMJ02 ASMMJ03 ASMMJ04 ASMMJ05 ASMMJ06 ASMMJ07 ASMMJ08 ASMMJ09 ASMMK01 ASMMK02 ASMMK03 ASMMK04 ASMMK05 ASMMK06 ASMMK07 ASMMK08 ASMMK09 ASSML01 ASMML02 ASMML03 ASMML04 ASMML05 ASMML06 ASMML07 ASMML08 ASMML09 ASMMM01 ASSMM02 ASMMM03 ASSMM04 ASMMM05 ASMMM06 ASMMM07 ASMMM08 ASMMM09 ASEMS01 ASSMS02 ASEMS03 ASSMS04 ASSMS05 ASEMT01 ASEMT01 ASSMT02 ASSMT03 ASEMT04a ASEMT04b ASSMT05

1811 1728 1804 1724 1799 1796 1793 1783 1803 1797 1796 1718 1787 1784 1779 1771 1764 1791 1789 1714 1784 1782 1705 1772 1767 1761 1829 1826 1819 1811 1805 1798 1788 1779 1778 3501 3356 3297 3096 3016 3336 3266 3328 3257 3082 2984 3033

Percentage of Correct Responses 60.41 68.34 39.36 36.31 66.54 53.73 44.28 71.34 62.17 77.13 45.38 44.88 73.81 58.18 22.60 68.83 35.43 42.16 45.95 83.84 66.70 38.61 47.39 41.25 28.13 59.68 73.32 47.21 58.71 36.33 36.95 65.46 36.86 84.54 65.75 43.32 59.06 28.45 48.87 52.06 70.92 40.55 41.17 45.96 18.23 12.40 62.45

Difficulty Estimate in Logit Metric -0.237 -0.698 0.831 0.966 -0.555 0.112 0.588 -0.817 -0.287 -1.147 0.560 0.628 -0.928 -0.069 1.848 -0.629 1.088 0.699 0.515 -1.638 -0.512 0.886 0.423 0.755 1.459 -0.144 -0.905 0.415 -0.133 0.953 0.923 -0.463 0.934 -1.664 -0.472 0.600 -0.068 1.280 0.496 0.360 -0.734 0.760 0.827 0.601 2.231 2.771 -0.179



0.053 0.057 0.054 0.056 0.055 0.053 0.053 0.058 0.054 0.061 0.053 0.054 0.059 0.053 0.062 0.056 0.055 0.053 0.052 0.070 0.055 0.053 0.053 0.053 0.058 0.053 0.057 0.051 0.052 0.053 0.053 0.054 0.053 0.069 0.054 0.024 0.039 0.026 0.040 0.040 0.042 0.025 0.039 0.039 0.050 0.059 0.041

1.07 0.85 0.97 1.05 1.08 0.97 0.99 0.99 1.07 0.98 0.98 0.93 1.07 0.99 1.18 0.91 1.04 0.85 1.01 0.97 0.94 1.07 1.06 0.89 0.96 1.02 1.04 0.91 1.03 1.01 1.13 0.95 1.15 0.92 0.93 1.01 0.86 0.95 0.91 1.08 0.85 0.94 0.94 0.92 0.91 0.89 0.99

123

CHAPTER 7

Table 7.2

Population 1 Mathematics: Item Statistics and Parameter Estimates for the International Calibration Sample (Continued 2)

Item Label






ASEMU01 ASSMU02 ASEMU03a ASEMU03b ASEMU03c ASSMU04 ASSMU05 ASEMV01 ASSMV02 ASSMV03 ASEMV04a ASEMV04b ASSMV05

3483 3418 3323 3274 3250 3237 3152 3486 3438 3347 3305 3232 3104

53.37 51.20 58.47 40.16 75.75 54.71 80.43 37.74 41.97 60.86 49.26 45.39 47.29

0.209 0.300 -0.035 0.877 -0.975 0.167 -1.277 0.872 0.711 -0.188 0.390 0.578 0.515

0.023 0.038 0.039 0.039 0.044 0.039 0.048 0.026 0.038 0.039 0.030 0.039 0.039

1.19 1.10 0.84 0.83 0.98 0.94 0.95 1.04 0.87 0.88 1.16 0.90 0.99

Probability

Figure 7.1 Empirical and Modelled Item Characteristic Curves for Mathematics Population 1 Item: ASMMB06. Fit MNSQ=1.13 1.00+---------- --------------------------------------------------+ | | | | | .| 0.90| | ...|.. | Legend: | |... X | | ...| | X = empirical 0.80| |.. | | • = modelled | ..| | | | .. X | 0.70| |. X | | ..| | | |.. X | 0.60| | | | ..X | | | .. | 0.50| X. | | .| | | .. | 0.40| X. | | X .| | | .. | | 0.30| |.. | | X .| | | |... | | 0.20| .| | | X .... | | | .|.. | Item is slightly less discriminating 0.10|. | | than was modelled | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -1.96 -1.47 -0.99 -0.50 -0.02 0.47 0.95 1.44 1.92 2.41

Theta

124

CHAPTER 7

Probability

Figure 7.2

Empirical and Modelled Item Characteristic Curves for Mathematics Population 1 Item: ASEMH05. Fit MNSQ=1.15

1.00+------------------------------------------------------------+ | | | | Legend: 0.90| | | | | X = empirical | |...| • = modelled 0.80| ..| | | | .. X | | |.. | | 0.70| .| | | | .. X | | |.. | 0.60| .| | | .. X | | | . | 0.50| |. | | ..| | | | . X | 0.40| |. | | ..| | | | .. | 0.30| X. X | | X ..| | | |.. | 0.20| X ..| | | X |... | | Item is slightly less discriminating than | | ...| | was modelled and there is a strange 0.10| X ...|. | dip in the middle of the observed ICC. |..|.... | | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -1.96 -1.47 -0.99 -0.50 -0.02 0.47 0.95 1.44 1.92 2.41

Theta

Probability

Figure 7.3

Empirical and Modelled Item Characteristic Curves for Mathematics Population 1 Item: ASMMK07. Fit MNSQ=1.18

1.00+------------------------------------------------------------+ | | | | Legend: 0.90| | | | X = empirical | | • = modelled 0.80| | | .| | .. | 0.70| .. | | .. | | .. | 0.60| . | | .. | | | .. | 0.50| | . | | |. | | ..| | 0.40| | . | | | |.. | | .| X | 0.30| | .. | | | |. | | X ...X | 0.20| X X |.. | X | | X | ...| | | | .|. | | Item was quite difficult and there is 0.10| X | .|.... | | evidence of guessing by lower | | .....|... | | scoring students. |..|.. | | 0.00| | | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -1.96 -1.47 -0.99 -0.50 -0.02 0.47 0.95 1.44 1.92 2.41

Theta

125

CHAPTER 7

For Population 1 science there are fewer misfitting items than for mathematics. The worst-fitting items are G08, H04, O05, and R03. Item G08 (Figure 7.4) was a difficult item and its misfit may be caused by guessing, while the misfit of H04 is caused by lower than modeled discrimination. An examination of the country-level data for these items shows that both have distracters that have positive point-biserials in a number of countries. The misfit of items O05 and R03, which is illustrated in Figures 7.5 and 7.6, is more difficult to explain – both of these items performed quite well in each of the participating countries. Item O05 does show some evidence of lower than expected discrimination, but there is also a large “blip” in the observed percentage correct for student in the second-lowest ability grouping. Item R03 has fewer than expected students at the upper achievement levels. An examination of the item-by-country interactions shows that students in the countries that had high average scores found this item more difficult than expected. Notably, this was the case in Japan, Korea, Singapore and Hong Kong, while the item was easier than expected in Slovenia, Hungary, and the Czech Republic.

126

CHAPTER 7

Table 7.3

Item Label

ASMSA06 ASMSA07 ASMSA08 ASMSA09 ASMSA10 ASMSB01 ASMSB02 ASMSB03 ASMSB04 ASMSC05 ASMSC06 ASMSC07 ASMSC08 ASMSC09 ASMSD01 ASMSD02 ASMSD03 ASMSD04 ASMSE05 ASMSE06 ASMSE07 ASSSE08 ASMSE09 ASMSF01 ASMSF02 ASMSF03 ASMSF04 ASMSG05 ASMSG06 ASMSG07 ASMSG08 ASMSG09 ASMSH01 ASMSH02 ASSSH03 ASMSH04 ASMSN01 ASMSN02 ASMSN03 ASMSN04 ASMSN05 ASMSN06 ASMSN07 ASMSN08 ASMSN09 ASMSO01 ASMSO02

Population 1 Science: Item Statistics and Parameter Estimates for the International Calibration Sample Number of Respondents in International Calibration





14554 14516 14501 14478 14454 7312 7298 6987 6975 5429 5420 5410 5387 5380 5282 5483 5479 5474 5411 5401 5399 5364 5349 5507 5495 5491 5481 5356 5354 5346 5338 5323 5460 5449 5438 5434 1862 1860 1858 1855 1850 1849 1842 1838 1832 1829 1825

84.15 82.86 62.02 66.13 74.32 70.69 72.06 68.33 57.49 68.06 91.09 43.51 81.60 48.83 58.54 68.19 88.37 72.84 61.50 92.33 57.47 73.60 43.30 84.60 65.13 61.28 68.87 79.41 86.33 70.61 26.60 86.02 77.34 83.56 39.48 42.69 38.56 69.68 40.90 32.40 70.81 40.02 51.09 58.60 59.55 42.81 38.03

-1.546 -1.439 -0.197 -0.404 -0.856 -0.658 -0.734 -0.524 0.029 -0.494 -2.248 0.694 -1.322 0.448 -0.026 -0.512 -1.944 -0.770 -0.193 -2.467 0.004 -0.832 0.678 -1.610 -0.375 -0.181 -0.569 -1.191 -1.743 -0.649 1.548 -1.709 -1.059 -1.506 0.870 0.717 0.933 -0.581 0.820 1.246 -0.642 0.866 0.346 -0.008 -0.052 0.676 0.912

0.024 0.023 0.018 0.019 0.020 0.028 0.028 0.028 0.026 0.031 0.049 0.030 0.037 0.029 0.030 0.031 0.044 0.033 0.030 0.053 0.030 0.033 0.030 0.039 0.031 0.030 0.031 0.036 0.042 0.032 0.033 0.042 0.034 0.039 0.030 0.030 0.051 0.054 0.051 0.053 0.055 0.051 0.050 0.051 0.051 0.051 0.052

0.97 0.96 1.02 1.00 0.96 0.99 0.96 0.98 1.08 1.01 0.97 1.05 0.97 1.06 0.96 1.01 0.95 1.02 0.99 0.98 0.96 0.98 1.07 0.99 1.06 0.97 0.93 0.98 1.03 0.99 1.14 0.94 1.03 0.93 0.96 1.21 1.03 0.94 1.10 1.10 1.05 1.01 1.03 1.05 0.99 1.03 1.07

127

CHAPTER 7

Table 7.3

Item Label

ASMSO03 ASMSO04 ASMSO05 ASSSO06 ASMSO07 ASMSO08 ASSSO09 ASMSP01 ASMSP02 ASMSP03 ASSSP04 ASMSP05 ASMSP06 ASMSP07 ASMSP08 ASMSP09 ASMSQ01 ASMSQ02 ASMSQ03 ASSSQ04 ASMSQ05 ASMSQ06 ASMSQ07 ASSSQ08 ASMSQ09 ASSSR01 ASMSR02 ASMSR03 ASMSR04 ASMSR05 ASMSR06 ASMSR07 ASMSR08 ASMSR09 ASESW01 ASSSW02 ASSSW03 ASSSW04 ASESW05 ASESW05 ASESX01 ASSSX02 ASESX03 ASSSX04 ASMSX05 ASESY01 ASESY02

128

Population 1 Science: Item Statistics and Parameter Estimates for the International Calibration Sample (Continued 1) Number of Respondents in International Calibration





1823 1820 1815 1808 1808 1804 1783 1780 1780 1777 1773 1768 1698 1765 1763 1759 1850 1845 1841 1834 1824 1822 1819 1808 1795 1830 1814 1805 1797 1790 1776 1765 1670 1734 3512 3431 3218 3143 2900 2747 3581 3557 3471 3397 3323 3399 3258

62.97 66.76 54.16 54.31 71.35 51.39 38.70 83.99 84.16 32.86 31.64 45.76 33.69 39.60 53.66 42.30 60.65 63.58 49.48 58.34 69.19 41.93 40.74 46.13 52.70 18.80 39.14 55.62 73.46 56.09 39.92 55.30 53.77 44.87 55.25 38.41 26.51 50.81 52.14 36.77 66.23 73.35 42.64 82.31 59.43 27.83 66.73

-0.281 -0.473 0.154 0.153 -0.708 0.297 0.909 -1.472 -1.480 1.254 1.323 0.630 1.214 0.929 0.269 0.807 -0.071 -0.211 0.456 0.044 -0.494 0.812 0.870 0.617 0.318 2.106 0.944 0.173 -0.735 0.149 0.908 0.190 0.242 0.678 0.178 0.989 1.658 0.458 0.440 1.188 -0.721 -0.787 0.736 -1.344 -0.009 1.519 -0.353

0.052 0.054 0.051 0.051 0.056 0.051 0.052 0.068 0.068 0.054 0.055 0.052 0.055 0.052 0.052 0.052 0.051 0.052 0.050 0.051 0.054 0.051 0.051 0.051 0.051 0.063 0.052 0.051 0.057 0.051 0.052 0.051 0.053 0.052 0.026 0.038 0.043 0.038 0.040 0.042 0.031 0.041 0.025 0.048 0.038 0.041 0.040

0.96 1.04 1.15 0.94 1.08 1.04 0.93 0.92 0.89 1.07 1.01 1.00 1.01 0.99 0.95 1.02 0.95 1.07 1.00 0.92 1.02 1.06 1.03 0.97 1.00 1.03 1.04 1.15 0.89 0.99 1.07 0.96 1.09 1.07 1.09 0.95 1.00 0.88 0.95 0.95 0.90 0.92 0.97 0.93 1.07 0.91 0.96

CHAPTER 7

Table 7.3

Item Label

ASESY02 ASESY03 ASESY03 ASESZ01 ASESZ01 ASESZ02 ASESZ03

Probability

Figure 7.4





3126 3021 2884 3479 3406 3390 3361

39.38 65.54 45.67 58.47 20.35 63.54 49.02

0.985 -0.231 0.725 0.026 1.996 -0.203 0.485

0.039 0.041 0.040 0.037 0.045 0.038 0.024

Mean Square Fit Statistic 1.01 0.94 0.95 0.94 1.02 0.94 0.99

Empirical and Modelled Item Characteristic Curves for Science Population 1 Item: ASMSG08. Fit MNSQ=1.14

1.00+------------------------------------------------------------+ | | | | The shape of this ICC suggests the 0.90| | presence of some guessing. While this | | is certainly one of the reasons for the | | misfit of this item, an examination of 0.80| X | the data country by country shows that | ..| one of the distractors is consistently | .. | attracting some more able students. 0.70| | ... | | | .. | | |. | 0.60| ..| | | | .. | | | |. | | 0.50| .| | | | .. | | | |. X | 0.40| ..| | | |.. | | .| | 0.30| .. | | | |.. X X | | X X ...| | 0.20| X |.. X | Legend: | X | ...| | | | ...|. | X = empirical 0.10| | ....|. | | • = modelled |..|.... | | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -2.26 -1.74 -1.23 -0.71 -0.19 0.32 0.84 1.36 1.87 2.39

Theta

129

CHAPTER 7

Probability

Figure 7.5

Empirical and Modelled Item Characteristic Curves for Science Population 1 Item: ASMSO05. Fit MNSQ=1.15

1.00+------------------------------------------------------------+ | | | | | Legend: 0.90| | .| | | ..|.. | X = empirical | | ... | | • = modelled 0.80| | ..|. | | | |... | X | | | .| | | 0.70| |... | X | | .| X | | | .. X | 0.60| |. | | | | ..X | | |... | | 0.50| .X | | X .. | | | X |. | | 0.40| X ..| | | | .. | | | | |. | 0.30| | ...| | | X |. | | | | ...| | 0.20| | .... | | | .|. | | Some evidence of lack of |. | | discrimination and an unusual "blip" 0.10| | | at the lower end. | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -2.13 -1.68 -1.22 -0.76 -0.30 0.16 0.62 1.07 1.53 1.99

Theta

Probability

Figure 7.6

Empirical and Modelled Item Characteristic Curves for Mathematics Population 1 Item: ASMSR03. Fit MNSQ=1.15

1.00+------------------------------------------------------------+ | | | | | 0.90| | .| Legend: | | ..|.. | | |.... | | X = empirical 0.80| | ..| | | • = modelled | |... | | | | ..| | | 0.70| |.. | X X | | .| | | |... | X | 0.60| X .X X | | | .. | | | |.. | | 0.50| X .| | | | .. | | | X |. | | 0.40| | ..| | | X | .. | | | | .|. | 0.30| | .. | | | .|. | | | X ... | | 0.20| | ... | | Students at the upper achievement |..|. | | levels are not scoring as well as | | | expected. 0.10| | | | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -2.13 -1.68 -1.22 -0.76 -0.30 0.16 0.62 1.07 1.53 1.99

Theta

130

CHAPTER 7

The fit of the items for Population 2 mathematics is quite acceptable, although it is the least favorable of the four data sets. There are eight items with fit that is less than or equal to 0.85 – six of them short-answer or extended-response – and ten items with a fit greater than 1.15 – nine of them multiple-choice. For the items that have weighted fit mean squares greater than 1.15 the reason for that misfit is quite varied. Items I03, J18, L11, and N17 are all relatively difficult multiple-choice questions and exhibit evidence of guessing. As with the questions that showed elements of guessing characteristics in the Population 1 data sets, each of these items has a distracter that had a positive point-biserial in a large number of countries. Items L11 and N17 also showed bad fit in a number of countries. Item N16 is the only item with fit above 1.15, where it is reasonably clear that the misfit is due to the item having lower than modeled discrimination. The misfit for items B07, D10, and N15, which cannot be easily characterized, is illustrated in Figure 7.8, which shows the observed and expected item characteristic curves for item D10. Examining this item at the country level we note that in a number of countries it has a distracter with a positive point-biserial. This distracter has probably attracted some of the more able students, resulting in the empirical item characteristic curve being lower than the modeled curve for students toward the upper end of the achievement distribution. Plots for items B07, N15, and P09 show a similar pattern, but in examining the data we have not been able to find an explanation for the unusual shape of the observed item characteristic curve.

131

CHAPTER 7

Table 7.4

Item Label

BSMMA01 BSMMA02 BSMMA03 BSMMA04 BSMMA05 BSMMA06 BSMMB07 BSMMB08 BSMMB09 BSMMB10 BSMMB11 BSMMB12 BSMMC01 BSMMC02 BSMMC03 BSMMC04 BSMMC05 BSMMC06 BSMMD07 BSMMD08 BSMMD09 BSMMD10 BSMMD11 BSMMD12 BSMME01 BSMME02 BSMME03 BSMME04 BSMME05 BSMME06 BSMMF07 BSMMF08 BSMMF09 BSMMF10 BSMMF11 BSMMF12 BSMMG01 BSMMG02 BSMMG03 BSMMG04 BSMMG05 BSMMG06 BSMMH07 BSMMH08 BSMMH09 BSMMH10 BSMMH11

132

Population 2 Mathematics: Item Statistics and Parameter Estimates for the International Calibration Sample Number of Respondents in International Calibration 23039 23036 22437 23033 23032 23026 11400 11389 11383 11377 11372 11366 8671 8670 8667 8661 8660 8654 8773 8761 8756 8753 8743 8740 8715 8710 8705 8698 8687 8679 8615 8606 8602 8596 8398 8583 8638 8633 8631 8629 8622 8621 8581 8575 8570 8564 8549




-0.127 -1.120 -0.359 0.062 -0.217 -1.225 -0.421 -0.754 -0.148 0.258 -0.451 -0.496 -0.158 -1.196 -0.313 0.009 0.082 -0.883 -0.345 -0.610 -0.649 0.499 -1.906 -0.957 -0.791 0.492 -0.096 -0.590 0.056 0.693 1.243 -0.253 -0.546 0.052 0.444 0.007 0.072 -1.192 0.234 -0.682 -0.202 0.686 -0.613 -1.046 -1.837 0.559 -0.297

0.015 0.017 0.015 0.015 0.015 0.017 0.022 0.022 0.021 0.021 0.022 0.022 0.024 0.027 0.024 0.024 0.024 0.026 0.024 0.025 0.025 0.024 0.032 0.026 0.026 0.024 0.024 0.025 0.024 0.024 0.026 0.024 0.025 0.024 0.024 0.024 0.024 0.027 0.024 0.025 0.024 0.025 0.025 0.027 0.032 0.024 0.025

Mean Square Fit Statistic 0.87 1.02 0.97 1.03 1.07 1.05 1.01 1.16 1.08 0.85 0.95 0.91 0.96 0.95 0.94 1.14 1.02 1.03 1.02 1.02 0.86 1.16 1.02 1.04 1.00 0.99 1.00 0.95 0.97 0.95 1.03 1.15 0.99 1.01 0.89 1.07 1.15 0.98 1.02 0.95 0.94 1.01 1.04 0.92 0.96 0.97 0.97

CHAPTER 7

Table 7.4


Item Label

Number of Respondents in International Calibration

BSMMH12 BSMMI01 BSMMI02 BSMMI03 BSSMI04 BSMMI05 BSSMI06 BSMMI07 BSMMI08 BSMMI09 BSMMJ10 BSMMJ11 BSSMJ12 BSSMJ13 BSMMJ14 BSMMJ15 BSMMJ16 BSMMJ17 BSMMJ18 BSMMK01 BSSMK02 BSMMK03 BSMMK04 BSSMK05 BSMMK06 BSMMK07 BSMMK08 BSMMK09 BSMML08 BSMML09 BSMML10 BSMML11 BSMML12 BSMML13 BSMML14 BSMML15 BSSML16 BSMML17 BSMMM01 BSMMM02 BSMMM03 BSMMM04 BSMMM05 BSSMM06 BSMMM07 BSSMM08 BSMMN11

8536 2884 2883 2882 2880 2877 2876 2877 2871 2872 2937 2865 2929 2928 2930 2926 2924 2916 2913 2958 2958 2956 2957 2956 2956 2955 2955 2952 2857 2857 2855 2854 2855 2854 2852 2845 2844 2745 2832 2831 2830 2830 2768 2828 2827 2827 2831




-0.996 1.008 -0.070 0.738 0.597 -0.981 -1.217 -0.464 0.666 -0.516 0.661 0.405 0.624 -1.576 0.536 -0.486 0.126 -0.585 0.631 -0.804 -0.549 -0.644 0.772 0.851 0.741 0.147 1.134 0.297 -0.168 -1.805 -2.024 1.146 -0.976 -2.332 1.735 0.959 0.868 0.379 -1.887 -0.410 -1.142 0.868 0.316 1.084 -0.878 0.427 -1.600

0.027 0.044 0.042 0.043 0.042 0.046 0.048 0.043 0.043 0.043 0.042 0.042 0.042 0.051 0.041 0.042 0.041 0.043 0.042 0.044 0.042 0.043 0.042 0.043 0.042 0.041 0.044 0.041 0.042 0.055 0.059 0.044 0.046 0.065 0.049 0.044 0.043 0.043 0.057 0.043 0.048 0.043 0.042 0.044 0.046 0.042 0.053


133

CHAPTER 7

Table 7.4

134


Item Label

Number of Respondents in International Calibration

BSMMN12 BSSMN13 BSMMN14 BSMMN15 BSMMN16 BSMMN17 BSMMN18 BSSMN19 BSMMO01 BSMMO02 BSMMO03 BSMMO04 BSMMO05 BSSMO06 BSMMO07 BSMMO08 BSSMO09 BSMMP08 BSMMP09 BSMMP10 BSMMP11 BSMMP12 BSMMP13 BSMMP14 BSMMP15 BSSMP16 BSMMP17 BSMMQ01 BSMMQ02 BSMMQ03 BSMMQ04 BSMMQ05 BSMMQ06 BSMMQ07 BSMMQ08 BSMMQ09 BSSMQ10 BSMMR06 BSMMR07 BSMMR08 BSMMR09 BSMMR10 BSMMR11 BSMMR12 BSSMR13 BSSMR14 BSEMS01

2829 2825 2821 2822 2746 2728 2799 2782 2881 2879 2881 2808 2881 2879 2879 2877 2876 2765 2764 2757 2754 2752 2740 2736 2730 2720 2674 2784 2778 2776 2774 2770 2762 2752 2746 2683 2728 2786 2785 2784 2783 2783 2783 2781 2779 2778 2829


Difficulty Estimate in Logit Metric -0.522 0.439 -0.593 -0.492 0.482 0.819 0.048 0.253 -0.015 1.589 0.489 0.555 0.562 -0.793 -0.694 -0.595 0.383 -0.005 0.895 0.037 0.044 -0.809 -0.636 -1.262 -0.401 1.110 -1.798 0.651 0.353 1.104 -1.777 -0.549 0.810 -0.199 0.556 0.238 0.560 -1.098 0.516 0.312 0.734 0.155 0.529 -1.954 1.167 0.892 -1.273



0.044 0.042 0.044 0.044 0.043 0.044 0.042 0.042 0.042 0.047 0.042 0.043 0.042 0.045 0.044 0.044 0.042 0.043 0.044 0.043 0.043 0.046 0.045 0.050 0.044 0.045 0.057 0.043 0.043 0.045 0.056 0.044 0.044 0.043 0.043 0.043 0.043 0.047 0.043 0.042 0.043 0.042 0.043 0.058 0.045 0.044 0.049

1.10 0.90 0.96 1.19 1.19 1.22 0.99 0.79 0.98 0.92 0.99 1.08 0.85 0.95 1.01 1.02 0.84 0.98 1.16 0.96 1.15 1.00 0.95 0.99 0.98 0.90 1.06 1.07 1.09 1.05 1.01 1.03 1.01 0.96 0.86 0.99 1.00 1.01 0.99 1.10 0.94 1.02 1.04 0.95 0.91 0.82 1.05

CHAPTER 7

Table 7.4

Item Label

BSEMS01 BSEMS02 BSEMS02 BSEMS02 BSEMT01 BSEMT01 BSEMT02 BSEMT02 BSEMU01 BSEMU01 BSEMU02 BSEMU02 BSSMV01 BSEMV02 BSMMV03 BSSMV04

Probability

Figure 7.7

Population 2 Mathematics: Item Statistics and Parameter Estimates for the International Calibration Sample (Continued 3) Number of Respondents in International Calibration





2739 2534 2382 2171 5661 5053 4998 4221 5585 5330 5009 4671 5477 5582 5538 5512

23.80 65.04 30.31 28.33 31.90 35.84 23.41 9.90 34.45 33.66 37.10 20.65 52.67 27.11 40.75 39.26

1.722 -0.421 1.420 1.590 0.915 1.047 1.799 3.076 1.045 1.132 0.778 1.745 0.110 1.113 0.732 0.813

0.050 0.046 0.050 0.053 0.019 0.033 0.037 0.055 0.031 0.032 0.020 0.026 0.030 0.017 0.031 0.031

0.97 0.92 0.82 0.90 1.01 0.79 0.88 1.00 0.96 0.99 1.13 0.95 0.91 1.17 0.95 0.90

Empirical and Modelled Item Characteristic Curves for Mathematics Population 2 Item: BSMMD10. Fit MNSQ=1.16

1.00+------------------------------------------------------------+ | | | | | Legend: 0.90| .X...| | | .... | | X = empirical | |.. | • = modelled 0.80| | ...| | | |. X | | ..| | 0.70| | .. | | |. | | ..| X | 0.60| . | | |.. X | | .| | 0.50| . | | |.. | | .X | 0.40| . | | | X .. | | X |. | 0.30| ..| | | X | .. | | |. | 0.20| | ...| | | X .|. | Unusual "S" shape in the observed | | .... | | item characteristic curve. 0.10|..|. | | | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -1.82 -1.36 -0.90 -0.44 0.02 0.48 0.94 1.40 1.86 2.32

Theta

135

CHAPTER 7

The compatibility of the model and data for Population 2 science is better than for any of the other three data sets. There is just one item, item Q18, with a weighted mean square greater than 1.15, and there are no items with weighted mean squares as low as 0.85. Figure 7.8 is a plot of the observed and expected item characteristic curves for Q18. The plot shows evidence of guessing. Examining the behavior at the country level again reveals that there is a distracter that is positive in many countries. As a set, the data appear to be quite compatible with the assumed Rasch scaling model. Certainly the extent of deviation from the model will have had no influence on the substantive outcomes of the study. A few isolated items that were retained in the scaling did not fit the model. The source of this misfit can generally be traced to multiple-choice item distracters that were attractive to some more able students.

136

CHAPTER 7

Table 7.5

Item Label

BSMSA07 BSMSA08 BSMSA09 BSMSA10 BSMSA11 BSMSA12 BSMSB01 BSMSB02 BSMSB03 BSMSB04 BSMSB05 BSMSB06 BSMSC07 BSMSC08 BSMSC09 BSMSC10 BSMSC11 BSMSC12 BSMSD01 BSMSD02 BSMSD03 BSMSD04 BSMSD05 BSMSD06 BSMSE07 BSMSE08 BSMSE09 BSMSE10 BSMSE11 BSMSE12 BSMSF01 BSMSF02 BSMSF03 BSMSF04 BSMSF05 BSMSF06 BSMSG07 BSMSG08 BSMSG09 BSMSG10 BSMSG11 BSMSG12 BSMSH01 BSMSH02 BSMSH03 BSMSH04 BSMSH05

Population 2 Science: Item Statistics and Parameter Estimates for the International Calibration Sample Number of Respondents in International Calibration



23016 23024 23015 22998 22982 22968 11410 11406 11407 11404 11362 11402 8642 8637 8633 8636 8624 8618 8794 8787 8787 8779 8769 8770 8669 8666 8662 8651 8642 8396 8637 8634 8392 8399 8631 8630 8619 8619 8614 8612 8605 8596 8264 8496 8494 8587 8586

67.11 65.54 76.66 68.38 57.92 58.51 86.92 53.40 26.47 88.48 48.88 83.44 37.75 71.63 72.95 77.08 45.26 52.70 40.22 73.39 36.95 54.99 66.27 72.75 41.38 79.17 77.80 53.59 57.30 53.93 66.61 63.19 66.17 68.33 80.44 68.81 86.99 59.38 74.32 51.64 49.56 51.14 69.26 79.26 79.15 50.73 22.99

-0.562 -0.484 -1.089 -0.626 -0.119 -0.146 -1.845 0.095 1.394 -1.998 0.299 -1.547 0.816 -0.786 -0.859 -1.102 0.468 0.131 0.674 -0.914 0.830 -0.001 -0.535 -0.877 0.634 -1.247 -1.158 0.080 -0.089 0.074 -0.549 -0.380 -0.515 -0.632 -1.349 -0.661 -1.856 -0.189 -0.948 0.166 0.260 0.189 -0.648 -1.240 -1.233 0.216 1.603

Asymptotic Standard Error in Logit Metric 0.015 0.015 0.016 0.015 0.014 0.014 0.029 0.020 0.022 0.030 0.020 0.026 0.024 0.025 0.025 0.027 0.023 0.023 0.023 0.025 0.023 0.023 0.024 0.025 0.023 0.028 0.027 0.023 0.023 0.023 0.024 0.024 0.024 0.025 0.028 0.025 0.033 0.023 0.026 0.023 0.023 0.023 0.025 0.028 0.028 0.023 0.027


137

CHAPTER 7

Table 7.5

Item Label

BSMSH06 BSMSI10 BSMSI11 BSMSI12 BSMSI13 BSMSI14 BSMSI15 BSMSI16 BSMSI17 BSSSI18 BSMSI19 BSMSJ01 BSMSJ02 BSSSJ03 BSMSJ04 BSMSJ05 BSMSJ06 BSMSJ07 BSMSJ08 BSSSJ09 BSSSK10 BSMSK11 BSMSK12 BSMSK13 BSMSK14 BSMSK15 BSMSK16 BSMSK17 BSMSK18 BSSSK19 BSMSL01 BSMSL02 BSMSL03 BSESL04 BSMSL05 BSMSL06 BSMSL07 BSMSM10 BSESM11 BSSSM12 BSMSM13 BSSSM14 BSMSN01 BSMSN02 BSMSN03 BSMSN04 BSMSN05

138




8583 2871 2869 2866 2795 2862 2859 2854 2851 2847 2759 2784 2942 2941 2941 2940 2940 2873 2936 2935 2876 2946 2945 2944 2942 2940 2938 2936 2925 2907 2859 2858 2859 2858 2857 2856 2857 2822 2821 2817 2813 2810 2839 2837 2837 2834 2688

51.40 74.54 45.59 35.35 60.86 54.72 49.11 85.00 40.72 35.30 51.40 39.22 62.71 27.13 41.11 64.35 22.69 47.16 45.16 76.12 34.14 55.02 51.85 74.01 79.74 59.35 35.94 51.63 54.22 72.38 47.22 51.68 68.42 32.96 64.30 52.21 68.15 46.03 65.65 52.36 46.82 71.14 43.68 39.76 60.35 50.53 31.29

0.186 -0.921 0.477 0.957 -0.217 0.066 0.320 -1.632 0.704 0.964 0.223 0.752 -0.369 1.337 0.629 -0.447 1.603 0.362 0.442 -1.079 0.947 -0.012 0.132 -0.953 -1.306 -0.210 0.867 0.143 0.029 -0.852 0.365 0.163 -0.631 1.041 -0.425 0.139 -0.618 0.425 -0.483 0.141 0.391 -0.764 0.550 0.732 -0.207 0.241 1.161

Asymptotic Standard Error in Logit Metric 0.023 0.045 0.040 0.041 0.041 0.040 0.040 0.054 0.040 0.041 0.040 0.041 0.040 0.044 0.040 0.041 0.046 0.040 0.040 0.045 0.042 0.039 0.039 0.044 0.048 0.040 0.041 0.039 0.039 0.044 0.040 0.040 0.043 0.042 0.041 0.040 0.042 0.040 0.042 0.040 0.040 0.044 0.040 0.041 0.041 0.040 0.044


CHAPTER 7

Table 7.5

Item Label

BSMSN06 BSSSN07 BSMSN08 BSMSN09 BSSSN10 BSSSO10 BSMSO11 BSMSO12 BSMSO13 BSESO14 BSMSO15 BSSSO16 BSSSO17 BSMSP01 BSSSP02 BSSSP03 BSMSP04 BSSSP05 BSSSP06 BSMSP07 BSMSQ11 BSSSQ12 BSMSQ13 BSMSQ14 BSMSQ15 BSMSQ16 BSSSQ17 BSSSQ18 BSMSR01 BSMSR02 BSESR03 BSSSR04 BSSSR05 BSESW01 BSESW01 BSESW02 BSESX01 BSESX02 BSESX02 BSESY01 BSESY02 BSESZ01 BSESZ01 BSESZ01 BSESZ02 BSESZ02






2833 2833 2834 2774 2830 2874 2797 2812 2872 2870 2862 2862 2803 2780 2776 2777 2774 2770 2764 2766 2713 2709 2703 2678 2612 2597 2567 2433 2786 2786 2784 2784 2783 5652 5408 5176 5690 5123 4923 5562 5291 2725 2449 2335 2341 2260

64.03 88.92 70.82 47.08 53.11 53.58 36.11 24.64 58.67 54.36 38.29 57.97 55.26 82.77 21.65 79.40 54.25 55.96 48.20 52.39 42.87 45.99 59.67 45.29 32.50 25.53 65.41 36.33 68.88 38.30 29.76 50.11 49.08 80.08 42.77 43.51 22.12 68.79 34.51 6.53 28.34 62.35 42.63 28.09 75.22 50.18

-0.382 -2.017 -0.727 0.399 0.123 0.087 0.905 1.516 -0.147 0.052 0.791 -0.112 0.025 -1.501 1.669 -1.264 0.047 -0.031 0.304 0.132 0.569 0.427 -0.193 0.463 1.080 1.444 -0.460 0.728 -0.654 0.771 1.160 0.231 0.277 -1.306 0.607 0.517 1.398 -0.609 1.019 3.162 1.208 -0.287 0.642 1.379 -0.895 0.344

0.041 0.061 0.043 0.040 0.040 0.040 0.042 0.046 0.040 0.040 0.041 0.040 0.040 0.052 0.048 0.049 0.040 0.041 0.025 0.040 0.041 0.041 0.042 0.041 0.044 0.047 0.044 0.026 0.043 0.041 0.030 0.040 0.040 0.035 0.029 0.018 0.022 0.032 0.032 0.055 0.021 0.042 0.043 0.048 0.050 0.045

0.98 0.91 1.00 0.95 0.96 0.96 1.02 0.98 1.01 0.91 1.01 0.95 1.04 0.92 0.95 0.91 0.98 0.99 0.98 1.06 1.03 0.99 0.98 1.07 1.03 1.05 0.99 1.17 1.03 1.05 0.94 0.91 0.88 1.00 1.05 1.05 0.98 1.03 0.99 0.94 1.13 0.99 0.90 0.91 1.02 1.00

139

CHAPTER 7

Probability

Figure 7.8

Empirical and Modelled Item Characteristic Curves for Science Population 2 Item: BSSQ18. Fit MNSQ=1.17

1.00+------------------------------------------------------------+ | | | | 0.90| | | | | | A difficult multiple-choice item with 0.80| | evidence of guessing. | | | | 0.70| | | X | | | .| 0.60| |.. | | ..| | | | .. | | 0.50| |... | | | | .| | | | ... X | 0.40| |. | | | | ...| | | |. X | 0.30| | ...X | | |.. | | | |....X | 0.20| X X | ...X | Legend: | | ..|. | | | | | ..|... X | X = empirical 0.10| X ....|.. | | • = modelled |..|... | | | | | 0.00| | +--+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+ -2.02 -1.57 -1.12 -0.67 -0.21 0.24 0.69 1.14 1.60 2.05

Theta

7.4.4

Reliability

Table 7.6 reports a variety of reliability indices for the four tests. The median Cronbach Alpha coefficients were computed by calculating the Cronbach Alpha coefficient for each test booklet within each country. The median of these values was then used as a reliability index for each country. The median of those country medians is reported in Table 7.6. The separation reliability for the international calibration sample was computed by fitting the scaling model without the use of any conditioning variables, drawing five plausible values for each student, and then computing the median of the ten correlations between pairs of plausible values. In general, these statistics show that the science tests are slightly less reliable than the mathematics tests and that the Population 1 tests are slightly less reliable than the Population 2 tests. Table 7.6

Unconditional Reliabilities Median of Lower Grade National Cronbach Alpha Coefficients

Median of Upper Grade National Cronbach Alpha Coefficients

Separation Reliability in the International Calibration Sample

Mathematics Population 1

0.82

0.84

0.83

Science Population 1

0.78

0.77

0.77

Mathematics Population 2

0.86

0.89

0.89

Science Population 2

0.77

0.78

0.80

TIMSS Test

140

CHAPTER 7

7.4.5

The Population Model for Population 2

For Population 2 it was considered expedient to proceed with a scaling that did not make extensive use of conditioning. There were two reasons for this. First, the reliability of the Population 2 data was relatively high so that the possible effect of conditioning would be ignorable. Second, the background data were not fully cleaned and checked at the time of processing, and extensive conditioning would have delayed publication of the international reports. For each participating country the scaling was undertaken with all item parameters set at the values obtained from fitting the model to the international calibration sample. In the population model sampling weights were used, and student grade was used as a conditioning variable. Five plausible values were drawn and an EAP estimate of achievement was obtained for each student. As illustrated in Table 7.7, conditioning on grade led to little improvement in the person separation reliability. This conditioning was, however, necessary to ensure that consistent results were obtained when plausible values were used to estimate characteristics of the achievement distributions for the upper and lower grades separately.

Table 7.7

Population 2 Reliabilities For Three Countries With and Without Conditioning on Grade Mathematics

Science

Country

Conditioning on Grade

No Conditioning

Conditioning on Grade

Australia

0.88

0.88

0.80

0.80

Cyprus

0.86

0.86

0.78

0.77

Hong Kong

0.87

0.87

0.76

0.76

7.4.6

No Conditioning

The Population Model for Population 1

In Population 1 conditioning was used much more extensively. For both mathematics and science the variables sex, grade, and the interaction between sex and grade were used as conditioning variables.1 Additionally, for mathematics the mean of the mathematics score variable ASMRAWST was computed for each class, assigned to each student in that class, and then used as a conditioning variable. This variable was called ASMRAWAV. Similarly, for science the mean of the mathematics score variable ASSRAWST was computed for each class, assigned to each student in that class, and then used as a conditioning variable. This variable was called ASSRAWAV. This conditioning was undertaken so as to improve the estimation of between-class and betweenschool variance components that would be obtained from secondary analyses using plausible values. Each individual student’s science score ASSRAWST was also used as a conditioning variable for mathematics, and in the case of science, each individual student’s mathematics score ASMRAWST was used.

1

The gender variable ASBGSEX is trichotomous (male, female, missing). When used in conditioning, this variable was replaced with two dummy coded variables.

141

CHAPTER 7

Table 7.8

Number of Principal Components Retained In Conditioning - Population 1

Country Australia Austria Canada Cyprus Czech Republic England Greece Hong Kong Hungary Iceland Iran Ireland Israel Japan Korea Kuwait Latvia Mexico Netherlands New Zealand Norway Portugal Scotland Singapore Slovenia Thailand United States

Number of Retained Principal Components 62 85 84 69 84 51 78 81 86 69 83 83 62 59 78 88 74 103 83 87 75 91 46 106 77 73 72

For both mathematics and science, the pool of over 100 student-level background variables was also represented in the conditioning. For each student-level variable a set of dummy variables was constructed from the original variables (see Appendix D). This new set of dummy variables retained all of the information in the original set of variables but made them appropriate for use in a principal components analysis. A principal components analysis of the set of dummy variables was then undertaken for each country and as many components retained as explained 90% of the variance. Scores on each of the retained components were then computed for each student. The number of retained components for each country is shown in Table 7.8.

These components, and the products of these components and ASMRAWAV (in the case of mathematics) and ASSRAWAV (in the case of science), were used as conditioning variables. Table 7.9 shows the conditioning variables that were used for mathematics and science. For some countries the total was in excess of 200. Table 7.10 illustrates for three selected countries the increase in reliability that was attained by conditioning, first by grade and then with the full set of conditioning variables.

142

CHAPTER 7

Table 7.9 Variables Used in Conditioning - Population 1 Variables

Mathematics

Science

Grade

✓

✓

Gender

✓

✓

Gender by grade interaction

✓

✓

Mathematics score

✗

✓

Science score

✓

✗

Class mean mathematics score

✓

✗

Class mean science score

✗

✓

Principal components

✓

✓

Principal component by class mean mathematics score

✓

✗

Principal component by class mean science score

✗

✓

Table 7.10 Variables Used in Conditioning - Population 1 Mathematics Country

Science

Conditioning No Full Conditioning on Grade Conditioning

Conditioning No Full Conditioning on Grade Conditioning

Australia

0.83

0.84

0.87

0.77

0.78

0.83

Cyprus

0.82

0.83

0.86

0.74

0.75

0.81

Hong Kong

0.78

0.79

0.84

0.73

0.74

0.81

143

CHAPTER 7

REFERENCES Adams, R.J. and Gonzalez, E.J. (1996). TIMSS test design. In M.O. Martin and D.L. Kelly (Eds.), TIMSS technical report, volume I: Design and development. Chestnut Hill, MA: Boston College. Adams, R.J., Wilson, M.R., and Wang, W.C. (1997). The multidimensional random coefficients multinomial logit. Applied Psychological Measurement, 21, 1-24. Adams, R.J., Wilson, M.R., and Wu, M.L. (1997). Multilevel item responses models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22 (1), 46-75. Beaton, A.E. (1987). Implementing the new design: The NAEP 1983-84 technical report. Report No. 15-TR-20. Princeton, NJ: Educational Testing Service. Bock, R.D. and Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, 443-459. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Kelderman, H. and Rijkes, C.P.M. (1994). Loglinear multidimensional IRT models for polytomously scored items. Psychometrika, 59, 149-176. Mislevy, R.J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177-196. Mislevy, R.J., Beaton, A.E., Kaplan, B., and Sheehan, K.M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29 (2), 133-161. Mislevy, R.J., Johnson, E.G., and Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131-154. Mislevy, R.J. and Sheehan, K.M. (1987). Marginal estimation procedures. In A.E. Beaton (Ed.), Implementing the new design: The NAEP 1983-84 technical report. Report No. 15-TR-20. Princeton, NJ: Educational Testing Service. Mislevy, R.J. and Sheehan, K.M. (1989). The role of collateral information about examinees in item parameter estimation. Psychometrika, 54 (4), 661-679. Rubin, D.B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons. Volodin, N. and Adams, R.J. (1997). The estimation of polytomous item response models with many dimensions. Paper presented at the Annual Meeting of the Psychometric Society, Gatlinburg, TN. 144

CHAPTER 7

Wilson, M.R. (1992). The ordered partition model: An extension of the partial credit model. Applied Psychological Measurement, 16 (3). Wu, M.L., Adams, R.J., and Wilson, M. (1997). Conquest: Generalized item response modelling software – Manual. Melbourne: Australian Council for Educational Research. Wu, M.L. (1997). The development and application of a fit test for use with marginal maximum estimation and generalized item response models. Unpublished Master’s dissertation, University of Melbourne.

145

Scaling Methodology and Procedures for the Mathematics ... - CiteSeerX

Scaling Methodology and Procedures for the Mathematics ... - CiteSeerX

Suggest Documents

Scaling Methodology and Procedures for the ... - TIMSS and PIRLS

Visualization Methodology for Multidimensional Scaling - CiteSeerX

research methodology and critical mathematics education - CiteSeerX

Alternative linear-scaling methodology for the second

Data Processing Procedures and Methodology for ...

THE METHODOLOGY OF INDIAN MATHEMATICS

Space-Limited Procedures: A Methodology for

Field methodology, recording and analytical procedures

Instrument Development Procedures for Mathematics Measures - Eric

APPLIED MATHEMATICS THESIS Scaling and Multiscaling in ...

Multidimensional Scaling Analysis of Haptic Exploratory Procedures

'Hard' Systems Methodology - Computing Science and Mathematics

Bayesian Predictive Procedures for Designing and ... - CiteSeerX

Inventory Procedures for Smallholder and Community ... - CiteSeerX

A Methodology for Compensating Errors - Mathematics, Statistics and ...

Procedures for Teaching Variational Formulation and ... - CiteSeerX

Rasch scaling procedures for informing development of a valid Fetal

Computational Procedures for the Multi-Disciplinary ... - CiteSeerX

Simplified gas-chromatographic procedures for the ... - CiteSeerX

Data and methodology - CiteSeerX

Investigation of Scaling Methodology for Strained Si n ... - Google Sites

Investigation of Scaling Methodology for Strained Si n ... - Google Sites

Novel Methodology for Functional Modeling and ... - CiteSeerX

Design Methodology for Aerodynamically Scaling of a ...