Requests for reprints should be sent to Paul W. Holland, Educational Testing Service, .... Thissen (1982) considers the case of (9) in which Pj(0) is given by the ...
PSYCHOMETRIKA--VOL.55, NO. 4, 577-601 DECEMBER 1990
ON THE SAMPLING THEORY FOUNDATIONS OF ITEM RESPONSE THEORY MODELS PAUL W . HOLLAND EDUCATIONAL TESTING SERVICE Item response theory (IRT) models are now in common use for the analysis of dichotomous item responses. This paper examines the sampling theory foundations for statistical inference in these models. The discussion includes: some history on the "stochastic subject" versus the random sampling interpretations of the probability in IRT models; the relationship between three versions of maximum likelihood estimation for IRT models; estimating 0 versus estimating 0-predictors; IRT models and loglinear models; the identifiability of IRT models; and the role of robustness and Bayesian statistics from the sampling theory perspective. Key words: stochastic subjects, marginal maximum likelihood (MML), conditional maximum likelihood (CML), unconditional maximum likelihood (UML), joint maximum likelihood (JML), probability simplex, loglinear models, robustness.
1.
Introduction
This paper is concerned with the statistical foundations of the probability models, based on item response theory (IRT), that are used to analyze educational and psychological test data. By "statistical foundations", I mean those aspects of IRT models used to draw statistical inferences from test data; that is, parameter estimates, their standard errors, and goodness-of-fit tests. The primary emphasis of this paper is on a formulation of these foundations that uses the random sampling of examinees from a population as its only source of probability. Many of the ideas in this paper have been discussed in one form or another by other writers so that its main contribution is the consistency with which the sampling theory approach is pursued. The remainder of this paper is organized as follows. Section 2 sets up the basic notation and discusses some relevant history. Section 3 discusses loglinear models for test data. In Section 4, I consider the geometric structure of IRT models. Section 5 shows the relation between three types of IRT likelihood functions and Section 6 introduces the idea of an ability predictor to replace ability estimates. Section 7 briefly sketches a few related topics. 2.
Basic Notation and the Underlying Ideas
I will assume that our attention is focused on a specific test, T. Throughout this discussion, T will be considered as given and fixed. By a " t e s t " I mean a specific set of questions with specific directions, given under standardized conditions of timing, A presidential address can serve many different functions. This one is a report of investigations I started at least ten years ago to understand what IRT was all about. It is a decidedly one-sided view, but I hope it stimulates controversy and further research. I have profited from discussions of this material with many people including: Brian Junker, Charles Lewis, Nicholas Longford, Robert Mislevy, Ivo Molenaar, Donald Rock, Donald Rubin, Lynne Steinberg, Martha Stocking, William Stout, Dorothy Thayer, David Thissen, Wim van der Linden, Howard Wainer, and Marilyn Wingersky. Of course, none of them is responsible for any errors or misstatements in this paper. The research was supported in part by the Cognitive Science Program, Office of Naval Research under Contract No. N00014-87-K-0730 and by the Program Statistics Research Project of Educational Testing Service. Requests for reprints should be sent to Paul W. Holland, Educational Testing Service, Rosedale Road 21-T, Princeton, NJ 08541. 0033-3123/90/1200-pa90 $00.75/0 © 1990 The Psychometric Society
577
578
PSYCHOMETRIKA
item presentation, and so forth. If any of these elements change, the resulting test is different from T. For my purposes here, this rough description of a test should be sufficient. The test, T, is made up of J test questions or items which I will index by the letter j (with or without primes or subscripts, as needed). Each item is assumed to have a "correct" answer and we use the indicator variable xj to denote right or wrong answers to item j of T, { 10
xj =
if item j is answered correctly, if item j is answered incorrectly.
Let me make my first simplifying assumption right here. I will not consider the possibility of unanswered questions on T in this paper, except briefly in section 7. Omitted or " n o t reached" responses are important considerations in real tests and are a crucial feature of computerized adaptive tests, but they are beyond the scope of this paper. The only values o f x j considered here are 0 and 1, although little effort is required to extend most of my comments to the case where xj is polytomous. The pattern of correct and incorrect responses to the J items that an examinee might produce upon taking the test T will be denoted by the response vector or response pattern x, where
x = (xl . . . . .
x j).
(2)
There are 2 J possible values of x. Up to this point, there is little that is different from other developments of IRT except for the fact that the response vector x does not have a subscript on it to indicate the examinee who produced it. This is a characteristic of the notation I will use.--x merely indexes all of the possible 2 J response patterns. We now come to an important difference between this and other developments of IRT. Let C denote a given population of potential examinees. If T were administered to everyone in C, a certain proportion of them, p(x), would produce the response vector x. The values of the p(x) are proportions and, as such, they satisfy these two conditions:
p(x) >- o,
(3)
~ , p(x) = I,
(4)
and
X
where ~'x denotes a summation over all of the 2 J values of x. Let p denote the (2J) dimensional vector with coordinates p(x) in some (e.g., lexographic) order:
p = (p(x)).
(5)
If a person is sampled at random from C and tested with T, the probability that this randomly selected examinee will produce response pattern x is exactly p(x). This probability is simply a consequence of what we mean by random sampling from C. It is convenient to let X denote the response pattern of a randomly sampled examinee from C. Then p(x) = Prob (X = x). When N people are sampled without replacement from C and tested with T, let n(x) denote the number who produce response vector x. Then let n denote the (2J)-dimen sional vector with coordinates n(x):
PAUL W. HOLLAND
n = (n(x)).
579
(6)
Hence, n(x) = N. x
The vector n is a (2J)-contingency table representation of the item response data from the N examinees and it has an approximate multinomial distribution with parameters N and p, provided that N is small relative to the size of C. (The exact distribution of n is multivariate hypergeometric). I now make a second simplifying assumption: C is very large relative to N so that we may ignore the fact that n does not have an exact multinomial distribution. In general, N will be known, and p will be unknown. The likelihood function (i.e., the probability of the observed data), is based on the multinomial distribution and is I-I p(x)n(x),
(7)
x
and its logarithm, the log likelihood function, is L = ~ n(x) log p(x).
(8)
x
This log likelihood function and its relationship to other IRT likelihood functions is discussed extensively in section 5. At this point, I want to emphasize that the only source of probability that I will use in this paper is random sampling from C. There are no stochastic mechanisms operating except the sampling process by which examinees are selected from C and tested with T. I shall make no probabilistic assumptions about how an examinee arrives at his or her particular response vector. I only assume that an examinee, if tested with T, will produce some response vector. This assumption, along with the random sampling from C, will be the sole probability basis for statistical inferences. This is the random sampling perspective and it should be distinguished from the stochastic subject perspective discussed later in this section. It is important to remember that performance on a test can depend on many factors in addition to the actual test questions--for example, the conditions under which the test was given, the conditions of motivation or pressure impinging on the examinee at the time of testing, and his or her previous experience with tests like T. In this paper, all of these factors are intended to be absorbed into either the precise definition of T or of the examinee population, C. It should be emphasized that I do not make the deterministic assumption that if an examinee were retested with T then the same response pattern would be produced by that examinee. Even if retesting were done under identical conditions, the examinee is no longer the same person in the sense that relevant experiences (i.e., prior exposure to T) have occurred that had not occurred prior to the first testing. In short, the tested sample of examinees has changed from a sample from C to a sample from a new population, C'. Thus, we explicitly do not include individual variability in test performance over time as a source of probability in defining p(x). Rather, p(x) is merely the proportion of C who would produce response vector x if tested with T. Individual variability over time is absorbed into the definition of the population C. As the population C " a g e s " in some important ways, it changes and so do the values of p(x). This approach runs the risk of defining the population, C, so narrowly that it is of little general interest--all of the high school juniors who took the
580
PSYCHOMETRIKA
SAT for the first time on Saturday, June 2, 1990. However, I believe that such a level of specificity is necessary for a precise statistical foundation for IRT models. So far, most of the features of IRT models have not made their appearance--for example, item response functions, latent traits, item parameters, and so on. It is now time to remedy this. We have defined the basic data vector, n, to which the data collection process (i.e., random sampling) gives a multinomial distribution with a known value for N but an unknown value for p. The next step is to build a model for p. From the random sampling perspective, this is the purpose of IRT. In general, a model for p is a restriction on the possible values ofp. Let f~j denote the set of all possible (2J)-probability vectors, that is, l~j is the probability simplex defined by:
f~J = {all vect°rs q = (q(x)) such that q(x) >-O and ~ q(x) = l } The vector p is a point in f~j but, aside from this, p is not restricted, as yet, in any way. The vector of observed proportions of examinees producing each response pattern is n/N, a point in f~j. For this reason f~j may be called the "data space" to distinguish it from the "parameter space" introduced in sections 3 and 4. Unfortunately, this distinction may cause confusion because p is also a point in f~j and p is a parameter of the multinomial distribution that governs the statistical properties of n/N. A model for p is a subset, M, of the data space, (M C f~j). In this framework, all IRT models correspond to various subsets of f~j. These IRT-subsets are all specified by special cases of (9), below J
p(x) = f
~ ~(0)~(0)
1 - ~ dF(O).
(9)
j=l
In (9), Pj(O) = 1 - Qj(O) is the item response function (IRF, or, in earlier usage, "item characteristic curve", ICC), 0 is the latent trait or "ability", and F is the cumulative distribution function of 0 over the population C. In (9), I have used the Stieltjes form of the integral (i.e., the dF(O) notation). This allows for complete mathematical generality. If F(O) is a differentiable distribution function, then its derivative f(O) = F'(O) is the density function of 0 and dF(O) may be replaced byf(O)dO in (9). Finally, in (9) 0 may be a scalar or a vector. For clarity, let 0 be a scalar unless I state otherwise. Unidimensionality is not an important restriction on much of my discussion. The usual IRT models correspond to specific parametric choices of the IRFs and/or o f F . For example, Lawley (1943), Tucker (1946), Lord (1952), Lord and Novick (I968), and Bock and Lieberman (1970) all study an explicit form of (9) with Pj(O) given by the "normal ogive" model, P~(O) = * ( a A o
- bj)),
where qb(z) is the standard normal distribution function, bj and aj are location and scale parameters, and F(O) is given by the standard normal distribution, F(O) = d~(O). Birnbaum (1967) examines another version of (9) in which Pj(O) is given by the logistic model
Pj(O) = LGT(aj(O - bj)), where LGT(z) is the logistic distribution function,
(10)
PAUL W. HOLLAND
581
e z
LGT(z)
-
1 + e z'
-
-
(11)
and F(O) is also given by the logistic distribution, F(O) = LGT(O). Thissen (1982) considers the case of (9) in which Pj(0) is given by the one-parameter logistic (or Rasch) model, Pj(O) -- LGT(O - bj),
(12)
and F(O) is the normal distribution function with mean zero but unknown variance, F(O) = q)(0/cr). Tjur (1982) and Cressie and Holland (1983) consider the case of (9) in which Pj(0) is given by the one-parameter logistic model (12) and F(O) ranges over all possible distribution functions. Holland and Rosenbaum (1986) and Stout (1987) consider the case of (9) in which Pj(O) is only required to be monotone increasing in 0 and F(O) is any distribution function. One may view (9) in at least two ways. On the one hand, it is a formula that gives legitimate values for p(x) (i.e., if ej(o) is restricted to lie in the interval [0, 1] and F(O) is a distribution function, then p ( x ) satisfies (3) and (4)), and as such it defines a model, M, for p. The model M will depend on the restrictions that are put on the set of IRFs and on F. From this first point of view the origin of (9) does not really matter. It is satisfactory in so far as it gives rise to models that fit data. On the other hand, it is important to be able to give reasons why p(x) might satisfy (9). Furthermore, if (9) fails to fit the data for a particular family of IRFs and choice of F, it is important to be able to expand the model in reasonable ways so as to improve the fit. A rationale for (9) may aid in choosing new parameters to add to the model that are sensible and improve the fit, and therefore the utility, of the model, M. The most common rationales for (9) may be divided into two types which I shall call respectively, the " r a n d o m sampling" and the "stochastic subject" rationales. Both focus on the integrand of (9), J
p(x[O) = 1--[ ej(o)xJOj(O) l - x j
(13)
j=l
Both rationales strive to give meaning to the notion that (13) gives the conditional "probability" that an examinee with ability 0 will produce response vector x when tested with T. Both rationales interpret Pj(O) as the conditional "probability" that an examinee with ability 0 will answer item j correctly; and they both assume that local independence (i.e., (13)) is the proper way to combine these individual item "probabilities" to obtain the "probability" for the entire response vector. They differ in the way that Pj(O) is interpreted as a probability. The R a n d o m Sampling Rationale
The " r a n d o m sampling" rationale is explicitly stated in Birnbaum's chapter in Lord and Novick (1968). Birnbaum (1967) says Item scores . . . are related to an ability 0 by functions that give the probability of each possible score on an item for a randomly selected examinee of given ability. These functions are . . . the item characteristic curve(s) . . . Pg(O) . . . (p. 397) In the " r a n d o m sampling" rationale, 0 defines strata, or subpopulations of the population C, with the same ability. The meaning of Pj(O) is the proportion of people in the
582
PSYCHOMETRIKA
0-th stratum of C who will answer itemj correctly if tested by T. This is what is meant by "the probability that a randomly sampled subject with ability 0 will correctly answer itemj". The random sampling rationale also appears in other applications of latent variable models. In his description of latent structure analysis applied to a measure of "ethnocentricity", Lazarsfeld (1950) describes the trace line (which is the latent-structure equivalent of the IRF) in terms very close to my definition of p(x) except that it is conditional on 0: the proportion of people with a given degree of ethnocentricity who make a positive response to an item. (p. 370) The rationale for local independence within the random sampling point of view is associated with the general way that latent traits are viewed as "explaining" a set of data. Lazarsfeld (1950) describes this notion of explanation in detail: It is possible to formulate mathematically what is meant if we say that an underlying continuum accounts for the interrelationship of two test i t e m s . . . Such a formulation reduces to this idea: If people have the same position on the underlying x-continuum, then their answers to the two questions will be unrelated; the probability they will answer two questions positively is then the product of the probabilities that they will answer each questions alone positively. (p. 369) In general then, a latent trait "explains" the intercorrelations among a set of variables in a population if, conditionally given the value of the latent trait, all of the variables are mutually independent (i.e., if local independence holds). This notion of explanation and some of its limitations are discussed in Holland and Rosenbaum (1986). The random sampling rationale fits in very naturally with the definition I have given for p(x). I f we randomly sample an examinee from C then, given this examinee's 0-stratum in C, we automatically randomly sample the examinee from that 0-stratum of C. One serious limitation with the random sampling rationale is that it does not suggest any choice of the form of the IRFs. Without a more substantive interpretation of Pj(O), it is difficult to see how one might be led to any of the item parameters that are in common usage. The important exception is the item difficulty parameter. These parameters allow p(x) to fit all of the one-dimensional marginal proportions exactly, and from general considerations of models for (2J)-dimensional contingency tables, one might argue that this is an absolute necessity for any model that aspires to fit real data. We return to this point again Sections 3 and 4. In any event, while the random sampling rationale for Pj(O) is neatly consistent with the assumption that examinees are randomly sampled from C, it does not lead to specific choices of the form of Pj(O).
The Stochastic Subject Rationale The "stochastic subject" rationale for (9) views the performance of an individual examinee on each item in T as inherently unpredictable for various reasons. As a mathematical model for this unpredictability, the responses of the examinees are assumed to be governed by a stochastic mechanism that is characterized by an ability parameter 0 and by parameters that describe the characteristics of the items. In the stochastic subject rationale, 0 is a person parameter and varies from examinee to examinee in C. The discussions in the literature as to the meaning of the stochastic mechanisms that produce responses are generally tautological: subjects are stochastic
PAUL W. HOLLAND
583
because human behavior is uncertain. For example, Rasch (1960) justifies his use of a probability model as follows: we return t o . . . the description of certain human acts by a model of chanc e . . . Even if we know a person to be very capable, we cannot be sure that he will solve a certain difficult problem, nor even a much easier one. There is always a possibility that he fails--he may be tired or his attention is led astray, or some other excuse may be given. And a person of slight ability may hit upon the correct solution of a difficult problem. Furthermore, if the problem is neither " t o o e a s y " nor "too difficult" for a certain person, the outcome is quite unpredictable. But we may in any case attempt to describe the situation by ascribing to every person a probability of solving each problem correctly, and this probability will be our indicator of "how easily" the problem is solved. The probability that a very able person solves a very easy problem is near unity, but not necessarily equal to 1, and the probability that a person of small ability solves a difficult problem is very near to 0. (p. 73) •
.
.
Along these same lines, Lord and Novick (1968) refer to the propensity distribution of a single examinee's responses to a test in the following terms: Most students taking college entrance examinations are convinced that how they do on a particular day depends to some extent on "how they feel that day". A student who receives scores which he considers surprisingly low often attributes this unfortunate circumstance to a physical or psychological indisposition or to some more serious temporary state of affairs not related to the fact that he is taking the test that day. To provide a mathematical model for such cyclic variations, we conceive initially of a sequence of independent observations . . . . . and consider some effects, such as the subject's ability, to be constant, and others, such as the transient state of the person, to be random. We then consider the distribution that might be obtained over a sequence of such statistically independent measurements if each were governed by the propensity distribution... The propensity distribution is a hypothetical one because . . . . . it is not usually possible in psychology to obtain more than a few independent observations. Even though this distribution cannot in general be determined, the concept will prove useful. (p. 30) The stochastic subject interpretation of Pj(O) is related to the probabilistic learning models described in Bush and Mosteller (1955). In describing the development of mathematical models in psychology, these authors argue that the stochastic subject view is an inescapable fact of life: These advances indicated a growing awareness that performance is an unpredictable thing, that choices and decisions are an ineradicable feature of intelligent behavior, and that the data we gather from psychological experiments are inescapably statistical in character. Given these basic facts of the theoretical psychologist's life, statistical theories in psychology would seem to have come to stay. (p. 336) A similar view is expressed in Samejima (1983). There may be an enormous number of factors eliciting his or her specific overt
584
PSYCHOMETRIKA reactions to a stimulus, and, therefore, it is suitable, even necessary, to handle the situation in terms of the probabilistic relationship between the two. (p. 159)
I believe that no completely satisfactory justification of the "stochastic subject" is possible, but I also believe that most users think intuitively about IRT models in terms of stochastic subjects. It has great heuristic value even though no one readily admits to the belief that there is some sort of random mechanism within each subject generating his or her response vector by (mental) flips of biased coins. A simple example of this heuristic value arises in a commonly given rationale for local independence. If it is plausible to believe that T is such that the response of an examinee to one question will not influence his or her answer to some other question, then it is not difficult to accept the view that the responses from such a stochastic subject will exhibit local independence. "Guessing parameters" provide another example of the heuristic value of stochastic subjects. It is difficult to imagine how guessing parameters would have been conceived without the aid of a stochastic subject interpretation of Pj(O). Finally, if 0 represents an "ability to solve a problem" then it is not difficult to suppose, for stochastic subjects and tests with "correct answers", that Pj(O) ought to increase in 0--as it does for most of the common parametric models. I view both the random sampling and the stochastic subject rationales for (9) as useful tools for understanding IRT models. The stochastic subject is a powerful metaphor that aids our intuition in the construction of models (i.e., choosing M C_ f~j). The random sampling rationale gives a firm logical basis for statistical inference in these models. Neither of these two important roles should be ignored. Lord (1974) discusses an additional interpretation of Pj(O). In his words:
Pia is most
simply interpreted as the probability that the examinee a will give the right answer to a randomly chosen item having the same ICC as item i. (p. 249)
While this interpretation of Pj(O) has a clear sampling theory basis, I did not include it in my discussion because it does not apply to the case assumed here in which the test Tis a fixed set of questions. It would apply to a testing situation in which every question presented to every examinee was sampled afresh from one of J item pools. Such applications can arise with computerized testing. Lord also mentions briefly the difference between the random sampling and stochastic subject interpretations of Pj(O). Continuing the above quotation he says: An alternative interpretation is that Pi(Oa) is the probability that item i will be answered correctly by a randomly chosen examinee of ability level 0 = Oa. (P. 250) This is an explicit statement of the random sampling rationale. Then he goes on to say that: These interpretations tell us nothing about the probability that a specified examinee will answer a specific item correctly. (p. 250) Here, Lord clearly allows for the possibility that a specific examinee might behave as a stochastic subject in certain circumstances--which, in that paper, concern responses to previously omitted items. Lazarsfeld (1959) gives the following, rather graphic, description of a stochastic subject---quoted at length in Lord and Novick (1968, pp. 29-30)---only nine years after his equally clear description of the random sampling rationale mentioned above:
PAUL W. HOLLAND
585
Suppose we ask an individual, Mr. Brown, repeatedly whether he is in favor of the United Nations; suppose further that after each question we "wash his brains" and ask him the same question again. Because Mr. Brown is not certain as to how he feels about the United Nations, he will sometimes give a favorable and sometimes an unfavorable answer. Having gone through this procedure many times, we then compute the proportion of times Mr. Brown was in favor of the United Nations. (p. 493-494) Thus writers on IRT models have been eclectic in their interpretation of the meaning of the "probability" that the IRF is supposed to represent. More often, they are silent and make no effort to interpret it at all; for example, Lawley (1943), Tucker (1946), Lord (1952), Samejima (1969, 1972), Bock and Lieberman (1970), Wright and Douglas (1977), Wright and Stone (1979), and Andersen (1980). To discuss the statistical foundations of IRT models without ambiguity or confusion, we can afford to be neither silent nor eclectic. In this paper I adopt the view that (9) and assumptions on the IRFs and F(O) define locally independent IRT models for p(x). I will use the random sampling rationale for the meaning of Pj(O) whenever that is important, and will only make use of particular choices of the IRFs to define subsets M C_ f~j without regard to the substantive interpretation of these choices of IRFs in terms of stochastic subjects. Lewis (1985, 1990) gives a Bayesian analysis of dichotomous item responses that presumes neither the random sampling nor the stochastic subject rationale. His approach provides an alternative perspective on the statistical foundations of IRT models to the one developed here.
But We Don't Sample People at Random! Let me briefly address this potential objection to the position I have adopted in this paper. In some practical situations, we are unable to specify a population of potential examinees, let alone randomly sample from such a population. Nonetheless, in building a foundation for statistical inference it is important to begin with a simple situation in which the statistical issues are relatively clear-cut. Once the problems are understood in such a context, we may then move on to more complex situations in which the sampling may be biased or for which the idea of a population of examinees may be meaningless. That is the methodological path that I will follow here. The theory of estimation and testing for the multinomial distribution is thoroughly understood in a variety of situations--a basic paper is Birch (1964)--and for this reason I believe it is an appropriate place to begin building the statistical foundations for item response theory models. I hope that the rest of this paper proves my point. Making the population of potential examinees an explicit part of the model runs counter to most developments of IRT models which start with an individual (stochastic?) subject and build a probability model for his or her response vector. While the approach of starting with an individual subject may appear to obviate the need for ever mentioning an examinee population, it is my opinion that this is an illusion. Item parameters and subject abilities are always estimated relative to a population, even if this fact may be obscured by the mathematical properties of the models used. Hence, I view as unattainable the goal of many psychometric researchers to remove the effect of the examinee population from the analysis of test data, (i.e., "sample-free item analysis", Wright and Douglas, 1977). The effect of the population will always be there, the best that we can hope for is that it is small enough to be ignored.
586
PSYCHOMETRIKA Loglinear Models for p(x)
3.
It is useful to consider, briefly, some alternatives to the IRT models for p(x) defined, in general, by (9). The class of loglinear models is such an alternative. These models are specified by equations of the form R
log p(x) = t~o + ~
(I4)
[3jbj(x),
j=l
where s0 is a normalizing constant to insure that ~x p(x) = I; {/3j} are the loglinear model parameters; and the {bj(x)} are known functions of x. Examples of bj(x) that arise are
b(x) = xi, b(x) = xixj,
i Pax - 0. Next, f o r j = 0, 1 , . . . J, let y(J) be the 0/1 vector ( y t j) . . . . . yjjrj such that {~ Yg(/) =
ifi-- j .
(27)
If (24) holds, then all response vectors, x, except for the J + 1 vectors, {y(J)}, must have p(x) = 0. The values o f p ( y (j)) are given by p(y(O)) = I - p a , ,
P(Y(J)) = Pgj - Pgj+t,
j=l
.....
J-l,
(28)
and p(y(Y)) = p g . This probability distribution puts positive probability on at most J + 1 response vectors and is called a Guttman scale after Guttman (I94I, 1950). Like MIND, the Guttman scales, p E MGUr, are completely determined by their marginal proportions correct, P 1, • • • , PJ. As in MIN D, the parameter space for M a u T is the J-dimensional unit cube K j , defined earlier. As the pj vary over the interval [0, 1], p = (p(x)), given by (28), traces out a J-dimensional boundary of f~j. It is a boundary because at most J + 1 coordinates of p are nonzero for p E MGU T. All IRT models may be thought of as being "in between" MIN D (which has no dependence) and MGu T (which has " p e r f e c t " dependence). Both MIN D and MGU T are characterized by their one-dimensional marginal proportions correct (P l, P2, • • • , P J); that is, they are J-dimensional manifolds in l-lj. MIND is a smoothly curved manifold while MruT" is a piecewise linear manifold because it is part of the J-dimensional boundary of l~j.
Parametric I R T Models Sometimes it is easy to guess the dimension of an IRT model from a count of the number of item and ability distribution parameters minus the number of constraints on them. The usual parametric forms for Pj(O) and F(O) may be defined by
Pj(O) = Po(aj(O - bj); cj),
(29)
F(O) = Fo((O - M o ' ; v),
(30)
and
where P0 and F 0 are specified functions, cj denotes any additional item parameters beyond the usual location and scale parameters, bj and aj, and v represents any additional ability distribution parameters beyond the location and scale,/~ and o-. When aj, bj, Iz and tr vary freely, it is easy to see that 0 in (9) may be transformed to th = ( 0 - Iz)/o-, thereby eliminating ~ and tr as free parameters. As an example, we consider the model used by Thissen (1982) in which Pj(O) has the one-parameter logistic form (I2) and F(O) = qb(0/cr). By a change of scale in F, this is equivalent to the model in which F(O) = q~(0) and Pj(O) has the 2-parameter logistic form (10) in which all of the aj parameters are equal, a i = a. By a suitable choice of sequence 7r (t) ----- (a (t), bt t) . . . . , b}t) ), we can arrange for~ p(~"(t) ) in this model to converge to either a point in MIND (let
590
PSYCHOMETRIKA
a (t) - + O) or tO a point in M a u r (let a (t) --> oo). In addition, by choosing the b) t) correctly, we can force equality of the marginal distributions for all t, F
Pj = J ZGT(a(t)(O - b(t))) ddP(O), for all ~r(0. Thus, for any choice of marginal proportions correct (Pl . . . . . p j) in K j there is a curve, indexed by the a-parameter, that moves continuously from a point on MINO with these marginal proportions to the corresponding point o n M G U T with the same marginal proportions. This curve traces out all of the points in Thissen's model that have the specified marginal proportions correct. Hence, Thissen's model is a J + 1 dimensional manifold in l l j . I f F is changed to a different, but fixed distribution (with nonzero variance), we get a different J + 1 dimensional manifold. At present, I do not know of any relationship between these various J + 1 dimensional manifolds that correspond to various versions of the Rasch model with essentially different ability distributions. If there are K item parameters for each item (including location and scale) and L parameters for F, including location and scale, then the total number of parameters appearing in the formula for p(x) is KJ + L - 2. Is the dimension of the corresponding subset M of IIj also K J + L - 2? In some cases the answer is clearly yes and in others this is less clear. Birch (1964) gives an important condition that gives insight into this question. Let all of the item and ability distribution parameters be combined into a single vector parameter, 7r, that ranges over a parameter space of D dimensions. Let the transformation that maps ~r into the corresponding point p E l l j be denoted by p(Tr) = (p(x; ~r)),
(31)
f. p(x; zr) = I 1-I Po(aj(O - bj); cj)X~Qo(aj(O - bj); Cj) 1 - x j dFo(O; v). J J
(32)
where
I let (32) define what I mean by a parametric IRT model. I will assume that Fo(O, v) is a family of distribution functions whose location and scale have been fixed. Examples are Fo(O) = qb(0), or F0(0; v) = 1/2 and F0(1; v) = 3/4 for all v. Birch's condition, given in the context of proving the consistency and asymptotic normality of maximum likelihood estimates for the multinomial distribution, is that for any e > 0, there exists a 8 > 0 such that if - , *11 >
then IlP( r) - p0r*)[I > 8,
(33)
where U'II denotes Euclidian distance. Birch's condition prevents two values 7r, -a'*, that are not near each other in the parameter space, to give rise to two points in the model M that are close to each other. Birch's condition (33) gives some useful insight into the structure of models in which Pj(O) has the standard 3-parameter form, Pj(O) = P(O; ~rj) = cj + (1
-
c j ) e o ( a j ( O - bj)),
(34)
where P0 is a continuous cumulative distribution function and F(O) = Fo(O) is a fixed distribution function. We envision moving the various parameters smoothly over the parameter space and seeing what happens to p(Tr) in IIj. Suppose {Tr(tl} is a sequence
P A U L W. H O L L A N D
591
of parameter values that converges to a parameter value rr. Suppose further that in this limit the IRFs are flat (i.e., they do not depend on 0), lim Pj(O; ~r (t)) = pj.
(35)
t----~o0
In the case of (34), (35) implies that we must have lima} t) = O, (36) lim
¢(t) = c j ,
t-...~ oo
and
- a, lim .,,(t)h(t) j ~.j - . . j , f----~oo
where
cj + (1 - c j ) P o ( - A j ) = pj.
(37)
Equations (36) and (37) show that many sequences of parameter values {~r(0} that are not near each other in the parameter space give rise to points in M that are all close to the same point in MIND, because t"
I nox, -., J J•
=n j
is always a point in MIND. This is a violation of Bitch's condition (33). It is easy to show that if, in (34), we restrict the parameter cj to be a fixed value, such as 0 or 1/5, for all j, then Aj in (37) is uniquely determined by the marginal proportion pj. This will prevent the violation of Birch's condition that can occur if the cj are allowed to vary freely. Furthermore, if we look at sequences of parameter values that approach the Guttman scale (i.e., for which aj is large), then the phenomenon we have just described can not occur. This type of analysis suggests that the well-known problem of the identifiability of the c-parameter in the 3-parameter logistic model,
Pj(O) = cj + (1 - cj)LGT(aj(O - bj)),
(39)
occurs primarily when aj is small, and is eliminated by choosing an a priori fixed value for all the cj. Both of these facts are used in practice and this analysis shows why they are true.
Semiparametric I R T Models When the Pj(O) have a parametric form like (29), but F(O) is allowed to vary over all distribution functions, the resulting models may be called semipararnetric after Oakes (1988). Tjur (1982) and Cressie and Holland (1983) examine the structure of the semi-parametric Rasch model in which Pj(O) has the 1-parameter logistic (Rasch) form specified in (12). They show that for this case, p(x) has the form of a loglinear model
592
PSYCHOMETRIKA J
logp(x)=ot0+ ~ j=l
J
ajxj + ~, "~k6k(X),
(40)
k=2
where 6k(x) is the function in (15) and {aj}, {Yk}are parameters, and they show that the Yk are subject to a system of inequality constraints but that the {aj} are not. Equation (40) shows that the dimension of MRASC~I for the semiparametric Rasch model is J + J - 2 because the inequality constraints do not lower the dimensionality of the {Yk}However, Holland (1990) gives results that suggest that for this model, the inequality constraints mentioned above become very tight as J ~ ~ and that MRASCH approximates a J + 1 dimensional manifold. We showed earlier that for Thissen's model, in which Pj has the Rasch form and F(O) = qb(0/tr), the dimension of M is J + 1. These two facts suggest that, at least when J is large, the dimensionality of the semi-parametric models may not be easy to guess from intuitive analyses. The development of a better method for reliably computing these dimensions is a useful line of future research. The work of de Leeuw and Verhelst (1986), Follman (1988), Levine (1989), and Lindsay, Clogg, and Grego (in press), which all emphasize discrete versions of F, may prove useful here.
The Nonparametric IRT Models When Pj(O) and F(O) are both allowed to vary over nonparametric classes of functions, the resulting IRT models may be called nonparametric. Examples of analyses of these models are Levine (1989), Holland and Rosenbaum (1986), Stout (1987, 1990), and Junker (1988, 1989, in press). I know of no descriptions of M for any nonparametric IRT models beyond the partial conditions given in Holland (1981), Rosenbaum (1984), and Holland and Rosenbaum (1986). The results given there do not suggest that the dimension of M is reduced. However, the conjecture of Holland (1990) is that when J is large, and F and the {Pj} are sufficiently well-behaved, and 0 is of dimension D, then M is approximately a (D + 1)J dimensional manifold represented by a second-order exponential model of the form log p(x) = ao + ~ oljxj "~- Z XiXj)tij, j i