Automatic Generation of Quantitative Reasoning Items

2 downloads 70106 Views 154KB Size Report
Keywords: algebra word problems, automatic item generation, quantitative reasoning, Rasch model ...... spondents had a high school leaving certificate with uni-.
M. Arendasy et al.: Automatic JournalGeneration of Individual ©of Differences 2006 Quantitative Hogrefe 2006; &Reasoning Huber Vol. 27(1):2–14 Publishers Items

Automatic Generation of Quantitative Reasoning Items A Pilot Study Martin Arendasy1, Markus Sommer2, Georg Gittler1, and Andreas Hergovich3 1

Differential and Personality Research Group, Faculty of Psychology, University of Vienna, Dr. Schuhfried GmbH, Mödling, 3Social Psychology Research Group, Faculty of Psychology, University of Vienna, all Austria

2

Abstract. This paper deals with three studies on the computer-based, automatic generation of algebra word problems. The cognitive psychology based generative/quality control frameworks of the item generator are presented. In Study I the quality control framework is empirically tested using a first set of automatically generated items. Study II replicates the findings of Study I using a larger set of automatically generated algebra word problems. Study III deals with the generative framework of the item generator by testing construct validity aspects of the item generator produced items. Using nine Rasch-homogeneous subscales of the new intelligence structure battery (INSBAT, Hornke et al., 2004), a hierarchical confirmatory factor analysis is reported, which provides first evidence of convergent as well as divergent validity of the automatically generated items. The end of the paper discusses possible advantages of automatic item generation in general ranging from test security issues and the possibility of a more precise psychological assessment to mass testing and economical questions of test construction. Keywords: algebra word problems, automatic item generation, quantitative reasoning, Rasch model

Introduction It is becoming increasingly common for psychological tests to be used in the selection of suitable job applicants (Schuler, 2000) or in career counseling. However, this also comes with an increased exposure of test content to the public. Thus, aspects of test security have become a more and more important issue in the past several decades, leading to methodological developments that aim to increase test security. Examples of such methods are (1) the use of computerized adaptive tests (CAT; Van der Linden & Glas, 2000; Wainer, 2000), (2) the implementation of item exposure control algorithms (e.g., Harris & Jackson, 2002; Stocking & Lewis, 2000; Sympson & Hetter, 1985), (3) the use of special software packages that prevent cheating in online-assessments (Edelblut, Elliot, Mikulas, & Bosley, 2002), and (4) the development of various methods of automatic item generation (AIG; for an introductory overview see Irvine & Kyllonen, 2002). According to Bejar (2002), the various approaches to AIG can be distinguished with regard to their level of generativity, which subsumes (1) the degree of theoretical foundation of the item generator, and (2) the degree of automation of the item generation process. Functional item generation is characterized by the lack of a cognitive model, while model-based item generation is based on a cogniJournal of Individual Differences 2006; Vol. 27(1):2–14 DOI 10.1027/1614-0001.27.1.2

tive analysis of the latent trait to be measured. One of the most well-known examples of model-based item generation is the work of Hornke and associates (e.g., Hornke, 2002), which was concerned with the model-based construction of short-term memory and figural matrices items. At the highest level of generativity any psychometric characteristic of an item can be described using a rule-based framework. Within the latter two approaches each item can be characterized using a set of radicals and incidentals. Radicals or controlling elements refer to item features that systematically affect item parameters, such as item difficulty or dispersion, while incidentals or noncontrolling elements can be regarded as surface features, which do not exert an influence on the item parameters and can, thus, be used interchangeably to generate psychometrically identical but superficially different items. Within the model-based and rule-based approaches to automatic item generation Dennis, Handley, Bradon, Evans, and Newstead (2002) distinguish Approach 1, which is characterized by the use of a generative framework that includes a predefined set of radicals and incidentals, which can be used to generate item variants (Bejar, 2002) with different psychometric characteristics (e.g., item difficulty); and Approach 2, which starts from a set of preexisting templates for the construction of isomorphs with item characteristics identical to their “mother-items.” The latter ap© 2006 Hogrefe & Huber Publishers

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

proach has been used by the Educational Testing Service (ETS) to generate quantitative reasoning items for the graduate research examination (GRE) in order to extend existing item pools. Even though automatic item generation is a comparably new field in Applied Psychometrics, its theoretical foundations are quite advanced. However, a question that has received little attention thus far is whether the generative framework has to be extended by an (automated) quality control framework in order to be able to generate items at a high psychometric level (Arendasy, 2004, 2005, in press). The generative framework basically includes a set of radicals derived from cognitive psychology based models of the latent trait to be measured (cf. Arendasy & Sommer, 2005; Arendasy, Sommer, & Ponocny, 2005) and serves as a specification of the construct-related solution strategies. In line with Greeno, Smith, and Moore (1993), and Siegler (1996), Arendasy (2005, in press) assumes that a test person’s solution strategy choice is not merely a characteristic of the test person himself, but arises from an interaction between characteristics of the respondent and characteristics of the item material. Different design features of the item material may suggest different solution strategies that can be more or less construct-related. Hence, the item material should be designed in a manner that provides an “affordance” (Greeno et al., 1993) for solution strategies to be related to the latent trait to be measured. Technically, the affordance of item material can be defined by the set radicals implemented into the generative framework of an item generator. At the same time one needs to take care that alternate solution strategies leading to differential item functioning (Holland & Wainer, 1993) are not supported by the test material. This latter goal can be achieved by defining a set of constraints (Greeno et al., 1993) that subsequently has to be implemented into the quality control framework of an item generator. The constraints can be derived from prior studies on possible reasons of differential item functioning (for examples see Arendasy & Gittler, 2003) or psychometric experiments specifically designed to investigate the impact of constraints implemented in a quality control framework (for recent examples of this approach to the ability domains “figural reasoning” or “spatial ability” see Arendasy & Gittler, 2003; Arendasy & Sommer, 2005). With regard to the psychometric quality the notion of Rasch-homogeneity has to be mentioned, which directly leads to the essential methodological background of this paper.

Methodological Background The item response theory (IRT, cf. Van der Linden & Hambleton, 1997) offers a measurement model that – apart from other advantages – enables one to investigate the dimensionality of test items in a theoretically sound way. The homogeneity problem of dichotomous items can be ana© 2006 Hogrefe & Huber Publishers

3

lyzed within the framework of IRT with the assistance of, e.g., the Rasch model (RM; Rasch, 1960/1980). If the RM holds true for a defined set of items and for particular groups of persons, the test is said to be “Rasch-homogeneous.” Rasch-homogeneous tests have a number of favorable psychometric characteristics (for details see Embretson & Reise, 2000; Fischer & Molenaar, 1995) that set them apart from scale constructions based on classical test theory (CTT). Two of the most important practical advantages are (1) equality of item parameter estimates over different subgroups of respondents (commonly referred to as the “person-homogeneity”-assumption of the RM; see Rost, 2004) and (2) the one-dimensionality of Rasch-calibrated scales (sometimes referred to as the “item-homogeneity”-assumption of the RM; Rost, 2004). The model fit of the RM can be assessed using different approaches, e.g., by well known, classical likelihood-ratio tests, which relate the likelihood of various restrictive model estimates to each other and transfer these estimates to an asymptotically χ2-distributed statistic. However, such statistics rely on some sort of parameter estimation (Glas & Verhelst, 1995). This means that they are only accurate given a sufficient sample size, which is not always available in practical research (Glas & Verhelst, 1995; Ponocny, 2001). One way to circumvent this difficulty is to use nonparametric goodness of fit statistics, which do not rely on parameter estimation. Ponocny (2001) recently introduced a family of nonparametric goodness of fit statistics that are conditional on the item and subject marginals of data matrices. In essence these goodness of fit statistics resort to a Monte-Carlo shuffling algorithm that exchanges the dichotomous entries in the tedrades of a data matrix while retaining the item and subject marginals. This idea is based on the assumption that all data matrices with given item and subject marginals are equally likely in case of a fit of the RM. Ponocny (2001) developed several T-statistics, which essentially determine the proportion of certain data matrices on matrices, which are in accordance with the RM (see also Rost, 2004). The fit of a given data matrix to the Rasch model is determined by means of a p-value. When a p-value is above .05 one can assume that the investigated model assumption can be retained. The benefit of these statistics over a parametric approach resides in the fact that these statistics constitute powerful tests of a given model assumption (e.g., person homogeneity, item homogeneity), which do not depend on the sample size but on the number of simulations carried out with the Monte-Carlo shuffling algorithm (cf. Ponocny, 2001, p. 445). Thus, one may speak of exact tests, because the approximation can be made theoretically arbitrarily precise (Ponocny, 2001). In order to test the assumption of global person homogeneity of the RM, the statistic T10 can be used; this nonparametric approach to the investigation of the assumption of person homogeneity enables the formulation of specific hypotheses regarding the items responsible for an assumed misfit, which increases the power of the test statistic (Ponocny, 2001). The assumption of item homogeneity can be Journal of Individual Differences 2006; Vol. 27(1):2–14

4

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

checked using the statistic T2. Both statistics essentially relate the number of simulated matrices that are in accordance with the RM assumption under investigation to the total number of simulated matrices. An additional and elegant way of testing the RM-fit is provided by means associated with fit indices for the mixed Rasch model (MRM; Rost, 2004). The MRM is less restrictive and can be seen as an extension of the RM. From a mathematical point of view, the MRM constitutes a generic model that integrates latent class analysis and the RM and, thus, includes a qualitative and a quantitative component. The model assumes that one-dimensionality doesn’t hold for the sample as a whole but only for qualitatively different classes within the sample. These qualitative differences are reflected either in the order of or the differences between the item parameters. The model equation of the MRM is similar to the RM equation extended by the class membership g. The probability of a subject s belonging to Class g to solve Item i is specified in equation 1.2. P(Xis = 1 | θs,βi,g) =

exp(θsg − βig) 1 + exp(θsg − βig)

[1.1]

As can be seen in equation 1.1, the probability of a correct answer depends on a class-specific latent-trait parameter (e.g., ability) as well as class-specific item parameters. The MRM can easily be transformed into the RM by assuming that the number of latent classes is 1. This means that a one-class solution is equivalent to the fit of the RM (Rost, 2004). The main advantage of this approach to RM-fit-testing is that, if a one-class solution fits the empirical data, no partitioning criteria exist for which the RM does not hold true. For practical aspects of the application of the MRM see Rost (2004).

Application to the Measurement of “Quantitative Reasoning” The theoretical considerations about the construction of item generators outlined above are applied to the automatic generation of quantitative reasoning items using an isomorph approach. Quantitative reasoning constitutes a second-order factor within the modified Gf-Gc theory proposed by Horn (1989; Horn & Noll, 1997) as well as the latest version of the three-stratum theory of intelligence (Bickley, Keith, & Wolfle, 1995; Carroll, 1993). This second-order factor has been consistently identified in several studies investigating the structure of intelligence (e.g., Bickley et al., 1995; Carroll, 1993; Gustafsson, 1984; Horn & Noll, 1997; McGrew & Woodcock, 2001; Roid, 2003; Undheim & Gustafsson, 1987) and proved to be a significant predictor of academic success (Ackerman, 1989). According to Horn and Noll (1997) quantitative reasoning comprises respondents’ number sense as well as their unJournal of Individual Differences 2006; Vol. 27(1):2–14

derstanding of basic arithmetic or algebraic principles and their ability to apply them in a problem-solving context. In the past decades several researchers have investigated the cognitive processes used to solve algebra word problems (e.g., Hall, Kibler, Wenger, & Truxaw, 1989; Hinsley, Hayes, & Simon, 1977; Kintsch, 1998; Koedinger & Nathan, 2004; Lane, 1991; Mayer, Larkin, & Kadane, 1984; Nathan, Kintsch, & Young, 1992; Reusser, 1989; Tirre & Pena, 1993) as well as the item features influencing their difficulty (e.g., Enright, Morley, & Sheehan, 2002; Koedinger & Nathan, 2004; Lane, 1991; Lepik, 1990; Reed, 1987; Sebrecht, Enright, Bennett, & Martin, 1996; Shalin & Bee, 1985; Yerushalmy & Gilead, 1999). The results of these studies provide valuable insights for the automatic generation of algebra word problems. According to Sebrecht et al. (1996) there are two primary sources of item difficulty in algebra word problems: working memory (Baddeley, 1992) and mathematical as well as real-world knowledge. During the solution of algebra word problems the mathematical and real-world knowledge retrieved from long-term memory has to be reconciled with the information from the problem text in working memory to derive a solution plan (Koedinger & Nathan, 2004). Based on the literature on algebra word problem solving, the solution process can be divided into five phases (Kintsch, 1998; Koedinger & Nathan, 2004; Mayer et al., 1984). In the first phase, respondents resort to their linguistic knowledge to interpret what is being stated in the given problem text. This step is often referred to as problem translation (Mayer et al., 1984). The result is an internal representation of the problem (“text base”; Kintsch, 1998) the construction of which is most likely associated with nonquantitative reasoning-related crystallized intelligence (Horn, 1989; Horn & Noll, 1997). As a consequence, “grammatical and linguistic complexity” has to be restricted by a corresponding constraint in the quality control framework to ensure that uncommon phrases and grammatically complex formulations are not used by the item generator. In the second phase the respondents start to elaborate the text base by retrieving mathematical and real-world knowledge using problem schemata from long-term memory. According to Mayer (1981) there are different schemata for various kinds of algebra word problems. In order to avoid the individual differences in respondents’ general word problems, schema (e.g., Reusser & Stebler, 1997) exert a differential effect on the psychometric item properties, and crucial information, such as the assumption of constant rates, has to be made explicit in the word problem text. This feature can be regarded as an essential constraint that ensures the unambiguousness of the problem text. According to Mayer (1981) each algebra word-problem family can be characterized by a unique set of basic schemata (e.g., speed = distance/time). Based on this theoretical consideration we assumed that the use of a more varied set of algebra word-problem families and their various sub© 2006 Hogrefe & Huber Publishers

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

types could result in a violation of the assumption of item homogeneity. Thus, a constraint regarding the algebra word-problem types was implemented, which restricted the generation process to various types of “rate problems,” excluding in advance other algebra word-problem types, such as geometry and probability word problems. According to Bernardo (1994), the algebra word problem schemata derived from long-term memory also contain information about their typicality, which can be used to facilitate their retrieval. Bernardo (1994), Blessing and Ross (1996), and Enright et al. (2002) have been able to show that the typicality of the cover story – measured as the relative frequency with which a certain cover story is used for a certain algebra word problem type (Mayer, 1981) – influences the item difficulty of a given algebra word problem. This item feature can then be used as a radical by integrating it into the generative framework of the item generator. The new information derived from a problem schema needs to be reconciled with the information from the text base. The result of the so-called problem integration phase (Kintsch, 1998) is a situational model (Kintsch, 1998; Nathan et al., 1992; Reusser, 1989) that represents the respondents’ understanding of the situation described in the word problem text. Theoretically the construction of the situation model could be complicated by altering the “presentational structure” (e.g., Koedinger & Nathan, 2004; Staub & Reusser, 1995) of the information presented in the word problem text. However, this alteration would result in a possible source of variance in the item difficulty parameters, which is associated with respondents’ text comprehension ability instead of their quantitative reasoning ability. Thus, a constraint has to be implemented in the quality control framework that fixes the presentational structure in a manner to ensure all word problems are presented in an “order naturalis” (Staub & Reusser, 1995) by representing respondents’ preferred information presentation order (which is in line with the actual time frame of the events described in the problem text). The third phase refers to the actual mathematization process. The situation model serves as a starting point for the construction of the mathematical problem model (Reusser, 1989) by providing the functional constraints that need to be kept in mind during the mathematization process (Kintsch, 1998; Nathan et al. 1992). According to Hall et al. (1989; Nathan et al., 1992; Koedinger & Nathan, 2004) this phase constitutes the core of “quantitative reasoning” (Horn, 1989; Horn & Noll, 1997). Erroneous mathematical problem models mainly result from ignoring central affordances and constraints in the situation model (Reusser, 1989). The number of functional constraints and affordances in a given algebra word problem can be quantified by a detailed analysis of the episodic (Shalin & Bee, 1985; Hall et al., 1989) or algebraic (Enright et al., 2002; Reed, 1987) structure of the given algebra word problem. Both procedures result in a number of functional relations between the elements of an algebra word problem (Hall et al., © 2006 Hogrefe & Huber Publishers

5

1989; Shalin & Bee, 1985) corresponding to the number of partial equations that can be set up for these relations (Enright et al., 2002; Reed, 1987). Sebrecht et al. (1996), Lane (1991), and Enright et al. (2002) have been able to demonstrate that the number of partial equations significantly affects the difficulty of an algebra word problem. This item feature is integrated as a radical in the generative framework. In the fourth phase the respondents derive a solution plan from the mathematical problem model (Reusser, 1989) using their strategic knowledge. This solution phase is commonly referred to as solution planning and monitoring (Mayer et al., 1984). Respondents may resort to various solution strategies in order to solve a given algebra word problem. These solution strategies can roughly be divided into formal-algebraic solution strategies, which are characterized by a translation of the mathematical problem model into a solution equation (e.g., Bobrow, 1968; Nhouyvanisong, 1999; Paige & Simon, 1966) and informal solution strategies, which re-sort primary to model-based reasoning based on the respondents’ situational understanding of the algebra word problem (e.g., Hall et al., 1989; Katz, Friedman, Bennett, & Berger, 1996; Koedinger & Nathan, 2004; Nathan et al., 1992; Nhouyvanisong, 1999). In contrast to formal-algebraic solution strategies, informal strategies do not require such a mathematical formalization. These two types of solution strategies are associated with different demands on respondents’ understanding of the principles of algebra. Research conducted by Hall et al. (1989), Katz et al. (1996), Koedinger and Nathan (2004), and Tabachneck et al. (1994) indicates that formal-algebraic solution strategies are seldom used; most respondents predominantly resort to informal solution strategies when solving algebra word problems. Only when faced with more complex algebra word problems are formal-algebraic solution strategies commonly used. The algebraic complexity of a given algebra word problem, thus, constitutes another radical for the automatic generation of algebra word problem. According to Enright et al. (2002), the algebraic complexity can be measured by the number of unknown elements and the number of embedded unknown elements in a solution-relevant equation. These authors also have been able to demonstrate that such a measure of complexity is significantly related to the item difficulty parameter of a three-parameter Birnbaum model. The total number of unknown elements in the solution-relevant equation is integrated as a further radical in the generative framework. In the final phase the respondents calculate the solution to a given algebra word problem using their knowledge of arithmetic procedures. This solution step is commonly referred to as solution execution (Mayer et al., 1984). Based on the theoretical considerations regarding automatic item generation and algebra word-problem solving outlined above, the aim of the present study was the construction of an item generator for algebra word problems using a schema-based approach and the empirical evaluation of its products. In line with the theoretical considerJournal of Individual Differences 2006; Vol. 27(1):2–14

6

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

ations outlined above, the item generator consists of a combination of quality control and generative frameworks to ensure the construction of algebra word problem items at a well-defined psychometric level (see Methods below).

Method General Description of the Item Generator and Its Products Based on the theoretical model outlined above a set of constraints and radicals was defined and implemented into the quality control and generative frameworks of the item generator. Unlike item generators using an item-variants approach, the constraints and radicals are embedded into the templates of the rate word-problem types described by Mayer (1981). The technical construction of the items was done using the item generator AGen (Arendasy, 2004). Each template in the item generator consists of a set of mandatory and optional sentences presented in an “order naturalis” (Kintsch, 1998; Staub & Reusser, 1995), whereby the optional sentences are used to increase either the number of partial equations or the number of unknown elements in the solution equation or both. Technically speaking, AGen could be considered a combination of an itemvariant approach and an item-isomorph approach (Dennis et al., 2002). The ambiguity of the templates was checked by an expert in German philology and an experienced test instruction designer. When both experts agreed that the item template was unambiguous, the template was included in the item generator. Furthermore, the uniqueness of the solution for each algebra word problem generated was ensured by means of a rule scanner, which checks that any item generated can only be accurately solved by a single equation or set of equations (see Arendasy, in press). The rule scanner, thus, is an additional constraint implemented in the item generator. Table 1 provides an overview of the constraints and radicals implemented in the item generator. The actual numerical values as well as the variable parts of the cover story, such as precise objects, etc., served as incidentals. AGen generates algebra word problems using the radicals and constraints outlined in Table 1. Furthermore, the item generator generates an item-specific protocol that includes all relevant item characteristics as well as the parameters derived from them.

General Procedure In order to investigate the products of the new item generator AGen, a series of studies investigating the dimensionality, reliability, and construct validity were carried out. For the first two studies, the test was administered using the TestWeb (Arendasy, 2002) which enables web-based psyJournal of Individual Differences 2006; Vol. 27(1):2–14

chological assessment. The first two studies were carried out during university seminars held by the first author in small groups of no more than three persons. Each graduate student test administrator was personally present during the administration of the test. The respondents were instructed that they would see several algebra word problems and should find out the numerical value of the unknown element. The task used a free-response item format to ensure that guessing would not affect the estimation of the item parameters in the way a multiple choice format could. All algebra word-problem items were presented in a power setting without any time limitations. The respondents were able to correct their answer until they pressed the “next” button to reach the next item of the test. They were not allowed to go back to an already solved item to correct their answer. The third study was conducted using the Vienna Test System (VTS). In this study the items used in Study I were administered as a subscale of the new intelligence structure battery (INSBAT; Hornke et al., 2004), which is based on the modified Gf-Gc theory proposed by Horn (1989; Horn & Noll, 1997). In addition, nine other subscales as indicators for the stratum-two factors Gq (quantitative reasoning), Gc (crystallized intelligence), Gf (fluid intelligence), Gltm (longterm memory), and Gstm (short-term memory) were administered to investigate the nomothetic span (Embretson, 1983) of the new algebra word-problem test described in this paper.

Study I Study I was designed to examine the necessity of a constraint regarding the restriction of the algebra word-problem families implemented into the quality control framework of the item generator. According to Mayer (1981) different types of algebra word-problem families require different basic schemata to be solved. We, thus, assumed that the necessity to resort to different schemata could result into a violation of the assumption of item homogeneity. In order to investigate this assumption, a total of k = 8 rate word problems was generated using a set of four radicals outlined in Table 1. Furthermore, k = 8 probability word problems were generated using the same set of radicals with slight adjustments to this specific problem type (see also Enright et al., 2002). These k = 16 algebra word problems were presented together in a predefined, fixed item order that was determined randomly. The instruction given to the respondents in this study was identical to the one outlined in the description of the general procedure.

Sample The sample consisted of 46 (46.5%) male and 53 (53.5%) female respondents aged 18 to 45 years (Mean=28.51; SD © 2006 Hogrefe & Huber Publishers

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

7

Table 1. Generative and quality control framework of the item generator AGEN (above) and an example item (below) Quality Control Framework Constraint

Description

Linguistic and grammatical complexity

Uncommon phrases and grammatically complex sentences Horn (1989) are not used Horn & Noll (1997) Reusser (1989)

Source

General word problem schema

Real world knowledge that is commonly but not necessari- Reusser & Stebler (1997) ly assumed in algebra word problem solving is made explicit

Presentational order

All word problems are presented in an “order naturalis”

Kintsch (1998) Staub & Reusser (1995)

Constructed response format

Reduce guessing probability

Enright et al. (2002) Sebrecht et al. (1996)

Generative Framework Radical

Description

Typicality of the cover story

The typicality of each cover story for all rate problem types Bernardo (1994) is categorized as typical, neutral, or untypical. Enright et al. (2002) Mayer (1981)

Number of partial equations

The total amount of partial equations

Number of relations between elements to be inferred

The relative amount of partial equations not explicitly men- Koedinger & Nathan (2004) tioned in the problem text

Total number of unknown elements

Total amount of unknown elements in the solution equation Enright et al. (2002) Hall et al. (1989) Katz et al. (1996) Koedinger & Nathan (2004) Tabachneck et al. (1994)

Example item

Radical

Categorization

A craftsman gets paid e 60,– during the first 3 hours of work. For each additional hour, he demands e 75,–. How many hours will he have to work, in order to earn e 255,–?

Typicality of the Cover story Number of partial equations Number of relations between elements to be inferred Total number of unknown elements

typical 4 1 2

= 6.66). A total of 11 (11.1%) respondents had completed 9 or 10 years of schooling without completing professional training (ISCED education level 2), 32 (32.3%) respondents had completed professional training or technical school (ISCED 3), 39 (39.4%) respondents had graduated from high-school (ISCED 4), and 17 (17.2%) respondents had graduated from college or university (ISCED 5).

Results In order to investigate whether rate-type algebra word problems and probability-type algebra word problems violate the assumption of person homogeneity we calculated various T10 statistics for the combined scale. The calculations were done with the software program T-Rasch (Ponocny & Ponocny-Seliger, 1999). Person homogeneity was assessed using the partitioning criteria rawscore, sex, age, and educational level. In this analysis the α-level was set © 2006 Hogrefe & Huber Publishers

Source

Enright et al. (2002) Hall et al. (1989) Koedinger & Nathan (2004) Lane (1991) Reed (1987) Shalin & Bee (1985) Sebrecht et al. (1996)

to .01 in advance because of the number of model tests. The results of the T10 statistics are presented in Table 2. The empirical p-value for the partitioning criteria educational-level is below α = .01, while the p-values for the remaining partitioning criteria are nonsignificant. This indicates that the items show different difficulties for respondents of different educational levels. An analysis of the pvalues for the individual items reveals that in relation to the other items the probability word problems are much more difficult for respondents with an educational ISCED-level up to three. In addition to the investigation of the assumption of person homogeneity we carried out an analysis with T2 statistics to check for the possibility of the presence of item heterogeneity in the item set. The items were divided into two subgroups consisting of (1) rate-type algebra word problems and (2) probability-type algebra word problems. Since the T10 statistic for the partitioning criterion “educational level” reached the level of significance, we calJournal of Individual Differences 2006; Vol. 27(1):2–14

8

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

Table 2. Study I: Mean and standard deviations of the rawscores and Rasch model goodness-of-fit tests (T10 statistics) for the combined rate word problem and probability word problem scale Model test criteria

Sima

Rasch model T10 statistic Hitsa

Ability level

6000

414

Gender

6000

2609

.435

Age

6000

1212

.202

p-value .069

Educational level 6000 54 .009 Note. aSim denotes the number of simulations; Hits denotes the number of matrices which are in accordance with the Rasch model.

culated the T2 statistics separately for respondents with an educational ISCED-level up to three and respondents with an educational ISCED-level of four or five. In the case of respondents with an educational level up to three, the T2 statistic results in a p-value of .0028 (Sim: 6000; Hits: 17), while the T2 statistic for respondents with an educational level of four to five results in a p-value of .048 (Sim: 6000; Hits: 288). Even though the deviation from the assumption of item homogeneity is more pronounced in the case of respondents with an educational level up to three, the T2 statistics are both statistically significant at α = .05. Based on this result it seems reasonable to assume that algebra word problems classified as rate word problems and algebra word problems classified as probability word problems measure a qualitatively different ability, as these two algebra word-problem types belong to two different families of algebra word problems (Mayer, 1981). Taken together, the results indicate that person homogeneity cannot be assumed for the combined rate and probability word-problem scale with regard to respondents’ educational level. Furthermore, the presence of item heterogeneity between rate and probability word problems points to qualitative differences in the processing of these two families of algebra word problems. Because of this result the implementation of a constraint regarding the algebra word-problem type seems to be justified. Based on these results, a decision was made to retain rate word-problem types, since this item type is one of the most commonly used word-problem types in current studies on cognitive processes in algebra word-problem solving.

Study II Study II was conducted to examine the dimensionality of a set of rate algebra word problems generated with AGen. A total of k = 15 rate word problems was generated using the set of four radicals outlined in Table 1. The number of partial equations varied from two to four (M = 2.8; SD = 0.68), while the number of partial equations not explicitly mentioned in the problem text ranged from zero to two (M Journal of Individual Differences 2006; Vol. 27(1):2–14

= 0.89; SD = 0.27). Furthermore, the total number of unknown elements in the equation varied from one to five (M = 2.47; SD = 1.25). Three different kinds of cover stories (typical, neutral, and atypical) were used to vary the typicality of the cover stories for the selected rate problems. The typicality of the cover story was determined individually for each template using the frequency norms reported by Mayer (1981). The main purpose of Study II was to generate a set of items covering a broad difficulty range and to investigate whether the built-in quality control framework of AGen suffices to generate algebra wordproblem items at a high psychometric level. The instructions given to the respondents was identical to the one described in the general procedure of this paper.

Sample The sample consisted of 147 (52.9%) male and 131 (47.1%) female respondents with ages ranging from 2 to 70 years (Mean=34.34; SD=12.55). A total of 18 (6.5%) respondents completed less than 9 years of schooling (ISCED 1), 41 (14.7%) respondents had completed 9 or 10 years of schooling without completing professional training (ISCED 2), 119 (42.8%) respondents had completed professional training or technical school (ISCED 3), 59 (21.2%) respondents had graduated from high-school (ISCED 4), and 41 (14.8%) respondents had graduated from college or university (ISCED 5).

Results The reliability of the k = 15 automatically generated rate algebra items was a Cronbach’s α-value of .87, indicating a sufficient degree of measurement precision from the viewpoint of classical test theory. We further investigated the psychometric property of this new item set using the Rasch model to investigate its one-dimensionality. The assumption of person homogeneity was assessed through calculation and comparison of three MRMs with one to three latent classes, respectively. Because of the low amount of some answer vectors, a parametric bootstrap (von Davier, 1996) was used to estimate the modelparameters. The results of the MRM analysis can be seen in Table 3. Table 3. Study II: Goodness-of-fit indices and information criteria for MRM with different number of latent classes for the scale “Algebraic Reasoning” Number of classes

p (Pearson χ²)* BIC

CAIC

1

.062

1793.36

1809.36

2

.080

1815.57

1848.57

3 .020 1868.97 Comparison with the saturated model.

1918.97

*

© 2006 Hogrefe & Huber Publishers

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

As can be seen in Table 3, the one-class solution as well as the two-class solution does not deviate significantly from the saturated model. Using the information criteria CAIC (Consistent Akaike Information Criterion; Rost, 2004) and BIC (Bayesian Information Criterion; Rost, 2004), the one-class solution appears to be the more parsimonious model. Based on this result it can be assumed that there is no other partitioning of the sample that would result in a violation of the assumption of person homogeneity of the RM. In addition, the assumption of item homogeneity was assessed by dividing the item set into two subgroups using the criteria (1) typicality of the cover story, (2) number of partial equations, and (3) total number of unknown elements in the equations. These three partitioning criteria were assumed to be the main radicals based on the theoretical model of algebra word-problem solving outlined above. With regard to the partitioning criteria “typicality of the cover story” the Martin-Löf statistic resulted in a χ2value of 51.298 with df = 55 and p = .617. For the partitioning criteria “number of partial equations” the corresponding χ2-value amounts to 15.903 with df = 35 and p = .997, while the Martin-Löf statistic for the partitioning criteria “number of unknown elements in the equation” resulted in a χ2-value of 44.751 with df = 49 and p = .646. Thus, all three Martin-Löf statistics resulted in nonsignificant χ2values; the corresponding effect size estimates suggested by Müller-Philipp and Tarnai (1989) all lie well below .30 (Min = .098, Max = .196). These results indicate that the assumption of item homogeneity can be retained. In sum, these results indicate that the 15 rate word problems used in this study indeed measure a one-dimensional latent trait.

Study III This study was carried out to investigate the nomothetic span (Embretson, 1983) of the automatically generated algebra word problems using confirmatory factor analysis. Based on the Gf-Gc theory (Horn, 1989; Horn & Noll, 1997) and the three-stratum theory (Carroll, 1993), we assumed that the new items constituted a measure of the second stratum factor Gq. In order to test this assumption, we administered a test battery consisting of 10 subtests measuring six of the secondstratum factors of the Gf-Gc theory and the three-stratum theory, respectively. The tests were administered in a predetermined fixed order in two sessions lasting 1 to 1.5 h each, including a 10-min break. The respondents were tested in groups of one to eight. All tests were computerized and administered using the Vienna Test System.

9

two measures of short-term memory (Gstm), and one measure each for long-term memory (Gltm) and visual processing (Gv). All subtests were taken from the INSBAT (Hornke et al., 2004), which has been shown to conform to the RM in previous studies (for a summary of the subtests used in the test battery as well as example items see Hornke et al., 2004; the test manual can be ordered free of charge at [email protected]). Thus, person parameters based on the RM will be used for each subtest in the analysis.

Tested Models Based on the Gf-Gc theory (Horn, 1989; Horn & Noll, 1997) and the three-stratum theory (Bickley et al., 1995; Carroll, 1993), we assumed that the observed correlations between these 10 subtests can be explained by the CattellHorn-Carroll model (CHC model), which assumes six factors and a higher-order g-factor. According to this theoretical model, the subtests Algebraic Reasoning and Computational Estimation should load on Gq, while the subtests Figural-Inductive Reasoning and Verbal-Deductive Reasoning should load on Gf. Verbal Comprehension and Verbal Production were assumed to constitute the factor Gc, while the subtests Verbal Short-Term Memory and Visual Short-Term Memory should load on the second-stratum factor Gstm. The subtests Spatial Comprehension and LongTerm Memory were assumed to load on Gv and Gltm, respectively. Since the later two factors are only marked by a single subtest, these two subtests were split into halves using an odd-even split procedure. Furthermore, the correlation between the latent factors should be explained by a higher-order g-factor. This model will be contrasted with a model that assumes that the correlations between the 10 subtests can be explained by a single latent factor. This model will be referred to as the g-factor model.

Sample The sample consisted of 101 (51.5%) male and 95 (48.5%) female respondents aged between 17 to 65 years (M = 37.74; SD = 11.94). A total of 17 (8.5%) respondents had completed 9 years of school but no vocational training, while 75 (38.3%) respondents had also completed vocational school (ISCED 1 to 2). In addition, 84 (42.9%) respondents had a high school leaving certificate with university entrance permission (ISCED 3) and 21 (10.7%) respondents had graduated from university or college (ISCED 5 to 6).

Measures

Results

The test battery used in this study comprised two measures of quantitative reasoning (Gq), two measures of fluid intelligence (Gf), two measures of crystallized intelligence (Gc),

Confirmatory factor analyses were conducted with AMOS 5.0 (Arbuckle, 2003) using Maximum Likelihood estimation. The means, standard deviations, and reliability esti-

© 2006 Hogrefe & Huber Publishers

Journal of Individual Differences 2006; Vol. 27(1):2–14

10

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

Table 4. Study III: Descriptive statistics for all measures (above) and fit statistics for the hierarchical confirmatory factor analysis (below) Measure AD CE FID VD VC VP VI VE RV LM Model CHC

Mean –.14 –.52 –1.04 –.10 .74 .02 .00 –.39 –.33 –.78 χ2

df

SD 0.72 1.66 1.02 1.02 1.07 1.31 1.92 1.05 1.89 1.02

Skew .41 –.60 .18 .82 –.84 –.25 –.37 .13 .28 .21

p

χ2/df

CFI RSMEA SRMR

1.32

.98

64.64 49

.064

Kurtosis .39 1.10 .44 1.63 1.04 .20 .91 –.50 –.18 –.21

normality can be retained. The global fit of the models was assessed using the following cut-off values for the global fit indices in order to decide whether the model fits the data: nonsignificant χ²-test (Marsh, Hau, & Wen, 2004), χ2/df < 2 (Byrne, 2001, RMSEA ≤ .06 (Browne & Cudeck, 1993; Hu & Bentler, 1999; Marsh et al., 2004), SRMR ≤ .06 (Byrne, 2001; Hu & Bentler, 1999; Marsh et al., 2004), and CFI ≥ .95 (Byrne, 2001; Hu & Bentler, 1999; MacCallum et al., 1996; Marsh et al., 2004). The fit statistics of the CHC model and the g-factor model are presented in Table 4 (below). The CHC model provides a rather good fit to the data as indicated by all five global fit indices. None of the standardized residuals were statistically significant (< 1.96) at α = .05. In contrast, none of the fit indices meet the standard criteria for an adequate model fit in the case of the g-factor model. It is, thus, hardly surprising that the CHC model fits the data significantly better than the g-factor model (∆χ2 [5] = 180.93; p < .10). The statistical significance of the factor loadings of the CHC model was also examined. The factor loadings of the individual subtests on the corresponding stratum-two factors were all moderate to high (α > .60) and reached statistical significance at α = .01. The standardized factor loadings are depicted in Figure 1. As can be seen in Figure 1, the subtest Algebraic Reasoning has the highest factor loading on Gq followed by Computational Estimation. The factor Gf is primarily defined by the subtest Figural-Inductive Reasoning followed by Verbal Deductive Reasoning, while the Gc loads primarily on the subtest Verbal Comprehension followed by the subtest Verbal Production. The factor Gstm is marked by the subtest Verbal Short-Term Memory followed by Visual

rtt .88a .71a .70b .77a .75b .70a .70b .70b .84a .72a

.04

.049

g-factor 245.77 54 .05). In order to test whether Gf and the g-factor are indeed indistinguishable, the model was recalculated with the additional restriction that the standardized factor loading from the g-factor to Gf equals 1. The restricted model yielded a χ2 [48] = 63.58 and p = .065. A ∆χ2 test was used to investigate whether the restricted CHC model fits the data less well than the unrestricted model. The result (∆χ2 [1] = 1.03; p = .311) indicates that the restricted CHC model fits no worse than the unrestricted CHC model. It can, thus, be concluded that Gf and the g-factor are indeed indistinguishable. This result is similar to the one obtained by Gustafsson (1984; Undheim & Gustafsson, 1987). Taken together, the results of this study confirm the hypothesis that the subtest Algebraic Reasoning represents a marker of the second stratum factor Gq and, thus, indicates that the nomothetic span of the new item material can be assumed.

Discussion The present paper investigated whether algebra word problems can be automatically generated using an extended schema-based approach to automatic item generation, which includes a quality control framework to enhance the likelihood that the automatically generated algebra word problems are in accordance with the RM (Rasch, 1980) and, thus, have a high psychometric quality. Based on a review of current approaches to automated item generation, the schema-based approach utilized in this paper was presented as an instance of an Approach 2 type (Dennis et al., 2002), which starts from a set of pre-existing templates for the construction of item isomorphs featuring the same item characteristics (radicals) as their “motheritems.” Following Arendasy (in press; Arendasy & Sommer, 2005), this schema-based approach was extended by incorporating a quality control framework in the item generator. The quality control framework essentially consists of a set of item features/rules that are assumed to (1) be associated with cognitive processes not related to the latent trait to be measured, or (2) possibly result in a differential item functioning. In the schema-based approach to automatic item generation these constraints were directly embedded into the templates. The present series of studies was, thus, designed to investigate (1) the necessity of constraints regarding the inclusion or exclusion of different algebra word-problem families into the templates, (2) the efficiency of the con© 2006 Hogrefe & Huber Publishers

11

straints built into the quality control framework, and (3) the nomothetic span of the automatically generated algebra word problems. In Study I a set of eight rate word problems was automatically generated and presented together with eight probability word problems in order to investigate the necessity of a constraint regarding the restriction to a single algebra word-problem family (Mayer, 1981). The results of the nonparametric goodness-of-fit statistics for the RM indicated that person homogeneity cannot be assumed for the combined rate and probability word-problem scale with regard to the partitioning criterion “educational level.” A further analysis of the p-values for the individual items revealed that the violation of the assumption of person homogeneity regarding the partitioning criterion “educational level” can be attributed to differences in the difficulty of the probability word problems compared to the other algebra word problems. This means that probability word problems do not provide an “education-fair” measure of respondents’ algebraic reasoning ability. Furthermore, the presence of item heterogeneity between rate and probability word problems within each of the two educational level groups indicated qualitative differences in the processing of such items. This result underlines the importance of the implementation of a quality control framework into automated item generation as first suggested by Arendasy (2004, 2005, in press). The results of Study I can, thus, be seen as a validation of one of the constraints of the quality control framework built into AGen. In Study II a set of k = 15 algebra word problems, which can be classified as rate word problems, was automatically generated using a set of radicals and a set of constraints derived from the current literature on algebra word-problem solving. Both the results of the MRM-based analyses as well as the Martin-Löf statistics indicated that the quality control framework seems to be rather effective in terms of automatically generating algebra word problems at a satisfactory psychometric level. In order to investigate whether the implementation of the quality control framework affects the construct validity of the automatically generated items by restricting the possible range of algebra word-problem types to the rate wordproblem family, a third study was conducted. The goal was to investigate the nomothetic span of the products of AGen according to the modified Gf-Gc theory and the latest version of the three-stratum theory of intelligence. Using a hierarchical confirmatory factor analysis, the result of Study III supports the assumption that the automatically generated algebra word problems load on Gq when administered together with another Gq-marker. The markers of the stratum-two factors Gf, Gc, Gv, Gltm, and Gstm provide evidence for the discriminant validity of the items. It was, thus, concluded that the implementation of a quality control framework does not necessarily reduce the construct validity of the automatically generated algebra word problems. Taken together, the three pilot studies reported in this article highlight the importance of the implementation of a quality control framework into automated item generation Journal of Individual Differences 2006; Vol. 27(1):2–14

12

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

as first suggested by Arendasy (2004, 2005, in press). However, the authors acknowledge that further studies need to be carried out in order to (1) examine further aspects of the construct representation of the automatically generated algebra word problems, and (2) provide further empirical support for the necessity of other possible constraints in AGen. From a practical point of view a schema-based approach to automated item generation based on a theoretically sound and empirically validated quality control framework would have several advantages: (1) The use of templates incorporating a unique set of radicals and constraints for the automatic generation of items helps to avoid misinterpretations in item writing in case a human item writer has to produce items above her or his ability level even when applying a set a generative rules. (2) The theory based definition of radicals and constraints enables a theoretically more sound interpretation of test scores, since the latent trait to be measured by the items is well defined by a generative framework. (3) The ability to generate new items on demand helps to increase test security, which is especially important for mass testing (e.g., college and university entrance exams).

References Ackerman, P.L. (1989). Within-task intercorrelations of skill performance: Implications for predicting individual differences? Journal of Applied Psychology, 74, 360–364. Andersen, E.B. (1973). A goodness of fit test for the Rasch model. Psychometrica, 8, 123–140. Arbuckle, J.L. (2003). Amos 5.0 update to the Amos user’s guide. Chicago: Smallwaters Corporation. Arendasy, M. (2002). TestWeb – Ein webbasiertes Testsystem [TestWeb – A web-based system for psychological assessment]. Online, URL http://131.130.64.42/ diffpsylabor/start. aspx (Feb. 23, 2003). Arendasy, M. (2004). AGen: Ein Itemgenerator zur Herstellung algebraischer Textaufgaben [AGen – An item generator for the construction of algebra word problems]. Wien: Eigenverlag. Arendasy, M. (2005). Automatic generation of Rasch-calibrated items: Figural Matrices Test GEOM and Endless Loops Test EC. International Journal of Testing, 5, 197–224. Arendasy, M. (in press). Automatisierte Itemgenerierung und psychometrische Qualitätssicherung am Beispiel des Matrizentests GEOM [Automatic Item generation and psychometric quality control demonstrated using the figural matrices test GEOM]. Frankfurt: Peter Lang Academic Publishers. Arendasy, M., & Gittler, G. (2003). IRT-basierter Vergleich zweier Varianten automationsgestützt erstellter Matrizentestaufgaben. [IRT-based comparison of two automatically generated figural matrices variants]. Zeitschrift für Differentielle und Diagnostische Psychologie, 24, 261–274. Arendasy, M., & Sommer, M. (2005). The effect of different types of perceptual manipulations on the dimensionality of automatically generated figural matrices. Intelligence, 33, 307–324. Arendasy, M., Sommer, M., & Ponocny, I.(2005). Psychometric Journal of Individual Differences 2006; Vol. 27(1):2–14

approaches help resolve competing cognitive models: When less is more than it seems. Cognition & Instruction, 23, 503–521. Baddeley, A.D. (1992). Is working memory working? Quarterly Journal of Experimental Psychology, 44A(1), 1–31. Bejar, I.I. (2002). Generative testing: From conception to implementation. In S.H. Irvine & P.C. Kyllonen (Eds.), Item generation for test development (pp. 199–218). Hillsdale, NJ: Erlbaum. Bernardo, A.B.I. (1994). Problem-specific information and the development of problem-type schemata. Journal of Experimental Psychology: Learning, Memory and Cognition, 20, 379–395. Bickley, P.G., Keith, T.Z., & Wolfle, L.M. (1995). The three-stratum theory of cognitive abilities: Test of the structure of intelligence across the lifespan. Intelligence, 20, 309–328. Blessing S.B., & Ross, B.H. (1996). Content effects in problem categorization and problem solving. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 792–810. Bobrow, D.G. (1968). Natural language input for a computer problem-solving system. In M. Minsky (Ed.), Semantic information processing (pp. 133–215). Cambridge, MA: MIT Press. Browne, M.W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K.A. Bollen & J.S. Long (Eds.), Testing structural equation models (pp. 445–455). Newbury Park, CA: Sage. Byrne, B.M. (2001). Structural equation modeling with AMOS: Basic concepts, application, and programming. London: Erlbaum. Carroll, J.B. (1993). Human cognitive abilities – A survey of factor analytic studies. Cambridge, UK: Cambridge University Press. Dennis, I., Handley, S., Bradon, P. Evans, J., & Newstead, S. (2002). Approaches to modeling item-generative tests. In S.H. Irvine & P.C. Kyllonen (Eds.), Item generation for test development (pp. 53–72). Hillsdale, NJ: Erlbaum. Edelblut, P., Elliot, S., Mikulas, C., & Bosley, L. (2002, June). Internet test security lessons learned. Paper presented at the International Conference on Computer-Based Testing and the Internet: Building Guidelines for Best Practice, Winchester, UK. Embretson, S.E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197. Embretson, S.R., & Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Enright, M.K., Morley, M., Sheehan, K.M. (2002). Item by design: The impact of systematic feature variation on item statistical characteristics. Applied Measurement in Education, 15(1), 49–74. Fischer, G.H., & Molenaar, I.W. (1995). Rasch models. Foundations, recent developments, and applications. New York: Springer. Glas, C.A.W., & Verhelst, N.D. (1995). Tests of fit for polytomous Rasch models. In G.H. Fischer & I.W. Molenaar (Eds.), Rasch models, foundations, recent developments, and applications (pp. 325–352). New York: Springer. Greeno, J.G., Smith, D.R., & Moore, J.L. (1993). Transfer of situated learning. In D.K. Dettermann & R.S. Sternberg (Eds.), © 2006 Hogrefe & Huber Publishers

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

Transfer on trial: Intelligence, cognition, and instruction (pp. 99–167). Westpoint, CT: Ablex Publishing. Gustafsson, J.E. (1984). A unifying model of the structure of intellectual abilities. Intelligence, 8, 179–203. Hall, R., Kibler, D., Wenger, E., & Truxaw, C. (1989). Exploring the episodic structure of algebra story problem solving. Cognition and Instruction, 6, 223–283. Harris, W.G., & Jackson, D.N. (2002, June). Security concerns for e-testing. Paper presented at the International Conference on Computer-Based Testing and the Internet: Building Guidelines for Best Practice, Winchester, UK. Hinsley, D.A., Hayes, J.R., & Simon, H.A. (1977). From words to equations: Meaning and representation in algebra word problems. In M.A. Just & P.A. Carpenter (Eds.), Cognitive processes in comprehension (pp. 89–105). Hillsdale, NJ: Erlbaum. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Erlbaum. Horn, J.L. (1989). Models for intelligence. In R. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 29– 73). Urbana: University of Illinois Press. Horn, J.L., & Noll, J. (1997). Human cognitive capabilities: GfGc theory. In D.P. Flanagan, J.L. Genshaft, & P.L. Harrison (Eds.), Contemporary intellectual assessment: Theory, tests, and issues. (pp. 49–91). New York: Guilford. Hornke, L.F. (2002). Item-generative models for higher order cognitive functions. In S.H. Irvine & P.C. Kyllonen (Eds.), Item generation for test development (pp. 159–178). Hillsdale, NJ: Erlbaum. Hornke, L.F., Arendasy, M., Sommer, M., Häusler, J., WagnerMenghin, M., Gittler, G., Bognar, B., & Wenzl, M. (2004). Intelligenz-Struktur-Batterie (INSBAT): Eine Testbatterie zur Messung von Intelligenz [Intelligence Structure Battery (INSBAT): A test battery for the measurement of intelligence]. Mödling: Dr. Gernot Schuhfried GmbH. Hu, L.T., & Bentler, P.M. (1999). Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55. Irvine, S.H. (2002). The foundations of item generation for mass testing. In S.H. Irvine & P.C. Kyllonnen (Eds.). Item generation for test development (p. 3–34). Mahwah, NJ: Erlbaum. Irvine, S.H., & Kyllonen, P.C. (2002). Item generation for test development. Mahwah, NJ: Erlbaum. Katz, I.R., Friedman, D.E., Bennett, R.E., & Berger, A.E. (1996). Differences in strategies used to solve stem-equivalent constructed response and multiple choice SAT-Mathematics items (College Board Report No. 95–3). New York: College Entrance Board. Kintsch, W. (1998). Comprehension: A paradigma for cognition. Cambridge, UK: Cambridge University Press. Kline, R.B. (1998). Principles and practice of structural equation modeling. New York: Guilford. Koedinger, K.R., & Nathan, M.J. (2004). The real story behind story problems: Effects of representations on quantitative reasoning. Journal of the Learning Sciences, 13, 129–164. Lane, S. (1991). Use of restricted item response models for examining item difficulty ordering and slope uniformity. Journal of Educational Measurement, 28, 295–309. Lepik, M. (1990). Algebraic word problems: Role of linguistic and structural variables. Educational Studies in Mathematics, 21, 83–90. © 2006 Hogrefe & Huber Publishers

13

MacCallum, R.C., Browne, M.W., & Sugawara, H.M., (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130–149. Marsh, H.W., Hau, K.T., & Wen, Z., (2004). In search of golden rules: Comment on hypothesis testing approaches to setting cutoff values for fit indexes and dangers in overgeneralising Hu & Bentler’s (1999) findings. Structural Equation Modelling, 11, 320–341. Martin-Löf, P. (1973). Statistiska Modeller. Anteckningar fran seminarier lasaret 1969–1970 utarbetade av Rolf Sundberg. Obetydligt ändrat nytryck, october 1973 [Statistical models: Report of the seminars 1969–1970 held by Rolf Sundberg, October 1973]. Stockholm: Institutet för säkringsmatematik och matematisk statistik vid Stockholms universitet. Mayer, R.E. (1981) Frequency norms and structural analysis of algebra story problems into families, categories, and templates. Instructional Science, 10, 135–175. Mayer, R.E. (1982). Memory for algebra story problems. Journal of Educational Psychology, 74, 199–216. Mayer, R.E., Larkin, J.H., & Kadane, J.B. (1984). A cognitive analysis of mathematical problem solving ability. In R.J. Sternberg (Ed.), Advances in the psychology of human intelligence (pp. 231–273). Hillsdale, NJ: Erlbaum. McGrew, K.S., & Woodcock, R.W. (2001). Technical manual. Woodcock-Johnson III. Itasca, IL: Riverside Publishing. Müller-Philipp, S., & Tarnai, C. (1989). Signifikanz und Relevanz von Modellabweichungen beim Rasch-Modell [Significance and relevance of model fit tests for the Rasch model]. In K.D. Kubinger (Eds.), Moderne Testtheorie [Modern test theory] (pp. 239–258). Weinheim: Psychologie Verlags Union. Nathan, M.J., Kintsch, W., & Young, E. (1992). A theory of algebra-word-problem comprehension and its implications for the design of learning environments. Cognition and Instruction, 9, 329–389. Nhouyvanisong, A. (1999). Enhancing mathematical competence and understanding: Using open ended problems and informal strategies. Doctoral dissertation, Psychology Department, Carnegie Mellon University. Paige, S.J., & Simon, H.A. (1966). Cognitive processes in solving algebra word problems. In B. Kleinmutz (Ed.), Problem solving (pp. 51–119). New York: Wiley. Phelps, L., McGrew, K., Knopik, S., & Ford, L. (2005). The general (g), broad, and narrow CHC stratum characteristics of the WJ-III and WISC-III tests: A confirmatory cross-battery investigation. School Psychology Quarterly, 20, 58–66. Ponocny, I. (2001). Nonparametric goodness-of-fit tests for the Rasch model. Psychometrika, 66, 437–460. Ponocny, I., & Ponocny-Seliger, E. (1999). T-RASCH 1.0. Groningen: ProGAMMA. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. (Original published 1960). Reed, S.K. (1987). A structure-mapping model for word problems. Journal of Experimental Psychology: Learning, Memory and Cognition, 13, 124–139. Reusser, K. (1989). Vom Text zur Situation zur Gleichung. Kognitive Simulation von Sprachverständnis und Mathematisierung beim Lösen von Textaufgaben [Text, situation and equation: Cognition simulation of verbal comprehension for solving algebraic word problems]. Habilitationsschrift, Universität Bern, Switzerland. Journal of Individual Differences 2006; Vol. 27(1):2–14

14

M. Arendasy et al.: Automatic Generation of Quantitative Reasoning Items

Reusser, K., & Stebler, R. (1997). Every word problem has a solution – the social reality of mathematical modelling in school. Learning and Instruction, 28, 309–328. Roid, G.H. (2003). Stanford-Binet intelligence scales (5th ed.). Itasca, IL: Riverside. Rost, J. (2004). Lehrbuch Testtheorie Testkonstruktion [Textbook of test theory and test construction] (2nd ed.). Bern: Huber. Sebrecht, M.M., Enright, M., Bennett, E., & Martin, K. (1996). Using algebra word problems to assess quantitative ability: Attributes, strategies, and errors. Cognition and Instruction, 14, 285–343. Schuler, H. (2000). Personalauswahl im europäischen Vergleich [Personnel selection: A European comparison]. In E. Regnet & M.L. Hoffmann (Eds.), Personalmanagement in Europa [Personell management in Europe]. Göttingen: Hogrefe. Shalin, V., & Bee, N.V. (1985). Analysis of the semantic structure of a domain of word problems (Tech. Rep. No. APS-20). Pittsburgh: University of Pittsburgh, Learning Research and Development Center. Siegler, R.S. (1996). Emerging minds: The process of change in children’s thinking. New York: Oxford University Press. Spada, H., & Scheiblechner, H. (1973). Stichprobenunabhängige Denkmodelle. Eine Analyse von Denkfehlern beim syllogistischen Schlussfolgern [Sample independent models of thinking: An analysis of errors in syllogistic reasoning]. In G. Reinert (Ed.), Bericht über den 29. Kongress der deutschen Gesellschaft für Psychologie, Kiel 1970 [Report of the 29th Convention of the German Society of Psychology]. Göttingen: Hogrefe. Staub, F.C., & Reusser, K. (1995). The role of presentational structure in understanding and solving mathematical word problems. In C.A. Weaver, S. Mannes, & C.R. Fletcher (Eds.), Discourse comprehension. Essays in honor of Walter Kintsch (pp. 285–305). Hillsdale, NJ: Erlbaum. Srp, G. (1993). Syllogismen als Aufgaben für einen computerisierten adaptiven Test [Syllogism tasks for a computerized adaptive test]. Unpublished diploma, University of Vienna. Stocking, M.L., & Lewis, Ch. (2000). Methods of controlling the exposure of items in CAT. In W.J. Van der Linden & C.A.W. Glas (Eds.), Computerized adaptive testing: Theory and practice. Boston: Kluwer. Sympson, J.B., & Hetter, R.D. (1985, October). Controlling item exposure rates in computerized adaptive testing. Paper presented at the annual meeting of the Military Testing Associa-

Journal of Individual Differences 2006; Vol. 27(1):2–14

tion. San Diego, CA: Navy Personnel Research and Development Center. Tabachneck, H.J.M., Koedinger, K.R., & Nathan, M.J. (1994). Toward a theoretical account of strategy use and sense making in mathematics problem solving. In Proceedings of the 16th Annual Conference of the Cognitive Science Society (pp. 836– 841). Hillsdale, NJ: Erlbaum. Tirre. W.C., & Pena, C.M. (1993). Components of quantitative reasoning: General and group ability factors. Intelligence, 17, 501–521. Tirre, W.C., & Field, K.A. (2002). Structural models of abilities measured by the Ball Aptitude Battery. Educational and Psychological Measurement, 62, 830–856. Undheim, J.O., & Gustafsson, J.E. (1987). The hierarchical organisation of cognitive abilities: Restoring general intelligence through the use of linear structural relations (LISREL). Multivariate Behavioral Research, 22, 149–171. van Dijk, T.A., & Kintsch, W. (1983). Strategies of discourse comprehension. New York: Academic Press. Van der Linden, W.J., & Glas, C.A.W. (2000). Computerized adaptive testing: Theory and practice. Boston: Kluwer. Van der Linden, W.J., & Hambleton, R.K. (1997). Handbook of modern item response theory. New York: Springer. von Davier, M. (1996). Methoden zur Prüfung probabilistischer Testmodelle [Model fit tests for probabilistic test models]. Kiel: IPN. Wagner-Menghin, M. (2004). Manual Lexikon-Wissen Test (LEWITE) [Manual Lexical Knowledge Test (LEWITE]. Mödling: Dr. G. Schuhfried GmbH. Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ: Erlbaum. Yerushalmy, M., & Gilead, S. (1999). Structures of constant rate word problems: A functional approach analysis. Educational Studies in Mathematics, 39 (1–3), 185–203.

Martin Arendasy Faculty of Psychology University of Vienna Liebiggasse 5 A-1010 Vienna Austria Tel. +43 1 4277 47848 E-mail [email protected]

© 2006 Hogrefe & Huber Publishers