Computer Based Tests: Alternatives for Test and Item Design

0 downloads 0 Views 706KB Size Report
petence assessment primarily for the following three reasons: (1) New test and item designs which .... tests in addition to, or as alternatives to, conventional tests.
Chapter 12 Computer Based Tests: Alternatives for Test and Item Design Joachim Wirth

Computer-based tests are becoming increasingly important in various fields of competence assessment primarily for the following three reasons: (1) New test and item designs which use multiple media and interactive simulations (Drasgow, 2002) are considered by many researchers to be more authentic and therefore more valid than conventional test and item designs. Furthermore, new tests and items can be constructed to collect different kinds of data (such as auditory or process information) not possible using conventional test formats. These data can be collected unobtrusively and with a high degree of accuracy, and they can provide the basis for new competence measures. Moreover, the opportunity to create new measures may open the gates to new areas of research and lead to new definitions or re-characterizations of conventional constructs (Hadwin, Winne, & Nesbit, 2005). (2) Adaptive testing can be carried out by using the computer to gather and analyze data online. Such analyses can lead to the selection and presentation of items that are of an appropriate difficulty level and that maximize information about the ability level of an examinee (Lord, 1980; see also Eggen, 2008, Chapter 10 in this book). As a result, adaptive tests can be psychometrically more efficient and, therefore, less time consuming than conventional tests (Folk & Smith, 2002). (3) Economic testing is especially important when examining large samples. The internet can be a very economic tool for delivering tests or for reporting test results to a vast number of people independent of the location of the examinees or the time of day at which the test is administered (e.g., Educational Testing Service, 2005; Fleischer, Pallack, Wirth, & Leutner, 2005; Groot, de Sonneville, & Stins, 2004). Furthermore, internet-based item banks can be shared by many researchers and test deliverers to create, manage and deliver electronic competence assessments without developing each item individually (e.g., Plichart, Jadoul, Vandenabeele, & Latour, 2004). The focus of this chapter is on new test and item designs; adaptive and economic testing are discussed by Eggen (2008, Chapter 10 in this book). New test and item designs will be considered from four different perspectives: (1) the definition and selection of an examinee’s competencies to be assessed; (2) the format of the items and tests presented to an examinee; (3) the data collected from an examinee; and (4) the measures and procedures used to analyze the data collected from an examinee. From each perspective, advantages of new test and item designs will be discussed as well as J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of Competencies in Educational Contexts, 219–235. © 2008 Hogrefe & Huber Publishers

220

J. Wirth

problems that emerge from using them.

Competencies Computer technology and use affects educational research in two ways. First, computers provide the opportunity to develop new types of learning environments and new assessment methods that can be used in research on traditional concepts or on new aspects of those concepts. Second, computer use and learning with new media have established new and independent fields of educational research leading to the definition of new constructs and the development of new tests assessing these new constructs. Thus, computer technology provides the opportunity to operate conventional constructs in new ways or to develop new constructs, raising the prospect of being able to re-characterize conventional constructs or to define new constructs and new areas of research (Hadwin et al., 2005). One of the most prominent examples of re-characterizing a conventional construct is demonstrated by the work on complex problem solving undertaken by Dörner and his group in Germany (Dörner, Kreuzig, Reither, & Stäudel, 1983; Dörner & Preußler, 1990; Dörner, Schaub, & Strohschneider, 1999). Dörner created a computer-simulated town called “Lohhausen”. Subjects were appointed mayor of Lohhausen and instructed to govern the town. The simulation included approximately 2000 variables each of which was somehow connected to the others. Variables changed their values either as an effect of a mayor’s intervention and/or as a function of time. Dörner and his colleagues were the first to use the computer to simulate such a highly complex and dynamic system. Their (and related) work had a strong impact on research on problem solving. Complex problem solving as the competence required to learn how to control a complex and dynamic system became a new construct in cognitive psychology (Frensch & Funke, 1995), and the distinction between knowledge acquisition and knowledge application became prominent in definitions of problem solving (Funke, 1985). Because low or even negative correlations were found between complex problem-solving performance and intelligence (e.g., Putz-Osterloh, 1981), even research on definitions and measures of intelligence were highly influenced by this and related work (e.g., Kröner, 2001; Leutner, 2002; Süß, 1996, 1999). Nowadays, computer-based tests are indispensable tools for the assessment of problem-solving competencies even in large-scale assessments (Baker & O’Neil, 2002; Klieme, Leutner, & Wirth, 2005; Wirth & Klieme, 2003). They are also used in the assessment of tacit knowledge about procedures and strategies that cannot easily be verbalized and therefore is very difficult to assess using conventional paper-based tests (Berry & Broadbent, 1995; Buchner, Funke, & Berry, 1995; Krauss et al., 2004). Complex and dynamic simulations are not the only computer-based tests that have contributed to (re-)definitions of constructs. The use of multiple media to present information has lead to growing research on a new construct called “multimedia learning”

Computer Based Tests: Alternatives for Test and Item Design

221

(Leutner & Brünken, 2000; Mayer, 2001; Niegemann, Leutner, & Brünken, 2004; Wittrock, 1989), and various tests have been developed to assess if and how students use different media to select, organize and integrate information from different modes into a coherent mental model (e.g., Brünken, Seufert, & Zander, 2005; Leutner & Plass, 1998; Plass, Chun, Mayer, & Leutner, 1998). Furthermore, the opportunity to easily include videos and animations to newly developed computer-based learning environments has strongly supported current research on the question of if and under which conditions animations facilitate learning (Höffler & Leutner, 2007; Tversky, Morrison, & Betrancourt, 2002). The use of animation as part of multimedia learning has become an important new field of research in educational psychology fostered by new item and test designs in computer-based learning environments and tests. Reading competence is another example where computer-based tests provide the opportunity to examine and define new aspects of a construct. As learning with hypertexts has become a growing new area of research on reading (cf. Bannert, 2003; Brünken, Müller-Kalthoff, & Möller, 2005), comparing competencies and strategies used when reading a linear versus a non-linear text especially has drawn great attention from researchers (e.g., Richter, Naumann, Brunner, & Christmann, 2005). The ability to efficiently navigate through non-linear texts has been established as a new aspect of reading competence and several researchers have developed logfile-based measures to assess this new aspect (e.g., Barab, Bowdish, & Lawless, 1997; Flender & Naumann, 2002; Lawless & Brown, 1997; Richter, Naumann, & Noller, 2003). Computer technology not only leads to the redefining of constructs, it also makes feasible the assessment of traditionally described competencies in settings where these competencies were previously not assessable economically. For example, although speaking is at the heart of language competence, the Test of English as a Foreign Language (TOEFL) formerly only assessed reading, writing, and listening (Educational Testing Service, 2005). When digital recording became administrable by ETS’s test centers, a speaking component was added to the new Internet-based test version TOEFL-iBT. Examinees’ responses are digitally recorded and transmitted via Internet to ETS’s “Online Scoring Network” where trained individuals are scoring them. Using this Internet-based technique enables ETS to collect data from a vast number of people and score it objectively and economically. Although new item and test designs of computer-based tests offer great opportunities to advance educational research, there are some pitfalls when using computer-based tests in addition to, or as alternatives to, conventional tests. Performance on computerbased tests can differ from performance on paper-based tests even if the same items are presented because other or additional abilities are required and therefore tested (Alderson, 2000; Jurecka, 2008, Chapter 9 in this book; Sawaki, 2001). Merely augmenting an item with information from different modes and presenting it on a computer screen does not automatically make it more valid than a paper-based item which covers more aspects of the competence to be assessed. For example, adding context visuals (photos of a speaker or a setting) to listening comprehension items on the computer-based version of the TOEFL had almost no effect on examinees’ performances. However, adding content visuals (photos, diagrams and/or drawings related to the con-

222

J. Wirth

tent of the audio portion of the item) did contribute to better comprehension (Ginther, 2002) which is in line with the theory behind multimedia learning (Mayer, 2001). In summary, new designs in computer-based tests and items have the potential either to enhance educational research on traditional constructs by assessing additional components of the construct, or to provide new measures for areas that emerge as a result of the increasing impact of computers on learning. However, using new computer-based item and test designs instead of traditional tests does not automatically lead to an improved measure (Mislevy, 1996) that covers more or new aspects of a competence. As is true for all new items and tests, the quality of computer-based items and tests has to be evaluated carefully. The more complex and dynamic a test is, and the more different media are included, the more difficult it is to evaluate which aspects of a competence are covered and how reliable the test is.

Formats Probably the most significant difference between conventional and computer-based items and tests is the use of multiple media and dynamic stimuli such as audio, video, or animations. The idea of using audio and video for testing is neither new nor restricted to computer-based tests. For example, the US Army Air Force Aviation Psychology Program used film-based assessments during World War II for measuring aptitudes involved in motion or distance perception (c.f. Siebert & Snow, 1965) and audio has been part of the paper-based TOEFL which assesses listening comprehension for a long time (Educational Testing Service, 2005). Recent advances in computer technology make the development and delivery of high quality audio- and video-based tests inexpensive for the first time. Thus, audio- and video-based tests can nowadays be found in nearly all fields of competence assessment including, for example, medicine, music, history, physical education, personnel psychology and teacher education (e.g., Ackermann, Evans, Park, Tamassia, & Turner, 1999; Bennett et al., 1999; Krauss et al., 2004; Olson-Buchanan et al., 1998; Vispoel, 1999). Computer-based multimedia tests differ from conventional tests in which test administrators use VCRs or CD-players to provide multimedia stimuli (often even in group test sessions, e.g., Educational Testing Service, 1997) in terms of test-fairness and individualization. Computer-based tests assure test-fairness insofar as each examinee sits in front of his or her own computer screen and uses his or her own headphones, thus, all examinees watch and listen to the stimuli under comparable conditions. External test conditions, such as sitting in the last row, do not affect the quality of stimuli perception. Many computer-based tests also allow the stimuli presentation to be individualized. For example, each examinee is able to set the volume of the audio presentation to his or her individual optimal level and also can stop or repeat the presentation of an audio or video or move forward or backward or in slow motion through it (e.g., Bennett et al., 1999; Krauss et al., 2004). Some tests presenting pictures allow examinees to zoom in

Computer Based Tests: Alternatives for Test and Item Design

223

and out or rotate the objects presented (e.g., Ackermann et al., 1999) and thus allow for a much more detailed exploration of the stimuli. People may differ in their preferred mode of information processing. For example, some better understand second-language texts when the definitions of unknown words are available verbally. Others perform better when unknown words are explained by visual definitions like pictures or video clips (Plass et al., 1998). Therefore, improved test-fairness could be the result if examinees were able to choose from among different modes of information presentation. Multiple media and dynamic stimuli are used in computer-based tests to improve test-fairness and individualization and also to improve the authenticity of items bringing test situations more in line with real situations. Improving authenticity can be one way to take into account that competencies are always related to the context and the domain in which a competence is shown. Simulating the key features of the real situation in which a competence is usually shown as authentic as possible in the test situation means to improve construct validity of the test. Complexity and interactivity are other means of developing more realistic items and tests. The computer-simulation “Lohhausen” previously mentioned in this chapter (Dörner et al., 1983) was developed as a result of this desire to create a realistic test situation in which real-life cognition can be assessed. Dörner criticized conventional tasks used in experimental research on problem solving or in traditional intelligence tests as being artificial because they do not assess cognition and competences required in real-life problem solving. Of course, Dörner’s highly complex, dynamic and interactive simulation of a town holds much more ecological validity than tasks such as “Please continue: 1, 2, 4, 7, 11, …”. Complex, dynamic and interactive simulations are particularly popular in vocational assessments (Strauß & Kleinmann, 1995; Streufert, Pogash, & Piasecki, 1988; Wagener, 2001). For example, the “Primum Computer Case Simulation” (Clyman, Melnick, & Clauser, 1995; cf. Luecht & Clauser, 2002) presents a patient’s case to a physician and the physician has to decide on diagnostic tests, treatments, and therapies (see also Gräsel, 1997). There are various user interfaces for running tests, entering orders and initiating actions. Some events, such as the patient suddenly having trouble breathing, occur as ordinary components of the case, and the examinee has to react appropriately. In addition, time simulation exerts pressure on the examinee. The test’s purpose is to assess clinical patient-management proficiency. Therefore, highly complicated cases are presented in an interactive problem-solving situation in an attempt to reflect real life situations as authentically as possible. Furthermore, computer-based items have the potential to go beyond being realistic. Animations can allow examinees to visualize processes that are difficult to perceive or not observable at all in real life, for example, metabolic processes (Nerdel, 2003). Animations can also simulate experiments and reactions that are too dangerous to conduct in reality (Mikelskis, 1997; Prenzel, von Davier, Bleschke, Senkbeil, & Urhahne, 2000). The latest technology even allows subjects to augment their real-life observations with further information while they walk around (Knight et al., 2005). Until now, such technology has been used for instruction and training in learning environments

224

J. Wirth

but it is just a question of time until it is also used for assessment. In summary, new designs of computer-based tests often use multiple media to present information in different complex, dynamic and interactive modes. Multimedia, complexity, dynamism and interactivity are considered good features of more authentic and realistic, and therefore more valid, tests – although differences in performance between real and computer-simulated test situations may remain (Shavelson, Baxter, & Gao, 1993). Increased authenticity leads to increased ecological validity of a test which in turn may lead to higher acceptance of the test from both test deliverers and examinees. However, authenticity does not automatically lead to higher construct validity. “Adding more realism to test items does not automatically lead to valid measures. […] Any new feature added to a test that is not essential to the variable the test is intended to measure is a potential threat to [construct] validity” (van der Linden, 2002, pp. 93.). This means that the more complex and dynamic a test is and the more different media used to present the information, the more difficult it is to ensure that the test situation reflects only key aspects of the context and domain that is part of the definition of the competence and that test performance only reflects the level of the competence the test is intended to measure. Thus, when designing new computer-based items and tests there is always a trade-off between ecological validity (and acceptance by test deliverer and examinee) and construct validity. The trade-off between task complexity and scoring simplicity is another issue to be considered when designing computer-based items and tests (Luecht & Clauser, 2002). Complex tasks lead to complex structures of the collected data, and scoring becomes a complex and laborious task. Conversely, restricting task complexity to ensure economic scoring can lead to artificial and oversimplified tasks. It is important to evaluate task complexity and authenticity within the context of the test’s purpose to ensure that the task provides appropriate and valid information for the test’s purpose.

Data Most conventional test formats (except performance tests) collect their data using paper and pencil. In contrast, computer-based items and tests provide examinees with a variety of different devices with which to enter their responses to the computer system which records these data in a logfile or a database system. The most common devices are the mouse and the keyboard. Other tests, for example, flight simulators, also use joysticks, pedals or even real devices to collect performance data from the examinees (Harris & Khan, 2003). Rosenblum, Parush, and Weiss (2003a; 2003b) used touchscreen technology to examine the timing and nature of handwriting pauses. They were able to distinguish between proficient and poor handwriters by assessing whether a person when handwriting lifts the pen at strategic points or transitions in letters or words or whether he or she lifts the pen more frequently and erratically. The microphone is another possible input device, for example, it may be used to collect verbal data for

Computer Based Tests: Alternatives for Test and Item Design

225

the assessment of competence when speaking a second-language (Educational Testing Service, 2005). Another example is provided by Norman, Debus, Doerre, and Leutner (2004): They used a microphone to record metronome-based speaking of simple syllables in order to assess workload as a measure of competence for the performance of a procedural task (tram driving). The appropriateness of an input device to collect data, of course, depends on the construct to be assessed. For example, handwriting or speaking competence cannot be expressed using a keyboard or a computer mouse. Furthermore, the kind of input device can also affect the validity of the data collected. For example, examinees with minimal experience using computers will have more difficulty using a keyboard or a computer mouse than examinees with extensive computer experience. Therefore, computer experience can have an impact on the data collected. Using another kind of device, for example a touchscreen, could perhaps inhibit this threat to validity. A great advantage of using computer-based items and tests is that the computer captures all of the data it is programmed to capture without exception. The resulting data structure is complete and includes no missing value. Another advantage is that the computer can record data unobtrusively and without interfering with specific cognitive processes (Hadwin & Winne, 2001; Winne, Jamieson-Noel, & Muis, 2002). Examinees dealing with, for example, hypertexts or computer-simulations often are not aware that the computer records every single mouse click or keystroke into a logfile (e.g., Wirth, 2004; Wirth & Funke, 2005). Thus, these data are not biased, for example, by an examinee’s tendency to act in a socially desirable way. Most computer-based tests collect data on the behavioral level rather than the reflective (self-report) level (Lompscher, 1994) and it can be argued that behavioral data are more valid and reliable than reflective data (provided the test is not intended to measure reflection). For example, Leutner and Plass (1998) compared the VV-BOS (a computer-based instrument used for direct observation of students’ preference for visual or verbal learning material in an authentic learning situation) with conventional questionnaire measures of the visual-verbal learning preference. While the VV-BOS proved to be a reliable predictor of learning outcome, conventional questionnaires neither correlated with the VV-BOS nor with learning outcome measures. Thus, direct computer-based observational data of the learning behavior turned out to be more reliable and more valid than reflective questionnaire data. Winne and Jamieson-Noel (2002) draw the same conclusion when comparing frequencies of study tactics that were carried out during learning and observed and recorded in a computer-based logfile with the frequencies of study tactics that students reported directly after studying. The correlation of observed and reported frequencies of the same tactics turned out to be low, and learners varied unpredictably in the degree to which they overestimated or underestimated their use of study tactics. Another advantage of using computers for collecting data is that the data can be collected online. When examining learning processes and learning strategies the computer can record traces of dynamic cognitive events at the time they occur during learning (Winne, 1982). Compared to other commonly used methods, such as assessing learning strategy use by collecting data after the learning process (e.g.,

226

J. Wirth

MSLQ; Pintrich, Smith, Garcia, & McKeachie, 1991), these data are not biased by the effects of reminding, beliefs or social desirability (cf. Artelt, 2000; Jamieson-Noel & Winne, 2003; Wirth, 2004). The recording of time can easily be included in computer-based items and tests offering great avenues for innovative measures. For example, Brünken, Steinbacher, Plass, and Leutner (2002) developed a direct measure for cognitive load during learning that is based on the dual-task paradigm. In addition to learning a task that was presented on a computer screen, examinees had to observe a character and react as quickly as possible every time the character changed color. Response time proved to be a valid direct measure of the cognitive load induced by the learning task, a measure that prevailed over many of the shortcomings of other indirect and subjective assessment methods. Time recording can also be used to gather information about order and change. For example, Jamieson-Noel and Winne (2003; Winne & Jamieson-Noel, 2002) recorded studying events such as highlighting or scrolling through a text while learning in a computer-based learning environment. They identified patterns of mouse clicks which when performed in a specific order served as indicators for strategic learning behavior. Wirth (2004; Wirth & Funke, 2005) recorded logfiles while students explored a complex and dynamic system. He analyzed the number of operations students performed either for the first time or repeatedly. Based on this information he developed a process measure that proved to be a good predictor of learning outcome. In summary, there are three main advantages to computer-based items and tests: (1) They are not restricted to paper-and-pencil formats but can use a great variety of different input devices to collect data from the examinees. (2) They can collect data unobtrusively online, without any missing value, and the data collected can be considered to be unbiased. (3) Probably the greatest advantage to computer-based data collection is the opportunity to capture the point of time when an examinee performs an action. This information about “what is done when” provides the opportunity to create innovative measures, for example, for cognitive load or for assessing behavioral process parameters. Computer-collected data are behavioral data, thus they are not direct data about cognition or metacognition. Therefore, it is possible to collect unbiased data about “what is done when” but not about “what is thought when”. This gap between cognition and behavior can be more or less wide. As it is true not only for computer-based collected data, this gap has to be kept in mind when using behavioral data for creating measures for cognitive or meta-cognitive competencies.

Measures and Procedures Most computer-based (non-adaptive) tests record data while examinees work on the test and write the data into a logfile. Gathering all kinds of potentially valuable information and storing it in logfiles is easy. However, as a result, the logfile is often voluminous,

Computer Based Tests: Alternatives for Test and Item Design

227

and its structure is complex (except for logfile data of simple multiple-choice items or the like). As Luecht and Clauser (2002) point out: “It is difficult to codify complicated and possibly interdependent data. The signal in the data can be difficult to filter from the noise.” (p. 77). This transformation of behavioral data into an objective, reliable and valid measure of cognitive or metacognitive competence is often the most challenging part of designing new computer-based items and tests. Objectivity is the criterion which is probably least difficult to achieve. The more the computer controls the administration of a test, the less an effect of the person who delivers the test can exert on the data collected. Most data can and should be analyzed on a computer using command files that specify all computations that are to be performed on the data. These command files provide a complete documentation of data analyses and of all computations that define a measure. The computations are applied to the data of each examinee equally, and there is neither any random error nor any subjective interpretation of the data that is not documented (which is not the case when the coding is done by individuals). Nowadays, even open questions, short essays or concept maps can be coded automatically by computers. This leads to the increased objectivity of the scores compared with conventional human ratings (Breland & Lytle, 1990; Burstein, Kaplan, Wolff, & Lu, 1997; Chung, Baker, Brill, Sinha, & Saadat, 2003; Chung, O’Neil, Bewley, & Baker, 2008, Chapter 12 in this book; Franzke, Kintsch, & Kintsch, 2005; Page & Petersen, 1995). Furthermore, as well as providing increased objectivity these computer-based ratings are faster and less expensive than human-based ratings. The reliability of a measure depends on its ability to select all relevant data from a logfile and to ignore all irrelevant data, that is, the sensitivity to filter the signal from the noise (Luecht & Clauser, 2002). Every single keystroke, mouse click or other manner of input has to be categorized, and the operation of these categories is at the heart of defining a more or less reliable measure. Doing so involves categorizing unique instances as being sufficiently similar so that they can be considered equivalent while justifiably disregarding some features of each instance (Winne et al., 2002). Grain size of information is one key aspect of this operational problem (cf. HowardRose & Winne, 1993; Pintrich, Wolters, & Baxter, 2000). For example, Winne and Hadwin (1998) distinguish tactics from strategies as components of self-regulated, strategic learning. A tactic is a single learning operation triggered by a single condition (IF-THEN rule), while a strategy is seen as an array of multiple tactics (IF-THENELSE rule) that are potentially useful toward reaching the same goal. Students can prefer one of the different tactics associated with a specific strategy. Thus, strategy measures often aggregate different tactics to one scale. However, aggregating these variables may cause the loss of the power that individual tactic had in predicting other variables such as learning outcome (Jamieson-Noel & Winne, 2003). Computer-based assessment has the potential to allow the creation of process measures which will evaluate the quality of order and change. Until now it has not been clear how to estimate the reliability of process measures that are, by definition, not stable over time (c.f. Leutner, 1992, 1993; Willet, 1989). For example, research on selfregulated learning emphasizes the process of small unit changes and adaptations. Until now there have been few measures that capture those small, dynamic events with sat-

228

J. Wirth

isfactory power and precision (Pintrich et al., 2000). Winne, Gupta, and Nesbit (1994; Winne & Nesbit, 1995) proposed such a measure. They converted their data about tactics and learning events into transition matrices and used graph theoretic statistics to describe process characteristics of learning traces. For example, they calculated linearity (also called density) as an estimate of the probability that a specific learning event is followed by another specific learning event (for an overview on different measures see also Winne et al., 2002). However, as Winne et al. (2002) noted, because these measures await careful evaluation, it is not clear how reliably and validly they describe learning processes. Wirth (2004; Wirth & Funke, 2005) developed a logfile-based process measure for evaluating the quality of self-regulated learning when exploring a complex and dynamic system. He distinguished between two goals a learner has to pursue when learning how to direct such a system; on the one hand, the learner has to identify and generate new information by interacting with the system. On the other hand, he has to integrate information once it has been identified into his knowledge base. The two goals require different strategies and tactics. Thus, a learner has to decide which goal to pursue at every single point in time during the learning process. Based on logfile data for several succeeding time intervals, Wirth computed a so-called log-odds-ratio-measure to estimate whether a learner tried to identify new information or whether he or she tried to integrate information. Using latent growth curve models (McArdle & Bell, 2000) Wirth was able to identify process characteristics that differ between successful and unsuccessful learners. Like all learners, successful learners started by identifying new information, but, unlike unsuccessful learners, they changed very early and sharply to integrating information once it had been identified. Furthermore, successful learners changed their learning behavior in that they were increasingly able to regard only relevant information instead of identifying and integrating all kinds of information (Wirth, in press). Like Winne and his colleagues (2002), Wirth (2004) did not report any reliability estimates of the log-odds-ratio-measure, but he did identify several factors affecting reliability, for example, the length of the time intervals used, and he proposed methods to decrease their impact on the psychometric quality of the measure. Validity is the third issue in connection with the new computer-based item and test designs. The gap between the behavioral data recorded by computers and the competencies being assessed can be wide, especially when cognitive or metacognitive competencies are to be measured. Not all cognition or metacognition is necessarily expressed overtly, and behavioral data may be a sample of the (meta‑) cognitive events to be assessed rather than a complete representation (cf. Jamieson-Noel & Winne, 2003). The definition of measures computed on these data is an attempt to build a bridge over the gap. Careful evaluation of construct validity using multi-method-designs is needed to estimate whether this bridge is adequate. Validity is especially an issue when designing computer-based process measures. Some researchers simply add up the number of equivalent activities, for example, searches performed with a search engine in a simulated World Wide Web environment. This measure is then treated as an estimate of the quality of the search process (e.g., Schacter, Herl, Chung, Dennis, & O’Neil, 1999). Evidence of the validity of using

Computer Based Tests: Alternatives for Test and Item Design

229

this kind of frequency measure as an indicator of process quality has to be evaluated carefully for at least two reasons: (1) The frequency of a specific kind of activity is dependent on the number of all kinds of activities an examinee exhibits. Whether an examinee is or is not interacting extensively with the computer system can be explained by differences in motivation or exploration style (Klahr & Dunbar, 1988). Thus, pure frequency measures are probably affected by a number of variables that are not intended to be assessed. (2) As Jamieson-Noel & Winne (2003) point out, it is questionable whether the same activities are commensurate over time. For example, pressing a button on an unknown computer for the first time can have a totally different meaning from pressing the button repeatedly later on (Wirth, 2004). The ability to generalize is another issue with computer-based assessment. Most activities an examinee performs are test-specific. For example, they depend on the input device, the context and content of the computer-based environment, and so on. Data collected may be very specific and might not generalize beyond the particular environment. In summary, innovative items and tests can be designed to evaluate, for example, the order and changes of processes beyond frequency measures or self-reports. Computerbased measures have the potential to be highly objective and inexpensive, especially when complex data structures like concept maps are to be analyzed. However, there are issues regarding the classical criteria of test quality that are often overlooked. Reliability of computer-based measures is difficult to achieve, especially when a measure is based on a highly complex data structure. Indices to estimate the reliability of measures of the process of change are still to be developed. Evaluation of construct validity and the ability of new computer-based items and tests to be generalized is often insufficient while test developers often seem to be content with face validity or ecological validity.

Summary and Conclusion Computer-based competence testing and assessment is a valuable addition to conventional formats. Computer-based items and tests lead to the definition of new constructs or the re-characterization of traditional constructs. They also make it possible to assess competencies in ways they were never able to be assessed before. The use of multimedia can enhance test-fairness and authenticity of the test situation, which in turn can lead to higher acceptance of the test by test deliverers and examinees alike and, if carefully designed, also to improved construct validity. Collecting data on computers means no data values are missed. Data, including time recording, can be collected unobtrusively online. These data provide the groundwork for the definition of innovative measures such as process measures and can be coded objectively and inexpensively. Because of these advantages, computer-based tests can provide new objective, valid and reliable measures for traditional or new competencies. However, there are at least three issues that have to be carefully considered when designing new computer-based items and tests: (1) Ecological validity and construct validity are not the same, and

230

J. Wirth

developing highly authentic test environments does not automatically lead to the provision of valid construct measures (van der Linden, 2002). In contrast, increasing authenticity and complexity of a test environment often affect construct validity because features are added to the test situation that are not an essential part of the definition of the competence. Thus there is always a trade-off between ecological validity and construct validity. (2) The more complex the structure of collected data the more difficult it is to filter the signal from the noise and to develop a reliable and valid measure. Thus, the second trade-off is between task- and data complexity on one hand and scoring simplicity on the other (Luecht & Clauser, 2002). (3) As is true for all new measures, computer-based items and tests have to be evaluated carefully. Multi-method-designs for the evaluation of validity as well as the development of new estimates of reliability are highly desirable. However, as long as the same criteria for the evaluation of test quality are carefully applied to the development of computer-based items and tests as are applied to conventional items and tests, computer-based assessments provide great alternatives for new item and test designs.

References Ackermann, T. A., Evans, J., Park, K.-S., Tamassia, C., & Turner, R. (1999). Computer assessment using visual stimuli: A test of dermatological skin disorders. In F. Drasgow & J.B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 137–150). Mahwah, NJ: Lawrence Erlbaum. Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press. Artelt, C. (2000). Strategisches Lernen [Strategic learning]. Münster: Waxmann. Baker, E., & O‘Neil, H. F. (2002). Measuring problem solving in computer environments: Current and future states. Computers in Human Behavior, 18, 609–622. Bannert, M. (2003). Effekte metakognitiver Lernhilfen auf den Wissenserwerb in vernetzten Lernumgebungen [Effects of metacognitive support on learning in web-based learning environments]. Zeitschrift für Pädagogische Psychologie, 17, 13–25. Barab, S. A., Bowdish, B. E., & Lawless, K. A. (1997). Hypermedia navigation: Profiles of hypermedia users. Educational Technology Research and Development, 45, 23–41. Bennett, R. E., Goodman, M., Hessinger, J., Kahn, H., Ligget, J., Marshall, G., & Zack, J. (1999). Using multimedia in large-scale computer-based testing programs. Computers in Human Behavior, 15, 283–294. Berry, D. C., & Broadbent, D. E. (1995). Implicit learning in the control of complex systems. In P. A. Frensch & J. Funke (Eds.), Complex problem solving: The European perspective (pp. 131–150). Hillsdale, NJ: Lawrence Erlbaum. Breland, H. M., & Lytle, E. G. (1990, April). Computer-assisted writing skill assessment using WordMap. Paper presented at the Annual Meeting of the American Educational Research Association, Boston. Brünken, R., Müller-Kalthoff, T., & Möller, J. (2005). Lernen mit Hypertext und Multimedia: Aktuelle Trends und Stand der Entwicklung [Learning with hypertext and multimedia: Trend in research and state of the art]. Zeitschrift für Pädagogische Psychologie, 19, 1–3.

Computer Based Tests: Alternatives for Test and Item Design

231

Brünken, R., Seufert, T., & Zander, S. (2005). Förderung der Kohärenzbildung beim Lernen mit multiplen Repräsentationen [Fostering coherence formation in learning with multiple representations]. Zeitschrift für Pädagogische Psychologie, 19, 61–75. Brünken, R., Steinbacher, S., Plass, J. L., & Leutner, D. (2002). Assessment of cognitive load in multimedia learning using dual-task methodology. Experimental Psychology, 49, 109–119. Buchner, A., Funke, J., & Berry, D.C. (1995). Negative correlations between control performance and verbalizable knowledge: Indicators for implicit learning in process control tasks? The Quaterly Journal of Experimental Psychology, 48A, 166–187. Burstein, J., Kaplan, R., Wolff, S., & Lu, C. (1997). Automatic scoring of advanced placement biology essays. Princeton, NJ: Educational Testing Service. Chung, G.K.W.K., Baker, E., Brill, D.G., Sinha, R., & Saadat, F. (2003, November). Automated assessment of domain knowledge with online knowledge mapping. Paper presented at the Interservice/Industry Training, Simulation, and Education Conference, Orlando, FL. Chung, G. K. W. K., O’Neil, H. F., Bewley, W. L., & Baker, E. L. (2008). Computer-based assessments to support distance learning. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in educational contexts: State of the art and future prospects (pp. 237–261). Göttingen: Hogrefe & Huber. Clyman, S. G., Melnick, D. E., & Clauser, B.E. (1995). Computer-based case simulations. In E. L. Mancall & P. G. Bashook (Eds.), Assessing clinical reasoning: The oral examination and alternative methods (pp. 139–149). Evanston, IL: American Board of Medical Specialities. Dörner, D., Kreuzig, H. W., Reither, F., & Stäudel, T. (1983). Lohhausen. Vom Umgang mit Unbestimmtheit und Komplexität [Lohhausen. Dealing with uncertainty and complexity]. Bern: Huber. Dörner, D., & Preußler, W. (1990). Die Kontrolle eines einfachen ökologischen Systems [Controlling a simple ecological system]. Sprache & Kognition, 9, 205–217. Dörner, D., Schaub, H., & Strohschneider, S. (1999). Komplexes Problemlösen – Königsweg der Theoretischen Psychologie? [Complex problem solving – Via Regia of Theoretical Psychology?] Psychologische Rundschau, 50, 198–205. Drasgow, F. (2002). The work ahead: A psychometric infrastructure for computerized adaptive tests. In C. N. Mills, M. T. Potenza, J. J. Fremer & W. C. Ward (Eds.), Computer-based testing. Building the foundation for future assessments (pp. 1–35). Mahwah, NJ: Lawrence Erlbaum. Eggen, T. (2008). Adaptive Testing and Item Banking. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in educational contexts: state of the art and future prospects (pp. 199– 217). Göttingen: Hogrefe & Huber. Educational Testing Service. (1997). The praxis series: Tests at a glance. Princeton, NJ: Educational Testing Service. Educational Testing Service. (2005). TOEFL iBT at a glance. Retrieved September 26, 2005 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_at_a_Glance.pdf Fleischer, J., Pallack, A., Wirth, J., & Leutner, D. (2005, September). Vergleichende Schulrückmeldungen im Rahmen der Lernstandserhebungen in Nordrhein-Westfalen [Comparative school reports within the Lernstandserhebungen in Northrhine-Westfalia]. Poster presented at the 67. Tagung der Arbeitsgruppe Empirisch-Pädagogische Forschung, Salzburg, Österreich. Flender, J., & Naumann, J. (2002). Empirisches Beispiel: Erfassung allgemeiner Lesefähigkeit und Rezeption nicht-linearer Texte: „PL-Lesen“ und Logfile-Analysen [Empirical example: Reading assessment and reception of non-linear texts: “PL-Lesen” and logfile analyses]. In N. Groeben & B. Hurrelmann (Eds.), Lesekompetenz: Bedingungen, Dimensionen, Funktionen (pp. 59–79). Weinheim: Juventa. Folk, V. G., & Smith, R. L. (2002). Models for delivery of CBTs. In C. N. Mills, M. T. Potenza, J. J. Fremer & W. C. Ward (Eds.), Computer-based testing. Building the foundation for future assessments (pp. 41–66). Mahwah, NJ: Lawrence Erlbaum.

232

J. Wirth

Franzke, M., Kintsch, E., & Kintsch, W. (2005, August). Using summary street to improve reading and writing instruction. Paper presented at the 11th Biennal Conference of the European Association for Research on Learning and Instruction, Nicosia. Frensch, P. A., & Funke, J. (Eds.). (1995). Complex problem solving: The European perspective. Hillsdale, NJ: Lawrence Erlbaum. Funke, J. (1985). Steuerung dynamischer Systeme durch Aufbau und Anwendung subjektiver Kausalmodelle [Directing dynamic systems by constructing and using subjective causal models]. Zeitschrift für Psychologie, 193, 443–465. Ginther, A. (2002). Context and content visuals and performance on listening comprehension stimuli. Language Testing, 19, 133–167. Gräsel, C. (1997). Problemorientiertes Lernen [Problem-oriented learning]. Göttingen: Hogrefe. Groot, A. S., de Sonneville, M. J., & Stins, J. F. (2004). Familial influences on sustained attention and inhibition in preschoolers. Journal of Child Psychology and Psychiatry and Allied Disciplines, 45, 306–314. Hadwin, A. F., & Winne, P. H. (2001). CoNoteS2: A software tool for promoting self-regulatioon and collaboration. Educational Research and Evaluation, 7, 313–334. Hadwin, A. F., Winne, P. H., & Nesbit, J. C. (2005). Roles for software technologies in advancing research and theory in educational psychology. British Journal of Educational Psychology, 75, 1–24. Harris, D., & Khan, H. (2003). Response time to reject a takeoff. Human Factors and Aerospace Safety, 3, 165–175. Höffler, T., & Leutner, D. (2007). Instructional animations versus static pictures: A meta-analysis. Learning and Instruction, 17, 722–738. Howard-Rose, D., & Winne, P. H. (1993). Measuring component and sets of cognitive processes in self-regulated learning. Journal of Educational Psychology, 85, 591–604. Jamieson-Noel, D. L., & Winne, P. H. (2003). Comparing self-reports to traces of studying behavior as representations of students’ studying and achievement. Zeitschrift für Pädagogische Psychologie, 17, 159–171. Jurecka, A. (2008). Introduction to Computer Based Assessment – a Review of Relevant Issues and Current Approaches in Europe. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in educational contexts: state of the art and future prospects (pp. 177–197). Göttingen: Hogrefe & Huber. Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive Science, 12(1), 1–48. Klieme, E., Leutner, D., & Wirth, J. (Eds.). (2005). Problemlösekompetenz von Schülerinnen und Schülern. Diagnostische Ansätze, theoretische Grundlagen und empirische Befunde der deutschen PISA-2000-Studie [Problem solving competence of students. Diagnostic approaches, theoretical background and empirical results of the PISA 2000 study]. Wiesbaden: VS Verlag für Sozialwissenschaften. Knight, J. F., Williams, D. D., Arvanitis, T. N., Chris, B., Wichmann, A., Wittkaemper, M., Herbst, I., & Sotiriou, S. (2005, October). Wearability assessment of a mobile augmented reality system. Paper presented at the 11th International Conference on Virtual Systems and MultiMedia (VSMM), Ghent. Krauss, S., Kunter, M., Brunner, M., Baumert, J., Blum, W., Neubrand, M., Jordan, A., & Löwen, K. (2004). COACTIV: Professionswissen von Lehrkräften, kognitiv aktivierender Mathematikunterricht und die Entwicklung von mathematischer Kompetenz [COACTIV: Professional knowledge of teachers, cognitive activating teaching in mathematics, and the development of mathematical compe­ tence]. In J. Doll & M. Prenzel (Eds.), Bildungsqualität von Schule: Lehrerprofessionalisierung, Unterrichtsentwicklung und Schülerförderung als Strategien der Qualitätsverbesserung (pp. 31– 53). Münster: Waxmann.

Computer Based Tests: Alternatives for Test and Item Design

233

Kröner, S. (2001). Intelligenzdiagnostik per Computersimulation [Intelligence testing with computer simulations]. Münster: Waxmann. Lawless, K. A., & Brown, S. (1997). Multimedia learning environments: Issues of learner control and navigation. Instructional Science, 25, 117–131. Leutner, D. (1992). Das Testlängendilemma in der lernprozessbegleitenden Wissensdiagnostik [The dilemma of test length in process-based knowledge testing]. Zeitschrift für Pädagogische Psychologie, 6, 233–238. Leutner, D. (1993). Das gleitende Testfenster als Lösung des Testlängendilemmas: Eine Robustheitsstudie [Moving test windows as solution to the dilemma of test length: A robustness study]. Zeitschrift für Pädagogische Psychologie, 7, 33–45. Leutner, D. (2002). The fuzzy relationship of intelligence and problem solving in computer simulations. Computers in Human Behavior, 18, 685–697. Leutner, D., & Brünken, R. (Eds.). (2000). Neue Medien in Unterricht, Aus- und Weiterbildung. Aktuelle Ergebnisse empirischer pädagogischer Forschung [New media in schools and further education. Recent results of empirical educational research]. Münster: Waxmann. Leutner, D., & Plass, J. L. (1998). Measuring learning styles with questionaires versus direct observation of preferential choice behavior in authentic learning situations: The visualizer/verbalizer behavior observation scale (VV-BOS). Computers in Human Behavior, 14, 543–557. Lompscher, J. (1994). Lernstrategien: Zugänge auf der Reflexions- und Handlungsebene [Learning strategies: Accesses on the reflective and behavioral level], LLF-Berichte (Vol. 9, pp. 114–129). Potsdam: Universität Potsdam. Lord, F. M. (1980). Applications of item response theory to practical problems. Hillsdale, NJ: Lawrence Erlbaum. Luecht, R. M., & Clauser, B. E. (2002). Test models for complex CBT. In C.N. Mills, M.T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), Computer-based testing. Building the foundation for future assessments (pp. 67–88). Mahwah, NJ: Lawrence Erlbaum. Mayer, R. E. (2001). Multimedia learning. Cambridge: Cambridge University Press. McArdle, J. J., & Bell, R. Q. (2000). An introduction to latent growth models for developmental data analysis. In T. D. Little, K. U. Schnabel, & J. Baumert (Eds.), Modeling longitudinal and multilevel data. Practical issues, applied approaches and specific examples (pp. 69–107). Mahwah, NJ: Lawrence Erlbaum. Mikelskis, H. F. (1997). Der Computer – ein multimediales Werkzeug zum Lernen von Physik [The computer – a multimedia tool for learning physics]. Physik in der Schule, 35, 394–398. Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33, 379–416. Nerdel, C. (2003). Die Wirkung von Animation und Simulation auf das Verständnis von stoffwechselphysiologischen Prozessen [Effects of animations and simulations on understanding metabolic processes]. Unpublished dissertation: University Kiel. Niegemann, H. M., Leutner, D., & Brünken, R. (Eds.). (2004). Instructional design for multimedia learning. Münster: Waxmann. Normann, M., Debus, G., Dörre, P., & Leutner, D. (2004). Training of tram drivers in workload management – workload assessment in real life and in a driving/traffic simulator. In T. Rothengatter & R.D. Huguenin (Eds.), Traffic and transport psychology – theory and application (Proceedings of the ICTTP 2000, pp. 113–121). Amsterdam: Elsevier. Olson-Buchanan, J. B., Drasgow, F., Moberg, P. J., Mead, A. D., Keenan, P. A., & Donovan, M. (1998). Conflict resolution skills assessment: A model-based, multi-media approach. Personell Psychology, 51, 1–24. Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading. Phi Delta Kappan, 76, 561–565. Pintrich, P. R., Smith, D. A. F., Garcia, T., & McKeachie, W. J. (1991). The motivated strategies for learning questionaire (MSLQ). Ann Arbor, MI: NCRIPTAL, The University of Michigan.

234

J. Wirth

Pintrich, P. R., Wolters, C. A., & Baxter, G.P. (2000). Assessing metacognition and self-regulated learning. In G. Schraw & J. C. Impara (Eds.), Issues in the measurement of metacognition (pp. 43–97). Lincoln, NE: Buros Institute of Mental Measurement. Plass, J. L., Chun, D. M., Mayer, R. E., & Leutner, D. (1998). Supporting visual and verbal learning preferences in a second-language multimedia learning environment. Journal of Educational Psychology, 90, 25–36. Plichart, P., Jadoul, R., Vandenabeele, L., & Latour, T. (2004, November). TAO, a collaborative distributed computer-based assessment framework built on semantic web standards. Paper presented at the International Conference on Advances in Intelligent Systems – Theory and Applications AISTA, Luxembourg. Prenzel, M., von Davier, M., Bleschke, M. G., Senkbeil, M., & Urhahne, D. (2000). Didaktisch optimierter Einsatz Neuer Medien: Entwicklung von computergestützten Unterrichtskonzepten für die naturwissenschaftlichen Fächer [Didactically optimized use of new media: development of computer-based teaching conceptions in science teaching]. In D. Leutner & R. Brünken (Eds.), Neue Medien in Unterricht, Aus- und Weiterbildung. Aktuelle Ergebnisse empirischer pädagogischer Forschung (pp. 113–121). Münster: Waxmann. Putz-Osterloh, W. (1981). Über die Beziehung zwischen Testintelligenz und Problemlöseerfolg [About the relationship between test intelligence and problem solving success]. Zeitschrift für Psychologie, 189, 79–100. Richter, T., Naumann, J., Brunner, M., & Christmann, U. (2005). Strategische Verarbeitung beim Lernen mit Text und Hypertext [Strategic processing during learning with text and hypertext]. Zeitschrift für Pädagogische Psychologie, 19, 5–22. Richter, T., Naumann, J., & Noller, S. (2003). LOGPAT: A semi-automatic way to analyze hypertext navigation behavior. Swiss Journal of Psychology, 62, 113–120. Rosenblum, S., Parush, S., & Weiss, P. L. (2003a). Computerized temporal handwriting characteristics of proficient and non-proficient handwriters. American Journal of Occupational Therapy, 57, 139–138. Rosenblum, S., Parush, S., & Weiss, P. L. (2003b). The in air phenomenon: Temporal and spatial correlates of the handwriting process. Perceptual and Motor Skills, 96, 933–954. Sawaki, Y. (2001). Comparability of conventional and computerized tests of reading in a second language. Language Learning and Technology, 5, 38–59. Schacter, J., Herl, H. E., Chung, G. K. W. K., Dennis, R. A., & O’Neil, H. F. (1999). Computer-based performance assessments: A solution to the narrow measurement and reporting of problem-solving. Computers in Human Behavior, 15, 403–418. Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessment. Journal of Educational Measurement, 30, 215–232. Siebert, W. F., & Snow, R. E. (1965). Cine-psychometry. AV Communication review, 13, 140–158. Strauß, B., & Kleinmann, M. (Eds.). (1995). Computersimulierte Szenarien in der Personalarbeit [Computer-simulated scenarios in personel work]. Göttingen: Verlag für Angewandte Psychologie. Streufert, S., Pogash, R., & Piasecki, M. (1988). Simulation-based assessment of managerial competence: Reliability and validity. Personel Psychology, 41, 537–557. Süß, H.-M. (1996). Intelligenz, Wissen und Problemlösen [Intelligence, knowledge, and problem solving]. Göttingen: Hogrefe. Süß, H.-M. (1999). Intelligenz und komplexes Problemlösen: Perspektiven für eine Kooperation zwischen differentiell-psychometrischer und kognitionspsychologischer Forschung [Intelligence and problem solving: Perspectives for cooperation of differential-psychometric and cognitivepsychological research]. Psychologische Rundschau, 50, 220–228. Tversky, B., Morrison, J.-B., & Betrancourt, M. (2002). Animation: Can it facilitate? International Journal of Human-Computer Studies, 57, 247–262.

Computer Based Tests: Alternatives for Test and Item Design

235

van der Linden, W. J. (2002). On complexity in CBT. In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), Computer-based testing. Building the foundation for future assessments (pp. 89–102). Mahwah, NJ: Lawrence Erlbaum. Vispoel, W. P. (1999). Creating computerized adaptive tests of music aptitude: Problems, solutions and future directions. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 151–176). Mahwah, NJ: Lawrence Erlbaum. Wagener, D. (2001). Psychologische Diagnostik mit komplexen Szenarios. Taxonomie, Entwicklung, Evaluation [Psychological diagnostics with complex scenarios. Taxonomy, development, evaluation]. Lengerich: Pabst Science Publishers. Willet, J. B. (1989). Some results on reliability for the longitudinal measurement of change: Implications for the design of studies of individual growth. Educational and Psychological Measurement, 49, 587–602. Winne, P. H. (1982). Minimizing the black box problem to enhance the validity of theories about instructional effects. Instructional Science, 11, 13–28. Winne, P. H., Gupta, L., & Nesbit, J. C. (1994). Exploring individual differences in studying strategies using graph theoretic statistics. Alberta Journal of Educational Research, 40, 177–193. Winne, P. H., & Hadwin, A. F. (1998). Studying as self-regulated learning. In D. J. Hacker, J. Dunlosky, & A. C. Graesser (Eds.), Metacognition in Educational Theory and Practice (pp. 277–304). Hillsdale, NJ: Lawrence Erlbaum. Winne, P. H., & Jamieson-Noel, D. L. (2002). Exploring students’ calibration of self reports about study tactics and achievement. Contemporary Educational Psychology, 27, 551–572. Winne, P. H., Jamieson-Noel, D. L., & Muis, K. (2002). Methodological issues and advances in researching tactics, strategies, and self-regulated learning. In P. R. Pintrich & M. L. Maehr (Eds.), New directions in measures and methods (pp. 121–155). Greenwich, CT: JAI Press. Winne, P. H., & Nesbit, J. C. (1995, April). Graph theoretic techniques for examining patterns and strategies in learners’ studying: An application of LogMill. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco. Wirth, J. (2004). Selbstregulation von Lernprozessen [Self-regulation of learning processes]. Münster: Waxmann. Wirth, J. (in press). Selbstreguliertes Lernen in komplexen und dynamischen Situationen. Die Nutzung von Handlungsdaten zur Erfassung verschiedener Aspekte der Lernprozessregulation [Selfregulated learning in complex and dynamic situations. Using behavioral data for the assessment of different aspects of learning regulation]. In C. Artelt & B. Moschner (Eds.), Lernstrategien und Metakognition: Implikationen für Forschung und Praxis. Münster: Waxmann. Wirth, J., & Funke, J. (2005). Dynamisches Problemlösen: Entwicklung und Evaluation eines neuen Messverfahrens zum Steuern komplexer Systeme [Dynamic problem solving: Development and evaluation of a new measurement of directing complex systems]. In E. Klieme, D. Leutner, & J. Wirth (Eds.), Problemlösekompetenz von Schülerinnen und Schülern. Diagnostische Ansätze, theoretische Grundlagen und empirische Befunde der deutschen PISA-2000-Studie (pp. 55–72). Wiesbaden: VS Verlag für Sozialwissenschaften. Wirth, J., & Klieme, E. (2003). Computer-based assessment of problem solving competence. Assessment in Education. Principles, Policy, & Practice, 10, 329–345. Wittrock, M. C. (1989). Generative processes of comprehension. Educational Psychologist, 24, 345–376.

Suggest Documents