Marking Strategies in Metacognition-Evaluated Computer-Based Testing

0 downloads 0 Views 1MB Size Report
a question mark next to a test item to indicate an uncertain answer. ... efficient use as self-managed learning tools, CBT systems must provide adaptive feedback for ..... materials related to major knowledge concepts written by several .... Summary of ANOVA of simple main effects for marking frequency (N=236). Source. SS.
Chen, L.-J., Ho, R.-G., & Yen, Y.-C. (2010). Marking Strategies in Metacognition-Evaluated Computer-Based Testing. Educational Technology & Society, 13 (1), 246–259.

Marking Strategies in Metacognition-Evaluated Computer-Based Testing Li-Ju Chen, Rong-Guey Ho* and Yung-Chin Yen Graduate Institute of Information and Computer Education, National Taiwan Normal University, Taipei Taiwan // [email protected] // [email protected] // [email protected] *Contact author ABSTRACT This study aimed to explore the effects of marking and metacognition-evaluated feedback (MEF) in computerbased testing (CBT) on student performance and review behavior. Marking is a strategy, in which students place a question mark next to a test item to indicate an uncertain answer. The MEF provided students with feedback on test results classified as correct answers with and without marking or incorrect answers with and without marking. The study analyzed 454 ninth graders randomly assigned to three groups: Gmm (marking + MEF), Gmu (marking), and Guu (none). Each group was further categorized into three subgroups based on their English ability. Results showed that marking improved medium-ability examinees’ test scores. This was a promising finding because the medium-ability students were the very target group that had the most potential for improvement. Additionally, MEF was found to be beneficial as well in that it encouraged students to use marking skills more frequently and to review answer-explanations of the test items. The follow-up interviews indicated that providing adaptive and detailed AEs for low-ability students were necessary. The present study reveals the potential of integrating marking and adaptive feedbacks into the design of learning functions that are worth implementing in CBT systems.

Keywords Computer-based testing (CBT), Test-taking behavior, Marking behavior, Metacognition evaluation, Confidence rating technique

Introduction Computer-based testing (CBT) has been widely used since information technology became popularity. Such tests are easily administrated by computer or an equivalent electronic device, and students can immediately access their test results. Many researchers claimed that CBT systems were valuable self-evaluation tools for self-managed learning (Croft, Danson, Dawson, & Ward, 2001; Peat & Franklin, 2002). However, studies indicated that, for effective and efficient use as self-managed learning tools, CBT systems must provide adaptive feedback for future learning (Souvignier & Mokhlesgerami, 2006; Thelwall, 2000; Wong, Wong, & Yeung, 2001). They must also provide information that enables students to control their own pace during the test (Parshall, Kalhn, & Davey, 2002, p. 41). Adaptive feedback enabled students to learn according to provided instructional strategies (Collis & Messing, 2000; Collis & Nijhuis, 2000). According to Collis, De Boer and Slotman (2001), giving adaptive feedback after a test was one strategy for helping students learn effectively. It could help underachievers extend their learning. For example, giving answer-explanations (AEs) related to key knowledge concepts of test items after a CBT could help students to understand what they have learned and to identify their mistakes (Wang, Wang, Huang, & Chen, 2004); that is, AEs was a metacognitive strategy (Rasekh & Ranjbary, 2003). Answer-explanations offered via automatic evaluation tools could correct student mistakes, reinforce their memories, and support their learning as well as reduce teacher workload so that individual students could receive adaptive compensatory instruction in a forty-student class. Therefore, if CBT systems only displayed scores without feedback, the “teachable moment”, or the moment of educational opportunity when students were disposed to learn, might not be used effectively (Collis et al., 2001; Ram, Cox, & Narayanan, 1995). To help students control their own pace, CBT systems could provide the information needed to navigate a test, such as reminders of unanswered items. Gibson, Brewer, Dholakia, Vouk and Bitzer (1995) showed that such information could help students complete the CBT efficiently and reduce their frustration and anxiety. Another mechanism for controlling the testing process within the CBT environment was the marking function. Marking was a skill used to increase the efficiency and effectiveness of self-managed learning (Parshall et al., 2002, p34). In the present study, marking referred to a test-taking behavior, in which the student placed a question mark next to a test item to indicate an uncertain answer, and it also served as a reminder to review, check or revise the answer. According to Higgins and Hoffmann (2005), students rarely marked test items when they were sure of their answers. Therefore, marking could be considered one alternative to the confidence rating technique conventionally used to measure the ISSN 1436-4522 (online) and 1176-3647 (print). © International Forum of Educational Technology & Society (IFETS). The authors and the forum jointly retain the copyright of the articles. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear the full citation on the first page. Copyrights for components of this work owned by others than IFETS must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from the editors at [email protected].

246

metacognition monitoring ability of students. Students applying confidence rating technique were required to check the confidence degree of their answers. Their metacognition monitoring ability was then evaluated by the matching the confidence degree with the test results (Baker & Brown, 1984; Vu, Hanley, Desoete, & Roeyers, 2006; Strybel, & Proctor, 2000). For example, choosing a correct answer and marking it high on confidence level suggested good metacognition monitoring ability whereas choosing a wrong answer and marking it high on confidence level indicated poor metacognition monitoring ability. This study proposed metacognition-evaluated feedback (MEF), a new feedback mode for CBT systems displaying AEs integrating student answer responses and marking records. This study had two purposes. First, it explored whether marking could improve the test scores of examinees. Second, it investigated how MEF affected the review behavior of students after completing a CBT. To achieve these two purposes, an experiment was designed to address the following questions: 1. Does marking improve student scores? 2. Does MEF increase use of marking skills and review behavior?

Related research Test-taking behavior and marking Test-taking behavior varies among students. Researchers generally classified test-taking behaviors into nine types: (1) browsing items, (2) clarifying meanings of item body and options, (3) knowing the answer, (4) not knowing the answer and guessing, (5) omitting, (6) abandoning, (7) not reaching, (8) having partial knowledge that might be right or wrong, and (9) changing answers (Brown, 1980; Lazarte, 1999; Lord, 1975; McMorris & Leonard, 1976). Examinees usually used marking skills under type (8) and (9) conditions (Burton, 2002; Parshall et al., 2002, p.34) because marking was a helpful test-taking technique for checking answers. However, most CBT systems described in the literature did not incorporate the marking function (Gibson et al., 1995; Parshall et al., 2002, p.34). Marking was a direct test-taking strategy used by students. It helped examinees remember the test items they skipped or wanted to recheck. The marked test items could then be changed according to partial knowledge or the test-taking skills of the examinees (Burton, 2002). Rogers and Bateson (1991) concluded that good test-taking skills and knowledge of a certain subject could help examinees improve their scores by identifying clues embedded in the test items. Therefore, marking was likely to enhance student performance because it could make them focus on specific items. However, current CBT systems such as Mklesson, Tutorial Gateway, Eval and Open Learning Agency of Australia (Gibson et al., 1995), LON-CAPA (http://www.lon-capa.org), and TopClass (http://www.websystems.com) (Bonham, Beichnen, Titus, & Martin, 2000; Wang et al., 2004) did not analyze marking behavior. Briefly, a noticeable problem of the current CBT systems was that they did not incorporate the marking function. In CBT systems without the marking function, examinees might not focus on the items they needed to reconsider. Therefore, this study attempted to overcome this problem by designing a CBT system with marking function.

Confidence rating technique Marking indicated student confidence as well as a remainder to recheck test items (Parshall et al., 2002, p.34; Higgins et al., 2005). For example, students might put a check mark beside a test item to indicate that they were not sure of the answer. Restated, marking was an alternative approach for judging the confidence level of examinees, which was traditionally measured by using confidence rating technique to estimate metacognition monitoring ability (Baker & Brown, 1984). Other measurement methods, such as interview, observation, thinking aloud, self-reporting and questionnaire survey, have also been used in past studies (Desoete & Roeyers, 2006; Elshout-Mohr, Meijer, van Daalen-Kapteijns, & Meeus, 2003; Garner, 1988, p61). However, each had drawbacks. The analytic results of interview, observation and thinking-aloud were accurate but time-consuming. Moreover, coherent results were difficult to obtain because these measurement methods often involved subjective evaluations (Veenman, 2003). Also, the results of self-reporting and questionnaire survey might induce ‘response set’ problems such as careless answering or acquiescence and social expectations (Garner, 1988, p.61; Linn & Gronlund, 2000, p.182). Therefore, this study employed marking as a confidence rating technique for the benefits of its stability, efficiency and practicality. 247

Confidence rating technique was performed as follows. Examinees estimated their confidence in their answers by ticking one of the three levels: ‘sure correct’, ‘not sure’, or ‘sure incorrect’. Their metacognition monitoring ability was then measured by matching their confidence degree (‘sure correct’ or ‘sure incorrect’) with their test results (‘correct’ or ‘incorrect’). For example, students who chose a correct answer but marked ‘sure incorrect’ on the confidence level suggested that they had poor metacognition monitoring ability. Conversely, students choosing a wrong answer and marking ‘sure incorrect’ on the confidence level showed that they had good metacognition monitoring ability. However, noted that students who marked ‘not sure’ were excluded from the analysis of metacognition monitoring ability regardless of whether their answers were correct. This approach provided simple and quick measures, which were expected in computer-based adaptive learning environments (Kalyuga, 2006). However, the problem with this confidence rating technique was that low-ability examinees were most likely to choose ‘sure incorrect’ in tests, and most indeed ended up having incorrect answers. Therefore, they were mistakenly interpreted as students with high metacognition monitoring abilities. The method applied in this study avoided this problem since MEF can clearly identify this particular group. In short, using marking as an alternative confidence rating technique was not only a good way to measure the metacognition monitoring abilities of examinees; it was also rather easy to incorporate into CBT systems (Parshall et al., 2002, p.34). Therefore, if the CBT was designed to employ marking, the confidence rating technique could be applied, and data for metacognition monitoring abilities could be attained.

Design of metacognition-evaluation feedback This study proposed metacognition-evaluated feedback (MEF), a new feedback mode integrating student marking records and responses. Before the CBT starts, students were instructed to place a mark on the test item where they were unsure of the answer. As soon as students completed the CBT, they obtained the MEF. The marking and correctness of their answers were the criteria used to classify their test results into four categories: Category I, II, III, and IV (Chen, Ho, & Yen, 2006). Category I represented correct answers with marking while Category II denoted correct answers without marking. Category III included incorrect answers with marking whereas Category IV was incorrect answers without marking. Since the presence of marking indicated whether or not students were sure of their answers, Categories I and III could therefore be defined as unsure-correct and unsure-incorrect. Further, Category II and IV were defined as expected-correct and unexpected-incorrect according to failure-driven learning theory (Pacifici & Garrison, 2004; Schank, 1995). This learning theory claimed that mistakes, including unsuccessful results and unmet expectations, were failures that could promote advanced learning. For instance, students made predictions about their test results and then observed what happened to check their predictions. If their predictions failed, they tried to determine how these mistakes occurred and then solved their problems. In MEF, further classification of incorrect responses as either unsure or unexpected might motivate students to practice further. In this study, MEF adopted marking as an indicator of student confidence level. Compared with traditional confidence rating technique, marking was more straightforward, and it reduced interference because it did not require students to check confidence level on each test item during the test (Jacobs & Chase, 1992; Wise & Plake, 1989). Also, by excluding Category III (incorrect answers with marking) from the score for metacognition monitoring ability, MEF avoided a common problem in traditional confidence rating technique: misinterpreting low-ability students as having high metacognition monitoring abilities. As Wang et al. (2004) indicated, CBT systems that collected and analyzed student responses and answering processes could identify student learning outcomes and subject matter misconceptions. Therefore, the AEs in the MEF were designed to incorporate the above information to provide useful adaptive feedback so that students could understand their performance, clarify their mistakes, and increase their learning motivation.

Methodology Participants A total of 454 ninth-graders participated in this experiment. All participants had over five years of formal computer literacy instruction (more than 180 hours), which confirmed that they had basic computer skills required to take a 248

CBT. One reason they volunteered to take part in the experiment was because they wanted to prepare for the English Basic Competence Test (EBCT) given by the Committee of the Basic Competence Test for Junior High School Students three months later. The participants were randomly assigned to Guu (four classes, 145 students), Gmu (four classes, 139 students), and Gmm (five classes, 170 students). Three versions of CBT system Three versions of CBT system, labeled Gmm, Gmu, and Guu, were designed based on two factors, marking and MEF. The Gmm (See Figure 1) adopted marking; MEF; Gmu (See Figure 2) only adopted marking, and Guu adopted neither of them (See Figure 3). Figure 1 shows an example of MEF test results in Gmm: examinee X had twenty correct responses and ten incorrect responses, i.e., 67% of responses by X were correct. The test results were then categorized as follows: the 2nd, 5th, 12th, 15th, 21st, and 25th test items were marked and correct (Category I); the 3rd, 6th, 8th, 9th, 11th, 14th, 17th, 18th, 22nd, 24th, 26th, 28th, 29th, and 30th test items were unmarked and correct (Category II); the 7th and 27th test items were marked and incorrect (Category III), and the 1st, 4th, 10th, 13th, 16th, 19th, 20th, and 23rd were unmarked and incorrect (Category IV). However, Figure 2 shows an example of feedback for the test results of examinee X in Gmu. The displayed information was identical to that in Gmm but not sorted into four categories. Figure 3 shows an example of feedback for the test results of examinee X in Guu. The displayed information was similar to that in Gmu, except that the summary did not include marking records. Three versions of CBT system recorded examinee scores, answer responses, time consumed and review records in each action. The examinee test results and responses were examined for effects of marking on student scores, test-taking time, and MEF on marking skills and review behavior. Briefly, the three versions of CBT system were as follows: 1. Gmm: Examinees could place or remove a question mark on any items, indicating they were ‘unsure’. The examinee responses, results, marks, and scores for each item were shown on the screen and sorted into four categories after the test was completed. 2. Gmu: The marking method was the same as that in Gmm, and so was the displayed information. However, the information was not sorted into four categories in this version. 3. Guu: Examinees could not mark any items. Except for marking, the displayed information was the same as that in Gmu.

Figure 1. Example of MEF screen in Gmm

249

Figure 2. Example of feedback screen in Gmu

Figure 3. Example of feedback screen in Guu Test items Three versions of CBT system were adopted in a test comprised of thirty multiple-choice items selected from the vocabulary and reading comprehension sections of the EBCT in Taiwan. Because less than 500 participants were sampled, analyzing the parameters of test items based on three-parameter model in item response theory was unsuitable (Hambleton & Swaminathan, 1985, p.227, p.308; Mason, Patry, & Bernstein, 2001). Therefore, classical test theory was used; the item difficulty index and item discrimination index of the test items were calculated. As Table 1 shows, both indices had means above .5, and item discrimination indices were above .4. The reliability of internal consistency (KR-20) was .926. These three figures indicated that the quality of the test items was acceptable (Ahmanan & Glock, 1981, p163; Ebel & Frisbie, 1991, p.231-32; Noll, Scannell, & Craig, 1979, p.109). The following is an example item in the reading comprehension section: 250

In 1999, there were about 2,482 traffic accidents in Taiwan. Most of the accidents happened because _________1_________. For example, some drivers drove too fast. Some drivers drank too much wine or beer before they got into the car. And some drivers never tried to stop when the traffic lights went from yellow to red. Most of the accidents happened because _________1_________. (1) motorcyclists were riding too fast (2) the MRT system was not built yet (3) drivers didn't follow the traffic rules (4) there were too many traffic lights on the road Table 1. Statistical properties of the test items (number of items=30, number of examinees =454) Parameter Mean Std Dev Minimum Maximum Item difficulty index .56 .098 .37 .71 Item discrimination index .70 .095 .44 .89 Start

Step1: Introductions N Next Step? Y Step2: Practice N Next Step? Y

Step3: Test

Select First Item Display item Response & Record

Item Selection (sequence/self-selection) Display Messages

Stop?

N

Y

Step4: Review

N Quit? Y End

Figure 4. Testing procedure of CBT system 251

Procedure Design three versions of CBT system As Figure 4 shows, three versions of CBT system first displayed instructions and demo clips page by page in the Instruction session. During the Practice session, examinees could answer the four sample items repeatedly to familiarize themselves with the CBT system interface. In the Test session, examinees could change their answer(s) by selecting items from a pop-up menu. The participants in Gmm and Gmu were also instructed to put a question mark next to items in which they were unsure of the answer. They could also remove question marks if they later felt certain that their answers were correct. The stopping conditions of the test were activated when the test time was finished or when all the items were completed and the ‘Finish’ button was clicked. The scores were calculated as soon as the test ended. In the Review session, the test results were shown and examinees could review the AEs including supplementary materials related to major knowledge concepts written by several junior-high-school English teachers. Also, according to suggestions from these teachers, the important concepts, words, and keys in the CBT system were highlighted with bright colors. Figure 5 illustrates an example of the AEs for test items in the Review session.

Figure 5. Screenshot of an AE for a test item Tryout Six ninth-graders and three eighth-graders from another school in the same district were recruited for system testing and tryout. The student with the lowest EA in the tryout group was able to complete a thirty-item test in 30 minutes. Therefore, the test-taking time in the main study was set to 30 minutes to ensure that all participants could finish the test within the time limit. Problems such as unclear instructions, blurred pictures, and misspelled items were corrected after the tryout. Main study All participants were volunteers and were randomly assigned to the Gmm, Gmu, or Guu group. They were coached briefly on the testing procedure, answering method, and how to get AE feedback on each CBT systems before taking the EBCT. Students in Gmm and Gmu were instructed to place a question mark next to any item when they were not 252

sure of the answer. The participants in the Gmm were told that the Review session would display their test results and sorted AEs according to their marking records and answer responses while those in Gmu and Guu were told that their test results and AEs given in the Review session would not be sorted. The participants took the ECBT in the computer classroom in their school to control for the anxiety associated with testing in an unfamiliar environment. The three versions of CBT system recorded the responses and the time administered. After the experiment, the participants received their test results and reviewed their record reports.

Results and Discussion To investigate the effects of marking and MEF on examinees with different levels of English ability, the examinees in Gmm, Gmu, and Guu were further classified by their test scores. The top 25% scorers were labeled as high English ability (H-EA) group; the bottom 25% scorers were labeled as low English ability (L-EA) group, and the scorers from 37.5% to 62.5% were labeled as middle English ability (M-EA) group. The sampling procedure skipped the scorers from 25% to 37.5% and from 62.5% to 75% (totaling 25% of the whole sample, i.e., 112 participants). This was to reduce possible influence between two successive EA groups on test scores and review behavior. Restated, the number of sampling participants was 342: 114 examinees in H-EA, M-EA, and L-EA, respectively. For H-EA examinees, thirty-two, forty-four, and thirty-eight were in Gmm, Gmu, and Guu, respectively. Forty-four, thirty-two, and thirty eight students in the M-EA were classified as Gmm, Gmu, and Guu, respectively. For L-EA examinees, forty-five, thirty-nine, and thirty were in Gmm, Gmu, and Guu, respectively. Compared with the total number of participants, the number of sample in each subgroup was relatively small, and the statistical reliability of the analysis would decrease. However, the sample size in each subgroup was still more than 30 which satisfied the criteria of minimum numbers (15 subjects in each subgroup) in an experimental study recommended by Gall, Borg, and Gall (1996, p.229). Therefore, the level of reliability to explore the effects of marking and MEF on test scores and review behavior for examinees with different EA levels was acceptable. Table 2 shows the descriptive statistics for test scores in each subgroup. The average test scores of H-EA examinees in three groups were, from highest to lowest, Gmm, Guu, and Gmu. However, the average test scores of M-EA and L-EA examinees in three groups were, from highest to lowest, Gmu, Gmm, and Guu. Additionally, the average test scores of examinees in three groups were, from highest to lowest, Gmu, Guu, and Gmm.

Groups Gmm Gmu Guu Total

Table 2. Descriptive statistics of test scores in each subgroup (N=342) H-EA M-EA L-EA Total n Mean Std Dev n Mean Std Dev n Mean Std Dev n Mean Std Dev 32 27.97 1.58 44 16.52 3.35 45 7.11 1.42 121 16.05 8.56 44 27.16 4.14 32 17.25 5.13 39 7.62 3.99 115 17.77 9.40 38 27.53 1.59 38 14.95 3.42 30 6.27 1.39 106 17.00 8.95 114 27.97 1.61 114 15.79 3.31 114 6.81 1.62 342 16.92 8.97

The following section presents the results and related discussions to answer the two research issues: the effect of marking on three levels of EA examinee performance in Gmm, Gmu, and Guu and the effects of MEF on their marking frequency and review behavior.

Effects of marking on examinee test scores To investigate the effects of marking on examinee test scores, the t-tests were conducted for three successive EA levels. As Table 3 shows, marking significantly affected the test scores of M-EA examinees (t.05 (112) =2.4, p.05) or L-EA (t.05 (112) =1.95, p>.05) examinees. The test results indicated that middle EA examinees benefited significantly from the marking function, which suggested that CBT systems incorporating marking function could improve average student performance. This finding was rather promising because classroom intervention was typically aimed at average level students, since this target group had the most potential for improvement compared to their high ability and low ability counterparts. In the present case, the high ability students were confident of their own answers or they had already understood the important concepts prior to the test; therefore, marking skill was not an immediate need for them. Similarly, marking did not improve the performance of low ability students. They lacked enough knowledge to answer correctly even though they had good marking skills. However, marking might have encouraged average ability students to seek the clues among test items 253

and assist them in answering correctly, which would thus have improved their performance. Therefore, marking should not be neglected in CBT systems design. Table 3. Descriptive statistics of examinee test scores and results of three t-tests (N=342) Marking Examinees’ English ability df With Without n Mean Std Dev n Mean Std Dev H-EA 76 27.50 3.32 38 27.53 1.59 112 M-EA 76 16.83 4.18 38 14.95 3.42 112 L-EA 84 7.35 2.90 30 6.27 1.39 112 * p