Running head: RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Technische Universität München
TUM School of Education
Master of Education Research on Teaching and Learning
The Relation Between Foreign Language Students and Teachers’ Postdiction Confidence Scores as a Measure of Metacognitive Monitoring
Author:
Rivera, Dennis A.
Supervisor:
Prof. Dr. Maria Bannert
Advisor:
Dipl. Psych. Elisabeth Pieger
Submission Date:
17.01.2018
Declaration of Authorship
I confirm that this master's thesis is my own work and I have documented all sources and material used.
This thesis was not previously presented to another examination board and has not been published.
_________________
____________________
Place and date
Signature
ii
Acknowledgements
I would first like to thank my thesis advisor Dipl. Psych. Elisabeth Pieger of the School of Education at the Technische Universität München. The door of the Chair of Teaching and Learning with Digital Media was always open whenever I needed advice or had a question about any part of my research or writing. Dipl. Psych. Pieger consistently supported and guided the realisation of this paper, steering me into the right direction through opportune feedback. I would also like to thank Dipl.-Ing. Agr. Denise Lichtig and Ms. Christina Thunstedt, heads of the TUM Sprachenzentrum, together with all the teachers and students who were involved in this research project. Without their passionate participation and input, this study could not have been successfully conducted. Finally, I must express my profound gratitude to my parents, teachers, classmates, and to my husband, for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you.
Author Dennis A. Rivera
iii
Table of Contents
Declaration of Authorship......................................................................................................ii Acknowledgements ................................................................................................................iii Table of Contents ...................................................................................................................iv List of Tables ......................................................................................................................... vi List of Figures ........................................................................................................................ vii Appendices ............................................................................................................................. viii Abstract ..................................................................................................................................2 1. Introduction ...................................................................................................................... 3 2. Theoretical background ...................................................................................................8 2.1.
Metacognitive Judgements ....................................................................................8
2.2.
Students‘ Judgements ............................................................................................ 11
2.3.
Teachers‘ Judgements ........................................................................................... 16
2.4.
Confidence Scores .................................................................................................21
3. Research Question ...........................................................................................................23 3.1. Hypothesis ................................................................................................................23 4. Methods............................................................................................................................ 25 4.1.
Sample ...................................................................................................................25
4.2.
Measure Instruments ............................................................................................. 28
4.2.1.
Teachers‘ Language Exam ...........................................................................28
4.2.2.
Teachers‘ Ethnographic Survey ...................................................................30
4.2.3.
Students‘ Ethnographic Survey....................................................................31
4.2.4.
Confidence Score Paper ...............................................................................32
4.3.
Design and Procedure ............................................................................................ 34
iv
5. Results .............................................................................................................................. 38 5.1.
Absolute Accuracy ................................................................................................ 40
5.2.
Exam Difficulty .....................................................................................................42
5.3.
Calibration Curves .................................................................................................44
5.4.
Relative Accuracy .................................................................................................47
5.5.
Correlations ...........................................................................................................49
6. Discussion ........................................................................................................................ 51 6.1.
Absolute Accuracy ................................................................................................ 51
6.2.
Relative Accuracy .................................................................................................54
6.3.
Factors Correlated to Postdiction Accuracy .......................................................... 56
6.3.1.
Teachers‘ Factors ......................................................................................... 57
6.3.2.
Students‘ Factors .......................................................................................... 58
6.3.3.
Exam Factors ............................................................................................... 60
7.
Limitations ...................................................................................................................... 63
8.
Conclusion ...................................................................................................................... 66
References Appendices
v
List of Tables
Table 1. Shraw (2009) Types of metacognitive judgement outcome measures, together with the constructs they measure and their score interpretation .............9 Table 2. Foreign language students’ characteristics ............................................................ 17 Table 3. Mean, minimum, and maximum values of the foreign language teachers’ characteristics ........................................................................................................19 Table 4. Teachers’ and students’ performance confidence scores (postdictions), together with the worth values of the questions that tested the listening, the reading, and the writing skills .........................................................................25 Table 5. Descriptive statistics of foreign language teachers’ and students’ postdictions, together with the actual exam scores ...............................................26 Table 6. Absolute accuracy of foreign language teachers’ and students’ postdictions .......27 Table 7. Descriptive statistics for absolute accuracy of foreign language teachers’ and students’ postdictions .....................................................................................28 Table 8. Students’ actual score and normalised score values ..............................................28 Table 9. Descriptive statistics of the difficulty of the exam and the questions that tested the listening, the reading, and the writing skills .........................................29 Table 10. Gamma correlations of foreign language teachers’ and student’s postdictions............................................................................................................32 Table 11. Spearman correlations of foreign language teachers’ and students’ postdictions............................................................................................................33 Table 12. Pearson’s r and Spearman’s rs correlations between teachers, students, and exam characteristics and postdiction judgements .................................................35
vi
List of Figures
Figure 1. Nelson and Narens (1990) Main stages in the theoretical memory framework .....6 Figure 2. Südkamp, Kaiser, and Möller (2012) A model of teacher-based judgements of students’ academic achievement .......................................................................14 Figure 3. Calibration curve for teachers’ postdictions on the overall exam .......................... 30 Figure 4. Calibration curve for students’ postdictions on the overall exam .......................... 30 Figure 5. Calibration curve for teachers’ postdictions on the exam skills ............................. 31 Figure 6. Calibration curve for students’ postdictions on the exam skills ............................. 31 Figure 7. A model of the factors that affect foreign language teachers’ judgement accuracy of students’ academic achievement ........................................................ 43
vii
Appendices
Appendix 1. Teacher Survey (English Version) ....................................................................60 Appendix 2. Teacher Survey (German Version) ...................................................................63 Appendix 3. Teacher Survey (Spanish Version)....................................................................64 Appendix 4. Teacher Survey (French Version) .....................................................................65 Appendix 5. Student Survey ..................................................................................................61 Appendix 6. Master research proposal to the authorities of the TUM Sprachenzentrum (Language Center) ............................................................................................ 56 Appendix 7. Master research proposal to the heads of the different language departments of the TUM Sprachenzentrum (Language Center) ...........................................58 Appendix 8. Students’ Consent Form ....................................................................................59
viii
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
2
The Relation Between Foreign Language Students and Teachers’ Postdiction Confidence Scores as a Measure of Metacognitive Monitoring
Abstract The accuracy of teachers’ and students’ metacognitive judgements is considerably important because of the impact that these judgements have on the students' learning experiences and educational trajectories. This paper investigated the metacognitive accuracy of teachers’ and students’ postdiction judgements through confidence scores to see the degree to which they were accurate and to pinpoint the factors that might affect accuracy in language learning. The study included foreign language teachers (N = 17) and higher education students (N = 135) from the Language Centre of the Technische Universität München (TUM). The results obtained from calibration data showed that students’ postdiction judgements were better calibrated than teachers’ postdiction judgements both on the overall exam and on the language skills tested (listening, reading, and writing). Resolution data revealed significant correlations for the listening, reading, and writing skills tested in the exam. An analysis of some teachers, students, and exam characteristics helped to find several factors that might affect judgement accuracy in language learning. This study contributes to our knowledge of metacognitive judgement accuracy in the area of language learning.
Keywords: Metacognition, Confidence Scores, Foreign Language Acquisition, Retrospective Confidence Judgments, Teacher Judgements, Metacognitive Monitoring
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
3
Chapter 1 Introduction
Metacognition, defined as “thoughts about one’s own thoughts and cognitions” (Flavell, 1979, p. 906), involves our ability to think about our own thoughts and understand the limits of our memory. We use this ability to analyse our own memory introspectively and to make decisions concerning our lives (Dunlosky & Metcalfe, 2009). This “awareness and management of one’s own thoughts” (Kuhn & Dean, 2004) comprises three facets, namely metacognitive knowledge, metacognitive monitoring, and metacognitive control (Dunlosky & Metcalfe, 2009). Each of these facets work together to allow us to have a higher control over cognitive processes, especially the ones engaged in learning. These processes can, for example, assist us in planning how to approach certain tasks, monitoring how well we have understood them, and evaluating our progress to accomplish those tasks (Borkowski, Carr, & Pressley, 1987). This ability to self-monitor our actions has been extensively investigated because of its significant implications on how people learn, use their memory to retrieve information, and determine the accuracy of that information (Lai, 2011). Classically, the core of these investigations involved three main metacognitive judgements: Feeling-of-Knowing judgements (FOKs), Tip-of-the-Tongue judgements (TOT), and Judgements of Learning (JOLs). Later, other metacognitive judgements were classified in different categories (confidence, source, and recognition judgements) depending on their function in the learning process (Dunlosky & Metcalfe, 2009). These judgements became an important focus of metamemory research due to their reflective quality (Dunlosky & Lipko, 2007; Metcalfe & Kornell, 2005; Nelson & Narens, 1990). The results from these investigations have not only given us a wide corpus of knowledge on how these metacognitive judgements and learning
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
4
control processes work but they have also shown us the effects that these judgements and processes can have on learning. Those results suggest that “appraising the products and regulatory processes of one’s learning”, in other words evaluating metacognition, is done through metacognitive judgements (Schraw, Crippen, & Hartley, 2006). Metacognitive judgements have been classified, according to the Theoretical Memory Framework provided by Nelson and Narens (1990), depending on the stage of learning in which they take place. Thus, we have Ease-of-Learning judgements (EOLs) and Judgements of Learning (JOLs) in the acquisition stage, Feeling-of-Knowing judgements (FOKs) and SourceMonitoring judgements in the retention stage, and Retrospective Confidence judgements (RC) in the retrieval stage. The nature of these judgements can be prospective (to predict future performance) or retrospective (to judge the accuracy of past responses). These metacognitive judgements, either prospective (predictions) or retrospective (postdictions), have been studied under two main views: the direct-access view and the cueutilization view (Koriat, 2007). Under the first view, Hart (1965) argues that we can, directly or partially, survey the contents of our memory and determine whether we know something or not. One merit of the direct-access view is that it seems to capture metacognitive feelings; such as the Tip-of-the-Tongue state (TOT). Nonetheless, under this view, if people were in direct contact with their memories, their introspections should be inherently accurate (Koriat, 2007). This is, however, not always the case. Under the second view, researchers believe that we use information-based and experience-based metacognitive judgements as cues and heuristics to access our knowledge (Benjamin & Bjork, 1996). While information-based judgements rely on one’s a-priori preconceptions and knowledge retrieved from memory, experience-based judgements rely on sheer subjective experience (mnemonic cues).
These experience-based judgements are
affected by the quality of such cues and the way in which information is accessed (Koriat,
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
5
2007; Nelson & Narens, 1990). Under this view, the accuracy of metacognitive judgements depends on the validity of such cues. In both views, within an educational context, metacognitive judgements are important because they help students regulate their learning at an object-level (cognition) and at a metalevel (metacognition) (Nelson & Narens, 1990). At an object level, these judgements assist us in processing information better (determining how well we have learnt something). At a meta level, on the other hand, some researchers believe that metacognitive judgements monitor the course and success of the object level (Metcalfe & Kornell, 2005). Using these judgements, students survey their memory to reach decisions and decide to stop studying, to change the focus of their study, or to continue studying to achieve personal learning goals (Ariel, Dunlosky, & Bailey, 2009; Metcalfe & Kornell, 2005). Previous research studies have shown that inaccurate metacognitive judgements, both at the object and at the meta level, can give students a false sense of confidence in the things they think they know (Dunlosky & Metcalfe, 2009). This false sense of confidence can cause them to stop their study when they have not yet learnt the things they need to know (achieved an adequate level of understanding). It can also lead students to focus their study on things they already know or to believe that an answer is correct when it is not. This inaccuracy in selfmonitoring one’s memory can make students neglect their learning and affect their outcomes and career paths (Nelson & Dunlosky, 1991). Metacognitive judgements and their accuracy are not only important for students but also for teachers since they can help them develop their diagnostic competence. This diagnostic competence is the ability that allows teachers to judge students’ performance level and to estimate the difficulty of tasks and materials correctly (McElvany et al., 2012). Teachers’ diagnostic competence also works at the object and at the meta level and allows them to promote content learning, choose appropriate classroom activities, define the difficulty of
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
6
tasks, solve problems, and assess students’ academic achievement (Alvidrez & Weinstein, 1999; McElvany et al., 2012; Ohle & McElvany, 2015; Voss, Kunter, & Baumert, 2011). Teachers judgements (TJs) are at the core of the teachers’ diagnostic competence and play a central role in helping teachers assess the academic skills of their students. Teachers use these judgements to make educational decisions that affect students’ learning (Matinez, Stecher, & Borko, 2009; McNair, 1978; Sharpley & Edgar, 1986). For example, teachers can decide to group students based on their skill level, refer them to other classes (remediation or acceleration), and design and implement a curriculum (Sharpley & Edgar, 1986). Simmilarly to inaccurate students’ metacognitive judgements, erroneous TJs can lead to an incorrect assessment of students’ skills and can have a negative impact on the students' learning experiences and educational trajectories. For example, teachers could refer students to groups in which they either do not learn well or encounter tasks that might be too easy or too difficult for them; this, far from contributing to their learning, affects the students’ learning confidence. Thus, successful learning depends on the ability of students and teachers to accurately monitor and control learning to reach educational outcomes. Although there is research exploring how metacognitive judgement accuracy affects performance (Koriat, Sheffer, & Ma'ayan, 2002; Kornel & Metcalfe, 2006: Meeter & Nelson, 2003), it has mainly investigated core curriculum subjects such as mathematics, language arts, and reading (Südkamp, Kaiser, & Möller, 2012). Foreign language is a subject in which retrospective metacognitive confidence judgements have rarely been investigated (Leucht, Triffin-Richards, Vock, Pant, & Koeller, 2012; Mingjing & Detlef, 2015). Furthermore, heterogeneous results from previous research studies urge us to pinpoint the sources of discrepancy of these judgements to improve foreign language education and students’ learning outcomes. Under the cue-utilization view (Koriat, 2007), the current study aims to discover the degree to which the metacognitive postdiction judgements of foreign language teachers and
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
7
students offer a similar picture of students’ learning when compared between themselves and to the actual performance on test scores. Likewise, this study aims to identify the possible factors that might explain the variation in the relationship between these measures.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
8
Chapter 2 Theoretical Background
2.1 Metacognitive Judgements Metacognitive judgements comprise assessments people make about how well they have learnt something, how likely they are to remember something, and how sure they are that their memories are correct (Dunlosky & Metcalfe, 2009; Nelson & Narens, 1990; Son & Metcalfe, 2005). These assessments about one’s learning and memories are not only critical for double-checking the information available in our heads but also for regulating the effectiveness of a study strategy (Frank & Kuhlmann, 2016). The study of these judgements has attempted to answer five main questions about how people learn. First, Koriat and Levy-Sadot (1999) investigated the basis of metacognitive judgements to understand how people know that they know something. Then, Schwartz and Metcalfe (1994) focused on how accurate metacognitive judgements were and on the factors affecting such accuracy. Benjamin and Bjork (1996) assessed the processes underlying the accuracy and inaccuracy of metacognitive judgements (in particular those that led to the illusion of knowing) to determine what such processes consisted of. Son and Metcalfe (2000) sought to explain how all these monitoring judgements affected control processes and learning behaviour. Finally, Metcalfe and Kornell (2005) explored how metacognitive monitoring and control affected memory performance. Nelson and Narens (1990) provided a Theoretical Memory Framework that shows the relationships between these judgements and metacognitive processes at an object level and at a meta level. Figure 1.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
9
Figure 1. Nelson and Narens (1990) Main stages in the Theoretical Memory Framework.
From judging the likelihood of recalling a correct response (Judgements of Learning) to judging the confidence on the correctness of a response (Retrospective Confidence Judgements), metacognitive judgements work together to help us make all kinds of decisions in our everyday lives. These judgements and processes are in constant communication with one another and, in education, can help students and teachers monitor and control learning (Kornel & Metcalfe, 2006; Thiede & Dunlosky, 2015). Metacognitive judgements allow teachers to develop the necessary competences to analyse the effectiveness of their teaching based on students’ classroom progress (Mingjing & Detlef, 2015). Amongst other educational activities, these judgements assist teachers in choosing appropriate classroom activities, selecting engaging learning material, and setting the right level of difficulty of their tasks (Alvidrez & Weinstein, 1999; McElvany et al., 2012; Ohle & McElvany, 2015; Voss et al., 2011). Metacognitive judgements also support students in monitoring and controlling their learning; for example, students can assess their prior knowledge and make predictions on what
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
10
they might be able to remember later (future knowledge). Based on these prior and future knowledge assessments, students make decisions to spend more time studying, change a study technique that does not seem to improve their learning, or stop studying because they believe to have reached their desired learning goals. Because of the impact that these decisions have on the students’ learning, it is important that their metacognitive judgements be accurate (Dunlosky & Metcalfe, 2009; Lichtenstein, Fischhoff, & Phillips, 1982). Nonetheless, the communication between these judgements can sometimes be unclear because (under many circumstances) metacognitive judgements have been researched to be biased and inaccurate (Dunlosky & Lipko, 2007). The inaccuracy of metacognitive judgements can result in bad learning habits and can have negative effects on an individual’s self-concept (Nelson & Dunlosky, 1991). This inaccuracy and bias in metacognitive judgements can be attributed to several internal and external causes. Amongst these causes we have: the study choices and learning strategies that people use (Thiede, Anderson, & Therriault, 2003), time pressure (Metcalfe, 2002), and the methods that researchers use for eliciting the judgements (Dunlosky & Lipko, 2007). This inaccuracy and bias can lead people to produce poor confidence-accuracy (C-A) correlations and miscalibrations (under- or over-confident judgements) (Fleming & Lau, 2014). Such miscalibrations affect students’ learning at different degrees during their formative assessments and summative assements. During formative assessments (conducted while the learning process takes place), students’ poor C-A correlations and miscalibrations can make them have a false sense of learning. This can lead them to disregard their assignments; for example, neglecting their homework or not preparing well enough for presentations. During summative assessments (conducted at the end of the learning process), these miscalibrations can give students a high degree of confidence on the correctness of a response which is, in fact, incorrect. Students’
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
11
miscalibrations also have an effect on their teachers’ competence to diagnose learning. They can mislead teachers into believing that their students have learnt new concepts when they have actually not. As a result, teachers may not implement activites that can reinforce students’ learning. Consequently, this can lead teachers to assess students’ learning using tasks that have the wrong level of difficulty for the students. Hence, this study aims to analyse the confidence-accuracy (C-A) relationship of foreign language teachers’ and students’ judgements because of the negative effects that inaccurate and biased metacognitive judgements can have in students’ formative and summative assessment. The focus of this study is on the C-A. relationships on a summative assessment instrument (final language test) because of its influence and significance in the students’ learning progress and future linguistic educational and professional trajectories. Since retrospective assessments help teachers and students to gather the information they need to make decisions concerning learning, this study focuses on postdictions to assess teachers’ and students’ confidence on the difficulty level of a testing instrument. This study seeks to analyse several causes (outside and within the individuals) that have been found to affect the C-A relationship to shed some light on the possible sources of judgement miscalibration that might explain the possible variance found.
2.2 Students’ Judgements As we can see in the model provided by Nelson and Narens (1990) in Figure 1, predictive metacognitive judgements, such as Ease-of-Learning (EOLs) and Feeling-ofKnowing judgements (FOKs), help students monitor how much they have learnt and estimate how likely they are to remember an answer. On the other hand, postdictive judgements, such as Confidence Judgements, Remember-Know judgements, and Source-Monitor judgements, give students a degree of certainty about the accuracy of their answers (Ariel et al., 2009;
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
12
Metcalfe & Kornell, 2005). These postdiction judgements also help students doublecheck their answers and judge the quality of their responses before moving forward on making decisions (Dunlosky & Metcalfe, 2009). Because of the importance of accurately judging the quality of their responses, this study focuses on students’ Retrospective Confidence (RC) judgements. Students’ RC judgements can only be made after a recall or recognition task has been completed; hence, their retrospective nature. In education, they refer to judgements made about events that took place in the past and help students determine whether they have given an adequate response to a question. Students make RC judgements to assess the certainty of the correctness of their responses; these are the kinds of judgements that students make when answering a cognition question either in a class exercise or in an exam. These judgements can be made both at an item-by-item level and at a global level (all the items as a whole) in a test (Schraw, 2009). Due to the nature of its design, this study collects students’ RC judgements at a global level. In 2009, Schraw described two common approaches to measure RC judgements: confidence ranges and dichotomous predictions. The first can be achieved by making an RC judgement that ranges from no confidence to absolute confidence, normally made on a continuous or ordinal confidence scale that ranges from 0 to 100. The second yields dichotomous data that predicts whether performance will be successful (pass or fail). Nonparametric data analysis techniques, such as gamma correlation (Goodman and Kruskal, 1954), are appropriate for this kind of judgement measure. This study analyses RC judgements through confidence ranges because it is a scale that is directly comparable to the students’ performance on their language exams. Like many other metacognitive judgements, research has found that the confidenceaccuracy (C-A) correlation of students’ RC judgements is not always perfect (Dunlosky & Metcalfe, 2009; Loftus & Ketcham, 1991). The source of inaccuracy of these judgements is
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
13
related to an overconfidence effect defined by Lichtenstein et al. (1982) as the Hard-Easy effect. This effect takes place when the calibration of students’ RC judgements is higher or lower than the students’ actual performance; that is, when students believe their answers are correct but they are not; and vice-versa (Merkle, 2009). Researchers have found four factors that explain this effect in probability judgements, namely biased judgement, biased choice of stimulus material (ecological model), lack of suitable adjustment of response criteria, and random error judgement (Suantak, Bolger, and Ferrell, 1996). This study includes an analysis of calibration curves to show any possible confidence inaccuracy problems due to the HardEasy effect. To calibrate the accuracy of RC judgements, researchers have used at least five different types of performance confidence measures to evaluate their goodness of fit (Allwood, Jonsson, & Granhag, 2005; Nelson, 1996; Schraw, 2009). These types of metacognitive judgement outcome measures include the absolute accuracy index (Yates, 1990), correlation coefficients (Nelson, 1996), the bias index, the scatter index (Yates, 1988), and the discrimination index (Allwood et al., 2005). These outcome measures, together with the construct they measure and their score interpretation, can be found in Table 1.
Table 1 Shraw (2009) Types of metacognitive judgements outcome measures, together with the constructs they measure and their score interpretation.
Construct
Outcome
Score interpretation
measured
Measure
Absolute
Absolute
Accuracy
accuracy index performance. Measures judgement precision.
Discrepancy between a confidence judgement and
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
14
Relative
Correlation
Relationship between a set of confidence judgements and
Accuracy
coefficient
performance scores. Measures correspondence between judgements and performance.
Bias
Bias index
The degree of over- or under-confidence in judgements. Measures direction of judgement error.
Scatter
Scatter index
The degree to which an individual’s judgement for correct and incorrect responses differs in terms of variability. Measures difference in variability for confidence judgements for correct and incorrect items.
Discrimination Discrimination Ability to discriminate between correct and incorrect index
outcomes. Measures discrimination between confidence for correct and incorrect items.
Several previous research papers have analysed the C-A relationship using only one type of measure. This has given us partial information about these judgements and the factors affecting them. This paper does not only seek to analyse the discrepancy between students’ RC judgements and their actual performance on a summative testing instrument (a foreign langauge exam), but it also seeks to estimate the degree to which judgements discriminate between differing levels of performance; in other words, this paper analyses absolute and relative accuracy. Furthermore, different measures are used to corroborate C-A findings. Calibration, or absolute accuracy, assesses the precision of a confidence judgement compared to actual performance (Maki, R., Shields, M., Wheeler, A., & Zacchilli, T., 2005). This measure provides an estimate of overall memory retrieval, reflects the extent to which metacognitive judgements are realistic, matches performance exactly, and can disclose overand under-confidence bias (inflated confidence relative to the actual performance) (Lichtenstein et al., 1982). Resolution, or relative accuracy, on the other hand, indicates the ability of a person to differentiate between items that are known (better-learnt) from items that are unknown (lesser-
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
15
learnt) (Maki et al., 2005). It assesses the relationship between a confidence judgement and performance measured by the within-subject correlation of performance and postdictions (Maki et al., 2005; Nelson, 1984). Measures of relative accuracy focus on the consistency of RC judgements relative to a set of performance outcomes rather than on their precision on an item-to-item basis. On a testing situation, for example, if a student judges that he would answer 80% of the questions correctly and the actual test results yield 80% of correct answers, we can assume a “perfect calibration accuracy” even when a close examination of the items reflects a poor resolution (high confidence on items answered incorrectly). To assess this relationship, relative accuracy has been normally measured by using a correlation coefficient (gamma correlation in the case of non-parametric dichotomous data, and Pearson’s r in the case of linearly related variables). This study also uses a correlation coefficient to analyse the consistency of students’ RC judgements. Both types of accuracy are important for students’ self-regulated learning; however, research on metacognitive judgement accuracy has mostly concentrated on calibration (Lichtenstein et al., 1982; Nelson & Dunlosky, 1991). Yet there is scarce research examining both calibration (absolute accuracy) and resolution (relative accuracy) for a typical classroom exam in the field of language learning. For that reason, this study intends to fill this research gap by analysing both absolute and relative accuracy (together with calibration curves) to identify Hard-Easy Effect problems on students’ RC judgements, thus painting a better picture of the C-A postdiction relationship of foreign language students and teachers. According to the ecological approach in the study of confidence (Gigerenzer, Hoffrage, & Kleinbolting, 1991), the source of inaccuracy of students’ RC judgements can be related to, amongst other factors, the difficulty level of the tasks that teachers set (Black & Wiliam, 1998; Rodriguez, 2004). The biased selection of questions that teachers choose to increase the test difficulty can lead to the Hard-Easy Effect; this, consequently, can sway students to
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
16
miscalibrate their judgements (Gigerenzer et al., 1991; Lichtenstein et al., 1982). Although some calibration studies have shown a positive relation between calibration accuracy and students’ actual achievement (Hacker, Bol, Horgan, & Rakow, 2000), other research results propose that, for students to avoid biased judgements and correctly calibrate their RC judgements, the exam difficulty needs to be appropriate (Schraw, 2009). Hence, accurate teacher judgements are critical when setting the difficulty of their testing instruments.
2.3 Teachers’ Judgements Teachers’ Judgements (TJs) are the teachers’ ability to revisit their own thoughts and analyse the effectiveness of their instructional performance based on students’ declarative information (Mingjing & Detlef, 2015). They are the judgements that teachers make about their students to choose classroom activities, select appropriate learning material, and set the difficulty level of their tasks (Alvidrez & Weinstein, 1999; McElvany et al., 2012; Ohle & McElvany, 2015; Voss et al., 2011). Due to the complex nature of education, teachers make different types of judgements at different stages of the teaching and learning process (Shavelson & Borko, 1979). Thus we have, according to the different stages of a lesson, pre-instructional judgements, interactive judgements, and post-instructional judgements, each of which are affected by different factors. McNair (1978), for example, found that interactive TJs are affected by an immediate student response to the teaching. On the other hand, pre- and post-instructional judgements are often affected by available documentation from the students (e.g. performance on worksheets and tests) and, in some cases, from the parents (Coladarci, 1986). The TJs that teachers make to set the level of difficulty of classroom tasks and questions on exams, namely Judgements of Difficulty (JOD), happen only after having gathered information from the students and having evaluated students’ knowledge through oral
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
17
questions or written works. In other words, they are post-instructional judgements because they take place at the last stage of a lesson. In this sense, they are postdiction judgements because they are affected by the students’ academic achievement, either at an individual or at the class level (Shavelson & Stern, 1981). This study refers to TJs as these kinds of post-instructional judgements. The teachers’ ability to make accurate judgements has been shown to have a high degree of accuracy and validity because of the multidimensional understanding they gain about students’ performance from their interactions in the classroom (Hoge & Coladarci, 1989; Hopkins, George, & Williams, 1985; Meisels, Bickel, Nicholson, Xue, & Atkins-Burnett, 2001; Perry & Meisels, 1996). Nonetheless, empirical evidence also shows that TJs are not always accurate and can be biased depending on several factors: personal characteristics (age, experience, and qualifications), their own beliefs about education, and available information about their students as well as the teaching and learning environment (Coladarci, 1986; Shavelson & Stern, 1981). One of the factors that can cause a bias on TJs is students’ ethnicity and background (Darling-Hammond, 1995). The assumption that researchers make here is that the general low achievement of ethnic minorities could negatively affect teachers’ expectations. This can lead teachers to provide a different treatment, such as fewer classroom participation opportunities and less feedback. Such treatment only favours students from ethnic majorities (Kaiser, Südkamp, & Möller, 2016; Tenenbaum & Ruck, 2007). Since this study is conducted in Germany and the country has a long immigration history, this study also takes the students’ ethnic background into account to analyse how it affects foreign language teachers’ postdictions. Teachers’ experience is another factor that can contribute to the inaccuracy of TJs. Some researchers have convincingly shown that teachers with five or more years of experience
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
18
tend to be more sensitive and flexible to their students’ learning contexts when compared to their novice counterparts (Berliner, 2004; Johansson, Strietholt, Rosén, & Myrberg, 2014). These research results also state that the teachers’ experience produces no difference in students’ achievement when the teachers have five or more years of experience (DarlingHammond, 2000). The years of experience that the teachers had, along with their teaching qualifications, are also presented and analysed in the present research study. Another important factor for the inaccuracy of TJs can be traced to teachers’ poor awareness of the motivational characteristics of their students. Teachers’ inability to detect a decrease in their students’ motivation (based on the rate and quality of classroom participation) can lead to interaction bias (Kaiser, Retelsdorf, Südkamp, & Moeller, 2013). This is particularly important in language learning since motivation is highly correlated with students’ academic achievement (Ellis, 2008; Gardner, 2010). Studies examining the TJ accuracy of student achievement and students’ self-reported motivation have shown moderate correlations (Gagne & St. Pere, 2001; Swanson, 1985) suggesting an increase in students’ motivation caused by the teachers’ assessment of their academic performance. Other factors that may affect TJ accuracy are students’ learning disability (Martínez & Mastergeorge, 2002), gender (Tiedemann, 2002), and classroom conduct (Bennett, Gottesman, Rock, & Cerullo, 1993). Studies on gender have found that male students exhibit “salient” behaviours that help TJ accuracy (Funder, 2012; Tiedemann, 2002). Classroom conduct (also referred to as student performance) has also been a factor found to moderate TJs (Bennett et al., 1993; Mingjing & Detlef, 2015). Begeny, Krouse, Brown, & Mann, (2011) corroborated previous studies showing that teachers are better at judging high-performing students because the interaction frequency between teachers and these students is higher. This interaction bias has been found to reduce TJ accuracy for low-performing students. However, divergent results in other studies have not been able to show the extent to which learning disability, gender, and
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
19
classroom conduct can affect TJs (and their accuracy thereof) or if they affect TJs at all (Kaiser et al., 2016). This study presents the gender of the students, along with several factors that aim to provide a raw picture of the students’ classroom conduct. These factors include the students’ effort, motivation, and self-confidence. Research results obtained by Hoge and Coladarci (1989) show, for example, that TJs and empirically tested students’ achievement had a median correlation of .66 (range between r = .28 and r = .92). Südkamp et al. (2012) corroborated these results and found a mean effect size of Zr = .63 (range between r = -.03 and r = .84) for the relationship between TJs and students’ achievement. Concerning the level of task difficulty, we can read reported findings that show how teacher judgements correlation varies between .33 < rmean < .56 both for under and overestimating task difficulty (Anders, Kunter, Brunner, Krauss, & Baumert, 2010; McElvany et al., 2009). To avoid such variability in C-A correlations and get valid results, Südkamp et al. (2012) provided a model based on theoretical considerations and empirical findings which identifies various aspects affecting the accuracy of TJs. Future researchers need to take into account the aspects presented in this model when examining TJ accuracy to pinpoint the source of inaccuracy. This model is shown in Figure 2.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
20
Figure 2. Südkamp, Kaiser, and Möller (2012) A model of teacher-based judgements of students’ academic achievement.
This model shows that TJ accuracy can not only be affected by the teachers’ perceptions of students’ classroom performance, their characteristics, and by the classroom environment (Leinhardt, 1983; Llosa, 2008; Rodriguez, 2004), but it can also be affected by the test and judgement characteristics (Südkamp et al., 2012). In other words, in addition to the teachers’ years of experience and level of education, researchers should also look at the classroom environment, the type of test taken (norm-referenced or peer-independent), and the type of judgement made (either a prediction or a postdiction) to avoid inflated correlations. Several previous studies have not, however, taken into consideration these factors to measure TJ accuracy, questioning thus the accuracy of their results (Südkamp et al., 2012). Furthermore, two other important limitations of previous studies are that 1) they have reported the correspondence between TJs and students’ abilities mostly in a predictive context (Coladarci, 1986; Dusek & O'Connell, 1973) and 2) TJs have been only expressed as ranking or general ratings of students’ performance (Hoge & Butcher, 1984; Luce & Hoge, 1978; Oliver & Arnold, 1978). This has not only left us with scarce data concerning postdiction judgements but it has also given us inflated correlations when comparing TJs to norm-
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
21
referenced measures of academic achievement (Hoge & Butcher, 1984; Hoge & Coladarci, 1989) and when using correlational analysis rather than percentage agreement methods to compare judgements (Feinberg & Shapiro, 2003). Since teachers need to accurately identify pupils’ knowledge to tailor their teaching methods to their students’ needs, give adequate feedback, and create tasks with an appropriate level of difficulty (Black & Wiliam, 1998; Hattie & Timperley, 2007), it is vital to analyse TJ accuracy to estimate how precise it is and how its precision helps students correctly calibrate their RC judgements. Such analysis should take into consideration Südkamp et al.’s model (2012) to avoid a high variability of correlations and to get valid results. This study seeks to obtain judgement accuracy data by analysing how TJs correlate with students’ RC judgements and with the students’ actual academic achievement using Südkamp et al.’s model (2012). This analysis is conducted in the area of foreign language acquisition because this subject has rarely been investigated in previous research studies (Leucht et al., 2012; Mingjing & Detlef, 2015).
2.4 Confidence Scores To obtain judgement accuracy data, previous studies have used the Certainty of Response Index (CRI) to measure confidence (Hasan, Bagayoko, & Kelley, 1999). This method has been able to distinguish between a lack of knowledge and the presence of misconceptions in learning; it does not, however, provide any data on calibration or resolution (Saglam, 2015). To tackle this problem, Caleon and Subramanian (2010) used confidence ratings (1 for “just guessing”, 2 for “very unconfident”, 3 for “unconfident”, 4 for “confident”, 5 for “very confident”, and 6 for “absolutely confident”) to calculate the mean confidence accuracy and confidence bias. Although this proved to be a good technique for obtaining calibration and resolution, the ratings were restricted to the student’s personal understanding of the ratings (Fritzsche, Kröner, & Dres, 2012).
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
22
In 1996, Crawford and Stankov showed that confidence can be more reliably measured in typical test-taking situations through answer ratings (expressed in terms of percentages) given immediately after responding to a test question. These answer ratings for all test questions can be averaged to give an overall confidence score (Harvey, 1997). Thus, the C-A relationship can be reliably measured both absolutely and relatively by comparing confidence judgements to general performance scores and to question performance scores, either using numerical or categorical data (Allwood et al., 2005). Furthermore, to obtain reliable data, Dunlosky and Bjork (2008) suggested to identify 1) the kind of judgement being made (either informed or uninformed), 2) the level of performance being used (an item-by-item level and at a global level), 3) the time when the judgement is being made (either a prediction or a postdiction), and 4) the way in which the difference between the judgement and the actual performance is calculated (absolute accuracy or relative accuracy). Because of the importance of monitoring learning accurately, this paper examines foreign language teachers’ and students’ informed postdictions at a global level to obtain reliable confidence measures of both absolute and relative accuracy. To this end, confidence scores (expressed in terms of percentages) are collected immediately after finishing a language exam. The aim of this study is to discover the degree to which these postdiction judgements offer a similar picture of students’ academic achievement. The research question is presented in Chapter 3.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
23
Chapter 3 Research Question Do Performance Confidence Scores of foreign language teachers and higher education students from the TUM Language Centre correlate with each other and with the students’ actual academic achievement? To answer this question, it is central to include a systematic comparison of accuracy measures. This study looks at the accuracy of teachers and students’ postdiction judgements (students’ RC judgements and TJs) using three different procedures: 1. The absolute accuracy index to measure absolute accuracy. 2. Calibration curves to identify the Hard-Easy effect. 3. A correlation coefficient to measure relative accuracy. These analyses aim to provide a richer picture of the postdiction judgements’ accuracy, together with their correlation, direction, and magnitude. To pinpoint possible contributing factors to the inaccuracy of TJs and RCs, this study uses correlations between postdiction judgements and teacher, student, and exam characteristics. This study also reports the difficulty level of the overall exam and of the individual exam questions, as this can also be a contributing factor to students’ RC judgement inaccuracy.
3.1 Hypothesis Despite the inconsistent conclusions in previous studies, it is hypothesised in this study that teachers who hold relevant teaching qualifications and have more than five years of teaching experience will have more accurate postdiction scores (Darling-Hammond, 2000). Research findings from Fritzsche et al. (2012) and from Juslin and Olsson (1997) lead to the hypothesis that students will also have accurate postdiction scores.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
24
Absolute accuracy for teachers will be established by the discrepancy between teachers’ postdiction judgements (TJs) and students’ actual performance scores. Similarly, absolute accuracy for students will be established by the discrepancy between students’ postdiction judgements (RC judgements) and their actual performance scores. Relative accuracy will be established by the correlation between teachers’ postdictions and students’ actual performance scores and by the correlation between students’ postdictions and students’ actual performance scores (Hoge & Coladarci, 1989). Teachers’ use of the target language in class is also analysed in this study to see its effects on judgement accuracy; since, to my knowledge, there are no previous studies analysing this variable, a two-tailed hypothesis is taken here.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
25
Chapter 4 Method
4.1 Sample This study included ethnically diverse higher education students (N=135; 60.7% male, 39.3% female) with a mean age of 21.95 years (SD=2.51) from a wide variety of careers and different levels of education (68.7% bachelor, 30.6% master, 0.7% PhD) at the Technische Universität München (TUM). Amongst the ethnicities found in the sample, there were Germanic students (80.6%), English students (3.7%), Latin students (4.5%), Asian students (6%), and Arabic students (3.7%). Students with Hindu and Slavic backgrounds represented 0.7% of the population sample each. These ethnicities do not refer to single countries but they rather represent a background based on the students’ mother language. Thus, for example, Germanic students comprise students from Germany, Austria, and Scandinavian countries (Norway, Sweden, Iceland, and Denmark) except for Finland because this language has a Uralic origin. Similarly, Latin students in this study refers to students whose mother language was any of the Neo-Latin languages (romance languages), namely Spanish, French, Portuguese, Italian, Romanian, Catalan, etc. All students were, on average, in the fourth semester of their studies (SD=2.48) and spoke a mean of 2.63 languages (SD=.69). All students spoke English either as a native or foreign language. A measure of their previous knowledge revealed that students knew about 12.15% (SD=21.42) of the foreign language they studied before they started their classes. Similarly, a measure of their general self-confidence showed that 65.4% of them believed to be good at the language they were learning; whereas a 34.6% of them believed the opposite. Data concerning the students’ learning effort, their focus of study, their motivation to take up the course, and their self-confidence (both positive and negative) is presented in Table 2.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
26
Table 2 Foreign language students’ characteristics
Effort
Foreign Language Students Characteristics Few
Some
Most
Homework
16.3%
41.5%
42.2%
Attendance
5.9%
25.2%
68.9%
Vocabulary Grammar Focus
38.8%
47.4%
13.8%
Internal
External
Both
34.1%
28.1%
37.8%
Good
Bad
65.4%
34.6%
Listening
Reading
Writing
Positive
8.9%
34.8%
2.2%
Negative
30.1%
1.5%
30.1%
Motivation
Confidence
Both
Self-Confidence
Homework here refers to the amount of homework that the students did during the time that the teaching took place. It was coded as 1 = few homework done, 2 = some homework done, and 3 = most homework done. Attendance refers to the amount of classes attended during the time that the teaching took place. It was coded as 1 = few classes attended, 2 = some classes attended, and 3 = most classes attended Focus refers to the focus of their study (what they mostly focused their study on). It as coded as 1 = focus on vocabulary, 2 = focus on grammar (grammatical rules and sentence structure), and 3 = focus on both. Motivation refers to the type of motivation that moved students to take up the foreign language. It was coded as 1 = internal motivation, 2 = external motivation, and 3 = both.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
27
Self-Confidence refers to the personal belief that the students had about their learning (whether they thought they were good or bad at learning the language in general). It was coded as 1 = they believed to be good at learning the language and 2 = they believed to be bad at learning the language. Finally, positive and negative confidence refer to the specific beliefs that the students had about their language skills (whether they thought they were good or bad at input skills (listening and reading) and at output skills (writing). This study also included foreign language teachers (N=17; 23.5% male, 76.5% female). The teachers were on average 45.82 years old (SD=10.39). The languages they taught included: German, Spanish, French, Portuguese, Swedish, Norwegian, Italian, Chinese and Japanese. All teachers were native speakers and were considered experienced since they all had more than five years of teaching experience (M=15.76, SD=8.13). All teachers had teaching qualifications (a teaching degree) either as their main or secondary academic formation; that is to say, that some teachers had degrees in law and journalism and then pursued a career in teaching acquiring the necessary qualifications. These qualifications were endowed either by a university or by a language teacher training institute (i.e. the Goethe Institute for German teachers, Instituto Cervantes for Spanish teachers, etc.) The teachers’ level of education included 41.2% teachers with a bachelor degree, 52.9% teachers with a master degree, and 5.9% teachers with a PhD. A measure of the amount of the target language that the teachers used in class showed that teachers spoke on average a 60.42% (SD=24.16) of the time in the language that the students were learning. Table 3 presents the teacher characteristics analysed in this study with their minimum and maximum ranges.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
28
Table 3 Mean, minimum and maximum values of the foreign language teachers’ characteristics
Teachers Characteristics Mean
Min
Max
Age
45.82 years
31 years
67 years
Experience
15.71 years
6 years
36 years
60.42%
25%
90%
Language Use
All teachers and students were randomly selected (volunteer participation criteria) from the Language Centre at the Technische Universität München (TUM Sprachzentrum). To establish fairness and to make the data comparable, the data was collected solely from students at an A1 level of linguistic proficiency (A1, A1.1, and A1.2), according to the Common European Framework for Language Reference (CEFR) of each aforementioned language.
4.2 Measure Instruments 4.2.1 Teachers’ Language Exam. All language exams were achievement tests that aimed to measure how much of the content of the language course had been learnt. The exams were based on the contents of the course book used during the lessons and the exam scoring was mostly objective since there was a criterion that established how a question should be marked (full points, half of the points, or no points) without requiring the judgement of the scorer. All language exams had a duration of 90 min. A percentage of 58.8% teachers had already used their exam in previous testing situations with one or more groups; however, all teachers had changed the activities in their exams to adjust it to the needs of their current students. This change in the activities was also done to avoid that students from higher linguistic levels (A2, B1, etc.) share information about the questions and tasks in the exam. All
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
29
exams took place in either of the three campuses of the TUM (Main Campus, Garching, and Weihenstephan) at different times of the day (35.3% in the morning, 52.9% in the afternoon, and 11.8% in the evening). This covered most of the places where the TUM Language Centre offers classes. Demaray and Elliot (1998) showed that TJ accuracy is higher when the teachers are well-informed about the way of measuring student achievement; this is, when the teachers know the exam that the students are going to sit. Bates and Nettelbeck (2001) also corroborated these findings stating that teachers tend to overestimate students’ achievement on standardised exams that the teachers were not familiar with. To avoid this overestimation and increase TJ accuracy, this study used the language exams that the teachers created themselves using their diagnostic competence. This competence enabled them to analyse what has been taught during the learning process and to appropriately tailor the difficulty level of the questions and tasks within their exams (Black & Wiliam, 1998; Hattie & Timperley, 2007). On one hand, this approach allowed teachers to make informed judgements, which helped to the C-A relationship because of the substantial influence that these judgements can have on the size of the correlation between TJs and students’ academic achievement (Südkamp et al., 2012). On the other hand, this approach eliminated an overconfidence effect caused by biased item sampling; that is, the tendency of researchers to include in the exam several deceptive items (Gigerenzer et al., 1991). In their language exams, teachers were free to include the number of questions and tasks they deemed pertinent to evaluate students’ language knowledge. The teachers measured students’ language knowledge in three different linguistic skills, namely listening, reading, and writing. Grammar and vocabulary knowledge was also tested through the test but, due to the design of this study, no postdiction data was gathered from this. Because of time limitations,
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
30
class logistics, and conventions, speaking was not included in any exam and remained as a skill only analysed in class through direct questioning and oral presentations. The listening comprehension skill was assessed through questions that asked the students to choose a correct response or fill in the gaps after a short listening input; the assessment of this skill lasted on average 10 minutes and the listening input was played twice. Reading comprehension was tested through true-false questions or through questions requesting information about a short text input. Grammar and vocabulary were examined through tasks such as matching, filling in gaps, choosing the correct answer and reordering sentences. Finally, the writing skill was measured through tasks that asked the students to fill out a form or pen down a short paragraph. The amount of minimum words to be written in the paragraphs varied from language to language; however, no teacher required less than 30 words or more than 150 words to fulfil the task. Several of the questions and tasks found in the teachers’ language exams resembled those approved by the Association of Language Testers in Europe (ALTE).
4.2.2 Teachers’ Ethnographic Survey. According to Südkamp et al.’s model of teacher-based judgements of students’ academic achievement, teacher characteristics (such as teaching qualifications and experience) are important factors affecting TJs’ accuracy. Some studies have corroborated these findings (Darling-Hammond, 2000; Johansson et al., 2014) whereas other studies have presented contradicting results (Demaray & Elliott, 1998). To corroborate either of these findings in the field of language learning, a short survey was provided to the teachers to record the teaching qualifications they held and the years of teaching experience they had (see Appendix 1). Other data, such as the teachers’ age, gender, language proficiency, and the amount of target language they used in class were also collected to study
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
31
how these characteristics might be related to TJ accuracy since little information has been reported on previous studies. This teachers’ ethnographic survey was translated into German for most teachers (teachers of German, Swedish, Norwegian, Chinese, Japanese, Italian, and Portuguese) (Appendix 2). It was also translated into Spanish for all Spanish teachers (Appendix 3) and to French for the French teacher that participated in the study (Appendix 4). The translations remained true to the linguistic meaning; although, it ommited some information that was previously given. For example, the head of the spanish and french departments assured that all teachers in their departments were native speakers; for this reason, the question asking whether the teacher was native speaker or not was ommited in the Spanish and French surveys. All translations were made by the researcher but a language teacher from each language who did not take part in the study was asked to correct the translations. Furthermore, a pilot study was conducted with a native speaker in every language (Spanish, French, and German) to assess linguistic understanding and clarity.
4.2.3 Students’ Ethnographic Survey. Students were also provided with an ethnographic survey that gathered information concerning their age, sex, and ethnicities (see Appendix 5). The students’ level of studies (bachelor, master, or PhD) and field of studies (architecture, engineering, etc.) were also collected in this survey. Finally, data pertaining to the students’ motivation to take the curse (internal, external, or both), their focus of study (vocabulary, grammar, or both), their prior knowledge of the language, their beliefs about their linguistic abilities (positive and negative self-concept), and their personal efforts on the course (which accounted for the total number of classes attended and the amount of homework done) was also collected. This data aimed to corroborate previous findings and determine whether they affect students’ RC.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
32
The students’ ethnographic survey was handed to all participants in English regardless of their nationality or native language. The criteria to do this was that all national and international students at the Technische Universität München (TUM), according to university regulations, must hold a B2 or C1 level of English linguistic proficiency (according to the CEFR) before being accepted into a Bachelor, Master, or PhD study programme at the TUM; even when their studies are conducted mainly in German. A pilot study was also conducted using both national (German citizens) and international students (from Colombia, The U.S. and India) to ensure linguistic understanding and clarity of the survey.
4.2.4 Confidence Scores Paper. To measure TJs and students’ RC judgements, foreign language teachers and students were given a confidence scores paper where they could record a confidence rating (expressed in terms of percentages) of the exam in general and of the questions that tested the three skills that form the focus of this study (listening, reading, and writing). This confidence score paper was included in the teachers’ and students’ ethnographic surveys (see Appendices 1, 2, 3, 4, and 5). The confidence scores paper asked teachers to make postdiction judgements of students’ overall exam score by assessing how many students would answer the exam questions correctly (How difficult do you think that the exam is overall?). The teachers were explained that if they believed that the exam was 30% difficult, they would expect 70% of their students to pass the exam. The teachers were also asked to judge the difficulty of every linguistic skill in their exams (How difficult do you think that the listening question(s) is? How difficult do you think that the reading question(s) is? How difficult do you think that the writing question(s) is?). This approach aimed to give light to the difficulty level of the overall exam and the skill questions, as perceived by the teacher, which could help to control for the Hard-Easy Effect.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
33
Although it would have been ideal that the confidence scores captured TJs and students’ RC judgements for every question, the different formats that teachers used to tailor their exams made this approach impossible. Some teachers, for example, included only one question to assess the listening, reading, or writing skills whereas other teachers included sub-questions or question formats that others did not. This created a distressing confusion when all teachers had to fill out one single confidence score paper that did not fully relate to the format of their exams. To capture TJs and students’ RCs for every question, this research should have used a personalised confidence score paper for every language exam. This was logistically impossible due to the limited access that the researcher had to the exam format of every teacher. For that reason, confidence scores were gathered from the skills as a whole; notwithstanding the number of questions that the teachers used to measure the skill. In this sense, the TJs and students’ RC judgements given for the language skills were global postdictions since they could involve one or more questions. Students were also asked in their confidence scores paper to judge the difficulty of the overall exam and the difficulty of every linguistic skill in their exams. Students were instructed, as in previous research studies, to write a confidence score ranging from 0 to 100 in the format of a percentage to express their postdictions (Stankov & Crawford, 1997). Should the exam include more than one question for every skill, students were instructed to give an overall confidence score; this is, if the listening section in the exam had two questions and the students were 100% confident of a correct answer on the first question but only 50% confident in the second, they could write a global confidence score of 75% for both questions in the listening section. The same was true for the other two skills tested in the exam. If they could not retrieve an answer to a question, they were instructed to write a 0% degree of confidence. This fine-grained and economic technique for obtaining confidence ratings of the overall exam and each skill question in a cognitive exam was selected because it taps on both
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
34
cognitive ability and metacognition (Schraw, 2009), helping researchers understand students’ and teachers’ degree of confidence in an accurate manner. Although it could have been better to obtain a confidence score for every question, this approach can only be implemented when exams are standardised and not tailored by every teacher.
4.3 Design and Procedure After establishing the goals of the study, its viability, and the benefits for the participants (both students and teachers), a research proposal was presented to the TUM Language Centre to conduct the study (see Appendix 6). Interested in the research proposal, the director and vice-director of the TUM Language Centre held a meeting with the researcher to inform themselves better about the importance of foreign language TJ accuracy. As a result of this meeting, they authorised carrying out the study and supported the research to the best of their abilities. After this, the heads of the language departments at the TUM Language Centre received an explanation of the master’s research study (see Appendix 7). This document was also translated into 3 languages, namely Spanish, German, and French to reach a greater number of teachers and to minimise language barriers (see Appendices 8, 9, and 10 respectively). After the heads of the language departments had read this explanation of the research study, they were asked to transmit the information to the teachers in their language departments to motivate them to participate in the study. Once the language teachers from the level A1 of linguistic proficiency had received and read the master’s research explanation of the study, those who expressed their willingness to participate were addressed personally and explained the study in detail to clarify any questions. After they had understood the research goal and their personal and professional benefit from the results, they were told to continue with their teaching throughout the semester as they had previously planned and that their help would be required only after they had tailored their
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
35
language exam to examine their students’ linguistic competence and knowledge at the end of the semester. All foreign language teachers taught their lessons during the summer term 2017, according to the TUM calendar. Classes started on April 24th and extended until July 28th; in some few cases, they extended until the first week of August. During their classes, the teachers collected information about the students’ academic performance through classroom activities and homework. Amongst these activities students used worksheets, prepared oral presentations, completed workbook and online assignments, and wrote short texts aimed to enhance vocabulary and grammar and to foster language learning. During the semester, teachers gave both general and individual feedback to their students by correcting their mistakes and checking for understanding. The levels of feedback given are not analysed in this report; however, a total of 5 language classes (Italian, French, German, Chinese, and Norwegian) were observed to enrich the study with an insight of the teaching styles used. To avoid the presence of the Hawthorne effect, the researcher acted as a regular student and did not tell the teachers that they were being observed. The teachers only became aware of this observation at the end of the semester when they agreed to participate in it. A detailed account of what was observed in these classes and how they affected metacognitive accuracy can be found on the discussion section of this paper. At the end of the semester, students and teachers were asked to fill out the ethnographic surveys and the confidence score papers. Teachers had the chance to fill out their surveys once they had prepared their exams; most of them, however, did it on the very same day of the exam. Students, on the other hand, only had the opportunity to fill out their surveys on the day of the final exam after having finished it. All students were explained the goal of the study a week before the study took place and they all signed a consent form in which they voluntarily agreed to participate (see Appendix 8). A maximum of 10 students per class were allowed to
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
36
participate in the study in order not to overwhelm the teachers with the extra work that represented the participation in this study; this is, marking the exams and sending the actual exam scores data to the researcher for statistical analysis. The participation criteria for the students was the voluntary participation of those who had finished the exam before their testing time was over. All students had 90 minutes to finish their exam but those who finished before the 90 minutes and wanted to participate in the study were given the survey. Although there were several more students who expressed their desire to participate in the study, time constrictions did not allow their participation because, once the 90 minutes granted to complete the exam were over, the classrooms had to be cleared out for other teachers and students to use it and there were no classrooms available for students to complete the surveys. The implication of this sample selection is discussed in the discussion section of this paper. The students gave a confidence score for their language exam at a global level, as well as a global skill confidence score for the three skills tested (listening, reading, and writing) immediately after having answered all the questions on their language exam. Students could choose the percentage that best represented their perceived confidence on the answered questions. This percentage ranged from completely unconfident on the correctness of their answer (labelled 0% of accuracy) to totally confident on the correctness of their answer (labelled 100% of accuracy). These confidence scores were afterwards transformed into exam scores using the rule of three to be able to calculate judgement accuracy when comparing their confidence to the actual exam scores. Students wrote a code both on their surveys and on the top right corner of their exams so that their participation in the study remains anonymous. The students could keep their exam while filling out the surveys so they had a moment to think about the answer they had given and thus provide a more assured confidence score. After the students were done filling out the
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
37
survey, the researcher collected the surveys and the teachers collected the language exams. After having corrected and marked the exams, the teachers sent an email to the researcher containing the students’ actual exam scores, together with the code that each student created. The researcher then used this code to pair it to the code written on the students’ survey and thus identify which students’ confidence scores belonged to which students’ exam scores. This approach allowed the statistical analyses of students’ data in an anonymous way.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
38
Chapter 5 Results
All exams had in total a maximum worth value of 100 points (except for one exam, which maximum worth value was 85 points. In this exam, the remaining points to reach 100 were accredited to homework). The worth values for the listening, reading, and writing skills, however, varied from exam to exam because every teacher assigned different worth values to these skills according to the complexity of the questions and tasks. In some exams, for example, there was only one listening question and it had a worth value of 5 points. In other exams, the listening question consisted of two or more inputs and / or tasks and had a worth value of 24 points. The same is true for the reading and writing skills tested. Table 4 shows the teachers’ and students’ postdictions, together with the weight each one of them had on the exam. E.g.: 14/20 means that the postdiction was 14 points and the total worth value of the question was 20 points.
Table 4 Teachers’ and students’ performance confidence scores (postdictions), together with the worth values of the questions that tested the listening, the reading, and the writing skills. Teachers’ and students’ performance confidence scores Teachers’ postdiction scores
Students’ postdiction scores
Exam Listening Reading Writing Exam Listening Reading Writing Class1
80
14/20
18/20
18/20
82
16/20
16/20
16/20
Class2
75
*
*
*
82
*
*
*
Class3
75
13/14
6/8
5/10
71
13/14
5/8
7/10
Class4
55
7/11
5/8
14/28
73
10/11
7/8
20/28
Class5
70
12/16
13/17
20/30
77
13/16
13/17
21/30
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
39
Class6
85
9/15
8/14
13/16
80
13/15
8/14
12/16
Class7
70
5/5
3/4
11/12
75
5/5
3/4
8/12
Class8
70
5/5
3/5
3/5
74
5/5
4/5
4/5
Class9
85
9/17
12/12
13/16
75
9/17
11/12
12/16
*
7/12
44/73
69/85
*
10/12
61/73
Class10 60/85 Class11
70
9/10
10/11
13/19
85
9/10
11/11
13/19
Class12
55
9/10
9/10
20/24
66
7/10
9/10
15/24
Class13
75
4/6
8/10
20/25
78
5/6
8/10
21/25
Class14
60
12/24
13/25
18/26
77
19/24
20/25
18/26
Class15
70
¾
*
24/39
74
4/4
*
26/39
Class16
50
8/15
11/15
15/30
61
11/15
10/15
15/30
Class17
60
16/20
19/25
10/20
81
16/20
23/25
17/20
Note. * This data was not available (it was either not given by the teacher or not tested in the language exam).
Means and standard deviations of these values cannot all be directly computed because there was one teacher whose exam was not worth 100 points and several teachers assigned different values to the language skills. To make the data comparable for further statistical analysis, all postdictions were used in their original form (a percentage score between 0 and 100) and the actual exam scores were normalised to match this format. Having the postdictions and the exam scores both as percentage ratings and exam scores allows a varied analysis of data and its comparison in magnitude. Table 5 presents the means and standard deviations of the teachers’ and students’ postdictions, together with the normalisation of the actual exam scores.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
40
Table 5 Descriptive statistics of foreign language teachers and students’ postdictions, together with the actual exam scores. Teachers’ and Students’ Postdiction and exam scores Mean
SD
69.15
10.20
Listening 74.63
17.05
Reading
74.45
14.30
Writing
68.35
14.69
Exam
76.01
6.10
Listening 84.03
12.75
Reading
80.15
11.62
Writing
72.34
9.13
Exam
85.23
4.73
Listening 93.99
6.61
Reading
93.19
7.56
Writing
85.22
8.34
Exam Scores
postdictions postdictions
Students’
Teachers’
Exam
To answer the research question in this paper and find out whether performance confidence scores of foreign language teachers and higher education students correlate with each other and with the actual academic achievement, measures for absolute and relative accuracy (including calibration curves) were computed using the data in Tables 4 and 5.
5.1 Absolute Accuracy In accordance with the confidence paradigm, an absolute accuracy score for foreign language teachers was calculated by subtracting the TJs from the actual exam scores. Similarly, to calculate the students’ absolute accuracy, students’ RC judgements were subtracted from
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
41
their actual exam scores. Here, the value 0 represents perfect accuracy and values above and below represent over- and under-confidence respectively. These over- or under-confidence values, nonetheless, do not represent statistical significance. The results from the calculations of these subtractions are presented in Table 6.
Table 6. Absolute Accuracy of Foreign Language Teachers’ and students’ Postdictions. Absolute Accuracy of Foreign Language Teachers’ and students’ Postdictions Teachers’ Postdictions
Students’ Postdictions
Exam Listening Reading Writing Exam Listening Reading Writing Class1
-5
-3
-2
-1
-3
-1
-4
-1
Class2
-15
*
*
*
-8
*
*
*
Class3
-8
0
-1
-2
-12
0
-2
0
Class4
-28
-4
-3
-9
-10
-1
-1
-3
Class5
-20
-3
-3
-5
-13
-2
-3
-4
Class6
-2
-5
-3
-2
-7
-1
-3
-3
Class7
-8
0
-1
1
-3
0
-1
-2
Class8
-6
0
-2
-2
-2
0
-1
-1
Class9
-2
-6
1
-2
-12
-6
0
-3
Class10
-19
*
-3
-25
-10
*
0
-8
Class11
-16
-1
-1
-3
-1
-1
0
-3
Class12
-34
-1
-1
-1
-23
-3
-1
-6
Class13
-4
-1
-2
1
-1
0
-2
2
Class14
-22
-8
-8
-2
-5
-1
-1
-2
Class15
-18
-1
*
-11
-14
0
*
-9
Class16
-33
-7
-2
-7
-22
-4
-3
-7
Class17
-30
-2
-4
-8
-9
-2
0
-1
Note. * This data was not available (it was either not given by the teacher or not tested in the language exam).
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
42
To summarise this data from all language teachers, the overall means and standard deviations of the absolute accuracy values were calculated. The descriptive statistics for teachers’ and students’ postdictions absolute accuracy results are presented in Table 7.
Table 7. Descriptive Statistics for Absolute Accuracy of Foreign Language Teachers and Students’ Postdictions. Absolute Accuracy of Foreign Language Teachers and Students’ Postdictions Teachers
Students
M
SD
M
SD
Exam
-15.88
10.91
-9.12
6.59
Listening
-2.80
2.65
-1.47
1.73
Reading
-2.33
1.99
-1.47
1.30
Writing
-4.75
6.52
-3.19
2.99
5.2 Exam Difficulty Since the difficulty level of the tasks that teachers set can contribute to the variability in the accuracy of students’ RC judgements (Black & Wiliam, 1998; Rodriguez, 2004), the present study analysed the difficulty level of the overall exam and of the questions testing the listening, reading, and writing skills. For this analysis, the normalisation of the students’ actual exam results was used. As all results were normalised to a 100 points, they were then subtracted from 100 points (which now represented the maximum worth value of any question in the exam). For example, if a skill question was worth 20 points and the student got 17 points (17/20), this gave 85/100 points after normalisation and meant that the question was 15% difficult. The value 100 here means that the student got the answer correct and obtained all the possible points that the
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
43
question granted. Students’ actual score values, together with this normalisation of the actual score values can be found in Table 8.
Table 8. Students’ actual score and normalised score values Students’ actual score and normalised score values Actual Exam Scores
Normalised Exam Scores
Exam Listening Reading Writing Exam Listening Reading Writing Class1
85
17
20
17
85
85
100
85
Class2
90
*
*
*
90
*
*
*
Class3
83
13
7
7
83
92
87
70
Class4
83
11
8
23
83
100
100
82
Class5
90
15
16
25
90
93
94
83
Class6
87
14
11
15
87
93
78
93
Class7
78
5
4
10
78
100
100
83
Class8
76
5
5
5
76
100
100
100
Class9
87
15
11
15
87
88
91
93
Class10
79
*
10
69
92
*
83
94
Class11
86
10
11
16
86
100
100
84
Class12
89
10
10
21
89
100
100
87
Class13
79
5
10
19
79
83
100
76
Class14
82
20
21
20
82
83
84
76
Class15
88
4
*
35
88
100
*
89
Class16
83
15
13
22
83
100
86
73
Class17
90
18
23
18
90
90
92
90
Note. * This data was not available (it was either not given by the teacher or not tested in the language exam).
Table 9 shows the means and standard deviations of the subtractions of the students’ actual exam results from the total worth values of the overall exam and of the listening, reading,
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
44
and writing skills. The value 0 here aims to represent a perfect match between the abilities of the students and the difficulty of the exam and its questions or tasks. This could also mean, however, that the exam and its questions were not difficult at all. Values below 0 intend to represent an increasing level of difficulty. The further away the values are from 0, the more difficult a question is perceived to be; nevertheless, these values could also represent that the students did not practise or study enough. Thus, this provides a rough picture of the difficulty set by the teachers of the overall exam and the skills that were tested in the exam.
Table 9. Descriptive statistics of the difficulty of the exam and the questions that tested the listening, reading, and writing skills
Descriptive statistics of the difficulty of the exam
Mean SD
Exam
Listening
Reading
Writing
-14.77
-6.01
-6.81
-14.78
4.73
6.61
7.56
8.34
5.3 Calibration Curves To identify the Hard-Easy Effect on teachers’ and students’ postdictions, calibration curves were plotted using the postdiction judgements (represented on the x-axis) and the actual exam scores (represented on the y-axis). Figures 3 and 4 show the calibration curves for teachers’ and students’ postdictions on the overall exam respectively.
Actual exam scores
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Teachers’ postdiction judgements
Actual exam scores
Figure 3. Calibration curve for teachers’ postdictions on the overall exam.
Students’ postdiction judgements
Figure 4. Calibration curve for students’ postdictions on the overall exam.
45
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
46
Similarly, to identify the Hard-Easy Effect on teachers’ and students’ postdictions on specific language skills, calibration curves show over- and under-confidence of the listening, reading, and writing skills (plotted on the x-axis) and the actual exam scores (represented on
Actual exam scores
the y-axis). Figures 5 and 6 show these calibration curves.
Teachers’ postdiction judgements
Figure 5. Calibration curve for teachers’ postdictions on the exam skills.
47
Actual exam scores
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Students’ postdiction judgements
Figure 6. Calibration curve for students’ postdictions on the exam skills.
5.4 Relative Accuracy The Goodman-Kruskal gamma correlation, conventionally used to measure judgement relative accuracy, was used in this study to measure teachers’ and students’ postdictions relative accuracy. The TJs and students’ RC judgements were correlated to the students’ exam performance scores across skills. A correlation of one (1) here indicates that the judgements accurately predicted the performance of one skill relative to another. Values closer to zero (0) represent weaker correlations and the value zero (0) indicates a lack of correlation. The results of these correlations, together with their significance values (p < .05), are presented in Table 10.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
48
Table 10. Gamma correlations of foreign language teachers’ and students’ postdictions. Gamma Correlations for Relative Accuracy Actual Exam Scores Overall Exam
Postdictions
Teachers’
G
p
Listening G
p
Postdictions
Writing
G
G
p
-.44
.00
.73
.00
.82
.00
p
Exam Listening Reading
.54
.00
.73
.00
.58 .00
.75
.00
.91 .00
Writing Exam
Students’
Reading
.29 .04
Listening Reading
.45
.00
.89
.00
.70 .00
.73
.00
.91 .00
Writing
Note. The values not shown were not statistically significant (p > .05).
A Goodman-Kruskal gamma correlation has been used in previous judgement accuracy studies as a standard measure of relative accuracy because of its usefulness to interpret probability (Nelson, 1996). Nevertheless, this non-parametric measure seems to be more appropriate when the variable “actual exam scores” is divided in two categories (“pass” or “fail”) rather than when it is captured through confidence scores from 0 to 100. For this reason, another correlation was used to corroborate the Goodman-Kruskal gamma results. A visual analysis of the probability distributions revealed non-normal distributions of the postdictions and exam scores. Therefore, a non-parametric measure of rank correlation, namely the Spearman rank correlation (Spearman’s rho), was used because of its usefulness with ordinal data (exam scores and confidence scores from 0 to 100) as well as its robustness to outliers (unlike the Pearson’s r correlation; which assumes a normal distribution). The two-
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
49
tailed test results from theses correlations, together with their significance values (p < .05), are shown in Table 11.
Table 11. Spearman correlations of foreign language teachers and students’ postdictions.
Spearman Correlations for Relative Accuracy Actual Exam Scores Overall Exam
Postdictions Postdictions
Students’
Teachers’
rs
p
Listening rs
p
Reading rs
p
Exam Listening Reading
Writing rs
p
-.56
.02
.85
.00
.91
.00
.68 .00 .67
.00
.85
.00
.96 .00
Writing Exam Listening Reading
.57
.02
.93
.02
.78 .00
.83
.00
.95 .00
Writing
Note. The values not shown were not statistically significant (p > .05).
5.5 Correlations To pinpoint possible contributing factors to the inaccuracy of teachers’ and students’ postdictions, different individual teacher, student, and exam characteristics were examined to see how they affected their metacognitive monitoring abilities. Such characteristics are described in the sample section of this paper. To find out whether the characteristics analysed depicted a linear or a monotonic relationship, both Pearson’s r and Spearman’s rs were computed. These correlation analyses are found in Table 12.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
50
Table 12. Pearson’s r and Spearman’s rs correlations between teacher, student, and exam characteristics and postdiction judgements
Pearson (r) and Spearman (rs) Correlations for Postdiction Scores and Teachers and Students’ Characteristics Teachers’ and Students’ Global Postdiction Scores
Characteristics
Teachers’
Exam
Listening
r
p
rs
p
.61
.03
.61
.03
Reading
r
p
rs
p
.23
.01
.21
.01
r
p
.21
.01
Writing
rs
p
.23
.00
r
p
rs
p
.20
.02
-.17
.04
-.32
.00
Age Language Use Experience
Age
Students’ Characteristics
Languages Homework
.22
.00
.18
Attendance
-.21
.01
Focus
-.18
.03
Previous Knowledge
-.26
.03
.00
.21
.01
.19
.02
Self Confidence
-.47
.00
-.47
.00
-.18
.03
-.19
.03
-.19
.03
-.25
.00
-.28
.00
Positive Self Confidence
.34
.00
.36
.00
.23
.00
.18
.03
.30
.00
.29
.00
.21
.01
.18
.04
.23
.00
Motivation
Negative Self Confidence
Note. The values not shown were not statistically significant (p > .05).
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
51
Chapter 6 Discussion
This study investigated the accuracy of postdiction judgements of higher education students and foreign language teachers. The main purpose was to study the relation between teacher judgements (TJs), students’ retrospective confidence judgements (RC), and students’ actual performance to see how accurate they were. Accuracy was operationalised as the relationship between these three pieces of data defined in two ways: 1) performance/judgement agreement data and 2) the use of correlations. This relation was analysed using different outcome measures to obtain more holistic results.
6.1 Absolute Accuracy Under the performance/judgement agreement data relationship, in other words absolute accuracy, both teachers and students showed underconfident results for global postdictions (retrospective confidence) on the overall exam as well as for language skill specific judgements. In spite of this, the students’ RC judgements were better calibrated when compared to the TJs. Students’ calibration registered a 6.76 points of better accuracy on the overall exam. Similarly, the students’ calibration showed to be 1.33 points better when compared to the language teachers’ calibration on the listening skill, .86 points better on the reading skill, and 1.56 points better on the writing skill. According to Black and Wiliam (1998), one possible reason for the miscalibration of students’ RC judgements can be the difficulty level of the questions and tasks that teachers set on their exams. The biased selection of these questions is believed to lead to the Hard-Easy Effect; which can make students miscalibrate their judgements (Gigerenzer et al., 1991: Lichtenstein et al., 1982).
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
52
An analysis of the difficulty of the exams and the of the questions that tested the listening, reading, and writing skills revealed that the overall exam was more difficult (M= 14.77, SD=4.73) than the questions that tested the listening and reading skills (M = -6.01, SD=6.61 for listening and M= -6.81, SD=7.56 for reading). However, the tasks that tested the writing skill was as difficult as the overall exam (M= -14.78, SD=8.34). These results lead to believe that language teachers can be more accurate when judging the difficulty level of the questions that test the listening and reading skills rather than the overall exam. This can also explain why students were better calibrated at judging their retrospective confidence on the listening and reading skill and not so well-calibrated at judging their retrospective confidence on the overall exam. These results go in accordance with previous research studies stating that teachers can be both very accurate and not so accurate when estimating the difficulty level of the tasks in their exams. The key to TJ accuracy may reside in testing specific skills rather than global knowledge. Other sources for students’ metacognitive judgement inaccuracy and bias can be attributed to their study choices and learning strategies (Thiede et al., 2003), time pressure (Metcalfe, 2002), and the methods that researchers use to elicit the judgements (Dunlosky & Lipko, 2007). Due to its design, this study did not include an analysis of the learning strategies that the students used; however, their study choices were gathered through the question that asked about the focus of their study. Here we see that, although 47.4% of students focused their study on grammar, their results showed a mean of -9.12 (SD=6.59) points of under-confidence on the overall exam. This means that focusing on grammar did not help students to increase their confidence on the questions that tested grammar knowledge, which in some cases composed most of the language exam. A closer look at the specific tasks, their worth values, and students’ interviews could help to clarify whether the study focus on grammar increases
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
53
grammar knowledge in a significant way or whether it simply contributes to the enhancement of the other language skills (listening, reading, and writing). The findings in this study confirm the hypothesis that students would have accurate postdiction judgement scores. Nevertheless, this was only true for the specific language skills rather than for the overall exam. They also support the research findings from Fritzsche et al. (2012) and from Juslin and Olsson (1997) which show that, although not flawlessly, students can be very accurate when monitoring their achievement on cognition and can express the results of their monitoring processes through confidence ratings. For teachers, the miscalibration of TJs can be due to several reasons. One possible reason is a bias in the population sample and the judgement requirement. Teachers were required to make a judgement for all their students; which included high, average, and low achieving students. However, the design of the study only captured RC judgements from those students who finished their exam first. This was done, as explained before, not to interfere with the natural course of the classes, the exam, or the learning process in general and to comply with the regulations established by the TUM Language Centre. Probably, although not taken for granted, the students whose RC judgements were captured in this study were generally high-achieving students, considering that they had finished the test first. This is, nonetheless, only a hypothesis since the study was completely anonymous and there is no tracking of who participated in the study; therefore, it is impossible to determine whether they were high, average, or low-achieving students in general. As with students’ postdiction accuracy, poor calibration of the difficulty of the questions in the exam can also be another factor for TJs’ miscalibration. This is also reflected in the calibration curves that show the presence of a moderately high Hard-Easy effect on the overall exam but a low effect on the questions that tested the listening, reading, and writing skills. To understand the reasons for such a miscalibration on the overall exam, future research
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
54
studies should focus on the classroom communication between foreign language teachers and students, namely the amount and type of homework used and the amount and type of feedback provided, especially for the tasks that contributed to the students’ understanding of the grammar of the target language.
6.2 Relative Accuracy The Goodman-Kruskal gamma correlation was used as a standard measure to examine the correlation between performance confidence scores of foreign language teachers and higher education students (postdictions) and the students’ actual academic achievement (exam scores). This correlation coefficient revealed statistically significant correlations on the overall exam and on both the teachers and students’ postdictions for the reading skill (G=.54, p < .01 and G=.45, p < .01 respectively). No other teacher or student postdiction correlated to the overall exam. A Spearman two-tailed test of significance also corroborated these results showing significant positive correlations of rs(15)=.67, p< .01 for the teachers postdictions and rs(15)=.57, p< .05 for students’ postdictions. The use of these two measures is an important contribution of this study because it helps to corroborate previous findings. Tables 10 and 11 in the results section of this paper show consistent results regarding the relative accuracy calculations using both the Gamma and the Spearman correlations; this is particularly true for the reading and writing skills. In order to understand the possible reasons of the correlations found in this study, five classes were observed during the semester. A participatory observation allowed the researcher to analyse the type of teaching methodologies used as well as to gather information from the students’ perspective regarding the teaching. The classes observed included the following languages: Italian, French, German, Norwegian, and Chinese. To avoid a Hawthorne effect; this is a modification in the participants’ behaviour as a response to their awareness of being
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
55
observed, the researcher enrolled himself in the aforementioned language courses as a regular student and complied with all the tasks. Neither the teachers nor the students were informed about the presence of the researcher until the end of the semester. The class observations registered a high emphasis of teachers on their students’ reading skills. Teachers, as observed in the different language classes, helped students to get to know the new writing systems that every language brought. Furthermore, they focused on pronunciation and guided students to form acoustic connections with familiar words in their first language. In German, for example, students had to learn to read the letters ä, ö, ü, ß, and understand their sounds and linguistic functions to be able to comprehend texts and to speak them out loud. The Norwegian teacher also identified a similar need for students with some Norwegian letters (æ, ø, and å). Although these letters were unfamiliar from a written perspective, they recalled common sounds in their students’ mother language; most of them German. In French and Italian, apart from some new letters, students had to cope with a variety of diacritical marks (é, è, ê, ë, etc.) that could change the meaning or pronunciation of words. The same is true for the Spanish and Portuguese classes, even when these classes were not observed. Finally, the Chinese teacher focused on both the Chinese writing system and the Chinese culture due to the strong link between the two of them and the difficulty of getting acquainted with a non-Roman alphabet. All this was done, as observed in the classes, with two main purposes: 1) to enable students to read and understand the tasks that they needed to accomplish either in the course book or as home assignments (as they were usually written in the target language) and 2) to allow teachers to control for and correct pronunciation mistakes. A good pronunciation and intonation is believed to allow foreign language learners to communicate effectively despite minor inaccuracies in vocabulary and grammar (Burns, 2003).
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
56
Teachers encouraged and fostered the learning of the reading and writing skills through class participation. These skills received personalised feedback at the process and selfregulation levels (Hattie & Timperley, 2007). These two levels of feedback have been researched to enhance students’ skills in self-evaluation and to provide them with greater confidence to engage on future tasks (Hattie & Timperley, 2007). However, due to time factors and class sizes, it was not always possible for all students to participate in class and enhance their reading and writing skills. Furthermore, not all students completed their assignments as they were not compulsory to pass the course. The listening skill, in some cases, received feedback at a task level (Hattie & Timperley, 2007), probably due to the nature of the tasks encountered by this skill (true-false or fill-in-the-blanks tasks). These observations, entangled with the correlation results, help us better understand the significant correlations found between the overall exam and the teachers and students’ postdictions for the reading skill. Although both Gamma and the Spearman correlations showed similar results, Spearman correlations presented in some cases higher (stronger) values. One reason for this discrepancy might be a possible metacognitive bias; that is, the sensitivity of the Gamma correlation to students’ tendency to use higher or lower confidence ratings because of a shy or humble personality (Masson & Rotello, 2009; Nelson, 1984). Even when the design of the study tried to control for this bias, personal and environmental factors are always present in a typical classroom exam situation and can affect the psychological processes of the participants. For this reason, the holistic approach that this study takes is vital for a better analysis and understanding of metacognitive accuracy.
6.3 Factors correlated to postdiction accuracy As part of this holistic approach, this study analyses the correlations between judgement accuracy and several personal and environmental factors that can affect metacognitive
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
57
monitoring. For these correlations, both Pearson and Spearman’s coefficients were calculated for two reasons: 1) because of their individual advantages and 2) to corroborate their findings.
6.3.1 Teachers’ factors. Teachers’ teaching qualifications in this study could not be analysed because this factor behaved as a constant in this study. All teachers had teaching qualifications either as their main or secondary academic formation and, although some qualifications varied (41.2% teachers with a bachelor degree, 52.9% teachers with a master degree, and 5.9% teachers with a PhD), the population sample (N=17) was not large enough to obtain reliable results. For this reason, the hypothesis in this study stating that teachers who hold teaching qualifications would have more accurate postdiction judgement (TJs) scores could not be tested. The teachers’ experience, their age, and the amount of language they used in class were analysed. No correlations were found between the teachers’ age and their postdiction scores. Similarly, no correlations were found between the teachers’ postdiction scores and their teaching experience. Furthermore, the hypothesis in this study that teachers with more than five years of experience would have more accurate postdiction judgements (TJs) scores could not be tested because all teachers had more than five years of teaching experience. These results go in accordance with the findings by Darling-Hammond (2000) where the calibrations of teachers who had more than five years of experience produced no differences on postdiction judgements. On the other hand, a measure of the amount of the target language that the teachers used in class showed significant positive correlation results for both, the Pearson correlation r (15)=.61, p < .05 and the Spearman correlation rs (15)=.61, p < .05. This can lead us to think that the more teachers speak on the target language, the better they tend to calibrate postdiction judgements. Further studies should look at the quality of the language used in class; for
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
58
example, to analyse if the instructions that teachers use in class and the instructions given on the exam are the same. Another approach could be to study whether this factor also affects teachers who teach higher levels of linguistic proficiency (A2, B1, B2, C1 or C2) according to the Common European Framework for Language Refence (CEFR).
6.3.2 Students’ factors. Several students’ characteristics, such as their age, the number of languages that they spoke (either as native or foreign language), the effort they put in class, the focus of their study, their previous knowledge, their motivation and their self-confidence were also analysed as part of this holistic approach to understand accuracy measure results. This study concurred with several others that the age of the students did not influence postdiction judgement accuracy. Although the amount of languages the students spoke did not correlate with the overall exam, weak significant correlations were found for the listening, reading and writing skills on both Pearson and Spearman’s correlations. This can be due to the experience that students get with these skills when learning a foreign language. The effort the students put in class was operationalised through two different factors: 1) the amount of homework that the students did and 2) the number of classes that they attended. Class attendance did not correlate with the overall exam; however, it correlated with the reading and the writing skills measured with the Spearman correlation (rs= -.21, p < .05 and rs= -.17, p < .05 respectively). The negative result of this correlation can be attributed to the negative coding of the variable because the survey asked the students “How many classes did you miss?” instead of “How many classes did you attend?”. Similarly, the students’ study focus (vocabulary, grammar, or both of them) also showed correlations with the reading and writing skills (r(15) = -.18, p < .05 and r(15) = -.32, p < .01 respectively). This suggests that the factors that affect judgement accuracy on the writing skill could also help students give accurate
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
59
judgements for the reading skill. This might occur due to the important relations that these two skills have and the likelihood that they have to influence each other (Eisterhold, 1990). The amount of homework the students did had a weak positive correlation with the overall exam and with the writing skill when measured with the Pearson correlation (r=.22, p < .01 and r=.18, p < .05 respectively). This suggests that the more homework students do, the more accurate their postdictions are on the overall exam and on the questions testing their writing skill. Further research should look at the type of homework done and at the type of feedback given to understand better which types of homework and feedback improve metacognitive monitoring accuracy. Students’ previous knowledge also showed a weak positive correlation with the overall exam on both the Pearson and Spearman’s correlations (r (15)=.21, p < .05 and rs=.19, p < .05 respectively). It did not correlate with any of the languages skills tested, though. This can be understood from the perspective of experience because this was not their first time that students were learning a foreign language or sitting a language exam. The experience they gained from learning a foreign language could be crucial to understand how language examinations are constituted and how this knowledge is extrapolated to other languages. Because of the important role that motivation plays in education, this was also analysed to find out its effects on confidence accuracy. Previous research studies have shown that students with integrative motivation could make more accurate postdiction judgements (NorrisHolt, 2001). This study asked students to state whether they had intrinsic, extrinsic, or both types of motivation. The results showed that students’ motivation to study the language did not have any effects on students’ global postdiction accuracy. The last factor analysed was students’ self-confidence. It was operationalised by asking students whether they believed to be good or not at the language they were learning (general self-confidence) and by asking them about the specific language skills they believed to be good
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
60
or bad at (positive and negative self-confidence). Students’ general self-confidence and positive self-confidence had a positive correlation with the overall exam and all the language skills. Students’ negative self-confidence correlated only with the listening skill.
6.3.3 Exam factors. Another important contribution of this study is the analysis of test characteristics (exam factors) to have a holistic understanding of the conditions of the teachers’ and students’ postdictions. In their model, Südkamp et al. (2012) showed how this factor can affect students’ test performance. Due to standards established by the TUM Language Center to ensure faireness in testing, all exams had the same amount of time (90 minutes); for this reason, the time of the test was a constant variable in this study. Nonetheless, it is important to consider that time pressure is another factor that has been researched to contribute to the inaccuracy and bias in metacognitive judgements (Metcalfe, 2002). In this study, all the students who participated finished their exam before the 90 minutes; however, as discussed earlier, they are considered to be high achieving students. Observations during the exams allowed to see that several students needed some more time to finish their exams (or to feel more confident about their answers). Future research could study the effects of this factor on postdiction judgement accuracy.
Thus, with the results obtained, we can build a better picture of foreign language teachers’ and students’ postdiction accuracy and the factors that affect them. This information help us contribute to the research by Südkamp et al. (2012) and build onto the model they proposed that identifies various aspects affecting TJ accuracy. We can suggest that, apart from the teachers’ experience, the amount of the target language to which foreign language teachers expose their students is also an important factor to consider if we want to avoid a high variability in the correlations and to get valid results.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
61
We can advocate that, in the area of foreign language teaching and learning, the students’ focus of study, their class attendance, focus of study, motivation, general selfconfidence, and positive self-confidence play a significant role in helping to produce accurate judgements. We can then see these factors in a new model specific for foreign language teachers. Figure 7.
• Experienced teachers
Type: Informed postdictions
• Language Usage: 60% Measure: Confidence scores
• Homework: 42% • Attendance: 69% • Focus: 47% on grammar
Achievement test 90 min.
• Self-confidence: 66% good
Figure 7. A model of the factors that affect foreign language teachers’ judgement accuracy of students’ academic achievement.
It must be emphasised that this proposed model is unique for the area of foreign language learning and that these characteristics seemed to have a significant effect on the accuracy judgement for the three main linguistic skills (listening, reading, and writing) but not for the overall exam testing grammar and vocabulary. Although some teacher, student, judgement, or test characteristics could be also important in other areas such as maths or
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
62
science, every subject ought to have its own model pinpointing the characteristics that affect teachers and students’ judgements in a specific area of knowledge to avoid miscalibration effects and to have a positive impact on the students' learning experiences and educational trajectories.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
63
Chapter 7 Limitations
Research has found that the relationship between teachers and students may affect TJ accuracy since a closer relationship with students can enable teachers to make more accurate predictions about student achievement (Coladarci, 1986). Such a relation tried to be captured through the variables “homework” and “attendance”, which sought to measure the amount of homework that the students did and the number of classes that they missed. However, because of the design of this study, this variable was not investigated in depth. Some insights into this variable are provided through class observations to appreciate how teachers provided feedback to the homework or demanded class participation from the students who had missed classes; however, they are insufficient to determine the closeness of teacher-student relationship. Personal interviews could provide more light onto this variable in future research studies. Although previous studies show that teachers’ experience and qualifications are significant factors affecting TJ accuracy, the difference between novice and experienced teachers could not be analysed in this study because all teachers had more than five years of teaching experience; therefore, they were considered experienced. Likewise, the influence of teaching qualifications (a teaching degree) on accuracy could not be seen since all teachers had a formal teaching training. Future studies could include both novice and experienced language teachers and teachers with a greater scope of academic formation in teaching, namely, from universities, institutes, etc. to see how these variables affect judgement accuracy. On this topic, this research was also confronted with natural limits in the analysis of teacher competence and teaching experience. While Darling-Hammond (2000) concluded in several studies that teaching certification is an important aspect of teacher compentece and that this is indeed a trait that cannot be easily measured by teaching qualifications (Hanushek,
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
64
2003). Likewise, teaching experience is a complex construct that cannot be simply defined by years. Future research should aim to identify the teachers’ level of Pedagogical Knowledge (PK) and Pedagogical Content Knowledge (PCK) (Shulman, 1986) to build a better picture of the teachers’ competence and experience. Some teacher characteristics that could also significantly affect TJ accuracy, such as teaching styles (student-centred, curriculum-centred) and teacher treatment to low- and high-achieving students could not be studied because they required a methodological design including structured classroom observations which were not logistically feasible in all classes. Future research should seek to fill this gap. Probably the biggest disadvantage in this study was the participation criteria for students. Considering the extra effort that participating in the study represented for teachers, the number of students allowed to give a confidence score was limited to 10 students per class. Also, as this study did not attempt to interfere with the natural course of the language classes or the exam, the design of the study allowed all students to have the full time that the teachers had previously planned to sit the exam (90 minutes). Thus, confidence scores were gathered only from those students who finished the exam before the time limit was up. These students are believed to be the more high-achieving students of the class. This design also affected teachers’ postdiction judgements since teachers were required to make a judgement for all their students; including high, average, and low achieving students, because they tailored the difficulty level of their test considering all students and not only the high-achieving ones. This might be the reason why the data shows that students were better calibrated than their teachers. Although the design would have ideally allowed to gather the postdiction judgements from all students, this was considered unfeasible because of the amount of work that it would have represented for the teachers. A cluster selection of students in every class to include high, average, and low-achieving students was also not possible because the
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
65
students could not predict whether they would finish their exam before the 90 minutes provided and have time to participate in the study. Finally, because of accessibility to the tests that the teachers tailored, the validity and reliability of the language tests were not calculated. Such measurements could help to enhance the statistical strength of this study. This study sought to analyse the level of difficulty of the exams and their questions by subtracting the students’ scores from the exam worth value. This approach, however, yields ambiguous data since it is not possible to determine whether the questions were too easy or the students did not study enough. Future research could make a better analysis of the exam difficulty and take test validity and reliability as important variables in measuring postdiction judgements accuracy.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
66
Chapter 8 Conclusion
This study investigated foreign language teachers’ and higher education students’ postdiction judgements to see their relationship between themselves and to the students’ actual performance on their language test. Specifically, this study examined the relationships between TJs and students’ RC judgements and the students’ actual performance exam scores. This analysis was done using several measures of confidence judgement accuracy to see the degree to which the results of those measures agreed to each other. The aim of this analysis was to see the extent to which metacognitive postdiction judgements were realistic and could disclose over- and under-confidence bias (inflated confidence relative to the actual performance). This analysis also sought to pinpoint the possible factors that affected teacher judgement accuracy. Meeting research and educational demands, this study does not only present the level of accuracy to which teachers’ and students’ scores are correlated with themselves and to the actual performance, but it also considers personal and behavioural aspects affecting the quality of the judgements and their accuracy thereof. This was achieved by analysing teacher, students, judgement, and exam characteristics based on previous research provided in the model presented by Südkamp et al. (2012). This model reflects theoretical considerations and empirical findings which identifies various aspects affecting the accuracy of TJs. This study found that foreign language students were more accurate than their teachers when judging retrospective confidence. This study also found that foreign language teachers and students were better calibrated when judging specific language skills rather than the overall exam. From the three language skills tested (listening, reading, and writing), reading was the skill that was the most accuratedly judged.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
67
The findings of this study go in accordance with previous research findings stating that teachers, on average, can be moderatly accurate in their metacognitive judgements (Hoge & Coladarci, 1989; Südkamp et al., 2012). One important contribution of this study is the use of several measures to calculate teacher and student retrospective judgement accuracy to strengthen the results obtained in previous studies through one single measure. Another significant contribution of this study is to see how metacognitive judgement accuracy behaves in a real teaching and learning environment and in an exam situation. Such a setting is important because education normally takes place in environments affected by several factors that are normally controlled in lab-like experiments and studies. A third important contribution of this study was the analysis of the percentage that teachers used the target language in class; this is, the amount to which students were exposed to the target language in class. To my knowledge, there are no previous studies analysing this variable and its effects on judgement accuracy. It is important because learners of foreign languages do not normally have contact with people who speak the language they are learning; for this reason, the teacher becomes an important source of listening input and learning. In this study, this assumption was true for all the languages but for German, because even when students were learning German as a foreign language, they were living in Germany and encountered the language in everyday situations. Finally, the analysis of teachers’ and students’ factors set a direction for future research to corroborate (or discard) the correlations found in this study. The reasons that might shed light on the discrepancy of teachers’ and students’ judgement accuracy are explained in the limitation section of this paper. Some suggestions to get better judgement accuracy results are also shared in the same section. Finally, a model of the factors that might affect foreign language teachers’ judgement accuracy of students’ academic achievement is also shared on the discussion section of this paper.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
68
The findings of this study contribute to the corpus of research on teacher judgement accuracy and provide a basis for foreign language teachers and all teachers in general to continue teaching metacognitively, making more accurate judgements when planning their lessons and evaluating their students both formatively and summatively.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
69
References
Allwood, C. M., Jonsson, A. C., & Granhag, P. A. (2005). The effects of source and type of feedback on child witnesses’ metamemory accuracy. Applied Cognitive Psychology(19), 331–344. Alvidrez, J., & Weinstein, R. S. (1999). Early teacher perceptions and later student academic achievement. Journal of Educational Psychology, 91, 731-746. Retrieved from http://dx.doi.org/10.1037/0022-0663.91.4.731 Anders, Y., Kunter, M., Brunner, M., Krauss, S., & Baumert, J. (2010). Diagnostische Fähigkeiten von Mathematiklehrkräften und ihre Auswirkungen auf die Leistungen ihrer Schülerinnen und Schüler. Psychologie in Erziehung und Unterricht, 57, 175193. Ariel, R., Dunlosky, J., & Bailey, H. (2009). Agenda-based regulation of study-time allocation: When agendas override item-based monitoring. Journal of Experimental Psychology: General, 138, 432-447. Retrieved from http://dx.doi.org/10.1037/a0015928 Bates, C., & Nettelbeck, T. (2001). Primary school teachers’ judgements of reading achievement. Educational Psychology. doi:10.1080/01443410020043878 Begeny, J. C., Krouse, H. E., Brown, K. G., & Mann, C. M. (2011). Teacher judgments of students’ reading abilities across a continuum of rating methods and achievement measures. School Psychology Review, 40, 23-38. Benjamin, A. S., & Bjork, R. A. (1996). Retrieval fluency as a metacognitive index. In L. Reder, Implicit memory and metacognition (pp. 309-338). Hillsdale, NJ: Erlbaum.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
70
Bennett, R. E., Gottesman, R. L., Rock, D. A., & Cerullo, F. (1993). Influence of behavior perceptions and gender on teachers’ judgments of students’ academic skill. Journal of Educational Psychology, 85, 347-356. Berliner, D. C. (2004). Expert teachers: their characteristics, development, and accomplishments. Bulletin of Science, Technology, and Society, 24(3), 200-12. Retrieved from https://www.researchgate.net/publication/255666969_Expert_Teachers_Their_Charac teristics_Development_and_Accomplishments Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy, and Practice, 5, 7-73. Borkowski, J., Carr, M., & Pressely, M. (1987). "Spontaneous" strategy use: Perspectives from metacognitive theory. Intelligence, 11, 61-75. Burns, A. (2003). Clearly speaking: pronunciation in action for teachers. National Center for English Language Teaching and Research, Macquaire Universaity, Sydney NSW 2109. Caleon, I. S., & Subramaniam, R. (2010). Do students know what they know and what they don’t know? Using a four-tier diagnostic test to assess the nature of students' alternative conceptions. Research in Science Education., 3(40), 313-337. Coladarci, T. (1986). Accuracy of teacher judgments of student responses to standardized test items. Journal of Educational Psychology, 10.1037/0022-0663.78.2.141. Crawford, J., & Stankov, L. (1996). Age differences in the realism of confidence judgements: A calibration study using tests of fluid and crystallized intelligence. Learning and Individual Differences, 6, 84-103. Darling-Hammond, L. (1995). Equity issues in performance-based assessment. In Equity and excellence in educational testing and assessment (pp. 89-114). Boston: Kluwer.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
71
Darling-Hammond, L. (2000). Teacher quality and student achievement: A review of state policy evidence. Education Policy Analysis Archives, 8(1), 1-44. Demaray, M. K., & Elliott, S. N. (1998). Teachers’ judgments of students’ academic functioning: a comparison of actual and predicted performances. School Psychology Quartely. doi:10.1037/h0088969. Dunlosky, J., & Bjork, R. A. (2008). The integrated nature of metamemory and memory. In J. Dunlosky & R. A. Bjork (Eds.), Handbook of metamemory and memory (pp. 11-28). New York: Psychology Press. Dunlosky, J., & Lipko, A. R. (2007). Metacomprehension: A brief history and how to improve its accuracy. Current Directions in Psychological Science, 16, 228–232. Dunlosky, J., & Metcalfe, J. (2009). Metacognition. Thousand Oaks, California, United States of America: Sage Publications, Inc. Retrieved 12 17, 2016 Dusek, J. B., & O'Connell, E. J. (1973). Teacher expectancy effects on the performance of elementary school children. Journal of Educational Psychology(65), 371-377. Eisterhold, C. (1990). Reading-writing connections: Toward a description for a second language learners. In B. Kroll (Ed.), Second Language Writing (pp. 88 - 101). New York: NY: Cambridge University Press. Ellis, R. (2008). The Study of Second Language Acquisition. Oxford: Oxford University Press. Feinberg, A. B., & Shapiro, E. S. (2003). Accuracy of teacher judgments in predicting oral reading fluency. School Psychology Quarterly. doi:10.1521/scpq.18.1.52.20876. Flavell, J. H. (1979). Metacognition and cognitive monitoring: a new era of cognitivedevelopmental inquiry. American Psychologist(34), 906-911. Fleming, S. M., & Lau, H. C. (2014). How to measure metacognition. Frontiers in Human Neuroscience, 8, 443. Retrieved from http://doi.org/10.3389/fnhum.2014.00443
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
72
Frank, David & Kuhlmann, Beatrice (2016). More than just beliefs: Experience and beliefs jointly contribute to volume effects on Metacognitive judgements. Journal of Experimental Psychology: Learning, Memory, and Cognition. Advanced Online Publication. Fritzsche, E. S., Kröner, S., & Dres, M. (2012). Confidence scores as measures of metacognitive monitoring in primary students? (Limited) validity in predicting academic achievement and the mediating role of self-concept. Journal for Educational Research Online, 4(2), 120-142. Funder, D. C. (2012). Accuracy of personality judgment. Current Directions in Psychological Science. doi:10.1177/0963721412445309. Gagne, F., & St. Pere, F. (2001). When IQ is controlled, does motivation still predict achievement? Intelligence (30), 71-100. Gardner, R. (2010). Motivation and Second Language Acquisition: The Socio-Educational Model. New York: Peter Long Publishing, Inc. Gigerenzer, G., Hoffrage, U., & Kleinboelting, H. (1991). Probabilistic mental models: A Brunswikian theory of confidence. Psychological Review(98), 506-528. Goodman, L. A., & Kruskal, W. H., (1954). Measures of association for across clasification. Journal of the American Statistical Association, 49, 732 - 764, doi: 10.2307/2281536. Hacker, D. J., Bol, L., Horgan, D., & Rakow, E. A. (2000). Test prediction and performance in classroom context. Journal of Educational Psychology, 92, 160-170. Hanushek, E. A. (2003). The failure of input-based schooling policies. The Economic Journal (113), F64-F68. Hart, J. T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208–216.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
73
Harvey, N. (1997). Confidence in judgment. Trends in Cognitive Sciences, 1, 78-82. Hasan, S., Bagayoko, D., & Kelley, E. L. (1999). Misconceptions and the certainty of response inderx (CRI). Physics Education, 5(34), 294-299. Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77, 81-112. Hoge, R. D., & Butcher, R. (1984). Analysis of teacher judgements of pupil achievement levels. Journal of Educational Psychology(76), 777-781. Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of academic achievement: A review of literature. Review of Educational Research, 59, 297-313. Hopkins, K. D., George, C. A., & Williams, D. D. (1985). The concurrent validity of standardised achievement tests by content area using teachers' ratings as criteria. Journal of Educational Measurement, 22, 117-182. Johansson, S., Strietholt, R., Rosén, M., & Myrberg, E. (2014). Valid inferences of teachers’ judgements of pupils’ reading literacy: does formal teacher competence matter? School Effectiveness and School Improvement, 25(3), 394-407. doi:10.1080/09243453.2013.809774 Juslin, P., & Olsson, H. (1997). Thurstonian and Brunswikian origins of uncertainty in judgment: A sampling model of confi dence in sensory discrimination. Psychological Review, 104, 344-366. doi:10.1037/0033-295X.104.2.344 Kaiser, J., Retelsdorf, J., Südkamp, A., & Moeller, J. (2013). Achievement and engagement: How student characteristics influence teacher judgments. Learning and Instruction, 28, 73-84. Kaiser, J., Südkamp, A., & Möller, J. (2016). The Effects of Student Characteristics on Teachers’ Judgment Accuracy: Disentangling Ethnicity, Minority Status, and Achievement. Journal of Educational Psychology. doi:10.1037/edu0000156
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
74
Koriat, A. (2007). The cambridge Handbook of conciousness: Metacognition and conciousness. New York, NY: Cambridge University Press. Koriat, A., & Levy-Sadot, R. (1999). Processes underlying metacognitive judgments: Information- based and experience-based monitoring of one’s own knowledge. In S. Chaiken & Y. Trope (Eds.), Dual process theories in social psychology (pp. 483-502). New York: Guilford Publications. Koriat, A., Sheffer, L., & Ma'ayan, H. (2002). Comparing objective and subjective learning curves: Judgements of learning exhibit increased underconfidence with practice. Journal of Experimental Psychology: General, 131, 147-162. Kornel, N., & Metcalfe, J. (2006). "Blockers" do not block recall during tip-of-the-tongue states. Metacognition and Learning, 1, 248-261. Kuhn, D., & Dean, D. (2004). A bridge between cognitive psychology and educational practice. Theory into Practice, 4(43), 268-273. Lai, E. R. (2011). Metacognition: a literature review. Pearson. Leinhardt, G. (1983). Novice and expert knowledge of individual students' achievement. Educational Psychologists, 18, 165-179. Leucht, M., Triffin-Richards, S., Vock, M., Pant, H. A., & Koeller, O. (2012). Diagnostische Kompetenz von Englischlehrkräften. Diagnostische Kompetenz von Englischlehrkräften bei der Bewertung von Schülerleistungen mit Hilfe des Gemeinsamen Europäischen Referenzrahmens für Sprachen. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie. doi:10.1026/00498637/a000071 Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). calibration of probabilities: The state of the art to 1980. In D. Kahneman, P. Slovic, & A. Tversky, Judgment under
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
75
uncertainty: Heuristics and biases (pp. 306-334). Cambridge, UK: Cambridge University Press. Llosa, L. (2008). Building and supporting a validity argument for a standards-based classroom assessment of English proficiency based on teacher judgements. Educational Measurement: Issues and Practice, 27(3), 32-42. Loftus, E. F., & Ketcham, K. (1991). Witness for the defence: The accused, the eyewitness, and the expert who puts memory on trial. New York: St. Martin's Press. Luce, S. R., & Hoge, R. D. (1978). Relations amongst teacher rankings, pupil-teacher interactions, and academic achievement: A test of the teacher expentancy hypothesis. American Educational Research Journal(15), 489-500. Maki, R. H., Shields, M., Wheeler, A. E., & Zacchilli, T. L. (2005). Individual differences in absolute and relative metacomprehension accuracy. Journal of Educational Psychology, 97, 723–731. doi:10.1037/0022-0663.97.4.723 Martínez, J. F., & Mastergeorge, A. (2002). Rating performance assessments of students with disabilities: A generalizability study of teacher bias. American Education Research Education. New Orleans, L.A. Matinez, J. F., Stecher, B., & Borko, H. (2009). Classroom assessment practices, teacher judgements, and student achievement in mathematics: Evidence from the ECLS. Educational Assessment, 14, 78-102. doi:10.1080/10627190903039429 McElvany, N., Schroeder, S., Baumert, J., Schnotz, W., Horz, H., & Ullrich, M. (2012). Cognitively demanding learning materials with texts and instructional pictures: Teachers’ diagnostic skills, pedagogical beliefs and motivation. European Journal of Psychology of Education, 27(3), 403-420. doi:0.1007/s10212-011-0078-1 McElvany, N., Schroeder, S., Hachfeld, A., Baumert, J., Richter, T., Schnotz, Ullrich, M. (2009). Diagnostische Fähigkeiten von Lehrkräften bei der Einschätzung von
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
76
Schülerleistungen und Aufgabenschwierigkeiten bei Lernmedien mit instruktionalen Bildern. Zeitschrift für Pädagogische Psychologie, 23(3), 223-235. McNair, K. (1978). Capturing inflight decisions: Thoughts while teaching. Educational Research Quarterly, 3(4), 26-42. Meeter, M., & Nelson, T. O. (2003). Multiple study trials and judgments of learning. Acta Psychologica, 113, 123-132. Meisels, S. J., Bickel, D. D., Nicholson, J., Xue, Y., & Atkins-Burnett, S. (2001). Trusting teachers' judgements: A validity study of a curriculum-embedded performance assessment in Kindergarden-Grade 3. American Educational Research Journal, 38(1), 73-95. Merkle, E. C. (2009). The disutility of hard-easy effect in choice confidence. Psychometric Bulletin and Review, 1(16), 204-213. doi:10.3758/PBR.16.1.204 Metcalfe, J. (2002). Is study time allocated selectively to a region of proximal learning? Journal of Experimental Psychology: General., 131, 349–363. Metcalfe, J., & Kornell, N. (2005). A Region of Proximal Learning model of study time allocation. Journal of Memory and Language, 52, 463-447. Retrieved from http://dx.doi.org/10.1016/j.jml.2004.12.001 Mingjing, Z., & Detlef, U. (2015). Teachers’ judgements of students’ foreign-language achievement. European Journal of Psychology of Education, 30, 21-39. doi:10.1007/s10212-014-0225-6 Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-ofknowing predictions. Psychological Bulletin, 95, 109-133. Nelson, T. O., & Dunlosky, J. (1991). When people's judgements of learning (JOLs) are extremely accurate at predicting perceptual identification and relearning. Journal of Experimental Psychology: General, 2, 282-300.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
77
Nelson, T. O., & Narens, L. (1990). Metamemory: a theoretical framework and new findings. In G. H. Bower, The psychology of learning and motivation (Vol. 26, pp. 125-133). New York: Academic Press. Nelson, T. O. (1996). Gamma is a measure of the accuracy of predicting performance on one item relative to another item, not the absolute performance on an individual item: comments on Schraw (1995). Applied Cognitive Psychology, 10, 257 - 260. doi: 10.1002/(SICI)1099-0720(199606) 10:33.0.CO;2-9. Norris-Holt, J. (2001). Motivation as a contributing factor in second language acquisition. The internet TESL journal, III(6). Retrieved from http://iteslj.org/Articles/NorrisMotivation.html Ohle, A., & McElvany, N. (2015). Teachers’ diagnostic competences and their practical relevance. Journal for Educational research Online, 7(2), 5-10. Oliver, J. E., & Arnold, R. D. (1978). Comparing an standardised test, an informal inventory, and teacher judgement of third grade reading. Reading Improvement(15), 56-59. Perry, N. E., Meisels, S. J., & . (1996). How accurate are teacher judgements of students’ academic performance? Washington D.C.: National Center for Educational Statistics. Rodriguez, M. C. (2004). The role of classroom assessment in student performance on TIMSS. Applied Measurement in Education, 17(1), 1-24. Saglam, M. (2015). The Confidence-Accuracy Relationship in Diagnostic Assessment: The case of the potential difference in parallel electric circuits. Educational Sciences, Theory, and Practice. doi:10.12738/estp.2016.1.0033 Schraw, G. (2009). A conceptual analysis of five measures of metacognitive monitoring. Metacognition and Learning, 4(1), 33-45. doi:10.1007/s11409-008-9031-3
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
78
Schraw, G., Crippen, K. J., & Hartley, K. (2006). Promoting self-regulation in science education: metacognition as part of a broader perspective on learning. Research in Science Education, 36, 111-139. Schwartz, B. L., & Metcalfe, J. (1994). Methodological problems and pitfalls in the study of human metacognition. In J. Metcalfe & A. P. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 93-113). Cambridge, MA: MIT Press. Sharpley, C. F., & Edgar, E. (1986). Teachers' ratings vs. standardized tests: An empirical investigation of assessment between two indices of achievement. Psychology in the Schools(23), 106—111. Shavelson, R. J., & Borko, H. (1979). Research on teachers' decisions in planning instruction. Educational Horizons(57), 183-189. Shavelson, R. J., & Stern, P. (1981). Research on teachers’ pedagogical thoughts, judgements, decisions, and behaviour. Review of Educational Research, 51, 455-498. doi:10.2307/1170362 Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 2(15), 4-14. Son, L.K., & Metcalfe J. (2000) Metacognitive and control strategies in study-time allocation. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2000; 26: 204–221. Son, L. K., & Metcalfe, J. (2005). Metacognitive and control strategies in study-time allocation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 3, 1116-1129. Stankov, L., & Crawford, J. D. (1997). Self-confidence and performance on tests of cognitive abilities. Intelligence, 25, 93-109. doi:10.1016/S0160-2896(97)90047-7
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
79
Suantak, Liliana, Bolger, Fergus, & Ferrell, William (1996). The Hard-Easy Effect in Subjective Probability Calibrastion. Organizational Behaviour and Human Decision Processes, 67(2): 201-221. DOI: 10.1006/obhd.1996.0074 Südkamp, A., Kaiser, J., & Möller, J. (2012). Accuracy of teachers’ judgments of students’ academic achievement: a meta-analysis. Journal of Educational Psychology, 104, 743-762. Swanson, B. B. (1985). Teachers' judgements of first graders' reading enthusiasm. Reading Research and Instruction, 1(21), 41-46. Tenenbaum, H. R., & Ruck, M. D. (2007). Are teachers’ expectations different for racial minority than for European American students? A meta-analysis. Journal of Educational Psychology, 99, 253-273. doi:10.1037/0022-0663.99.2.253 Thiede, K. W., & Dunlosky, J. (2015). Toward a general model of self-regulated study: an analysis of selection of items for study and self-paced study time. Journal of Experimental Psychology: Learning, Memory, an Cognition, 30, 21-39. doi:10.1007/s10212-014-0225-6 Thiede, K. W., Anderson, C. M., & Therriault, D. (2003). Accuracy of metacognitive monitoring affects learning of texts. Journal of Educational Psychology., 95, 66–75. Tiedemann, J. (2002). Educational Studies in Mathematics. Teachers’ gender stereotypes as determinants of teacher perceptions in elementary school mathematics, 50(1), 49-62. Voss, T., Kunter, M., & Baumert, J. (2011). Assessing teacher candidates’ general pedagogical/psychological knowledge: Test construction and validation. Journal of Educational Psychology, 103, 952–969. Retrieved from http://dx.doi.org/10.1037/a0025125 Yates, J. F. (1990). Judgment and decision making. Englewood Cliffs, New Jersey: PrenticeHall.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
80
Yates, J. F. (1998). Analysing the accuracy of probability judgements for multiple events: an extension of the covariance decomposition. Organizational Behaviour and Human Decision Process, 41, 281 - 299. doi:10.1016/0749-5978(88)90031-3.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 181
TUM School of Education – Master Research on Teaching and Learning
Teacher Survey The Relation Between Foreign Language Students and Teachers’ Postdiction Confidence Scores as a Measure of Metacognitive Monitoring Dennis A. Rivera TUM School of Education
Dear teacher, Thank you very much for your help in this research to improve education. This study aims to identify your level of confidence in your student’s performance on the test you have developed. Please complete the following questions to the best of your ability. All information has an anonymous nature and will be used only for statistical purposes.
Ethnographic Data Language Skills What language(s) do you teach? _______________________ How much of the language you teach do you use in class? (E.g. 5%) ______ (100% means that you use the language that you teach all the time)
Experience and Qualifications
Sex: Age: _______ Language Proficiency (CEFR) Native speaker C2 C1
How long have you been teaching (in total)? _______ years or _____ months Do you use activities in class (not in moodle)? Yes No If yes, which activities do you use? Videos (music, films) Worksheets Games (you can choose more than one) Others (please specify): ________________________ Do you hold a university degree? Yes No If yes, what degree do you hold? ____________________________________________ Do you hold any of these qualifications in teaching? Bachelor’s Master’s PhD (You can choose more than one option) CELTA / DELTA DaZ/DaF-Zusatzqualifizierung Other (please specify): _________________________
Class and Test Characteristics Class characteristics Which language level course do you teach? A1.1 A1.2 A1.3 Which of these is true for the students in your class? More males What motivation do most students have? Internal (family/travel)
A1.4 A1 (compact) More females Equal External (ECTS credits/work/Erasmus)
Test characteristics Have you used this test in a former course? If yes, did you modify any questions?
Yes Yes
No No
Time of the test: ____:____ Some
Confidence Scores Express your answer in percentages from 0% to 100%. (100% means the question is very difficult and NO student will answer it correctly. 0% means the question is very easy and ALL students will answer it correctly).
Test in General A. How difficult do you think that the exam is overall? ____________
Questions on Language Skills 1. How difficult do you think that the Listening question(s) is (if your exam has one)? ____________ 2. How difficult do you think that the Reading question(s) is (if your exam has one)? ____________ 3. How difficult do you think that the Writing question(s) is (if your exam has one)? ____________ Thank you for your participation in this study. We hope the results will help you continue improving your teaching practices in behalf of your students.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 282
TUM School of Education – Master Research on Teaching and Learning
Lehrer Umfrage Das Verhältnis zwischen den Antwortsicherheiten von Fremdsprachenschülern und Lehrern zum Bewerten Meta-kognitiver Überwachung Dennis A. Rivera TUM School of Education
Liebe Dozenten, vielen Dank für Ihre Hilfe bei dieser Forschungsarbeit, die das Ziel hat die Lehre zu verbessern. Diese Studie versucht, Ihr Einschätzungsvermögen bezüglich der Schwierigkeit von Prüfungsfragen zu identifizieren. Bitte füllen Sie folgende Fragen aus. Alle Informationen sind anonym und werden nur zu statistischen Zwecken verwendet.
Ethnographischen Daten Sprachkenntnisse
Geschlecht: Welche Sprache lehren Sie? _______________________ Alter: _______ Wieviel von der Sprache, die Sie lehren, benutzen Sie im Unterricht? (zB. 5%) _____ Sprachkompetenz (CEFR) (100% bedeutet, dass Sie immer die Sprache die Sie lehren benutzen) Muttersprachler C2
Erfahrung und Qualifikationen
Wie lange lehren Sie (insgesamt)? _______ Jahre / Monate Verwenden Sie extra Aktivitäten im Unterricht? Ja Nein Wenn ja, welche Aktivitäten verwenden Sie? Videos (Musik, Filme) Arbeitsblätter Spielen (Sie können mehr als eins wählen) Andere (bitte spezifizieren): ______________________ Haben Sie einen Hochschulabschluss? Ja Nein Wenn ja, welchen Hochschulabschluss haben Sie? _________________________________________ Welche dieser Qualifikationen für die Lehre haben Sie? Bachelor Master PhD (Sie können mehr als eins wählen) CELTA DaZ/DaF-Zusatzqualifizierung Andere (bitte spezifizieren): ______________________
Klasse und Prüfungmerkmale Klassenmerkmale Welches Sprachkursniveau lehren Sie? A1.1 A1.2 A1.3 A1.4 A1 (kompakt) Wieviele Studenten gibt es in Ihrem Sprachkurs (ungefähr)? ______ (_____ männlich; _____ weiblich) Welche Motivation haben die Studenten? Internal (Familie/Reisen) External (ECTS credits/Arbeit/Erasmus)
Prüfungsmerkmale Haben Sie diese Prüfung in früheren Kurse benutzt? Wenn ja, haben Sie Fragen geändert?
Ja Ja
Nein Nein
Zeit der Prüfung: ____:____
Antwortsicherheiten (Confidence Scores) Antworten Sie in Prozent von 0% bis 100%. (100% bedeutet, dass die Frage zu schwierig ist und KEIN Student sie richtig beantworten wird. 0% bedeutet, dass die Frage zu einfach ist und ALLE Studenten sie richtig beantworten werden).
Prüfung im Allgemeinen A. Wie schwer glauben Sie ist die Prüfung insgesamt? ____________
Fragen zur Sprachkompetenz 1. Wie schwer ist die Frage(n), die das Hören prüft (falls vorhanden)? ____________ 2. Wie schwer ist die Frage(n), die das Textverständnis prüft (falls vorhanden)? ____________ 3. Wie schwer ist die Frage(n), die das Schreiben prüft (falls vorhanden)? ____________ Vielen Dank für Ihre Hilfe bei dieser Studie. Ich hoffe, dass die Ergebnisse helfen werden, Unterricht im Allgemeinen weiter zu verbessern
C1
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 383
Facultad de Educación de la TUM – Maestría en Investigación de la Enseñanza y el Aprendizaje
Encuesta para Profesores La Relación de Confianza Metacognitiva entre Profesores y Estudiantes de Lenguas Extranjeras para medir su Precisión Metacognitiva Estimados Docentes: Muchas gracias por su ayuda en este trabajo de investigación para mejorar prácticas educativas. Esta investigación intenta identificar el grado de dificultad de las preguntas del test que usted ha desarrollado. Por favor conteste las siguientes preguntas lo mejor que pueda. Toda información es de naturaleza anónima y será usada únicamente con propósitos estadísticos.
Información Etnográfica Habilidades Lingüísticas ¿Es el español su lengua nativa? Sí No ¿Qué porcentaje del idioma que usted enseña utiliza usted en clase? (Ej.: 5%) ________
Sexo: Edad: _______
(100% quiere decir que usted siempre habla en el idioma que usted enseña en clase)
Experiencia y Certificaciones ¿Cuánto tiempo lleva enseñando el idioma? _______ años / meses Además del libro, ¿usa otras actividades en clase (aparte de moodle)? Sí No Si contestó sí, ¿qué actividades usa? Videos (canciones, etc.) Actividades extra Juegos (puede elegir más de una opción) Otros (por favor especificar): ______________________ ¿Ha estudiado usted para ser docente? Sí No ¿Qué título universitario tiene usted? ____________________________________________ ¿Cuál de estos certificados en educación tiene? Licenciatura Máster en ELE PhD (puede elegir más de una opción) CELTA Instituto Cervantes DaF Zusatzqualifizierung Otros (por favor especificar): _____________________
Características de la clase y del examen ¿Cuántos estudiantes hay en su clase (en promedio)? _______ (_____ hombres; _____ mujeres) Tiempo total del examen ______ min. Hora: ____:____ AM/PM ¿Qué tipo de motivación tenía la mayoría? Interna (familia, viajes) Externa (ECTS, Trabajo, Erasmus) ¿Ha usado este examen en cursos previos? Sí No Si contestó sí, ¿ha cambiado actividades? Sí No Algunas Solo el orden
Confianza Metacognitiva Por favor escriba su respuesta en porcentajes del 0% al 100%. (100% significa que usted considera que esta pregunta es muy difícil y ningún estudiante contestará correctamente. 0% significa que usted considera que esta pregunta es muy fácil y todos los estudiantes contestaran correctamente).
Examen en General A. ¿Cuán difícil considera usted que es el examen en general?
____________
Preguntas en el Examen sobre competencias lingüísticas 1. ¿Cuán difícil considera usted que es la(s) pregunta(s) de comprensión oral? _________ 2. ¿Cuán difícil considera usted que es la(s) pregunta(s) de comprensión escrita? _________ 3. ¿Cuán difícil considera usted que es la(s) pregunta(s) de producción escrita? _________ Muchas gracias por su ayuda en esta investigación. Esperamos que los resultados le ayuden a continuar mejorando sus prácticas educativas.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 484
Faculté de L’Education de la TUM – Master dans l’Enseignement et la Recherche
Enquête pour les professeurs La relation de confiance métacognitive entre les professeurs et les étudiants de langues étrangères pour mesurer leur précision métacognitive
Chers Professeurs : Merci beaucoup pour votre aide dans cette investigation pour améliorer l’éducation. Cette étude cherche à identifier le niveau de difficulté des questions de votre examen. Toute information est anonyme et sera utilisée seulement pour faire des analyses statistiques.
Information Ethnographique Compétences Linguistiques Est-ce que le français est votre langue maternelle ? Oui Non Dans quelle proportion parlez-vous français en classe ? (Ex. : 5%) ________
Sexe : Âge : _______
(100% veut dire que vous parlez français tout le temps en classe)
Expérience et études Depuis combien de temps êtes-vous professeur(e) de français ? _______ ans / mois Excepté du livre, utilisez-vous d’autres activités (pas trouvé sur moodle) ? Oui Non Si oui, quelles activités utilisez-vous ? Vidéos (chansons, etc.) Feuilles de Travail Jeux (Vous pouvez cocher plus d’une réponse) Autres (soyez spécifique SVP) : ___________________ Avez-vous étudié de la pédagogie ? Oui Non Quelle est votre profession ? ___________________________________________ Quels certificats avez-vous ? Baccalauréat Maîtrise en FLE PhD (Vous pouvez cocher plus d’une réponse) CELTA Instituto Cervantes DaF Zusatzqualifizierung Autres (soyez spécifique SVP) : ____________________
Caractéristiques de la classe et de l’examen ¿Combien d’étudiants y a-t-il dans votre classe (plus ou moins) ? _____ (____ hommes; ____ femmes) Temps de l’examen : ______ min. Heure : ____:____ AM/PM ¿Quelle motivation ont vos étudiants ? Interne (famille, voyage) Externe (ECTS, Travail, Erasmus) ¿Avez-vous donné cet examen avant ? Oui Non Si oui, ¿avez-vous changé les activités ? Oui Non Quelques-unes Seulement l’ordre
Confiance Métacognitive Ecrivez votre réponse en pourcentages de 0% à 100%. (100% veut dire que vous pensez que la question est trop difficile et PERSONNE ne va y répondre correctement. 0% veut dire que vous pensez que la question est trop facile et TOUT LE MONDE va y répondre correctement).
Examen en Général A. Quel est le niveau de difficulté de votre examen, en général ?
____________
Questions de compétences linguistiques dans l’examen 1. Quel est le niveau de difficulté des questions de compression oral ? _________ 2. Quel est le niveau de difficulté des questions de compression écrite ? _________ 3. Quel est le niveau de difficulté des questions de production écrite ? _________ Merci beaucoup pour votre aide dans cet investigation. Nous espérons que les résultats vous aideront à continuer à améliorer vos techniques d’enseignement.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 585
TUM School of Education – Master Research on Teaching and Learning
Student Survey Dear student, Thank you very much for your participation. This study aims to identify your level of confidence in your answers. Your answers here will NOT affect your exam score at all. Please complete these questions AFTER you have finished your test. To keep all information anonymous, please create a code in the following way: Example Your answer 1. Write the first two letters of your birth city SP (Springfield, USA) _______ 2. Write the day of your birthday 02 _______ 3. Write the last two letters of your favourite colour UE (Blue) _______ Example Code: PS02UE Your code: ______________ Please write this code on the TOP RIGHT CORNER of your exam. All data has an anonymous nature and will be used only for research purposes.
Ethnographic Data Sex: You are: Studies:
Age: ______ C Student C Bachelor
Nationality: _______________ Languages: ___________________
TUM Employee Master’s
Other What are you studying? _______________ PhD Semester: _______
Motivation to study this course (You can choose more than one) Family / Friends Required subject
Erasmus Improve CV
Travel & Culture Extra ECTS credits Other (please specify): _________________________
Personal Effort, Knowledge & Beliefs How much homework did you do? None/few Some Most/all How many classes did you miss in total? 0–2 3–6 7 or more What did you focus your study for this test? Vocabulary Grammar Other: _________ From 0% to 100%, how much of the language in this course did you know before? ________% Do you think you’re good at the language you are learning? Yes No Please choose the option(s) that are true for you: A. You think you’re good at Listening Reading Writing Grammar (you can choose more than one or none)
B. You think you’re not good at
Listening
Reading
Writing
Grammar
(you can choose more than one or none)
Confidence scores
For statistical Purpose DO NOT WRITE IN THIS GRAY AREA
Please express how confident you are that your answers are correct in percentages from 0% to 100%. (0% means you are not at all sure your answer is correct and 100% means you are absolutely sure your answer is correct).
Test in General A. How well do you think you answered all the questions in the exam? _________
_____________
Questions on Language Skills 1. How confident are you that your answer to the Listening question is correct? __________ 2. How confident are you that your answer to the Reading question is correct? __________ 3. How confident are you that your answer to the Writing question is correct? __________
_____________ _____________ _____________
*If your exam did not have a listening, reading or writing question, you can leave the answer in blank. If the listening section has 2 or more questions, please write how confident you are of your answers overall (maybe 100% confident in the first question but 60% confident in the second; so on in total 70% or 80% confident of your answers). Thank you for your participation in this study. We hope the results will help all teachers in the Language Center to improve their teaching practices in behalf of all students.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 686
TUM School of Education – Master Research on Teaching and Learning
Master Thesis Research Proposal The Relation Between Foreign Language Students and Teachers’ Postdiction Confidence Scores as a Measure of Metacognitive Monitoring Dipl.-Ing. Agr. Denise Lichtig
Christina Thunstedt
Director TUM Language Center
Vice Director TUM Language Center
Dennis A. Rivera Master RTL TUM School of Education
Introduction The TUM Language centre, since its creation, has played a central role in the internationalization of the Technical University Munich. It offers, both students and staff, diverse opportunities to develop linguistic competences in a foreign language. Because of the essential help the TUM Language centre provides to language learners, it is important to support its teacher development and improve pedagogical practices. To this end, the TUM Language Center Quality Assurance Group organises regular teacher training workshops and conducts student evaluations; however, research on the effectiveness of these workshops on teaching practices, more specifically on improving teacher judgements, is yet missing. For this reason, I hereby present the following proposal to the TUM Language Centre to conduct a master’s research study that will help teachers become more aware of the accuracy of their diagnostic competence and better calibrate the level of difficulty of their testing instruments. By doing this, this master’s research study aims to improve educational practices within the TUM Language Center.
Theoretical background of the study Metacognitive Judgements (Schraw, Crippen, & Hartley, 2006), in an educational context, help students determine how well they have learnt something to decide either to stop or continue studying (Ariel, Dunlosky, & Bailey, 2009; Metcalfe & Kornell, 2005). They also help teachers develop diagnostic competence to assess students’ academic achievement (Alvidrez & Weinstein, 1999; McElvany et al., 2012; Ohle & McElvany, 2015; Voss, Kunter, & Baumert, 2011). One key metacognitive judgement for students is Retrospective Confidence (RC). RC judgements help students judge the quality of the responses they provide and be certain of their correctness. These judgements are crucial in the learning process because they are the outcome of the learning process and the basis on what teachers provide feedback. RC judgements are, however, strongly influenced by Teachers Judgements (TJs) (Black & Wiliam, 1998; Rodriguez, 2004). TJs help teachers make classroom decisions to choose activities, select learning material, and define the appropriate difficulty of tasks (Matinez, Stecher, & Borko, 2009). They are at the core of their diagnostic competence and help them determine how much has been learnt and calibrate the level of difficulty of their testing instruments. Their inaccuracy can negatively affect students’ RC judgements and their learning. Although there is research exploring how Teachers Judgements (TJs) and students’ metacognitive judgement accuracy affect performance (Koriat, Sheffer, & Ma'ayan, 2002; Meeter & Nelson, 2003; Kornel & Metcalfe, 2006), it has been mainly investigated on core curriculum subjects, such as mathematics, language arts, and reading (Südkamp , Kaiser, & Möller, 2012). Foreign language is yet a subject which has rarely been investigated (Leucht, Triffin-Richards, Vock, Pant, & Koeller, 2012; Mingjing & Detlef, 2015). Furthermore, heterogeneous results from these investigations urge
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
87
us to pinpoint the sources of discrepancy of these judgements to improve education and students’ learning outcomes. Because of the importance of both teachers and students’ ability to monitor learning and the accuracy of their judgements, this master research proposal aims to examine the correlation between foreign language Teachers’ Judgements (TJ) and student’s metacognitive Retrospective Confidence (RC) judgements through Confidence Scores. Examining this correlation can help us to discover the degree to which foreign language teachers judgements and students metacognitive judgements offer a similar picture of students’ achievement when compared within themselves and to their test scores. Likewise, this study aims to identify the factors that might explain the variation in the relationship between these two measures. Thus, the research question is:
Research Question Are foreign language teachers and higher education students’ judgements related a) to each other and b) to the actual academic achievement?
Hypothesis Based on previous research on other subjects, it is expected to see more accurate Teachers Judgements (TJ) calibration to students’ Retrospective Confidence (RC) judgements scores and their actual achievement depending on their relevant teaching qualifications and years of experience (Darling-Hammond, 2000; Johansson, Strietholt, Rosén, & Myrberg, 2014). If foreign language teachers and learners’ confidence scores do not correlate between themselves, teachers might be having an erroneous idea of what has been learnt in class. Thus, they might be designing a testing instrument unfit to the learners, which could lead them to mistake on their metacognitive evaluation and cause them to lower their self-concept.
Research Benefits This master’s research aims to provide foreign language teachers from the TUM Language Centre a clear picture of their judgements when preparing an evaluating instrument for them to analyse and improve their diagnostic competence. Foreign language teachers aware of their judgement accuracy will help students better calibrate their Retrospective Confidence (RC) judgements and thus bring about better educational outcomes.
Research Ethics All information collected will not include names, addresses, or any data that could be used to trace back to any specific individual. The data collected will be used exclusively for research proposes (statistical analysis). No part of this research process will affect teachers, students, or the normal flow of the classroom in any possible way. All participants will be free to withdraw from the study at any time.
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 788
TUM School of Education – Master Research on Teaching and Learning
MASTER THESIS RESEARCH PROPOSAL The Relation Between Foreign Language Students’ and Teachers’ Postdiction Confidence Scores as a Measure of Metacognitive Monitoring FOREWORDS To support the role that the TUM Language Center plays in the internationalization of the Technical University Munich, this research proposal aims to help teachers become more aware of the accuracy of their diagnostic competence when they calibrate the level of difficulty of their testing instruments. THE STUDY This study examines the correlation between foreign language Teachers’ Judgements (TJ) and student’s Retrospective Confidence (RC) judgements. This examination is made through Confidence Scores and the goal is to discover the degree to which they correlate when compared within themselves and to their test scores. Likewise, this study aims to identify the factors that might explain the variation in the correlation between these two measures.
Dennis A. Rivera Master RTL TUM School of Education Contact Information
[email protected] +49 157 71160553
Thesis Supervisor Dipl. Psych. Elisabeth Pieger
RESEARCH REQUIREMENTS This research requires the assistance of all foreign language teachers from the TUM Language Centre currently teaching an A1 level course (A1.1, A1.2 and A1 compact). Once teachers have constructed their exams, they will fill out a survey to record their confidence scores. On the day of the final exam, teachers will be asked to hand out a different survey to a minimum of 10 students. The students will fill it out after they have finished their exam to record their confidence scores. They will create a code to help teachers identify which survey belongs to which student. Finally, after marking the exams, teachers will be asked to fill out the students’ survey with their actual exam scores next to every confidence score (See Appendix 1). RESEARCH ETHICS All information will not include names or any data that could be used to trace back to any specific individual. The data collected will be used exclusively for research proposes (statistical analysis). No part of this research process will disturb teachers, students, or the normal flow of the lectures in any possible way. RESEARCH BENEFITS This master’s research aims to provide teachers a picture of the accuracy of their diagnostic competence when preparing an evaluating instrument. Teachers’ accurate diagnostic competence will help students better calibrate their Retrospective Confidence (RC) and bring forward better educational outcomes.
Teaching and Learning with Digital Media TUM School of Education Technical University of Munich
Chairholder Prof. Dr. Maria Bannert Teaching and Learning with Digital Media TUM School of Education Technical University of Munich
Contact Information Arcistrasse 21 80333 Muenchen +49 89 289 24368
[email protected]
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
Appendix 889
TUM School of Education – Master Research on Teaching and Learning
The Relation Between Foreign Language Students’ and Teachers’ Postdiction Confidence Scores as a Measure of Metacognitive Monitoring Dennis A. Rivera Master RTL TUM School of Education
[email protected]
Dear Student, Thank you for participating in this master study. You can find here important information about the study.
1. The Study This study seeks to discover whether you and your teacher identify a similar degree of difficulty in the questions on your exam.
2. The Method You and your teacher are asked to write a number (from 0 to 100) that best represents the difficulty of the questions on your exam; this is called a Confidence Score. These confidence scores will later be compared within themselves and to your true exam results to see the degree to which confidence was accurate.
3. The Importance When your teacher creates tasks in your exam that have the appropriate level of difficulty, if you studied and learnt the language well, you will feel challenged and motivated to learn. On the other hand, if your teacher over or underestimates the level of difficulty in an exam, this can cause negative learning outcomes.
4. The Anonymity Confidentiality in this study is provided to the fullest extent possible. This study is completely anonymous to the researcher; however, to compare your confidence scores to your true exam results, your teacher will use the code you created to match your survey with your exam. He will not focus on any names and your answers on the Student Survey will NOT affect your exam score at all. This data will be used for statistical purposes only and the results aim to help teachers at the TUM Language Center continue to improve their exams.
5. Voluntary Participation Please keep in mind that your participation in this research is completely voluntary and you can decide to stop participating at any time for any reason. This decision will not influence your relationship with your teacher, the TUM Language Center, or any other institution in any possible way. By signing this form, you consent to participate in this mater study. You state that you understand the nature of this project and wish to participate. You are not waiving any of your legal rights by signing this form. A copy of this form will be given to you. Your signature below indicates your consent. Signature: ___________________________ Participant (name):
Date: _______________________
Signature: ___________________________ Researcher: Dennis A. Rivera
Date: _______________________
MRTL – TUM School of Education
This research study has been reviewed by the TUM School of Education and its implementation has been approved by the TUM Language Center. If you have any questions about this process or your rights as a participant in this study, you can contact the chair of Teaching and Learning with Digital Media of the TUM School of education: Arcistrasse 21, 80333 Muenchen. +49 89 289 24368
RELATION BETWEEN POSTDICTION CONFIDENCE SCORES
90