Is a Dialogue-Based Tutoring System that

11 downloads 0 Views 256KB Size Report
lem-solving goal or its sub-goals) predict how much students learn from ... says, “The acceleration would be positive” and the tutor follows up with ... This is a co-constructed object:attribute relation in which the student provides a ... to revise the dialogues they had produced for the control version by applying the ... necessary.
Is a Dialogue-Based Tutoring System that Emulates Helpful Co-constructed Relations during Human Tutoring Effective? Patricia Albacete, Pamela Jordan, and Sandra Katz Learning Research and Development Center University of Pittsburgh, Pittsburgh PA 15260, USA [email protected]

Abstract. We present an initial field evaluation of Rimac, a natural-language tutoring system which implements decision rules that simulate the highly interactive nature of human tutoring. We compared this rule-driven version of the tutor with a non-rule-driven control in high school physics classes. Although students learned from both versions of the system, the experimental group outperformed the control group. A particularly interesting finding is that the experimental version was especially beneficial for female students. Keywords: ITS, natural-language tutoring systems, physics education

1

Introduction

Research on one-on-one human tutoring has shown that its highly interactive nature largely accounts for its effectiveness (e.g. [5], [8], [16]). However, machine-learning driven analyses of automated tutoring indicate that neither the amount of interaction during tutoring (e.g., the frequency of exchanges, such as question-answer exchanges), nor the granularity of the interaction (e.g., whether an exchange addresses a problem-solving goal or its sub-goals) predict how much students learn from tutoring. Instead, they point to what content is addressed and how it is addressed in a particular context ([6], [14]) as key features of tutorial dialogue. This research, in turn, highlights the need to further specify what these features are, so that they can be more generally simulated in natural-language tutoring systems. Since tutoring is carried out through language, some developers of dialogue-based tutoring systems have stressed the need for more research aimed at identifying particular linguistic mechanisms that support learning during tutoring (e.g., [3], [15]). Several research teams, including ours, have responded to this call through various approaches, including machine learning driven analyses of annotated interactions between students and an automated tutor, and statistical analyses of annotated human tutorial dialogue corpora (e.g., [3, 4], [12], [17]). The tie that binds these studies is their focus on the interactions between the student and the tutor, instead of on the contributions of either party, the student or the tutor. Correspondingly, this research is grounded in linguistic theories that can help describe the relationships between speak-

er turns. For example, Speech Act Theory is well-suited for classifying one intention associated with each speaker’s utterances during an instructional dialogue (e.g., to ask a question, make an assertion) and for highlighting interaction patterns, but is inadequate for capturing how information in the tutor’s turn relates to information in the student’s turn, and vice versa; in other words, for highlighting where in the interactions knowledge co-construction may be taking place. Rhetorical Structure Theory (RST) captures both informational and intentional relationships between parts of a discourse (spoken or written text, and dialogue) and is generalizable across domains ([13]). For these reasons, we used RST to identify potentially effective rhetorical relationships during live physics tutorial dialogues and to express these relationships as decision rules that could be explicitly encoded within a natural-language tutoring system for physics, Rimac. This work on specifying a set of dialogue decision rules is summarized in the next section; a more detailed description can be found in [11]. The current paper focuses on an initial evaluation of an experimental version of Rimac that deliberately incorporates these decision rules, as compared with a control version of the tutor that does not.

2

Rimac and the Decision Rules that Simulate Interactivity

Rimac is a web-based natural-language tutoring system that scaffolds students in acquiring a deeper understanding of the physics concepts and principles associated with quantitative physics problems. It performs this task through automated reflective dialogues that students engage in after solving quantitative physics problems on paper. Rimac’s dialogues were developed following a common framework for generating automated dialogues known as a directed line of reasoning, or DLR ([7]). During a DLR, the tutor presents a series of carefully ordered questions to the student. If the student answers a question correctly, he advances to the next question in the DLR. Otherwise, the system launches a remedial sub-dialogue and then returns to the main line of reasoning after the sub-dialogue has completed. In order to simulate the interactivity of human tutoring in Rimac, we first used RST to characterize the co-constructed discourse relations that took place during live physics tutoring sessions, by manually annotating a large corpus of reflective discussions between human tutors and students. We focused on inter-speaker relationships that implement abstraction and specification (e.g., part:whole, step:process) since these relationships have been shown to promote learning (e.g., [17]) and on other relationships that are common in the domain of physics, such as comparison and conditional reasoning (e.g., Louwerse et al., 2008; cited in [11]). For example, the student says, “The acceleration would be positive” and the tutor follows up with “Right, the x component of the acceleration would be positive.” According to RST, this exchange would be classified as a whole:part relation, since the student names a vector (acceleration) and the tutor refers to a specific component of that vector. To take another example, a student says, “the velocity is 14” and the tutor specifies, “it is 14 m/s.” This is a co-constructed object:attribute relation in which the student provides a value for velocity and the tutor specifies its units. After we tagged each student-tutor and tutor-student dialogue exchange according to the types of discourse relations that they embodied, we searched for correlations between the frequency of each type of relation

and learning, as measured by students’ gain scores from pretest to posttest. We found that the frequency of several types of co-constructed relations in the tagged corpus predicted learning and, moreover, that these correlations varied according to student ability level [11]. In order to express these potentially beneficial tutorial interactions as decision rules that could guide dialogue authoring, we needed to specify the discourse contexts, or “triggering conditions”, under which they occurred—for example, if they tended to take place at the beginning, middle, or end of a dialogue; if they were triggered by particular types of student errors, etc. The 11 decision rules that stemmed from this process guide dialogue authoring by specifying how the tutor should respond to different types of student input, at each step of the dialogue—that is, after a student’s response to each question in the main line of reasoning, or during a remedial sub-dialogue. Table 1 presents an example of one of the 11 tutoring decision rules that stemmed from this process. Along with related rules, the decision rule shown in Table 1 drives Rimac’s implementation of co-constructed condition reasoning relations, which we found predicted learning especially among low pretest scorers [11]. Table 1. A conditional reasoning exchange during human tutoring and associated decision rule Reflection Question: A bungee jumper of mass 80 kg just had an exciting ride from the center of a bridge. Unfortunately, the bungee, fully stretched, leaves him 18 meters above the ground. What is the tension in the bungee as he is hanging there? Student: Tension = weight (student assertion of physical situation) Tutor: Why does the tension equal the weight in this problem? (tutor prompt for condition(s)) Student: Because there are no other outside forces acting on the bungee/jumper system. (condition) Discourse Relation: T-S: situation:condition Decision rule that stemmed from conditional relationships such as this: If the student does not provide an explanation to support a claim, especially at the beginning of a reflective dialogue, prompt the student to explain why this claim is true in the given situation.

Would a version of Rimac that deliberately implements this suite of dialogue decision rules outperform a more traditional version of the system, which does not? To address this question, we conducted an experiment to compare two versions of reflective dialogues in Rimac. For the control version, dialogue authors, who were experienced physics tutors or teachers, were instructed to write the best dialogues that they could, by following the standard authoring framework (i.e., DLRs, with embedded remedial sub-dialogues). For the experimental version, the same authors were asked to revise the dialogues they had produced for the control version by applying the decision rules in appropriate contexts as frequently as possible, and without altering the content discussed during the control version of the dialogues. Table 2 illustrates these two versions of Rimac’s dialogues. They were produced using the TuTalk dialogue development toolkit [10]. To keep accuracy of automatic recognition of student input high, most questions posed by the tutor require a short answer. However, implementation of some decision rules, such as the one shown in Table 1, invite an explanation from the student. Given the complexity of interpreting such responses, and the frustration that students experience when the system misin-

terprets them, we bypassed this problem by presenting a menu of responses for students to map theirs to, as shown in the first dialogue excerpt in Table 2. It is important to note that some dialogue rules were unintentionally incorporated within the control dialogues. This is not surprising, given that the decision rules stem from analyses of naturally occurring human tutorial interaction. In other words, it would have been unnatural for dialogue authors to completely avoid employing the interaction patterns embodied in the decision rules while generating the control dialogues. Nonetheless, we verified that the frequency of each rule was significantly lower in the control version of each dialogue than in its experimental counterpart. Table 2. Dialogue excerpts from the experimental (rule-driven) and control versions of Rimac Rule-driven Tutor (T=Tutor; S=Student) Example1: Decision rule: If the student does not provide an explanation to support a claim, especially at the beginning of a reflective dialogue, prompt the student to explain why this claim is true, in the given situation. T: Is the direction of the acceleration in the same direction or the opposite direction of the motion? S: Opposite. T: Why do you think this? S: Because there is deceleration so the acceleration has to be opposite the motion. T: Please select the best match below for the answer you just gave: o The problem said so Because when an object is slowing down acceleration opposes its motion o Because when there is deceleration the acceleration is always negative o Another reason. T: I agree. Let’s now step through the reasoning. Example 2: Decision rule: When the student answers a question incorrectly, if possible, state the condition(s) under which the answer would be correct. S: The final velocity is larger. T: If the supertanker were speeding up the final velocity would be larger. But the supertanker is slowing down hence the final velocity is smaller than the initial velocity.

3

Control Tutor (T=Tutor; S=Student) A correct answer from the student moves him on to the next step in the line of reasoning.

T: Is the direction of the acceleration in the same direction or the opposite direction of the motion? S: Opposite. T: Very good. Let’s now step through the reasoning.

When the student answers a question incorrectly he receives a remediation. S: The final velocity is larger. T: Not quite. The supertanker is slowing down hence the final velocity is smaller than the initial velocity.

System Evaluation

Participants. The study was conducted in high school physics classes at four schools in the Pittsburgh PA area, following the course unit on kinematics, with 256 students participating; 117 females and 137 males. (Two students in the experimental condition did not report gender.) Students were randomly assigned to conditions: experimental (decision rule-based version; N= 125, 54 females, 69 males) and control (standard DLR version; N= 131, 63 females, 68 males).

Materials. After consulting with high school physics teachers, we selected three quantitative kinematics problems and developed several reflective dialogues per problem which addressed their associated concepts. Students engaged in these dialogues after solving the problems. Also with teachers’ advice, we developed tests that would allow us to measure students’ learning gains after using the tutor. The pretest was isomorphic to the posttest. The tests consist of 14 items; 8 multiple-choice and 6 open-response problems. We developed a rubric to promote consistent scoring across graders, who were experts in the physics content. Tests were scored by one grader and reviewed by another to ensure fidelity to the rubric; adjustments were made when necessary. Procedure. On the first day of data collection, the teacher gave the pretest in class and assigned four homework problems: the three problems mentioned in the materials section for which we had developed reflective dialogues, plus an extra problem. The purpose of the extra problem was to control for time on task, since we expected the experimental dialogues to take longer to work through than the control dialogues. This problem was isomorphic to one other assigned problem, to control for content. During the next one or two days (depending on whether classes were 45 min. or 80 min. long), students used Rimac in class. For each homework problem, students watched a video “walkthrough” of a sample solution and then engaged in the problem’s reflective dialogues. The videos focused on procedural/problem-solving knowledge, while the dialogues focused on conceptual knowledge. Finally, at the next class meeting, teachers administered the posttest. Results. Data analysis addressed four objectives—namely, to determine whether: a) students who interacted with the tutor learned, as measured by amount of gain from pretest to posttest, regardless of treatment condition, b) there was a difference in amount of gain between conditions, c) there was an aptitude-treatment interaction d) there were gender differences in learning from one or both versions of the system. The data was first analyzed considering all problems together; then multiple-choice and open-response problems were considered separately. We expected that students’ ability to verbalize physics concepts and reasoning would be better fostered by the experimental version of the system, and open-response items would detect this better than multiple-choice test items. Moreover, open-response problems do not allow for guessing the correct answer to the extent that multiple-choice items do. Did students learn from the tutoring system? To determine whether students’ interaction with Rimac, irrespective of condition, promoted learning, we compared pretest scores with posttest scores by performing paired samples t-tests. The results are shown in Table 3. When all students were considered together, we found a statistically significant difference between pretest and posttest for multiple-choice problems, open-response problems, and all problems combined. When students were considered by condition, we found a statistically significant difference between pretest and posttest for all problems and for multiple-choice problems. However, for open-response problems, we found a significant difference only in the experimental group.

Taken together, these findings suggest that students in both conditions learned from the system; on average, they could solve a half to one more problem correctly on the posttest than on the pretest. However, the experimental version of the system was perhaps more effective in helping students express the physics reasoning required to solve the open-response problems. Did one version of the tutor promote learning better than the other? Before testing our hypothesis that students who used the experimental version of Rimac would outperform students who used the control version, we compared pretest scores and time on task between conditions. There were no statistically significant between-group differences in pretest scores considering all problems combined, multiple-choice problems or open-response problems. However, we found that the mean time on task in the experimental condition (M=48.2 minutes SD=13.5 minutes) was significantly higher than in the control condition (M=45.04 minutes SD=12.18 minutes), (t(254)= 1.963, p=0.05) which prompted us to perform an ANCOVA to test the effect of condition on gain score, using time on task as a covariate to control for its possible effects. [Before performing the ANCOVA we verified that there was no interaction between condition and time on task. None was present; F(1,252)=.001, p=.977]. Controlling for time on task, we found a significant effect of condition on gains for all problems combined (F(1,252)=4.478, p=.035), but not for multiple-choice problems or openresponse problems considered separately. Table 3. Learning from interacting with the system Problems considered

Condition All Students

All problems

Experimental Control All Students

Multiple choice

Experimental Control All Students

Open response

Experimental Control

Pretest Mean SD (normalized Mean SD) M=5.04 SD=2.48 (M=0.36 SD=0.18) M=4.95 SD=2.55 (M=0.35 SD=0.18) M=5.13 SD=2.42 (M=0.37 SD=0.17) M=3.53 SD=1.73 (M=0.44 SD=0.22) M=3.52 SD=1.80 (M=0.44 SD=0.23) M=3.55 SD=1.67 (M=0.44 SD=0.21) M=1.51 SD=1.12 (M=0.25 SD=0.19) M=1.43 SD=1.11 (M=0.24 SD=0.19) M=1.58 SD=1.14 (M=0.26 SD=0.19)

Posttest Mean SD (normalized Mean SD) M=5.87 SD=2.61 (M=0.42 SD=0.19) M=6.00 SD=2.65 (M=0.43 SD=0.19) M=5.74 SD=2.57 (M=0.41 SD=0.18) M=4.19 SD=1.69 (M=0.52 SD=0.21) M=4.33 SD=1.72 (M=0.54 SD=0.22) M=4.05 SD=1.66 (M=0.51 SD=0.21) M=1.68 SD=1.28 (M=0.28 SD=0.21) M=1.68 SD=1.33 (M=0.28 SD=0.22) M=1.68 SD=1.24 (M=0.28 SD=0.21)

t (n)

p

t(255)=7.55

< 0.01

t(124)= 6.92

< 0.01

t(130)= 3.92

< 0.01

t(255)= 6.78

< 0.01

t(124)= 5.94

< 0.01

t(130)= 3.72

< 0.01

t(255)= 2.80

0.01

t(124)= 2.79

0.01

t(130)= 1.19

0.24

The results of further between-group comparisons using t-tests are presented in Table 4. Gains were defined as (posttest – pretest) and their normalized versions as (posttest / #problems) – (pretest / #problems). Consistent with the ANCOVA results, when all problems were considered together, the mean gain of the experimental con-

dition was significantly higher than the mean gain of the control condition. However, no significant difference between conditions was found when considering multiplechoice problems and open-response problems individually. Overall, even though the practical difference between gain score was very modest—on average, about .5 items or about a 4% increase in correct number of problems solved—these findings suggest that the experimental version of the system has the potential to outperform its control counterpart. Table 4. Comparing learning between conditions Problems Considered

Condition Experimental

All problems Control Multiple choice

Open response

Experimental Control Experimental Control

Gain = Posttest-Pretest (normalized Gain) M=1.06 SD=1.71 (M=0.08 SD=0.12) M=0.60 SD=1.76 (M=0.04 SD=0.13) M=0.81 SD=1.53 (M=0.10 SD=0.19) M=0.50 SD=1.55 (M=0.06 SD=0.19) M=0.24 SD=0.98 (M=0.04 SD=0.16) M=0.10 SD=0.97 (M=0.02 SD=0.16)

t(n)

p

t(254)=2.078

0.04

t(254)=1.600

0.11

t(254)=1.173

0.24

Did the effect of condition on learning vary depending on student ability? In other words, was there an aptitude-treatment interaction (ATI)? Using course grade as a measure of aptitude, we found no significant interaction between condition and aptitude in their effect on overall gain (F(1,251)=.586, p=.45), multiple-choice gain (F(1,251)=1.751, p=.19), or open-response gain (F(1,251)=.553, p=.46) (the grade of one student was not reported by her teacher). Since grading can vary across schools and teachers, we also investigated ATI using pretest score (i.e., prior knowledge) as a measure of aptitude. Similarly, we did not find a significant interaction between condition and aptitude in their effect on overall gain score (F(1,252)=.048, p=.83), gains on multiple-choice problems (F(1,252)=.096, p=.76), or gains on open-response problems (F(1,252)=1.002, p=.32). These results suggest that the effect of condition on learning does not vary depending on students’ ability in physics. Correspondingly, given that students in the decision rule-based condition significantly outperformed students in the control condition, the ATI analyses indicate that students using the decision rule-guided dialogues learned more across ability levels. Did the effectiveness of each version of the tutor depend on gender? To investigate possible gender differences, we first performed a t-test comparing gains of females with gains of males, for both conditions combined. We found no statistically significant differences between mean gains for all problems combined and for multiplechoice items. However, females’ mean gains on open-response items (M=0.37,

SD=0.91) were significantly higher than males’ mean gains (M=0.001 SD=1.01), t(252)=3.025, p

Suggest Documents