Failure to track the learner's activities accurately could cause the ... conducting this work is to create pedagogical agents that are able to ... able to track learner actions using these modalities about as well as they ... such as confusion and decide intervene time. These four ... pixel coordinates of the video monitor. For this ...
Pedagogical Agents that Interact with Learners Lei Qu, Ning Wang, and W. Lewis Johnson Center for Advanced Research in Technology for Education (CARTE), USC / ISI 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292 {leiqu, ning, Johnson}@isi.edu
Abstract In this paper, we describe a method for pedagogical agents to choose when to interact with learners in interactive learning environments. This method is based on observations of human tutors coaching students in on-line learning tasks. It takes into account the focus of attention of the learner, the learner’s current task, and expected time required to perform the task. A Bayesian network model combines evidence from eye gaze and interface actions to infer learner focus of attention. The attention model is combined with a plan recognizer to detect learner difficulties which warrant intervention. This capability is part of a pedagogical agent able to interact with learners in socially appropriate ways.
determine when best to interact with the learner. Our approach involves monitoring of the learner’s activities both interface actions and focus of eye gaze. It infers the learner’s focus of attention using a Bayesian network [22], which allows reasoning under uncertainty with various sources of information as discussed later. And it combines the method for tracking learner focus of attention with a plan recognition capability for interpreting learner actions and forming expectations of future actions. This approach should be of general use for improving the interactivity of intelligent tutoring systems. Our particular motivation for conducting this work is to create pedagogical agents that are able to interact with learners in more socially appropriate ways, sensitive to rules of politeness and etiquette and able to influence learner motivational as well as cognitive state [5, 7, 15].
2. Requirements from Background Study 1. Introduction Animated pedagogical agent technology seeks to improve the effectiveness of intelligent tutoring systems, by enabling them to interact in a natural, more engaging way. However, work to date has focused mainly on improving the output side of the interface, through the inclusion of expressive, lifelike behaviors [24]. The focus of the work described in this paper is on the input side, to enable the agent to track the learner’s activities and infer learner state, so it can initiate interactions with the learner at the appropriate times and in appropriate manners. For instances, we want to enable agents to intervene proactively when the learner appears to be confused or stuck. Proactive interaction is important for pedagogical agents because learners are sometimes reluctant to ask for help [7]. If the learner is truly stuck, it may be appropriate to interrupt and offer assistance. That requires an ability to monitor the learner’s activities continuously. Failure to track the learner’s activities accurately could cause the agent to interrupt the learner with advice when the learner does not really need it, which could be very annoying. In this paper, we present our work on enabling pedagogical agents to track learner activities and focus of attention, to generate analyses that can be used to
In an earlier study, we investigated how human tutors coach learners while interacting with the Virtual Factory Teaching System (VFTS) [15, 8], an on-line factory system for teaching industrial engineering concepts and skills. We found that tutors used the following types of information, observed and inferred, in deciding when and how to interact with the learner: • The task that the learner was expected to perform next. As the learner read the tutorial, the tutor would note what the learner was reading, infer what the learner ought to do next based on the content of the tutorial, and then intervene if the learner is having difficulties to perform that task. • The learner’s focus of attention. The tutor would note the learner's focus of eye gaze, as well as keyboard and mouse actions, in order to infer the learner's focus. • The learner’s self-confidence, inferred from the questions that the learner asked. • The learner’s effort expended, as evidenced by the amount of time that the learner spent reading the tutorial and carrying out the tasks described in the tutorial. We therefore designed the user interface of our new system to enable an agent to have access to sufficient information about the learner, her/his activities, cognitive
and motivational state. The new interface includes three major components: • The VFTS interface, which reports each keyboard entry and mouse click that the learner performs on it. • WebTutor, an on-line tutorial used to teach learner instruction and concepts of industrial engineering. • Agent Window, in which the left part is a text area used to communicate with the agent (or a human tutor in Wizard-of-Oz mode) and the right part is an animated character that generates speech and gestures. The input devices consist of keyboard, mouse, and a small camera focused on the learner’s face. This interface thus provides information that is similar to the information that human tutors use in tracking learner activities. We then conducted a Wizard-of-Oz study with the interface to verify that the information collected via the interface was sufficient for enabling an agent to interact in a manner similar to the way human tutors do. In the study the experimenter was able to view the learner’s actions on the display, as well as the view of the face picked up by the camera, in order to infer approximate focus of gaze. We concluded from these experiments that human tutors were able to track learner actions using these modalities about as well as they could in in-person interactions.
3. Description of Our Method There are four components in our approach to choosing interaction points: • WebTutor provides information about what task the learner is working on, as well the actions the learners perform as they read through the tutorial. • The plan recognizer in VFTS monitors the learner’s actions and tracks learner progress through the task. • The focus of attention module takes input from the WebTutor interface, the VFTS interface and Agent interface as well as eye gaze information, in order to infer learner focus of attention. • The decision module uses focus of attention and plan recognition information to infer learner difficulties such as confusion and decide intervene time. These four components can provide agents with information about learners’ states and their expected tasks. Therefore agents are able to detect learners’ confusion and interact with them and help them overcome their difficulties. Our method is similar to that of the Lumière Project [18], which also tracks learner activities by monitoring learner actions and tracking learner tasks. The differences are that our system has more information about learner activity (e.g., eye gaze) as well as more information about learner task (from the WebTutor). It can therefore track learner activities with greater confidence, and can therefore focus more on detecting and categorizing learner difficulties.
3.1 Tracking learner focus of attention under uncertainty Information about eye gaze is extremely useful for detecting user focus of attention. In our system we want an eye tracking approach that is unobtrusive, that requires no special hardware and no calibration. Common eye tracking methods require special hardware attached to the head, restrict head movement, and/or require calibration [21]. This has limited their applicability in learning systems to laboratory experiments (e.g., [20]).
Figure 1: Screenshot of eye tracking program We use a program developed by Larry Kite in the Laboratory for Computational and Biological Vision at USC (as shown in figure 1) to track eye gaze. It estimates the coordinates on a video display that correspond to the focus of gaze. The approach is to discover a mapping from a multiple dimension feature space, comprising features extracted from processed real-time video images of the student’s head, to the two dimensional space of the x-y pixel coordinates of the video monitor. For this purpose, it tracks the following landmarks: the pupil and two corners of each eye, the bridge of the nose, the tip of the nose, and the left and right nose corners. This technique is robust and it can be used easily in almost any environment where a camera is installed. It works without calibration, although accuracy suffers as a result, particularly in the vertical direction. Fortunately, we can combine the eye tracking information with other learner interaction data in order to improve accuracy. The agent uses two types of information to infer the learner’s focus: (1) information with certainty, i.e., mouse click, type and scroll window events in VFTS, WebTutor and Agent Window, and (2) information with uncertainty, namely data from eye track program and inferences about current state based upon past events. Information with certainty When certain events occur, the agent can infer the focus of learner’s attention with near certainty. For example, if a learner clicks a button or scrolls a window, the focus of learner must be on that window, assuming that the hand and eye are coordinated.
Information with uncertainty However there are also intervals of time during which the learner is looking at the screen and taking no action. During these intervals the system combines evidence from eye tracking with evidence from previous interface actions to make probabilistic inferences about the learner’s focus of attention. For example, if a learner performs a mouse click in the VFTS window at time n, the agent can infer learner’s exact focused window in VFTS at time n. After that the focus of attention is likely to stay on the VFTS window for a period of time. We found from our experiments that the probability of the learner’s focus remaining at the same window after 5 seconds was more than 60%. Table 1 shows the probabilities of the learner’s focus remaining in the same window for different intervals following an interface action. n+1 n+2 n+3 n+4 n+5 n+10 >95% >85% >75% 65% 60% 1.0 and Progress remaining the same typically mean learner is failing to make progress on tski. And if ErrorTries also remains the same, learner must be confused by tutorial. Indecision Indecision defines the degree of hesitancy to make decisions. The agent increases measure of learner indecision after when the time on task exceeds EstDecisionTime and EstActionTime, without having performed actions to the VFTS. In this case, agent need to decide intervene the learner because learner must be confused. An example of indecision is if the learner stares at the tutorial, then stares at VFTS, but he/she can not make decision how to proceed in VFTS after expected time and goes back to the tutorial again. Learners with a high indecision value are most likely to be confused or stuck. The relative proportion of gaze activity may indicate the reason for the indecision, for example a student who spends a long term reading the tutorial may have difficulty deciding how to proceed, whereas a student who spends a long time looking at the VFTS interface may be having difficulty figuring out how to carry out his actions. In a related fashion, it is possible in come contexts of detect instances of learner frustration, as episodes where the learner spends a significant amount of time on a task, tries multiple actions, but fails to make progress in completing the task. Self-confidence Self-confidence represents the confidence of learners in solving problems in the learning environment. If learners
perform actions in VFTS after reading tutorial without much hesitancy, such learners must have high confidence. The self-confidence can also be reported by the learner via the WebTutor interface. After the learner reads the tutorial and spends EstDecisionTime to decide what to do next, but Self-confidence remains the same, agent would initialize the intervention.
4. Experiments and Evaluation 4.1 Comparison and evaluation of tracking models Three approaches have been performed to comparatively evaluate the effectiveness of inferring learner focus of attention: eye tracking program (ETP), learner’s actions (LA) in system and Bayesian network model (BNM). In ETP, results from eye gaze program represent the learner’s current focus of attention. LA method considers the last window that learner performs actions as current focus of attention. And BNM infers learner focus of attention by Bayesian network with uncertainty and certainty information. In all of these experiments, human tutors inferred the learners’ focus during the experiments and recorded them Baseline
4
ETP 4
NON
VFTS
Agent Window
WebTutor Window ETP 78% 71% 75% LA 0% 22% 37% 26% BNM 78% 84% 81% 88% Table 2: Average accuracies for different windows with three approaches to infer leaner focus of attention Time after learners change Accuracy of Bayesian focused window network’s results 1 second > 60% 2 seconds > 75% > 3 seconds > 85% Table 3: Accuracies of Bayesian network’s results based on the time after learners change focused window.
As shown in table 3, the agent is able to detect periods of fixation comparable to what learners do in practice, i.e., how long learners have already focused on the VFTS or WebTutor. When agents infer the learner’s focus of attention is “other area” for some times, i.e., >5 seconds, agents can recognize the learner is not focused on system, e.g., the learner walks away or turns away from the screen.
WW 3
WW
AW
AW
6.0
VFTS 1
VFTS
5.0
NON
NON
4.0
3
2
2 1
0
0
10 10 11 20 21 30 31 40 41 50 51 60 61 70 71 80 81 90 91 100 101
Time
LA
4
Time
BNM
4
Series1 Effort Indecision Series2 Series3 Self-Confidence
3.0 2.0
WW 3
WW 3
AW
AW 2
VFTS
VFTS 1
NON
NON
2
V a lue
10 10 11 20 21 30 31 40 41 50 51 60 61 70 71 80 81 90 91 100 101
1.0 0.0
1
01
Time
0
0
10 21 20 30 10 11 31 40 41 50 51 60 61 70 71 80 81 90 91 100 101 Time
10 9 1720 25 3033 4041 50 49 5760 6570 73 8081 90 89 100 97 105
10 21 20 31 30 41 40 51 50 60 10 11 61 70 71 80 81 90 91 100 101 Time
Figure 4: Comparison of three approaches, ETP, LA and BNM. using a program when they coached learners. Although these inferred data from observation of tutors does not match learners’ actual focus of attention completely, it still reflects what learners are doing. So we compared results from the three approaches with these analysis data. Figure 4 illustrates the inferred results from different approaches during the experiments. The figures shows BNM gets more accurate result compared to EPT and LA. The vertical line indicates the four different windows. Table 2 compares the average accuracies of the three approaches. These results indicate that agent using BNM has more confidence for tracking learner attention than using ETP or LA approach.
Figure 5: A sequence of key parameter values
4.2 Evaluation and discussion of decision model Figure 5 demonstrates a sequence of key parameter values in the experiments. In the experiments, we record the learner’s screen in video file. So it is possible for tutor to review the video, analyze learner motivational state and decide possible intervention times. For example from time 40 to 65, the Self-confidence for learner remains same after EstDecisionTime and EstReadTime, also Indecision kept increasing, so it is clear that this is the right intervention time. And it is matched with human tutor’s observations from the experiments.
5. Future Work As part of our future work, we plan to extend the user monitoring capability to handle a wider range of ambiguous contexts. For example, the WebTutor interface does not always determine the learner’s task unambiguously. Several paragraphs could be visible to the learner at the same time, so the agent must decide among multiple tasks that the learner might be trying to perform. The history of previous activity can help disambiguate; if the learner has completed one task they are likely to continue on to the next task. But we also need to develop a method for classifying the learner’s overall pattern of activity, i.e., are they exploring the interface, exploring the tutorial, or working through the tutorial. Learners do not always follow the tutorials as written. In out studies we observe that human tutors sometimes interrupt to offer help when the learner seems to be exploring randomly, but once the learner explains that he is exploring the tutor temporarily ceases offering advice, until the tutor observes that the learner is starting work on the tutorial. Although this work has focused specific programs and interfaces, we can generalize the approaches and techniques used in our work to different domain. When applied to a new application domain, we need to know the possible actions on each window and the learner’s learning characteristics in this domain. Then we can use such information to rebuild and initialize the Bayesian Network, and make it capable of tracking user’s attention in the new domain. We also need to complete the plan library and redefine the key parameters used for agents to decide to intervene in the new domain. Then agents can detect learner’s confusion and decide when to intervene in the new domain.
Acknowledgements This work was supported in part by the National Science Foundation under Grant No. 0121330, and in part by a grant from Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References 1. Bloom, Benjamin S. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6):4-16, 1984. 2. Scholer, A., Rickel, J., Angros, R. & Johnson, L. Learning domain knowledge for teaching procedural tasks. In AAAI2000 Fall symnposium on Learning How to Do Things, 2000. 3. Shaw, E., Johnson, W.L., & Ganeshan, R. Pedagogical agents on the Web. In Proceedings of the Third Annual Conference on Autonomous Agents, ACM Press, 283-290, May, 1999. 4. Johnson, W. Lewis, S. Kole, E. Shaw, and H. Pain, Socially Intelligent Learner-Agent Interaction Tactics. In Proceedings
of the International Conference on Artificial Intelligence in Education, submitted, 2003. 5. W. Lewis Johnson, Paola Rizoo, Wauter Bosma, Sander Kole, Mattijs Ghijsen, Herwin van Welbergen. Generating Socially Appropriate Tutorial Dialog. Affective Dialogue Systems (ADS04), Kloster Irsee in Germany, 2004. 6. David Heckerman. A tutorial on learning with bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, Redmond, Washington, 1995. 7. W. L. Johnson. Using Agent Technology to Improve the Quality of Web-Based Education. In N. Zhong and J. Liu (Eds.), Web Intelligence. Springer, Berlin, 2002. 8. Dessouky, M.M., Verma, S., Bailey, D., & Richel, J. A methodology for developing a Web-based factory simulator for manufacturing education. IEEE Transactions, 33, 167180, 2001. 9. A. E. Nicholson and J. M. Brady. Dynamic belief networks for discrete monitoring. IEEE Transactions on Systems, Man and Cybernetics, 24(11):1593-1610, 1994. 10. Albrecht, D. W.; Zukerman, I.; Nicholson, A. E.; and Bud, A. Towards a Bayesian model for keyhole plan recognition in large domains. In Jameson, A.; Paris, C.; and Tasso, C., eds., Proceedings of the Sixth International Conference on User Modeling (UM '97), 365-376. SpringerWien New York, 1997. 11. Dean, T. and Wellman, M. Planning and Control. San Mateo, California (Morgan Kaufmann, 1991). 12. Burton, Richard R. Diagnosing Bugs in a Simple Procedural Skill. Intelligent Tutoring Systems. 157-183. Sleeman, D. and J. S. Brown, eds, Academic Press, 1982. 13. Munro, A., M. C. Johnson, D. S. Surmon, and J. L. Wogulis. Attribute-centered simulation authoring for instruction. In Proceedings of the AI-ED 93 World Conference of Artificial Intelligence in Education, 82-89. Edinburgh, Scotland, 1993. 14. Rickel, Jeff. An Intelligent Tutoring Framework for TaskOriented Domains. Proceedings of the International Conference on Intelligent Tutoring Systems, 109-115, Montreal, Canada, 1988. 15. W. Lewis Johnson. Interaction Tactics for Socially Intelligent Pedagogical Agents. In Proceedings of the Intelligent User Interfaces, 2003. 16. J. Rickel and W.L. Johnson. Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control. Applied Artificial Intelligence 13:343-382, 1999. 17. A. D'Souza, J. Rickel, B. Herreros, and W.L. Johnson. An Automated Lab Instructor for Simulated Science Experiments. In Proc. Tenth International Conference on AI in Education, pp. 65-76. IOS Press, May, 2001. 18. E. Horvitz, J. Breese, D. Heckerman, D. Hovel, and K. Rommelse. The Lumière project: Bayesian user modeling for inferring the goals and needs of software users. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 256--265, Madison, WI, 1998. 19. T. del Soldato and B. Du Boulay. Implementation of motivational tactics in tutoring systems. Journal of Artificial Intelligence in Education, 6(4), 337-378, 1995. 20. Kevin A. Gluck, John R. Anderson, Scott Douglass: Broader Bandwidth in Student Modeling: What if ITS were "Eye"TS?. Intelligent Tutoring Systems 2000. 504-513, 2000. 21. Duchowski, A. T. Eye Tracking Methodology: Theory and Practice. Springer-Verlag, London, UK, 2003. 22. Pearl, J., Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference. San Mateo, CA: organ-Kaufmann, 1998.
23. X. Zhou & C. Conati. Inferring user goals from personality and behavior in a causal model of user affect. IUI ‘2002, 211-218. ACM Press, New York, 2002.
24. Johnson, W.L., Rickel, J.W., and Lester, J.C. Animated pedagogical agents: Face-to-face interaction in interactive learning environments. International Journal of Artificial Intelligence in Education, 11, 4778, 2000.