Development of a Perceptive Interface based on Facial Displays applied to Learning Environments Javier Arturo Porras Luraschi UNAM – Faculty of Science Circuito exterior, Ciudad Universitaria, Coyoacan, México D.F. 04510
[email protected]
Ana Luisa Solís González Cosío UNAM – Faculty of Science Circuito exterior, Ciudad Universitaria, Coyoacan, México D.F. 04510 (55) 5622 4858
[email protected]
ABSTRACT
facial expressions as expressions of emotional states. The other views facial expressions in a social context, and regards them as communicative signals. The term “facial displays” is equivalent to “facial expressions”, but does not have the connotation of emotion. In this paper, we use the term “facial displays”. .
This paper proposes a real time perception interface which can enhance the experience at learning environments by using the face as a communication channel with the system. In this work, relevant expressions are assigned to learning environments when it is important to know the user’s interest on the information that is displayed. The actual system is implemented using pattern recognition techniques and visual recognition algorithms, and evaluated by its results.
THEORY OF COMMUNICATIVE FACIAL DISPLAYS
Facial displays can be seen as a communication channel used during interaction processes between humans which displays various kinds of signs over the face. We have two major assumptions [3, 4, 5]:
Keywords
Facial displays, expressions recognition, perceptive interfaces, learning environments, neural networks, optic flow, computer, vision, pattern, real time.
1.
INTRODUCTION
Ideally, human-computer interfaces should respond to people’s intuitive modes of communication. Recent advances in real time computer vision offer the promise of endowing human-computer interfaces with direct perception of human users. To realize a true perceptive interface, it is necessary to study how humans perceive information and to which information humans are sensitive.
Facial displays are primarily communicative. They are used to convey information to other people. The information that is conveyed may be emotional information, or other kind of information, for example, syntactical information, indications that the speaker is being understood, relationship definition, listener responses, etc. Facial displays can function during interaction as communication on their own. That is, they can send a message independently of other communicative behavior. Facial emblems such as listener’s comments (agreement or disagreement, disbelief or surprise) are typical examples. Facial displays can also work in conjunction with other communicative behavior (both verbal and nonverbal) to provide information.
In designing human-computer interaction, human face to face conversation has provided an ideal model. One of the major features of face to face communication is multiplicity of communication channels. As terms “face-to-face” and “interface” indicate, faces play an essential role in communication.
2.
The study of facial expressions has attracted the interest of number of different disciplines, including psychology, ethology, and interpersonal communication. Facial expressions are viewed in either of two ways. One regards
Facial displays are primarily social. They occur for the purpose of communicating information to others. Their occurrence is regulated more by social situations than by any underlying emotion process
Terminology and individual displays are based on those of Chovil [3] and Ekman [4]. Three major categories are defined as follows in [3]. Syntactic Displays. These are defined as facial displays that: •
Mark stress on particular words or clauses
•
Connected with the syntactic aspects of an utterance or
CLIHC'05, October 23-26, 2005, Cuernavaca, México. Copyright is held by the author(s). ACM 1-59593-224-0. 155
•
Connected with the organization of the talk To read expressions, we need first to determine the expressions from facial movements. Ekman and Friesen have produced a system describing all visually distinguishable facial movements [4, 5]. The system, called the Facial Action Coding System (FACS), is based on enumeration of all “Action units” (AU) of a face that cause facial movements. The system in this work was development using FACS.
Speaker Displays. Speaker displays are defined as being facial displays that: •
Illustrate the idea being verbally conveyed, or
•
Add additional information to the ongoing verbal content
Listener Comment Displays. These area facial displays made by the person who is not currently speaking and which are made in response to the utterances of the other person. In this case, the facial displays are nonverbal communication messages. This is a kind of messages we can get from the user in front of the computer interacting with an application (learning system, games, internet, etc).
Expression
AU Description 42 Right eye open Understanding 42 Left eye open 51 Head turn left Distraction Head 52 Head turn right 20 Lip stretcher Doubt Mouth 15 Lip Corner Depressor Surprise Mouth 27 Mouth Stretch Thinking Head 53 Head up Table 1 - AUs for recognition
CATEGORIZATION OF FACIAL DISPLAY USED IN THE LEARNING ENVIRONMENT
We must consider the facial displays we are interested in recognizing in the Learning Environment. The kind of expressions is very specific: •
Thinking / Remembering. Eyebrow rising or lowering. Eye closing. Pull back one side of the mouth.
•
Facial shrug / I don’t know. Eyebrow flashes. Mouth corners pulled down. Mouth corners pulled back.
•
Backchannel / Indication of attention. Eyebrow rising. Mouth corners turned down.
•
Doubt. Eyebrow drawn to center.
•
Understanding levels:
SYSTEM ARCHITECTURE AND IMPLEMENTATION
Our architecture joins a coarse head movement approach combined with detailed pattern recognition techniques using neural networks. The main loop (see Table 2) in the application can be defined as: 1. Capture image from device 2. Find the users face using built in cascades from OpenCV1 3. Try to find coarse head movements using optical flow techniques implemented in OpenCV
Confident. Eyebrow rising. Head nod.
4. Try to find detailed patterns, i.e. a simile, over the region where is possible to find them using our own implementation of a neural network
Moderately confident. Eyebrow rising. Not confident. Eyebrow lowering •
Agreement. Eyebrow rising
•
Request for more information. Eyebrow rising
•
Evaluation of utterance
RECOGNIZING FACIAL DISPLAYS
Our specific objective was to propose, implement and analyze a system that could detect the main facial displays in order to improve the teaching process at learning environments. The main expressions that we considered were, •
Understanding
•
Distraction
•
Surprise
•
Doubt
•
Thinking
Region Right Eye Left Eye
Source
Face
Image
Recognition
Optical Flow
Pattern Recognition
Table 2. Recognition architecture Since we used our own neural networks to find detailed patters, it was necessary to create different modules that could help us build different training patterns. Those modules were:
1
156
Open Computer Vision library developed mainly by Intel
a)
Capturing Facial Displays by Program. The objective was to capture different facial displays from users that didn’t knew their expressions were being captured using a computer program with several different questions that would try to get the expressions we wanted to study, like understanding, doubt, surprise, etc. The test consisted of 28 questions which were asked in a thread while other thread was used to capture the faces and stored them in the expression directory associated with the question. Most of the users didn’t realize that their expressions were being captured because the camera on the top part of a laptop appeared to be turned off. Figure 2. Sequence from system Thread A
XML
Exam Definition
Question / Answer
Thread B
Figure 3 Face capture module Image Capture Recognition Classification
Verification faces were used to test the neural network rather than for training it, these cases were extremely important because with them we decided when to stop the training process.
Figure1 Global structure Expression Faces Exp Distraction 607 Doubt Understanding/ Thinking 4883 Surprise Total Table 3 Faces captured
Faces 4387 320 10197
Expression Faces Expression Faces Distraction 306 Doubt 208 Understanding 223 Thinking 122 Surprise 335 Verification 39 Total 1233 Table 4 Number of faces captured directly
b) Capturing Facial Displays by a Actor. In order to add some more faces to our database, a module was made for capturing faces directly from an actor whom forced each expression based on the models gathered in the previous module. This shall not be done, it is preferable to get all the samples from the test, but because of time restrictions it was necessary to get more expressions in a faster way.
c)
2
157
Pattern classifier. From each face its main components like eyes and mouth are extracted manually and the samples histogram is normalized2. This module creates and XML file that is used during the training process. The classifier defines an area where it is possible to find the pattern; these results are show on Table 8.
When the histogram from an image is normalized its color palette is used completely, therefore the information that contains is improved
Minimum Maximum x y Scale X y Scale Mouth 0.27 0.65 0.24 0.69 0.99 0.34
Expression Region
Understand R. Eye 0.53 0.27 0.15 0.88 0.48 0.23 L. Eye 0.11 0.28 0.13 0.44 0.49 0.22 Distraction Table 5 Normalized histogram (right image)
Doubt Reasoning Surprise
R. Eye 0.51 0.22 0.12 0.95 0.51 0.28 L. Eye 0.03 0.20 0.14 0.48 0.53 0.26 Mouth 0.23 0.64 0.26 0.73 0.97 0.45 R. Eye 0.51 0.25 0.15 0.89 0.51 0.23 L. Eye 0.51 0.25 0.15 0.89 0.51 0.23 Mouth 0.20 0.64 0.26 0.77 1.00 0.46 Table 8 Pattern search area
d) Network trainer. Finally, using the training samples created by the classifier the trainer trains the neural network using the Robbins-Monro back propagation algorithm, which was chosen because it allows adding more training patterns even if the network is already trained. In order to maximize the network usage the initial activation values where approximated by their expected value (Figure 4).
Table 6 Classifier module Table 7 shows the samples needed during the training process.
Expression Understanding
Distraction Doubt Negatives
Training Set Region Samples Mouth 2630 Right Eye 3808 Left Eye 3393 Right Eye 5180 Left Eye 7018 Mouth 3646 Right Eye 1462 Left Eye 1489 Right Eye 2939
K L −1
K L −1
i =0
i=0
ε ∑ wmiL −1 y iL −1 =
(
)
∑ wmiL−1ε yiL−1 =
1 K L −1 L −1 ∑ wmi 2 i =0
Figure 4 Initial activation value Subtotal Verification Samples Positives Negatives Mouth 90 100 Understand R. Eye 100 93 L. Eye 90 92 R. Eye 100 100 Distraction L. Eye 75 89 Doubt Mouth 48 99 R. Eye 100 100 Thinking L. Eye 100 100 Surprise Mouth 100 100 Table 9 Verification samples results Expression
9831 12198 3646 2951
Thinking 2939 Total samples 31565 Table 7 Total samples classified
158
Region
When searching for the pattern, the probable area where the pattern could be found, defined by the classifier, is used starting from the top-left corner and applying the pattern all over the area until the pattern is found or until the bottomright corner is reached.
Table 10 Some patterns trained
Figure 6. Pattern search process
The final application combined pattern recognition methods and optical flow algorithms, using the schema from Figure 5. This module could not use just the patterns that were trained because of the similarities between them while using a low resolution camera, so optical flow was used instead to determine coarse movements of the head. Table 11 resumes which techniques were used to find each expression.
Stable results where obtained with most of the expressions, but the light and the low video quality prevailed as unresolved problems. The first one could be fixed by adding more training samples with different lighting, but the latter won’t be fixed in the near future because web cameras are being made with low resolution and, since one goal of the architecture was to spread the use of facial recognition systems we won’t propose to change the system resolution, although we have obtained better results using high quality cameras without any major changes.
RESULTS
Expression
Region Mouth Right Eye Left Eye Head
Method Neural Network Understanding Neural Network Neural Network Horizontal Optical Distraction Flow Doubt Mouth Neural Network Thinking Head Vertical Optical Flow Surprise Mouth Neural Network Table 11 Recognition Methods Figure 5 shows the order in which each pattern or movement is tried to be found, whenever a pattern is found the process is aborted and the recognized expression prevails during that frame.
Figure 7. Recognition using Optic Flow
Figure 8. Recognition using Neural Networks
Figure 5. Recognition algorithm
159
INTEGRATION OF PERCEPTIVE INTERFACES AND INTELLIGENT TUTORING SYSTEMS
The system is well suited for developing an autonomous intelligent tutor who may communicate and exchange a dialog with a real person through facial displays recognition [11, 12]. The system architecture in Figure 10, consists of several subsystems: a vision subsystem that processes motion video input and facial display recognition, a facial animation subsystem that generates a three-dimensional face with TTS (Text-to-Speech) to simulate the virtual tutor, and the intelligent tutoring subsystem for the learning environment. The input to the analyzer is the facial display recognition module. The result of the analysis can provide the facial display of the user. The virtual face responds to real person, a database is used with content words with facial expression states. These are defined in terms of constituent phonemes and facial expressions.
Figure 10. Integration of Perceptive Interfaces and Intelligent Tutoring Systems CONCLUSION
The series of timed phonemes and expressions are then compiled and decomposed into the basic action units to produce the visual changes to be done on the virtual tutor’s face. The words to be spoken are transmitted to the vocal synthesizer module.
The architecture proposed could be implemented in real time with acceptable results, using neural networks is a relative simple way to face complex problems for computers like the recognition of facial expressions; the optical flow performed well when analyzing movements in big scale but poorly when tracking specific patterns like the eye limits over large periods of time. The Robbins-Monro back propagation algorithms contributes with great scalability to the system, this was very useful because the trained networks could be tested in the early development to get an idea of how effective they were really being. REFERENCES
Figure 9. Facial Communication
1.
Ajit, S., Optic Flow Computation: A Unified Perspective, IEEE Computer Society Press, 1991.
2.
Chandrasiri, N.P.,Naemura, Facial Expression Recognition System with Applications to Facial Animation in MPEG-4. IEICE Trans. Inf. & Syst., Vol. E84-D, No. 8, Aug. 2001
3.
Chovil, N. Communicative Functions of Facial Displays in Conversation. Ph. D. Thesis, University of Victoria, 1989.
4.
Ekman, P. and Friesen, W.V. Facial Action Coding System. Consulting Psychologist Press, Palo Alto, California, 1978.
5.
Ekman, P. Facial signs: Facts, fantasies and possibilities.T.Sebeok, editor, Sight, Sound and Sense. Indiana University Press, 1978.
6.
Essa, I. Analysis, Interpretation and Synthesis of Facial Expressions, MIT Media Laboratory, 1995
INTERACTING WITH THE INTELLIGENT TUTORING SYSTEM THROUGH FACIAL DISPLAY
Input from facial expression may be used to interact with the application in the learning environment. Relevant expressions are assigned with a virtual tutorial system when it is important to know the user interest on the information that is displayed or when the user is interacting in a learning virtual environment.
160
7.
Fontoura, L., Marcondes, R., Shape Analysis and Classification, CRC Press, 2001
8.
Jähne, B., Digital Image Processing. Concepts, Algorithms, and Scientific Applications, Springer, 1992.
9.
Pandzic, I., MPEG-4 Facial Animation: The Standard, Implementation and Applications, John Wiley & Sons, 2002.
11. Solís, A., Ríos, H., Interaction for Virtual Environments based on Facial Expression Recognition. The Future of VR and AR Interfaces: Multi-modal, Humanoid, Adaptive and Intelligent. Proceedings of the Workshop, IEEE Virtual Reality 2001, Yokohama, Japan. 12. Solís, A., Ríos, H., Aguirre, E., Facial Expression Recognition and Modeling for Virtual Intelligent Tutoring Systems, Lecture Notes in Artificial Intelligence, Springer, 2000.
10. Solís, A., Castro, J., Expression Modeling and Speech for Conversational Agents in Virtual Environments, Proceedings of MICAI/TAINA`02, Avances en Inteligencia Artificial, Mérida, Mexico, April 2002.
13. Theodoris, S., Koutroumbas, K., Recognition, Academic Press, 2003.
Pattern
14. Torras, C. Computer Vision: Theory and Industrial Applications, Springer-Verlag, 1990.
161