VIRTUAL CLASSROOM SYSTEM WITH IMPROVED STUDENT FEEDBACK M. van der Schyff, H.C. Ferreira, W.A. Clarke Department of Electrical and Electronic Engineering, Rand Afrikaans University, P.O. Box 524, Auckland Park, 2006, South Africa. Tel: +27-11-489-2463 Fax: +27-11-489-2357 E-mail:
[email protected] ABSTRACT - Student feedback in traditional virtual classrooms is usually limited to intentional feedback. Information pertaining to the actual mental state (bored, interested, content, confused), gathered from facial expressions and gestures is lost. The focus of this paper is on methods to increase student feedback without placing unnecessary or excessive strain on available resources such as bandwidth, memory and processing power. The method uses indirect (facial actions from student video) and direct (answers to questions and opinions) feedback to estimate the student’s state. These estimates are then used to prioritise video transmission / updating from student to teacher. An experimental system is currently being developed and implemented using the Java Multimedia Framework (JMF). I. INTRODUCTION Distance education is a formalized teaching system that, through the use of technology, facilitates communication between a teacher and learner situated at different locations. In a virtual classroom the aim is to provide education in a manner that resembles a traditional classroom as closely as possible. Distance education has many advantages, including low cost, accessibility and the potential to increase the standard of education by increasing access to highly qualified teachers. Unfortunately, before the benefits of distance education can be fully exploited there are still some factors that need to be addressed. These include computer illiteracy, resistance to change, recognition of qualifications obtained online, cost to students, limited interaction between students and teachers as well as the quality of audio and video due to limited bandwidth. The main focus of the proposed system is on improving student-teacher interaction by providing the teachers with near real-time feedback from students. This would greatly aid in attaining the main goal of virtual classrooms, since the resulting system would more closely resemble a true classroom that can easily adapt according
to prevailing needs. The students would feel part of the class and the effectiveness of the class is less dependant on the willingness of students to provide the necessary feedback to ensure quality education. The willingness of students to give feedback in a virtual classroom may be higher as many people are more likely to click on a button on a computer than actually asking a question or making a comment. Opinion polls, online quizzes and answers to questions provide student feedback in current systems requiring conscious or intentional feedback from the student – the unintentional feedback through gestures, expressions and body language that teachers in conventional classrooms use to determine lecture pace and the students’ understanding of coursework, is lost. The proposed system uses intentional and unintentional feedback from the students to prioritize video transmission from student to teacher. The teacher is then presented with video or regularly updated images of the students in addition to information on average class state, student state, pace adjustment suggestions etc. from which they can derive all the information available to them in conventional classrooms. During the development of the system the focus is on the real-time extraction of information from the video monitoring a student, establishing suitable metrics for the prioritizing of the video, as well as the development of a protocol to control the transmission of video and other information between terminals without adding excessive strain on available resources. II. SYSTEM OVERVIEW The proposed virtual classroom consists of a teacher and student terminals. The terminals are connected via a server responsible for the coordination of data between the teacher and student terminals. Both the student and teacher terminals are capable of sending and receiving real-time audio and video streams using various standards and compression schemes. This enables performance comparisons between various standards while providing the scalability needed to allow the system to work on both
local area networks (LANs) and the Internet. A custom developed protocol handles the coordination of audio and video between the student terminals, the server and the teacher terminal.
Student Terminal
B. Server The server receives status and control packets from the student and teacher terminals and is responsible for initialising, starting, stopping, terminating and prioritising transmission between all the terminals through the use of data contained in the packets. The server achieves control by sending control packets to the terminals. C. Teacher Terminal
Teacher Terminal
Server
Student Terminal
Student Terminal
Figure 1: System overview The system is being implemented using the java multimedia framework (JMF). “JMF provides a unified architecture and messaging protocol for managing the acquisition, processing, and delivery of time-based media.” [8] It has built-in support for various video (H263, MPEG, MJPEG, QuickTime etc.) and audio standards (MP3, G711, GSM etc.). Transmission of audio and video using the real-time protocol is also provided for most of these standards with automatic audio and video synchronization. It allows the creation of customised effects, codecs, multiplexers, renderers etc. and supports DirectX rendering. Conversion between different formats in the processing chain as well as threading between the audio / video processes is handled automatically. JMF’s comprehensive media framework significantly decreases the development time of a system and makes it ideally suited for use in the development of an experimental system. One drawback of such a flexible framework may be execution speed. JMF tries to address this issue by providing performance packs (native code) targeted at specific operating systems, but actual JMF performance still needs to be investigated. Using different virtual machines or compiling the code to native code could provide performance gains. A few videoconferencing and education systems have already been implemented using JMF. (See [8] and [9]) A. Student Terminal The student terminal is responsible for presenting the audio, video and other content transmitted from the teacher terminal. The terminal consists of a computer containing the developed software, a headset incorporating a microphone and a webcam to capture live video of the student. The video is used to monitor a student and provide indirect feedback to the teacher.
The teacher terminal captures and transmits video and other information used during the lecture to the student terminals. The terminal also receives video and presents estimates of class activity from students. III. MONITORING A STUDENT The following graph (Fig. 2) represents the procedure used to allow the monitoring of a student. The process is repeated at timed intervals. The interval (processing rate) and algorithm used for feature extraction and classification can be changed according to available resources such as processing power. This minimizes the effect of the algorithm used on the transmission frame rate of the video as the processing is executed in another thread. Capture frame from video stream
Extract features
Classify student state
Construct and transmit status packet
Server analyses received packets and transmits control packet to student terminal
Student terminal starts / stops transmission Figure 2: Monitoring a student The procedure is discussed in more detail in the following sections. A. Capture frame from video stream Video is captured from a USB webcam pointed at a student at a resolution of 352 x 288 pixels. The video capture frame rate is set to 15 frames per second. For best
Table 1: Symbols and their meanings Symbol r ( f , p)
Meaning Red component of a pixel p on sample f.
g ( f , p)
Green component of a pixel p on frame f.
b( f , p )
Blue component of a pixel p on frame f.
I ( f , p) w h I ( p)
Resulting binary value of pixel p on frame f. Frame width. Frame height. Binary value of pixel located at position p.
x( p)
The group number of pixel p.
s ( m)
Size of group m.
results the webcam should be placed in such a way that the student' s face covers most of a captured video frame. B. Extract features The next step in the process is extracting features that can be used to help classify a student’s state. Currently, a very simple method is used for extraction to test the system as a whole and obtain a reference with regards to performance – other techniques are also being implemented for comparison. Each captured frame is converted into a binary image using the luminance differences between the current and previous frames.
r ( f , p) > r ( f − 1, p) and g ( f , p ) > g ( f − 1, p) and I ( f , p ) = 1 b( f , p) > b( f -1, p )
(1)
This method is very fast to compute, but it relies on the assumption that the student’s face is large compared to the background and that the lighting on the student would stay uniform for the period between successive frames. Under these conditions most of the luminance differences between frames can be attributed to student movement. Region labelling (blob colouring) is used to identify 4connected components in the binary image. Components with the same region label are grouped. Each group has a size property that can be used to remove noise. The size of a group is defined as the number of pixels belonging to the group. w× h
J (k ) where,
k =0
(2)
J (k ) = 1 if x(k ) = m, otherwise J (k ) = 0 A group of pixels can be removed if the size is below ( t2 ) or above ( t1 ) a certain threshold.
x( p) = −1 for t1 < s ( x( p )) < t2
C. Classify student state The next step is to classify the student’s state based on the importance of the movement captured from the webcam. Two main states are defined: Normal / Unknown and Activity. The level of activity is also specified as Average, Above Average and Below Average. The average activity value is specific to a particular student (i.e. above average movement / activity may be normal activity for another student). The Activity State and Activity Level can be subdivided by specific algorithms to provide more information. Possible states are: Bored, Interested and Confused or even Head and Eyebrow movement. Other information such as the current page shown on the screen in relation to the page being discussed can also be used to help estimate student state. To test the basic system only the main states are used. Better classifiers are being developed to try and incorporate the above states and determine if they are needed or even obtainable. Currently the activity is directly proportional to the number of groups defined as
n = max( x( p ))
otherwise I ( f , p) = 0
s (m) =
In this implementation the threshold was set to relatively small values ( t1 = −1 and t2 = 100 ) to filter out small changes on the image due to speckle noise caused by the lighting used in the room. This method could prove useful if the size range of a group / object to be filtered out is known. Setting the group number of a pixel to –1 will filter it out. The results of this extraction method are shown in Fig. 3. Other extraction methods and their effect on the system are currently being investigated.
(3)
(4)
Another possibility would be to base the activity on the amount of movement, or movement in specific parts of the frame such as the eyebrows. D. Construct and transmit status packet The state information is used to form a status packet and the packet is placed in a queue to be transmitted at predefined intervals. The student terminal also combines these packets at predefined intervals or when the queue is full. All packets include timestamps to help determine the delay on the video transmission from a student to the server. The packets use a format similar to that of an MPEG-7 packet and contain video content descriptions, information on the algorithms used as well as the priorities and states assigned to the video. E. Server analyses received packets and transmits control packet to student terminal The server receives the status packets from student terminals and place them in a queue for analyses at predefined intervals. After analysis the sender of the packet conveying the most relevant information (e.g.
(3.1a)
(3.1b)
(3.3a)
(3.3b)
(3.1c)
(3.1d)
(3.3c)
(3.3d)
(3.1e)
(3.3e)
(3.2a)
(3.2b)
(3.4a)
(3.4b)
(3.2c)
(3.2d)
(3.4c)
(3.4d)
(3.2e)
(3.4e)
Figure 3: Examples illustrating the previous frame (a), current frame (b), binary image of frame difference before noise removal (c), after noise removal (d) and after pixel grouping (e). Different colours indicate different groups.
Above average activity - confused) is requested to start transmitting the delayed video. The request is made in the form of a control packet. The information in the status packets is also combined to give the teacher a summary of estimated class activity. This could be in the form of a graph showing the number of students in each defined state and / or pace suggestions using arrows etc. For example: If the whole class suddenly show more activity the teacher could try and assess if there is a reason for it, such as boredom or incomprehension of the teacher’s last statement. As the expression recognition techniques become more advanced information such as interest, boredom or confusion could also be provided to the teacher. F. Student terminal starts / stops transmission The student terminal receives the control packet and processes the request - video transmission between student and teacher terminals is started. The video delay is determined by the round trip time of the packet from student to server and back. The duration of transmission is determined by the duration of the relevant information or a predefined maximum duration. The information in the packets can also be used to change the processing, combining and status transmission interval as well as frame rate, as a form of transmission flow control similar to that of TCP. For example, when network congestion occurs, the system could choose to increase the transmission interval and transmit only frames with relevant motion. As the congestion is cleared the transmission threshold is decreased until normal transmission occurs or satisfactory video quality is received (purposefully saving bandwidth). This could increase the perceived quality of the transmission. The frames to be transmitted are encoded and transmitted using normal low bandwidth video compression techniques. Video is transmitted using RTP and RTPC, while status data is transmitted over UDP. IV. RESULTS Experiments using the number of groups as a classifier [1] resulted in frames / video sequences with head and eyebrow movement being classified as Above Average Activity, and therefore most of these frames were transmitted. The effect of the monitoring process (encoding, transmission and analysis) on transmission frame rate is also less pronounced than in [1] where the processing was done in series with the transmission, whereas it is now done in parallel through the use of another tread. The effect of this change on the accuracy of the estimates must still be established. V. FUTURE WORK The student monitoring system will be further enhanced through research on the effect of different developed
protocols, feature extractors and classifiers on system performance with regards to speed of execution, accuracy of estimates, bandwidth utilisation and efficiency of transmission. Different ways of presenting the information and the usefulness of the information will also be researched and documented as part of a master’s thesis. VI. ACKNOWLEDGEMENT This work was supported by the NRF under grant number 2053408. VII. REFERENCES [1] M van der Scyff and HC Ferreira, "Improving student feedback in virtual classrooms," Proceedings of the Fourteenth Annual Symposium of The Pattern Recognition Association of South Africa, November 27-28, 2003, Langebaan, South Africa, pp 101-105. [2] Hossein Arsham, “Impact of the Internet on Learning and Teaching”, Education at a distance (USDLA Journal), Vol. 16, No.3, pp. 9 - 20, March 2002. [3] Peter Eisert, Bernd Girod, ”Analyzing Facial Expressions for Video Conferencing”, IEEE Computer Graphics and Communications, Vol. 18, No. 5, pp. 70 – 78, September/October 1998. [4] Juergen Luettin, “Automatic Expression Analysis: A Survey”, http://old-www.idiap.ch/~fasel/html/rr9919-3/, November 2002. [5] Atul Puri, Alexandros Eleftheriadis, “MPEG-4: An Object-based Multimedia Coding Standard supporting Mobile Applications”, ACM Mobile Networks and Applications Journal, special Issue on Mobile Multimedia Communications, Vol. 3, Issue 1, pp. 5 – 32, June 1998. [6] José M. Martinez, “MPEG-7: Overview of MPEG-7 Description Tools, Part 2”, IEEE Multimedia, Vol. 9, Nr 3, pp. 83-93, July-September 2002 [7] Sun Microsystems, Java Media Framework 2.1 Programmers Guide, http://www.java.sun.com /products/javamedia/jmf/2.1.1/index.html, 2003 [8] Ahmet Uyar, Wenjun Wu, Hasan Bulut, Geoffrey Fox,” An Integrated Videoconferencing System for Heterogeneous Multimedia Collaboration”, 7th IASTED International Conference on Internet and Multimedia Systems and Applications, August 2003, Honolulu, Hawaii. [9] S. Li, S. Chen, M. Shyu, “A Live TV-Quality Distant Learning Multimedia Presentation System for Education”, 34th Annual Hawaii International Conference on System Sciences (HICSS-34), Vol. 4, p. 4047, January 2001, Maui, Hawaii.