10th International Symposium on Image and Signal Processing and Analysis (ISPA 2017)
September 18-20, 2017, Ljubljana, Slovenia
Estimation of Students’ Attention in the Classroom From Kinect Features Janez Zaletelj
∗ University
of Ljubljana Faculty of Electrical Engineering Ljubljana, Slovenia Email:
[email protected] Abstract—This paper proposes a novel approach to automatic estimation of attention of students during lectures in the classroom. The approach uses 2D and 3D features obtained by the Kinect One sensor characterizing both facial and body properties of a student, including gaze point and body posture. Machine learning algorithms are used to train attention model, providing classifiers which estimate attention level of individual student. Human encoding of attention level is used as a training set data. The experiment included 3 persons whose attention was annotated over 4 minute period in a resolution of 1 second. We review available Kinect features and propose features matching the visual attention and inattention cues, and present the results of classification experiments.
I.
y
Ij
gi
pi
I NTRODUCTION
Si
Automated learning analytics is becoming an important topic in the educational community, which needs effective systems to monitor learning process and provide feedback to the teacher. Recent advances in visual sensors and computer vision methods enabled automated monitoring of behaviour and affective states of learners at different levels from the the university level [1] to the pre-school level [2]. Student affective states such as interested, tired, confused etc. are automatically determined from facial expressions [3], [2], [4], and attention state is computed from different visual cues such as face gaze, head motion, body postures etc. [1],[5]. In the higher education estimation of long-term student engagement in the learning process is possible through implicit usage data collection within e-learning environments [6]. Sustained student attention during lectures is recognized as an important factor of the learning sucess [7], however tracking of individual students’ attentive state in the classroom by using self-reports is difficult and interferes with the learning process, which is also the case for using psychophysical data sensors [8]. Visual observation is a non-intrusive method and video recordings can be used for manual attention coding, however for long-term observations automatic computer vision methods should be applied. Non-intrusive visual observation and estimation of affective parameters is commonly using recorded video (RGB) signal, for example to estimate student engagement from facial expressions [3], [9], to estimate mood of children [2], or to estimate driver’s vigilance from his head pose [10]. Automatic affect detection methods [4] usually require high-quality image and are applicable to single person observation, which limits their usability or reduces accuracy and complexity of image analysis [1] in the classroom setting.
c 2017 IEEE 978-1-5090-4011-7/17/$31.00
220
z
x
Fig. 1. The concept of utilising RGB+D sensor (Kinect One) in the classroom to observe behavior of multiple students during the course of the lecture in order to assess their attention to the lecture.
The introduction of low-cost RGB + Depth sensors aimed at computer games (Microsoft Kinect) started a lot of research interest in various applications, especially those requiring detection of body posture and gestures. One study [5] utilized two Kinect sensors to record body motions of teacher and the student during dyadic learning interactions to predict learning performance, however other capabilities of the Kinect sensor have not yet been applied in the context of learning. The basic idea of our work is to utilize advanced capabilities of Kinect One sensor to unobtrusively collect behavioral data of multiple students during attending traditional lectures in the classroom. We propose a methodology to compute features from the Kinect data corresponding to visually observable behaviors, and to apply machine learning methods to build models to predict attentive state of the individual students. II.
K INECT- BASED S TUDENT O BSERVATION S YSTEM
In this section we give an overview on the system to record and extract features which are used to predict various student attention parameters. A. Experimental Setup The goal of our experiments was to record and measure user behavior in the classroom. The Kinect One sensor was
Poster Sessions Poster Session II
10th International Symposium on Image and Signal Processing and Analysis (ISPA 2017)
set up to observe three students acting as test persons. Kinect Studio software was used to record the incoming data stream to the hard drive. The recording data rate was over 120 MBps. The 8GB in-memory buffer was allocated to help to reduce performance problems. In case of buffer overfill the incoming data frames are lost. The recorded data was played back by the Kinect Studio software at the original rate. This means that analysis must work with 30 frames per second, otherwise data loss occurs. The data was analysed by Matlab scripts using the methods provided by Kin2 Toolbox for Matlab, which encapsulates the Microsoft Kinect 2 SDK. The first real-time pass of analysis was intended to capture the video and skeletons data and store them on the disk drive. Then the offline analysis of extracted data was performed by Matlab scripts. B. Kinect One Sensor and Its Outputs Kinect One sensor provides different types of output streams. Kinect Studio software allows recording of all sensor output to a file, and offline playback of the sensor’s output streams. Microsoft provides Kinect for Windows Software Developement Kit (SDK), which is able to extract different types of information from the Kinect data stream, including •
Color image frames, at the maximum frame rate of 30 frames per second, in Full HD resolution.
•
Infra red images with resolution 512 by 424 pixels.
•
Depth images with resolution 512 by 424 pixels, encoding the distance from the sensor.
•
Body frames, providing information about skeletons of persons detected by the Kinect hardware. Each body frame contains up to 6 body elements corresponding to actual persons in the scene. Each body element contains positions of the 25 body joints in the 3D space, then tracking state of each joint, and state of the left and the right hand. The hand states are Open, Closed, Lasso, NotTracked, Unknown.
Based on the color and depth images, further detectors are implemented which result in the following information characterizing the persons’ faces: •
Face boxes, providing image coordinates of the detected frontal faces.
•
Face points, providing image coordinates of the five features of the face, the left and right eye, nose and left and right mouth corner.
•
Face rotation, providing an estimate of the face orientation (or head pose) given by yaw, pitch and roll angles which are given relative to the line from the person to the Kinect sensor.
•
Face properties, which provide the binary classifications of the various face gestures and properties, including Happy, Engaged, WearingGlasses, LeftEyeClosed, RightEyeClosed, MouthOpen, MouthMoved and LookingAway.
•
HD Face, a detailed 3D face model composed of 1347 mesh vertices.
221
September 18-20, 2017, Ljubljana, Slovenia
•
17 facial animation units (AUs), which measure facial expressions such as mouth open/closed, eyebrow raised/lowered etc.
C. Kinect Coordinate System and Feature Representation Kinect One sensor provides RGB, depth and IR frames with their own 2D coordinate systems. In color (RGB) space, x = 1 and y = 1 corresponds to top left corner of the image and x = 1920, y = 1080 is the bottom right corner. Similarly in depth and IR space x = 1 and y = 1 is the top left and x = 512, y = 424 is the bottom right corner of the image. The Kinect SDK provides methods to map color coordinates to depth coordinates and vice versa. It is thus possible to obtain the depth of the specific feature identified in the RGB image. We represent color space coordinates by a vector c = [cx , cy ] and depth/IR coordinates by d = [dx , dy ]. Camera space refers to the the right-handed 3D coordinate system used by Kinect One. The origin of the coordinate system (x = 0, y = 0, z = 0) is located at the center of the IR sensor. The x coordinate grows to the sensors left, y grows up and z grows out in the direction the sensor is facing. The unit is 1 meter. Kinect SDK provides mapping methods between camera space and 2D depth or color spaces. The 3D point in the camera space is represented by a vector pj = [pj,x , pj,y , pj,z ]. The basic Kinect feature is the skeleton of the person’s body. Skeleton is given as a set of 25 body joints given by their 3D coordinates within Kinect camera space coordinate system. Each body joint is represented by a vector pj = [pj,x , pj,y , pj,z ]. At each time instance t, up to 6 skeletons can be detected, and they are given in an indexed array. The problem is that the skeleton indexing is not consistent through time, because when a person disappears from the scene, the other skeletons are assigned a different index. The k-th skeleton at time t is thus given by 25 body joints, and we denote it as Sk (t) = {p1,k (t), p2,k (t), · · · , p25,k (t)} The joint number 4 gives the 3D head position p4,k which is of special interest for our gaze processing system. By using camera to color space mapping we obtain the 2D image coordinates of the skeletons head c4,k = [c4,k,x , c4,k,y ]. We define a world coordinate system to estimate gaze point location of individual test persons located in the classroom. The origin of a world coordinate system is set at the floor level and at the left corner of the classroom, with x axis extending through the slide display area and the whiteboard. The y-axis represent a height above the floor, and z axis is extending towards the persons in the classroom. The 3D coordinates of a point within world (classroom) coordinate system are given by vector p0 = [p0x , p0y , p0z ]. D. Human Annotation of Attention and Behavior The literature is lacking a consistent definition of the student’s attention in the classroom. We assume that there are several measurable components which influence the overall attention level. It is also important to define a time span over which an attention level was estimated, ranging from overall attention score over the whole length of the lecture, to the micro-attention which changes each second based on internal mental processes or external input, distractions etc. We thus started with the question of the definition of the student attention and which visually observable human behavior is related to it.
Poster Sessions Poster Session II
10th International Symposium on Image and Signal Processing and Analysis (ISPA 2017)
Recording
24 Skeleton Joints
Computed Features
Head Pt 3D Spine Base Pt 3D
Face Features
Face gaze Face properties (8)
HD Face Features
September 18-20, 2017, Ljubljana, Slovenia
Face gaze HD Animation Units (17) HDFace Properties
Behavioral Clues
Lean Back
Attention States
Observing slides
High Attention
Writing notes Head Displacement 3D
Lean forward
Gaze Point 2D
Medium Attention
Observing slides Supporting Head
Gaze Point HD
Lean forward
Eyes Closed
Low Attention / Inattenition
Leaning back Looking away
Face Deformation
Head scratching Mouth Open
Stretching hands Yawning
Compute Features
Ext ract Features
Train Classifier
Predict Attention
Annotate Video
Fig. 2. Relation between computed features based on Kinect signals and the observable behavioral cues. High correlation between computed feature and the behavior is shown as dotted line. The machine learning process is shown below.
The five human coders were assigned a task to estimate micro-attention level on the scale 1..5 with time granularity of 1 second while observing video footage of the student. They were given no prior instruction on what visual clues they should observe so their estimates are not biased with prior information and might also be uncorrelated. They annotated 260 seconds of video footage of 5 test students. We denote those independent encodings of student attention as visual attention scores Avi,j (t) ∈ {1, 2, 3, 4, 5}. To derive a mean attention score Am i (t) of a student i at time t we removed minimum and maximum estimate and calculated mean of the three visual attention scores. Due to estimation noise from the five human coders this attention score experiences fast fluctuations on a second basis which could not be associated with visual behavioral cues. To regularize the attention score we performed median filtering with 10 second time window, and thresholding to three levels. The final reference attention of a student is Ari (t) ∈ {1, 3, 5} and provides a human estimate of the current attention level of a student in the range low, medium and high. By observing video footage and the attention score we were able to identify body language, face expressions and other visual behavioral cues which correspond to each of the levels of estimated attention, as shown in Fig. 2. High level of attention was associated with observing slides, writing notes, and body is leaning forward. Writing notes signal is shown in Fig. 3 b) as a blue line, which is clearly correlated with high levels of attention (red line). Medium level was associated with observing slides, body leaning forward, and head supported by a hand, and some hand and finger gestures (rolling a pencil etc). Low level of attention was associated with ges-
222
tures expressing tiredness or boredom, such as leaning back, rubbing a neck, scratching head, yawning, looking away etc. Those observations represent our underlying model connecting student behavior and the attention level. We manually annotated the presence of the specific behaviors for each of the test students, such as writing. We annotated starting and ending time in the resolution of one frame (0.1 second), and calculated binary signals representing those actions. The set of reference data includes the following features •
writing, Wi (t) ∈ {0, 1}, which was annotated when pencil was writing on paper and student was observing the notes,
•
yawning, Yi (t) ∈ {0, 1},
•
supporting head, Si (t) ∈ {0, 1}, where one hand is supporting or touching a face,
•
person’s gaze, Gi (t) ∈ {0, 1, 2, 3}, where the numbers represent four gaze directions of a student (looking away, slides, white-board, notes),
•
and observed attention Aoi (t) ∈ {1, 3, 5}. III.
S TUDENT ATTENTION C LASSIFICATION
This section outlines computation of features to be used within the proposed automatic student attention measurement system, and presents results of the experimental evaluation.
Poster Sessions Poster Session II
10th International Symposium on Image and Signal Processing and Analysis (ISPA 2017)
September 18-20, 2017, Ljubljana, Slovenia
2
Frame: 02051
1 0 -1 -2
a)
-3
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
0.4
0.2
0
-0.2
6
b) -0.4
5 4
1
3
0.8
2
0.6
1
0.4
a)
500
1000
1500
2000
2500
0.2
c)
5
0
4 3
1
2
0.8
1
0.6
0
0.4
b)
500
1000
1500
2000
2500
0.2
d)
Fig. 3. Human annotation of student’s attention level for the Student 3. On a graph a), mean attention score Am i (t) from three human coders is shown as a blue line, smoothed score as a magenta line, and final 3-level reference attention Ari (t) as a red line. Graph b) shows attention score and annotation of writing.
0
1 0.8 0.6 0.4 0.2
e)
A. Body Posture Features
0
1
Upper body posture was found to be highly correlated with student activities such as observing slides and writing. Features were computed from body joints given in 3D camera space, cj,k (t). Leaning forward when observing slides or writing resulted in changed head position in the 3D camera space. 1) Head Displacement: in order to characterize changes in the head position during activities such as writing, we computed displacement vector of the current person’s head position c4,i (t) from the mean position over the experiment c¯4,i , resulting in 3D displacement vector Di (t) = c4,i (t)−¯ c4,i . 2) Lean Backward: in order to characterize overall upper body posture we calculated vector from head to lower spine, d(t) = c4,i (t) − c1,i (t). We then calculated Lean forward indicator as an angle to the vertical coordinate axis y, Li (t) = arctan ddyz (t) (t) B. Face Gaze Point Kinect SDK provides an estimation of the head gaze, given as a vector gi (t) = [γx,i (t), γy,i (t), γz,i (t)], where γx,i (t) corresponds to head yaw and γz,i (t) corresponds to pitch. Using the head position in the 3D camera space c¯4,i and the Kinect sensor position in the world space K 0 we calculate projection of the head gaze onto the x − y plane in the world coordinates, resulting in the 2D world gaze point coordinates Pi (t) = [p0i,x (t), p0i,y (t)]. C. Facial Features Facial features are derived from the 17 Animation Units computed from detailed 3D face model. They are used to
223
0.8 0.6 0.4 0.2
f)
0
Fig. 4. Feature signals for student 3 (right) over time (260 seconds): a) 2D gaze point (x and y gaze point in meters), b) 3D head displacement from average position (x,y,z in meters), c) eyes closed (relative, ranging from 0 .. 1), d) face deformation, e) mouth open, f) lean backward. Parts of signals are missing due to detection failures or recording buffer overruns.
characterize observable behaviors such as yawning and writing. Logistic function is used to preprocess the original values. 1) Eyes closed: is computed from the RightEyeClosed indicator, and corresponds to writing and observing notes. 2) Mouth open: is computed from the JawOpen indicator, and corresponds to yawning. 3) Face deformation: is computed from the LeftcheekPuff indicator, and corresponds to supporting head with the left hand. D. Experimental Results We utilized different sets of proposed features to train classifiers. The data set A consists of the signals from three students which are merged, with total 780 samples for each feature. The attention prediction experiment utilized a set of computed features composed of body posture (Lean Back feature), student’s gaze point computed from head orientation
Poster Sessions Poster Session II
10th International Symposium on Image and Signal Processing and Analysis (ISPA 2017)
September 18-20, 2017, Ljubljana, Slovenia
1 Simple Tree Medium Tree Coarse KNN Medium KNN Weighted KNN Bagged Trees Subspace KNN Mean
0.9
Accuracy
0.8
0.7
0.6
0.5
Fig. 5. Results of attention classification on the data set A (3 students), for Bagged Trees method. Confusion matrix is shown for three levels of attention.
0.4
1
2
3
4
5
6
Person
5
Fig. 7. Accuracy of attention prediction for each test person (data set B). The best overall accuracy over all 6 test persons is achieved by Simple Tree method.
4 3 2 1 500
1000
1500
2000
2500
Fig. 6. Predicted attention level (blue line) and the annotated attention level (magenta) for the student 3.
and position, and facial features including Eyes closed, Face (cheek) deformation and Mouth open feature. We tested a range of simple to complex machine learning algorithms, comparing their overall accuracy in estimating attention level using 10-fold cross validation. The data set is split in each iteration of learning into training and test set, and results of prediction of each iteration are averaged. Linear discriminant classifier and Simple Tree classifier resulted in accuracy of 75 %. Best accuracy was achieved using Bagged Trees, ranging from 85.0 to 86.9 % depending on method parameters. Confusion matrix for this classifier is shown in Fig. 5. An example of predicted attention level compared to reference level is shown in Fig. 6. The second data set B consists of the signals from 6 students. The data split procedure was different: learning set contained 5 students and testing was performed on the remaining student. A general attention model is thus learned from 5 students and applied to predict attention of the new student. The resulting accuracies for 7 machine learning methods are shown in Figure 5. Each point represents an accuracy of attention prediction of 260 samples for a selected test person, when using a model built from data of remaining 5 persons. The overall accuracy of attention prediction on the data set B was ranging from 0.61 for the Subspace K-Nearest Neighbor method to 0.69 for the Simple Tree method, which was notably lower than results for the data set A. This indicates probable model over-fitting for the data set A, which might happen due to high temporal correlation of the data samples. IV.
C ONCLUSION
In this paper we proposed a model to estimate attention level of students in the classroom using a set of features computed from the data obtained by the Kinect One sensor. Based on visual observation of behavioral clues and their correlation with the human observed attention level we derived a set
224
of body, gaze and facial features related to observed student behavior. Further experiments will be performed on a larger set of persons and over larger duration of time in order to validate the results. The proposed automatic attention classification system has a potential usage as a tool for automated analytics of the learning process. R EFERENCES [1]
D. Dinesh, A. Narayanan, and K. Bijlani, “Student analytics for productive teaching/learning,” in 2016 International Conference on Information Science (ICIS), Kochi, India. IEEE, 2016, pp. 97–102. [2] N. J. Butko, G. Theocharous, M. Philipose, and J. R. Movellan, “Automated facial affect analysis for one-on-one tutoring applications,” in Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, 2011, pp. 382–287. [3] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, and J. R. Movellan, “The faces of engagement: Automatic recognition of student engagement from facial expressions,” IEEE Transactions on Affective Computing, vol. 5, no. 1, pp. 86–98, jan 2014. [4] R. A. Calvo and S. DMello, “Affect detection: An interdisciplinary review of models, methods, and their applications,” IEEE Transactions on Affective Computing, vol. 1, no. 1, pp. 18–37, 2010. [5] A. S. Won, J. N. Bailenson, and J. H. Janssen, “Automatic detection of nonverbal behavior predicts learning in dyadic interactions,” IEEE Transactions on Affective Computing, vol. 5, no. 2, pp. 112–125, apr 2014. [6] C. R. Henrie, L. R. Halverson, and C. R. Graham, “Measuring student engagement in technology-mediated learning: A review,” Computers & Education, vol. 90, pp. 36 – 53, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0360131515300427 [7] E. F. Risko, N. Anderson, A. Sarwal, M. Engelhardt, and A. Kingstone, “Everyday attention: Variation in mind wandering and memory in a lecture,” Applied Cognitive Psychology, vol. 26, no. 2, pp. 234–242, 2012. [Online]. Available: http://dx.doi.org/10.1002/acp.1814 [8] C.-M. Chen, J.-Y. Wang, and C.-M. Yu, “Assessing the attention levels of students by using a novel attention aware system based on brainwave signals,” British Journal of Educational Technology, pp. n/a–n/a, 2015. [Online]. Available: http://dx.doi.org/10.1111/bjet.12359 [9] H. Monkaresi, N. Bosch, R. A. Calvo, and S. K. DMello, “Automated detection of engagement using video-based estimation of facial expressions and heart rate,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 15–28, jan 2017. [10] N. Alioua, A. Amine, A. Rogozan, A. Bensrhair, and M. Rziza, “Driver head pose estimation using efficient descriptor fusion,” EURASIP Journal on Image and Video Processing, vol. 2016, no. 1, jan 2016.
Poster Sessions Poster Session II