Soft Biometrics by Modeling Temporal Series of Gaze Cues Extracted ...

Soft Biometrics by Modeling Temporal Series of Gaze Cues Extracted in the Wild Dario Cazzato2(B) , Marco Leo1 , Andrea Evangelista2 , and Cosimo Distante1 1

National Research Council of Italy - Institute of Optics, Arnesano, LE, Italy 2 Faculty of Engineering, University of Salento, Lecce, Italy [email protected]

Abstract. Soft biometric systems have spread among recent years, both for powering classical biometrics, as well as stand alone solutions with several application scopes ranging from digital signage to human-robot interaction. Among all, in the recent years emerged the possibility to consider as a soft biometrics also the temporal evolution of the human gaze and some recent works in the literature explored this exciting research line by using expensive and (perhaps) unsafe devices which, moreover, require user cooperation to be calibrated. By our knowledge the use of a low-cost, non-invasive, safe and calibration-free gaze estimator to get soft-biometrics data has not been investigated yet. This paper fills this gap by analyzing the soft-biometrics performances obtained by modeling the series of gaze estimated by exploiting the combination of head poses and eyes’ pupil locations on data acquired by an off-the-shelf RGB-D device.

1

Introduction

Biometrics is the science of establishing the identity of an individual basing on physical, chemical or behavioral attributes of the person. In the literature, several features have been employed in order to achieve the recognition task, like palmprint [14], iris [22] or fingerprint [24], as well as DNA, face, retina and so on. The diffusion of large-scale biometric systems in both commercial and government applications have increased the researcher’s awareness of this technology. As a consequence of this rapid growth, also the challenges associated with designing and deploying biometric systems have been highlighted. Hard biometric systems raise many security and privacy issues, since they are based on personal, physiological and behavioral data that could be stolen and misused [3]. Moreover, they need to process information that could not be always accessible, or available only by means of intrusive devices in order to obtain the required reliability and precision for the particular application context. In largescale identification applications, due to the larger number of comparisons to be performed, these systems may not yet be extremely accurate [11], as well as due to noise in the data, intra-class variation and non-universality of the biometrics. To improve reliability, different biometrics can be merged in the same solution: at this aim, the work of [9] formulates the problem of multiple biometrics, c Springer International Publishing Switzerland 2015 S. Battiato et al. (Eds.): ACIVS 2015, LNCS 9386, pp. 391–402, 2015. DOI: 10.1007/978-3-319-25903-1 34

392

D. Cazzato et al.

showing the potential improvement of multibiometrics. However, requirements of such solution increase in terms of computational needs, as well as the overall intrusiveness. At this purpose, the concept of soft biometrics has spread in the literature. In [10] soft biometrics are defined as characteristics that provide some information about the individual, but such that they lack of the distinctiveness and permanence to sufficiently differentiate any two individuals. Example of soft biometric estimations by computer vision algorithms are age estimation [7] or gender recognition [1], but also the race, the height, the color of the hair or the shape of the face are classified as soft biometrics. These features can be merged in an easier way to provide multiple label classification [8], or in such a way that a set of biometrics can enhance another estimation problem, like in [21]. Among well consolidated soft biometrics, the idea of gaze analysis as a personal distinctive feature has been also taken into account. The milestone of this research area is a series of works which make use of an head mounted eye tracker, based on the detection of infrared light reflection, to temporally analyze the eye movements during predefined stimuli. The final aims of the data analysis range from the evaluation of student behavioral skills [2] to the identification of users among a set of predefined ones ([13], [6]). Instead of analyzing eye movement, more recently, the work in [4] proposes to consider the temporal evolution of the gaze direction as a soft-biometrics. The study was based on data acquired by a Tobii 17501 remote eye tracker that is expensive, it requires the user cooperation to achieve an initial calibration and it employs infrared light concentrated on the eye pupils whose safety is still under discussion (as demonstrate the recent updates of required standards for commercial devices2 ). The work demonstrated that the gaze direction is able to distinguish among users but its real applicability is limited by its operating modes and hardware requirements. By our knowledge the use of a low-cost, non-invasive, safe and calibration-free gaze estimator to get soft-biometrics data has not been investigated yet. This paper fills this gap by analyzing the soft-biometrics performances obtained by modeling the series of gaze estimated by an innovative framework exploiting the combination of head poses and eyes’ pupil locations on data acquired by an off-the-shelf RGB-D device. The gaze estimation approach does not require any initial person dependent calibration, it allows the user to freely rotate the head in the field of view of the sensor and it is insensitive to the presence of eyeglasses, beard or hairstyle. Gaze estimation series are probabilistically modelled by using Hidden Markov Models [17] and their suitability for biometrics purposes has been demonstrated by acquiring in the wild different persons watching a benchmark video. The use of HMM for the classification of gaze cues is another important contribution of the paper. Although HMMs have been largely exploited for the classification of biometrics traits ([18], [16]), there is only one work that uses HMM to build a personalized gaze profile from data acquired by a commercial eye-tracker [25]. 1 2

http://www.tobii.com/ http://www.tobii.com/en/eye-tracking-research/global/support-and-downloads/ faqs/501a0000000zX3c/

Soft Biometrics by Modeling Temporal Series of Gaze Cues Extracted

393

The rest of the paper is organized as follows: Section 2 introduces the proposed gaze estimation framework, while Section 3 deals with the problem of human gaze as a possible soft biometric. Experimental setup and results are presented in Section 4. Section 5 concludes the paper.

2

Gaze Estimation in the Wild

Fig. 1 gives an overall view of the proposed solution. The proposed gaze estimation method works on depth and RGB images extracted from commercial depth sensors e.g., Microsoft Kinect3 and ASUS Xtion Pro Live4 . The acquired data are processed by a multistep approach performing, at first, head pose estimation using both depth and RGB streams. The head pose estimation algorithm computes the exact position of the head with regards to the sensor, in terms of yaw, pitch and roll angles. Head pose information, integrated with the 3D positions of the user, can supply a rough estimation of the human gaze: this can be carried out by computing the intersection between the sensor plane and a straight line whose direction in the 3D space is defined by head pose angles [5] but, unfortunately, any gaze estimation that does not take into account the localization of the eye centers is highly inaccurate [19], especially for some kinds of application. For this reason the proposed approach, as a second step, computes pupils localization over the RGB data, using differential geometry and local self-similarity matching that assures a suitable accuracy in detection even under critical acquisition conditions [15]. The information about pupils is then used to refine the initial gaze estimation and this is done by computing a correction factor for the angles of the 3D model. For the estimation of the rough gaze only the 3D position of the eye center5 and the head pose information angles are used . The gaze track is computed as the straight line passing through the eye center with direction defined by the available head pose angles and finally the rough point of regard (POR) is estimated as the intersection of the line with a a vertical plane with regard to the ground and passing from the center of the sensor. Fig. 2 schematizes this procedural step and it helps to understand the underlying equations which are stated basing on a right-handed coordinate system (with the origin at the sensor, z axis pointed towards the user and y axis pointed up). The depth sensor directly gives the length of the segment AC, i.e. the distance between the eye of the observer and the considered target plane. The can be then completely solved as: AB = AC and right-angled triangle ABC cos ωy 2 2 BC = AB − AC . and using the same coordinate system, it is possible to 3 4 5

www.microsoft.com/en-us/kinectforwindows/ www.asus.com/Multimedia/ Only the gaze track passing though one eye is considered since taking into account both eyes would require additional knowledge about how, for each specific person, the eye are aligned and moreover a mutual error compensation scheme should be introduced.

394

D. Cazzato et al.

Fig. 1. A block diagram of the proposed solution.

compute also the cartesian equation of the gaze ray as the straight line passing for points A = (xA , yA , zA ) and B = (xB , yB , zB ) expressed as: x−xA A = yy−y B −xA B −yA (1) r : xy−y z−z A A yB −yA = zB −zA with zA = 0 since the sensor lies on the target plane. The translations of the observer, on the x and y axes, with regard to the sensor, are easily handled by a posteriori algebraical sums. At this point the rough gaze estimation, i.e. the gaze based only on head pose information, is available and it has to be refined by integrating the information extracted by the pupil locator. To do that a 3D geometric model of the eye has to be introduced [20] and, in particular, the used model is defined by the following three parameters: – Eye Center, derived from the 3D overlapped mask (denoted by EyeCenter = (xEyeCenter , yEyeCenter , zEyeCenter )); – Pupil Center, derived from the pupil detection module (denoted by EyePupilCenter = (xEyePupilCenter , yEyePupilCenter , zEyePupilCenter ));


395

Fig. 2. Gaze estimation by head pose.

– Eyeball Center, a variable that is not visible by the system, and whose position can only be estimated (EyeballCenter = (xEyeballCenter , yEyeballCenter , zEyeballCenter )); The Eyeball Center parameter (EyeballCenter ) is firstly computed: the eye is modeled as a perfect sphere, with an estimated radius of 12 mm [23] since the low camera resolution does not allow to consider more accurate models. The parameter EyeballCenter is then estimated as the point that lies 12 mm behind the eye center, in the direction (meant as the straight line) previously computed. Indicating with RadiusEB this value, it follows that: ⎧ ⎪ ⎨x = xEyeCenter ± |RadiusEB cos ωx sin ωy | (2) EyeballCenter = y = yEyeCenter ± |RadiusEB cos ωy sin ωx | ⎪ ⎩ z = RadiusEB cos ωx cos ωy + zEyeCenter where the sign ± depends on a, respectively, negative or positive pitch (yaw) angle of the specified coordinate system. This 3D point represents the center of the sphere of the considered eye. From this point, it is possible to compute the straight line that passes through EyePupilCenter and EyeballCenter with Equation 1. Thus, the xtp and ytp coordinates on the target plan tp (with z = 0) are: ⎧ zEyeballCenter ⎪ xtp = zEyeballCenter ⎪ −zEyePupilCenter (xEyePupilCenter − xEyeballCenter )+ ⎪ ⎪ ⎨+x EyeballCenter (3) zEyeballCenter ⎪ ytp = zEyeballCenter ⎪ −zEyePupilCenter (yEyePupilCenter − yEyeballCenter )+ ⎪ ⎪ ⎩+y EyeballCenter

and this is the finer gaze estimation that is the outcome of this procedural step. Fig. 3 gives an overall view of the proposed method: the straight line r (replicated with the parallel straight line r passing for the nose, for clarity

396

D. Cazzato et al.

to the reader) represents the estimation by using head pose information only. This information is used to estimate the 3D coordinates of EyeballCenter , and thus to use the new gaze ray to infer the user point of view. Note that all the key parameters of the method have voluntarily been enlarged in the figure, for the sake of clarity.

Fig. 3. Overall view of the gaze estimation model.

3

Gaze as Soft-Biometrics

This section describes how the gaze outcomes have been used as soft-biometrics, i.e. to build a temporal pattern that can be exploited to recognize a target person. In order to represent the real intersection point with the environment and to realize experimental tests, coordinates are normalized to image plane coordinates with the generic formula, valid for both coordinates x and y of the image plane: boundlow · size(I) (4) cnorm = c − boundupp − boundlow where boundupp and boundlow are the two bounds, in meters, of the space, and size(I) is the width (or height, depending on the coordinate under consideration), expressed in pixels. After that, a mapping from continuous to discrete outcomes is performed: this helps addressing distortions in the gaze estimation process. The aforementioned mapping is achieved in practice by partitioning the target scene (i.e. a screen) in a suitable number of regions labeled with predefined symbols (e.g., numbers or letters). This way, every time a subject is observed, a series of symbols is associated to his gaze behavior and it represents the pattern to be modeled for biometric scopes. To this end, a HMM is associated to each involved subject obtaining this way a HMM bank. Each HMM is then trained using one or more series associated to the subject. After training, learned HMMs are used to get the likelihoods that, given an unknown series in input, it belongs to each of the subjects in the training group. The maximum likelihood score


397

is then considered and the label corresponding to the subject associated to the winning HMM is finally attached to the unknown series under investigation. The above mentioned procedural steps are resumed in Fig. 4.

Fig. 4. The performed Soft-Biometrics authentication by HMM.

4

Experimental Results

In order to evaluate the proposed framework, the experimental environment in Fig. 5 has been setup. In particular, a screen of dimensions 70 × 22.5 cm was used. Each user was asked to sit down on a chair at a distance of 60 cm from the screen and the depth sensor was placed on top of the screen, pointing at the user. A Microsoft Kinect sensor has been employed in order to acquire input data. The screen has been split into 15 equal rectangles of size of 512 × 360 pixels each one, labeled from 1 to 15, as shown in Fig. 6. For each gaze point, the rectangle to which it belongs is stored and then a series of symbols is collected for each test session. All the hits lying out of the screen were automatically replaced with the coordinates of the closest pixel on the screen, and thus associated to the respective rectangle. To map world coordinates into screen coordinates, the Equation 4 (that refers to the center of a generic rectangular target) was slightly modified by considering a translation parameter among the y axis equal to the distance between the actual sensor position and the center of the screen. Experiments were carried out on 7 participants that were asked to look at the same video while the system

398

D. Cazzato et al.

Fig. 5. The realized environmental setup.

Fig. 6. The used subdivision of the watching area in an example frame of the benchmark video.

recorded and stored their gaze tracks. For each person, 5 different acquisition sessions were performed. Each session was carried out in different days (spread in 15 days) experiencing different lightning conditions and without imposing any constraint to the users in terms of head rotations, beards and eyeglasses. During each session the same benchmark video run on the screen: the video shows a session of F1 cars test and presents a large variability of the position of relevant objects (the cars) on the screen due also to a frequent change of camera views. The video is coupled with the audio of the engines and this is an external feature that somehow could influence the users’ behavior. The video is available on [26]. Fig. 7 shows three frames of the selected video. The video lasts 199 seconds but only the first 170 seconds has been considered. The video has a resolution of 1280 × 720 pixels, a frame rate of 25fps and it was shown at a


399

Fig. 7. Three frames of the benchmark video.

fullscreen resolution on a screen of 2560×1080 pixels6 . This causes a black border of 7.7cm on both left and right side of the screen that was anyway recorded. The synchronization between the gaze estimator and the actual frame shown on the screen was made by means of the timing functions of the operative system7 . For each gaze estimation the x and y coordinates, computed as described in Equations 3 and 4, were stored. As expected the gaze was not estimated in each frame of the video due to computational delay and missed detections of the face/pupil. Missing data were then filled by Kalman Filter predictions [12]. Anyway, on average, an actual estimate was available every 2.89 frames. This way, a vector of size 1 × 4252 components have been created for each session of each participant. In order to evaluate recognition accuracy, to each participant a label has been manually associated, from 1 to 7. The gaze estimation system has been implemented using Microsoft Visual C++ developing environment, with RGB and depth images taken at a resolution of 640 × 480, 30 fps. In order to accomplish the aim of discovering if the used framework is suitable for biometrics purposes, the analysis of acquired data was performed as follows. First of all, in order to evaluate the optimal number of hidden states for HMMs, different configurations (from 2 to 24 states) were considered and trained using the first series of observations for each involved subject. Then the output likelihoods (when the same series is given as input) were added together and the configuration reporting the maximization of the achieved score was retained for the subsequent experiments. Fig. 8 shows the Log-likelihood values while varying HMM’s configuration. As can be observed, the best value was obtained when a configuration with 16 hidden states was used. In the second experimental phase the bank of 7 HMMs with 16 hidden states was exploited for soft-biometrics issues. In order to evaluate how accurate the predictive model will performs in practice, the experiment was done on the available data performing N-times (N = 10) a k-fold (k = 5) cross-validation. This way each example was classified 10-times by using 10 different training sets built at each iteration on the 30 examples which did not belong to the same fold of the test examples. The relative confusion matrix is reported in Table 1 from where the following parameters of performances can be extracted: 6 7

Philips 298P4 Microsoft Windows 8.1

400

D. Cazzato et al. k−f old precisionM = 0.739;

k−f old = 0.737. recallM These data are very encouraging and they demonstrate that it is possible to find a strong relationship between the identity of a person and its gaze behavior during the projection of a specified video on a screen in front at it. This study demonstrated that this is true even if the gaze behavior is recorded by using not-invasive, calibration free and low-cost system as the proposed one.

4

−0.5

x 10

−1

−1.5

LOG−Likelihood

−2

−2.5

−3

−3.5

−4

−4.5

−5 0

5

10

15

20

25

Number of HMM hidden states

Fig. 8. Log-likelihood values for different HMM lengths. Table 1. Confusion matrix using the proposed framework and k-fold cross-validation. Estimated labels p#1 p#2 p#3 p#4 p#5 p#6 p#7 p#1 37 2 5 2 1 3 p#2 41 1 1 4 3 p#3 4 1 40 4 1 p#4 3 35 6 6 p#5 3 1 8 34 4 p#6 5 5 40 p#7 6 7 6 31

5

Conclusions

This work introduced a preliminary study to evaluate biometric identification of individuals on the basis of data acquired by a low-cost, non-invasive, safe and calibration-free gaze estimation framework consisting of two main components conveniently combined and performing user’s head pose estimation and eyes’ pupil localization on data acquired by a RGB-D device. The experimental evidence of the feasibility of using the proposed framework as soft-biometrics has been given on a set of users watching a benchmark video in an unconstrained environment. Future works will validate the system by comparisons with a commercial eye tracker, in order to provide quantitative measurements of the system reliability. Moreover, an evaluation of the possibility to combine both eye information in order to enhance the system will be performed.


401

References 1. Bekios-Calfa, J., Buenaposada, J.M., Baumela, L.: Robust gender recognition by exploiting facial attributes dependencies. Pattern Recognition Letters 36, 228–234 (2014) 2. Busjahn, T., Schulte, C., Sharif, B., Begel, A., Hansen, M., Bednarik, R., Orlov, P., Ihantola, P., Shchekotova, G., Antropova, M., et al.: Eye tracking in computing education. In: Proceedings of the Tenth Annual Conference on International Computing Education Research, pp. 3–10. ACM (2014) 3. Campisi, P.: Security and Privacy in Biometrics. Springer (2013) 4. Cantoni, V., Galdi, C., Nappi, M., Porta, M., Riccio, D.: Gant: Gaze analysis technique for human identification. Pattern Recognition 48(4), 1027–1038 (2015) 5. Cazzato, D., Leo, M., Distante, C.: An investigation on the feasibility of uncalibrated and unconstrained gaze tracking for human assistive applications by using head pose estimation. Sensors 14(5), 8363–8379 (2014) 6. Deravi, F., Guness, S.P.: Gaze trajectory as a biometric modality. In: BIOSIGNALS, pp. 335–341 (2011) 7. Fern´ andez, C., Huerta, I., Prati, A.: A comparative evaluation of regression learning algorithms for facial age estimation. In: Ji, Q., B. Moeslund, T., Hua, G., Nasrollahi, K. (eds.) FFER 2014. LNCS, vol. 8912, pp. 133–144. Springer, Heidelberg (2015) 8. Guo, G., Mu, G.: Joint estimation of age, gender and ethnicity: CCA vs. PLS. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6. IEEE (2013) 9. Hong, L., Jain, A.K., Pankanti, S.: Can multibiometrics improve performance? In: Proceedings AutoID, vol. 99, pp. 59–64. Citeseer (1999) 10. Jain, A.K., Dass, S.C., Nandakumar, K.: Soft biometric traits for personal recognition systems. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 731–738. Springer, Heidelberg (2004) 11. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Transactions on Circuits and Systems for Video Technology 14(1), 4–20 (2004) 12. Kalman, R.E.: A new approach to linear filtering and prediction problems. Journal of Fluids Engineering 82(1), 35–45 (1960) 13. Kasprowski, P., Komogortsev, O.V., Karpov, A.: First eye movement verification and identification competition at BTAS 2012. In: 2012 IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 195–202. IEEE (2012) 14. Kumar, A., Wong, D.C., Shen, H.C., Jain, A.K.: Personal verification using palmprint and hand geometry biometric. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 668–678. Springer, Heidelberg (2003) 15. Leo, M., Cazzato, D., De Marco, T., Distante, C.: Unsupervised eye pupil localization through differential geometry and local self-similarity matching. PloS one 9(8), e102829 (2014) 16. Parris, E.S., Carey, M.J.: Language independent gender identification. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 1996, vol. 2, pp. 685–688. IEEE (1996) 17. Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

402

D. Cazzato et al.

18. Roy, A., Halevi, T., Memon, N.: An HMM-based behavior modeling approach for continuous mobile authentication. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3789–3793. IEEE (2014) 19. Stiefelhagen, R., Zhu, J.: Head orientation and gaze direction in meetings. In: CHI 2002 Extended Abstracts on Human Factors in Computing Systems, pp. 858–859. ACM (2002) 20. Sun, L., Liu, Z., Sun, M.T.: Real time gaze estimation with a consumer depth camera. Information Sciences (2015) 21. Wang, X., Ly, V., Lu, G., Kambhamettu, C.: Can we minimize the influence due to gender and race in age estimation? In: 2013 12th International Conference on Machine Learning and Applications (ICMLA), vol. 2, pp. 309–314. IEEE (2013) 22. Wildes, R.P.: Iris recognition: an emerging biometric technology. Proceedings of the IEEE 85(9), 1348–1363 (1997) 23. Xiong, X., Cai, Q., Liu, Z., Zhang, Z.: Eye gaze tracking using an RGBD camera: a comparison with a RGB solution. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, pp. 1113–1121. ACM (2014) 24. Ye, Z., Mohamadian, H., Ye, Y.: Information measures for biometric identification via 2d discrete wavelet transform. In: IEEE International Conference on Automation Science and Engineering. CASE 2007, pp. 835–840. IEEE (2007) 25. Yoon, H.J., Carmichael, T.R., Tourassi, G.: Gaze as a biometric. In: SPIE Medical Imaging, pp. 903707–903707. International Society for Optics and Photonics (2014) 26. YouTube: Video (2015). https://www.youtube.com/watch?v=KnJtsZzMJ8Y