Motion Based Human Identification Using Histograms of Optical Flow Janez Perˇs, Matej Kristan, Matej Perˇse, and Stanislav Kovaˇciˇc Faculty of Electrical Engineering, University of Ljubljana
[email protected] Abstract In this paper, we deal with the problem of extraction and compact representation of human motion from video sequences. Our application domain is video-based identification of a person which is performing card and fingerprint authentication at the control point, equipped with video surveillance camera. We aim to identify a person solely from the characteristics of his or her motion. To achieve this, optical flow is calculated across several seconds of video recording. Then, each image is divided into six segments, and two-dimensional histogram (direction/amplitude) of optical flow is obtained for each region, for each image from the sequence, capturing the dominant motion in the region as its maximum. Changes in histograms in each segment are observed over time and a corresponding symbol is generated whenever the position of maximum in any of six histograms is changed between neighboring images. This way, six symbol sequences are generated for each video recording. They represent unique motion pattern, which encodes the person’s motions while performing the authentication and can be compared to other patterns in the database. By cross-comparing symbol sequences of 112 real-world video recordings of 5 persons we established that our method yields the correct identification in 82% of the cases.
1
Introduction
Biometric identification has a long history, dating more than 100 years, when fingerprints were introduced into forensic evidence. In addition to fingerprints, there are many biometric features that can be used for verification of subject’s identity, for example face, gait, hand geometry, iris, retina, voice, keystroke dynamics or facial thermogram. In this paper, we investigate a possibility of developing motion based biometric sensor to increase security at control points for high-security applications. Our main contribution is a method to extract compact representations of human motion, which can be used in the identification process. Human motion has already been used for identification in biometric gait analysis [5, 12, 3, 8]. The characteristics of human gait depend on the physical characteristics of particular person. Therefore, in the case of gait analysis, motion only conveys the information about the physical characteristics of particular person in a way that can be observed from distance using computer vision techniques. Our approach is different. Our main assumption is, that
people learn to perform routine tasks in a similar way, yet the way how they do it differs from person to person. By properly representing the motion of a person, we could build a compact representation of its activity at the time of video recording. We rely on the fact that the observed activity is in some way constrained (for example, people are required to do the same task whenever identification is required). Then it would be possible to obtain one’s identity by comparing the compact representation of its motion with the representations, stored in the database. Without constraints, this is a difficult problem. In everyday life, human motion is determined by wide range of tasks, and there are a very few common denominators when motion of two or more individuals is compared. However, there are cases where certain common activity is required from people, and one of this cases is the ”classic“ authentication procedure at a control point, where a keycard, a PIN code or a fingerprint may be required from person wishing to enter a secure area. Without performing required tasks (showing the card, entering the PIN or showing the fingerprint), the person will be denied access. Our approach is therefore based on the assumption that people develop their own way of approaching the conventional access control systems. We assume that the developed pattern of motion remains constant over time, and that the activity during authentication differs significantly between different people. We do not aim to develop self-contained motion based biometric identification system. Our intention is to lay foundations for a development of a motion based biometric ”soft sensor“, e.g. a device, which would provide additional complementary information to a biometrics based access control systems in a multi-modal system setup [11], but would not exclusively decide whether a submitted identity of a person is the correct one. In the remainder of the paper, we will first discuss why this approach is feasible and present some limited evidence that people indeed develop their own ”motion patterns“ when dealing with the conventional access control systems. Then, we will present the relevant related work, with focus on human detection, gesture recognition and activity recognition. Next, we will present our approach to generating compact descriptions of human motion using histograms of optical flow. Finally, we will present the results, obtained by cross comparing a database of 112 recordings of five different people.
Motion Based Human Identification Using Histograms of Optical Flow
2
Human motion and behavior as biometric feature
Suitability of biometric features, which are in wide use today has been extensively researched even before the computerized use was considered. Since we are proposing a new biometric feature, two separate issues have to be addressed: • Is human motion by itself suitable biometric feature under specified conditions – regardless of the actual means of observing and evaluating? • If human motion is a suitable biometric feature, which computer vision algorithms do we use for best performance and how we generate compact representations for storage in the database of motion/behavior patterns? 2.1
Suitability
Regarding the first problem, Yun [14], has specified seven criteria which can be used to assess the suitability of certain biometric technology and their corresponding features: • Universality indicates how commonly is certain feature found in population. • Uniqueness indicates how well the feature separates one person from another.
human behavior can be learned and mimicked, but for positive identification an impostor would need both to trick the ”conventional“ access control system with its conventional biometric sensors, and in that process also mimick the behavior and motion of the chosen person, which should be considerably more difficult than tricking just one of the systems. Additionally, since the process of motion acquisition is non-intrusive, the impostor would not automatically be aware that there is an additional mechanism which verifies motion and behavior before accepting the submitted identity. As we can see, the two most important remaining issues are those of uniqueness and permanence. To check if the humans really exhibit unique and permanent behavior, we collected the database of video recordings and inspected it to assess those two factors, before proceeding with the development of the computer vision methods. 2.2
Test database
The test database was recorded at the entry point to the computer vision laboratory, where a ”conventional“ access control system requires users to identify with their keycard and then confirm the identity by submitting their fingerprint. All recordings have been made during day-to-day operation of a security system - no recordings were staged. The layout of the sensors, including camera is shown in Figure 1.
• Permanence indicates how well the feature resists the effects of aging. Fingerprint scanner • Collectability indicates how easily is feature acquired for processing.
Camera
• Performance indicates the achievable accuracy, speed and robustness of the feature. Door
• Acceptability indicates degree of acceptance of the technology by the public in their daily life. Keycard reader • Circumvention indicates the level of difficulty to circumvent or fool the system into accepting an impostor.
Figure 1: Control point with locations of the sensors.
Universality of human motion is self evident. Provided that we can develop a method for extracting motion features of one person, such approach can be easily extended to the whole population. Uniqueness is by far the most difficult problem - there is no guarantee that different people will exhibit different motion patterns when attempting to gain access using the access control system. Permanence is another problem, as there is no guarantee that same person will exhibit similar motion in two different authentication attempts, let alone over longer periods of time. In our context, collectability is closely related to our ability to properly capture motion and generate compact motion representations using computer vision algorithms. Performance can be assessed after the vision system is built and tested on the proper database of videos. Acceptability in our case relies on the acceptability of video-based surveillance, and since the video surveillance is widespread, we assume this is not an issue anymore. The problem of circumvention is not as troublesome as it may appear at first glance: it is true that
All participating users were aware of the recording, as they had to sign consent form before the experiment was started. The camera (640×480 pixel color IEEE 1384 camera) was rotated 90 degrees to better use the available resolution. Clips, ranging from 8 seconds to 10 seconds were recorded using motion detection software to conserve disk space. The recording system was not connected to the access control system, and due to shortcomings of the motion detection scheme, many recordings missed critical elements of the activity and had to be deleted. After a review of videos, 112 complete video clips were selected and manually categorized. Additionally, the videos have been temporally aligned with respect to the moment when person’s keycard came to closest distance to the keycard reader. 2.3
Evaluation of suitability
After the database has been collected, the videos were visually inspected to evaluate the uniqueness and permanence of
Janez Perˇs, Matej Kristan, Matej Perˇse, and Stanislav Kovaˇciˇc motion and activity. We observed that people indeed developed unique ways of approaching the system, moreover, under the same circumstances they repeatedly performed same sequence of motions to perform authentication. This rule was broken mainly under the influence of other factors, such as carrying additional objects, tailgating (entry of multiple persons), presence of other people distracting the person who was performing the authentication, and other unusual activities (e.g. leaving the lab door open to return without authentication). It has been established that the human motion during authentication is heavily dependent on the objects people carry. For example, a person entering a lab without carrying anything, would do the authentication process with one sequence of motions, but when carrying an object, its motion would be different and usually include shifting the object from one hand to another. Nevertheless, if the object is commonly carried by a person, we expect that the person would again develop unique way of authentication while carrying it. That would allow both personal identification, and, additionally, identification of the object that the person was carrying at the time of authentication. Our examination of the database has shown that this assumption is not entirely unfounded. For this effect to be taken into account, we had to categorize the videos both according to subject identity and subject activity (e.g. person X carrying a bag, person Y carrying a notebook, etc.). To illustrate the typical motion patterns, Figure 2 shows frames from multiple recordings of different people.
3
Related work
Most approaches to human identification by motion focused on the problem of recognizing humans by observing human gait [5, 12, 3, 8]. Human gait is essentially considered a motion of person’s legs, while some researchers [3] include motion of arms in their gait recognition schemes as well. However, human gait is not any motion of extremities, it is specifically the motion due to human locomotion (walking, running). The context of locomotion in essence normalizes the observed activity - there are many things people can do with legs and arms, but there are only a few ways a person can walk or run, and the constraints induced by narrowing the context (such as assumption that the gait is periodic) help significantly in the task of gait-based human identification. In our case, the problem is narrowed down by introducing the context of (classic) authentication. This is achieved by observing only people, who wish to enter secured area and have to authenticate using some other authentication method. Therefore, while there is a wide range of motions people could perform with their body, just a handful of activities make sense if one wishes to check his or her identity using a keycard and a fingerprint, for example. Our task of motion-based human recognition is closely related to gesture and activity recognition from images or videos. In general, while activity recognition aims to identify activities, we aim to identify people. Recently published activity recognition algorithms use variety of methods for activity recognition. Zelnik-Manor and Irani [15] for exam-
ple represent image sequences as three dimensional (spatiotemporal) stacks and perform statistical analysis to detect activity boundaries and activity types. Black, Yacoob et al. [1] used parametric models on optical flow across the image to estimate facial and limb motion, and recognize facial expressions. Yacoob and Davis [13] tracked specific regions on the human face and translated them into symbols using a dictionary of universal expressions. Dai et al. [4] extracted facial action features by observing histograms of optical flow for lower and upper region of the face. In 2003, Carlsson [2] demonstrated that walking people can be recognized from the features, derived by tracking small number of specific points on the human body. He achieved 95% recognition rate on small database of 20 recordings of six different persons. Finally, there are approaches to activity recognition that do not rely or tracking or motion estimation. Lu and Little [9] used Histograms of Oriented Gradients (HOG) descriptors to successfully track and recognize the activity of hockey players, however the activity recognition was based on the output of HOG descriptor, not the tracking results.
4
Methods
In this section, we describe the low-level methods that extract motion, and the high-level methods that represent that motion as the compact sequence of symbols. Each step of the processing pipeline has one or more parameters that affect the behavior of the algorithm; for convenience, a summary of parameter values is provided in Table 1. 4.1
Scene partitioning
To isolate motion of different parts of human body, a partitioning scheme was devised. Partitioning of the scene to 6 regions, as shown in Figure 3 was used. To ensure the regions were square, some of the visible area was cropped out. The partitioning to 2×3 regions was chosen as it corresponds to natural shape of human body: the upper two regions capture motion of the head, the middle two the motion of arms and torso, and the bottom two the motion of lower parts of the body. Smaller number of regions provides coarse description, but requires less accurate placement of human body. Note: partitioning did not affect image data, as it was applied after the optical flow calculation to optical flow data only. 4.2
Histograms of optical flow (HOFs)
Optical flow is one of the basic methods to extract motion from image sequences. If properly estimated, a motion of every pixel in the image is known. It has been often used in human motion analysis applications, for example [4, 1]. Optical flow has an distinct advantage over other approaches which rely on analyzing pixel brightness: the burden of correctly estimating motion is confined to optical flow calculation. Properly calculated optical flow is robust to cluttered background and variability in clothing and skin color. For these reasons, we used optical flow to estimate motion. In our preliminary study, the algorithm for dense estimation of optical flow, developed by Michael Black [6]
Motion Based Human Identification Using Histograms of Optical Flow
Clip 1
Clip 2
Clip 3
Clip 4
Clip 1
Clip 2
Clip 3
Clip 4
t = -1s
t=0
t = +1s
Person 1, normal entry.
Person 2, normal entry.
Person 3, normal entry.
Person 1, entry with a notebook.
t = -1s
t=0
t = +1s
Figure 2: Illustration of videos from our database. Four categories of videos are shown, depicting three different people. Each image in each strip shows same frame from four different authentications of the same person (therefore, the frames from four different clips are shown in the same strip). The topmost strip shows the frames, captured one second before keycard authentication, the middle one shows the frames captured at the moment when its keycard was closest to the keycard reader, and the third strip shows the frame captured one second after that. It can be observed that behavior of the same person is quite consistent across different authentication attempts.
Janez Perˇs, Matej Kristan, Matej Perˇse, and Stanislav Kovaˇciˇc
A
B
C
D
E
F
mediately after histogram calculation. Therefore, the overall number of samples in the remaining part of the histogram may vary significantly (and may be even zero). Two remaining bins in amplitude direction collect slow and fast human motions, respectively. Four direction bins collect motion vectors pointing to each of the 4 directions. Figure 5 shows 8-bin histograms of optical flow for each of the six image regions. The more intense the color of the bin, more samples have been collected in the particular bin. Dominant motions due to the motion of arms towards the right side of the image are clearly visible. In some regions (C,D and E) the HOF detected slow motion, in others (F) it registered faster motion.
Figure 3: Partitioning of scene into 6 square regions, denoted by letters A to F.
appeared to provide the best results, and was therefore used for optical flow calculation. Before calculating optical flow, each image was downsampled by the factor of 8 to speed up the algorithm. After that, pixel brightness data was discarded and only optical flow vectors were used in subsequent stages of the algorithm. Due to downsampling, optical flow fields had dimensions of 60×80 pixels. To smooth out temporal spikes in optical flow, a median filtering across temporal axis was used on each vector from the optical flow field. The width of kernel was 3 frames. Figure 4 shows actual optical flow vectors, obtained from one of the videos from our database. After temporal smoothing, each optical flow field was scaled by constant value (0.06*8, the factor of 8 is present due to subsampling of input images) and carved into 6 regions, according to partitioning scheme, shown in Figure 3. Optical flow scaling is introduced instead of adjusting histogram amplitude boundaries – with scaling, the amplitude boundaries can be in user-friendly range (0..1). For each segment, a two dimensional histogram of optical flow was calculated, with one dimension being the optical flow amplitude, and other being the optical flow direction. The edges of bins have been set to (0.33, 0.66, 1) on the amplitude axis and (0, 90, 180, 270 degrees) on the direction axis. After the histogram is calculated, the position of maximum bin shows the dominant motion in each of 6 regions of the observed scene. HOFs bear certain resemblance to Histograms of Oriented Gradients (HOGs, [9]), as the latter try to establish the dominant direction of edges in image. In the case of HOFs, the lowest amplitude bin collects low amplitude motion (too slow to be of any importance) and noise. The bins with lowest amplitude are discarded im-
Figure 4: Optical flow vectors for one frame from our video database. Note: actual flow was denser, not all vectors are shown here for clarity.
4.3
Symbol generation
By calculating HOFs across whole sequence and recording the position of maximum in each HOF, the person’s motion can be compactly described. To retain spatial alignment of videos, each sequence of maximum histogram positions was split at point of alignment (the moment when person’s keycard came closest to the keycard reader, denoted as frame 0). Then, the sequence of maximum positions for the negative frames (e.g. those before person showed the keycard) are rewritten in opposite direction. After this operation, there are two sequences of maximum positions for each HOF, from each region of the scene. Such arrangement guarantees that any comparison between different persons will start from the point where the sequences are aligned, thus minimizing the initial synchronization penalty. This way, we have 12 sequences for single
Motion Based Human Identification Using Histograms of Optical Flow Parameter Image dimensions Partitioning Image subsampling Optical flow (OF) OF median filter kernel OF amplitude scaling Amplitude bin edges Direction bin edges Symbol sequence Distance measure
Value 480×640 2×3 8× Dense robust flow [6] 3 frames 0.06×Image subsampling factor (0, 0.33, 0.66, 1) (0, 90, 180, 270) Separate (per each segment) Normalized Levenshtein
Alternative(s) — 3×4 (more detail) 4×, 2×, none (affects speed) Lucas-Kanade method [10] 5 frames (less temporal detail) — (0, 0.25, 0.5, 0.75) - more detail (45, 135, 225, 315) Common (sensitive to temporal coordination) Other string distance measures
Table 1: Summary of algorithm parameters, their values and possible alternatives
video, two from each HOF. However, such description still contains temporal dimension, and variability in duration of each phase of motion would have undesired influence on person identification. In our experience, certain outside factors can influence the length of the intervals between consecutive motions - for example the (in)ability of fingerprint reader to recognize a fingerprint in a first try under less-than-ideal conditions (finger misalignment, sweat). In this case, person needs to lift the finger a bit and try to reposition it, which takes some time. Therefore, such representation of motion is further compacted by observing only the changes in position of HOF maximum for each segment. The twelve sequences of symbols are generated, encoding the HOF segment that originated them (see Figure 3) and the amplitude and direction
bin after each change in maximum position. With two useful bins on amplitude axis and four bins on the direction axis, each HOF can generate up to eight different symbols. 4.4
Distance measure
To compare such compact motion description to the records from the database, a distance measure is needed. Levenshtein distance [7] is perfectly suited for comparison of symbol sequences. However, a plain Levenshtein distance provides only the count of element operations on a string, making it unusable as relative measure of similarity. To alleviate this, we normalized the calculated Levenshtein distances as follows: • Calculate the Levenshtein distance between 12 pairs of symbol sequences. • Determine the length of the longer sequence in each of the 12 pairs. • Sum up the 12 individual distances and divide the sum by the sum of sequence length, obtained in the second step. Levenshtein distance between the two strings that have no common elements always returns the length of the longer string. Therefore, after the described normalization, we obtain a distance of 0 when two descriptions are 100% similar, and a distance of 1 when two descriptions (across all 12 symbol sequences) do not share any common symbol. This way, when a compact description of motion is obtained from HOFs, one can find the most similar description in the database using the above algorithm – normalized Levenshtein distances can be directly compared one to each other.
5
Figure 5: Histograms of optical flow (HOFs) for all six image regions, calculated for one frame from our video database. Dominant motion in four regions is clearly visible.
Experiments and results
To test the performance of the presented approach, we performed all-against-all matching of clips from the database, and calculated the confusion matrix. The result of the searching for the closest match was deemed correct if it matched other clip with the same person, performing the same activity. The obtained confusion matrix is shown in Table 2.
Janez Perˇs, Matej Kristan, Matej Perˇse, and Stanislav Kovaˇciˇc Person/activity 1/plain 1/notebook 2/plain 3/plain 3/bags 4/plain 5/plain
1/plain 13 2 1 1 0 0 0
1/notebook 0 1 0 1 0 0 0
2/plain 1 0 30 1 0 0 2
3/plain 1 1 0 7 1 0 0
3/bags 0 0 0 3 8 0 1
4/plain 0 0 0 0 3 9 1
5/plain 0 0 1 0 0 0 23
Table 2: Confusion matrix for all clips from the database. The word after the slash (/) denotes the activity. ”plain” denotes the usual mode of authentication - without carrying any objects. ”notebook” means that person was carrying a laptop computer, and ”bags” means that person was carrying extra luggage. Numbers denote the number of matches between each of the clips in the categories in the first column and categories in the first row.
It can be seen that confusion matrix contains the highest values along its diagonal, indicating successful categorization for most of the samples. Overall recognition rate is 82%. However, many of the falsely categorized samples belong to categories ”1/notebook” and ”3/bags”. It is interesting that in the several of those cases the algorithm failed to determined the correct activity (those are clips with ”unusual” activity), but correctly identified the person in the clip. By lowering our standard to require only correct recognition of persons, and not both persons and activities, the recognition rate climbs to 87%.
6
Discussion and future work
We presented Histograms of Optical Flow (HOFs), which were used to compactly describe human motion from sequences. As our second contribution, we have shown that such descriptions can be used for person identification under certain circumstances. The final recognition rates of 82% and 87% are comparable to some published results for identification based on gait recognition. This approach is different from the usual biometric features in one important aspect: human motion can be learned and therefore there is a possibility of an impostor perfectly mimicking the other person’s motion and thus succeeding in tricking the system. However, we have observed that people do develop their characteristic behavior when performing such everyday tasks as performing authentication for the purpose of entry to the secured area, that such behavior is stable over certain periods of time (the recording was spanning through several months), and that such behavior differs between different persons. In all plausible scenarios, the described methods could not be used alone, but rather as an additional soft sensor, providing just hints to the security system. In other words, such soft sensor in multimodal biometric setting could help to vote against an imperfect impostor, should he fail in perfectly mimicking the person’s motion, in addition to poorly scoring some other biometric tests. The results show that Histograms of optical flow (HOFs) perform well regarding the described problem. In future, simple normalized Levenshtein distance should be substituted with more advanced methods, such as Hidden Markov
models. Additionally, methods for automatically determining regions of interest should be developed, to reduce the number of spurious symbols, that are generated by noise in optical flow in areas of the image where person is not present.
Acknowledgement The research, presented in this paper has been supported by Slovenian Ministry of Defence (MORS) contract CiVaBis M2-0156, and, in part by Slovenian Research Agency (ARRS) contracts P2-0232 and L5-6274.
References [1] Michael J. Black, Yaser Yacoob, and Shanon X. Ju. Recognizing human motion using parameterized models of optical flow. In Mubarak Shah and Ramesh Jain, editors, Motion-Based Recognition, pages 245–269, Boston, 1997. Kluwer Academic Publishers. [2] Stefan Carlsson. Recognizing walking people. I. J. Robotic Res., 22(6):359–370, 2003. [3] Naresh Cuntoor, Amit Kale, and Rama Chellappa. Combining multiple evidences for gait recognition. In Multimedia and Expo, 2003. Proceedings of ICME ’03, pages III: 113–16, 2003. [4] Ying Dai, Yoshitaka Shibata, and Dawei Cai. Understanding facial expressions by the hierarchical recognition of genuine emotions. International Journal of Innovative Computing, Information and Control, 1(2):203–214, 2005. [5] Jeff P. Foster, Mark S. Nixon, and Adam Prugel-Bennett. Automatic gait recognition using area-based metrics. Pattern Recogn. Lett., 24(14):2489–2497, 2003. [6] Michael J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding, 63(1):75–104, 1996. [7] A. Levenshtein. Binary codes, capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(707–710), 1966. [8] James J. Little and Jeffrey E. Boyd. Recognizing people by their gait: The shape of motion. Videre: Journal of Computer Vision Research, 1(2), 1998.
Motion Based Human Identification Using Histograms of Optical Flow [9] Wei Lwun Lu and James J. Little. Tracking and recognizing actions at a distance. In Proceedings of the ECCV Workshop on Computer Vision Based Analysis in Sport Environments (CVBASE ’06), Graz, Austria, May 2006.
[10] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Imaging Understanding Workshop, pages 121–130, 1981. [11] Arun Ross and Anil K. Jain. Multimodal biometrics: An overview. In Proc. of 12th European Signal Processing Conference (EUSIPCO), pages 1221–1224, Vienna, Austria, September 2004. [12] Liang Wang, Tieniu Tan, Huazhong Ning, and Weiming Hu. Silhouette analysis-based gait recognition for human identification. IEEE Trans. Pattern Anal. Mach. Intell., 25(12):1505–1518, 2003. [13] Yaser Yacoob and Larry Davis. Computing spatio-temporal representations of human faces. In Proceedings CVPR ’94, pages 70–75, Seattle, WA, USA, 1994. [14] Yau Wei Yun. The ‘123’ of biometric technology. Synthesis Journal, 2002. http://www.itsc.org.sg/ synthesis/2002/biometric.pdf. [15] Lihi Zelnik-Manor and Michal Irani. Statistical analysis of dynamic actions. IEEE Trans. Pattern Anal. Mach. Intell., 28(9):1530–1535, 2006.