Keyword: Face Detection and Classification
Real-time Face Detection and Classification for ICCTV Brian C. Lovell Security and Surveillance Research Group School of ITEE, The University of Queensland, QLD 4072, Australia NICTA,300 Adelaide St, Brisbane, QLD 4000, Australia Phone: 61-7-3000 0497 Fax: 61-7-3000 0480 Email:
[email protected]
Shaokang Chen NICTA, 300 Adelaide St, Brisbane, QLD 4000, Australia Security and Surveillance Research Group School of ITEE, The University of Queensland, QLD 4072, Australia Phone: 61-7-3000 0491 Fax: 61-7-3000 0480 Email:
[email protected]
Ting Shan NICTA, 300 Adelaide St, Brisbane, QLD 4000, Australia Security and Surveillance Research Group School of ITEE, The University of Queensland, QLD 4072, Australia Phone: 61-7-3000 0516 Fax: 61-7-3000 0480 Email:
[email protected]
Real-time Face Detection and Classification for ICCTV Brian C. Lovell, Shaokang Chen and Ting Shan Security and Surveillance Research Group School of ITEE, The University of Queensland NICTA, Australia
INTRODUCTION Data mining is widely used in various areas such as finance, marketing, communication, web service, surveillance and security.
The continuing growth in
computing hardware and consumer demand has led to a rapid increase of multimedia data searching.
With the rapid development of computer vision and communication
techniques, real-time multimedia data mining is becoming increasingly prevalent. A motivating application is Closed-Circuit Television (CCTV) surveillance systems. However, most data mining systems mainly concentrate on text based data because of the relative mature techniques available, which are not suitable for CCTV systems. Currently, CCTV systems rely heavily on human beings to monitor screens physically. An emerging problem is that with thousands of cameras installed, it is uneconomical and impractical to hire the required numbers of people for monitoring. An Intelligent CCTV
(ICCTV) system is thus required for automatically or semi-automatically monitoring the cameras.
BACKGROUND CCTV Surveillance Systems In recent years, the use of CCTV for surveillance has grown to an unprecedented level. Especially after the 2005 London bombings and the 2001 terrorist attack in New York, video surveillance has become part of everyday life. Hundreds of thousands of cameras have been installed in public areas all over the world, in places such as train stations, airports, car parks, Automatic Teller Machines (ATMs), vending machines and taxis. Based on the number of CCTV units on Putney High Street, it is “guesstimated” (McCahill & Norris 2002) that there are around 500,000 CCTV cameras in the London area alone and 4,000,000 cameras in the UK. This suggests that in the UK there is approximately one camera for every 14 people. However, currently there is no efficient system to fully utilize the capacity of such a huge CCTV network. Most CCTV systems rely on humans to physically monitor screens or review the stored videos. This is inefficient and makes proactive surveillance impractical. The fact that police only found activities of terrorists from the recorded videos after the attacks happened in London and New York shows that existing surveillance systems, which depend on human monitoring, are neither reliable nor timely. The need for fully automatic surveillance is pressing. Challenges of Automatic Face Recognition on ICCTV Systems Human tracking and face recognition is one of the key requirements for ICCTV systems. Most of the research on face recognition focuses on high quality still face
images and achieves quite good results. However, automatic face recognition under CCTV conditions is still on-going research and many problems still need to be solved before it can approach the capability of the human perception system. Face recognition on CCTV is much more challenging. First, image quality of CCTV cameras is normally low. The resolution of CCTV cameras is not as high as for still cameras and the noise levels are generally higher.
Second, the environment control of CCTV cameras is
limited, which introduces large variations in illumination and the viewing angle of faces. Third, there is generally a strict timing requirement for CCTV surveillance systems. Such a system should be able to perform in near real-time — detecting faces, normalizing the face images, and recognizing them.
MAIN FOCUS Face Detection Face detection is a necessary first step in all of the face processing systems and its performance can severely influence on the overall performance of recognition. Three main approaches are proposed for face detection: feature based, image based, and template matching. Feature based approaches attempt to utilize some priori knowledge of human face characteristics and detect those representative features such as edges, texture, colour or motion.
Edge features have been applied in face detection from the beginning
(Colmenarez & Huang 1996), and several variations have been developed (Fan, Yau, Elmagarmid & Aref 2001; Froba & Kublbeck 2002; Suzuki & Shibata 2004). Edge detection is a necessary first step for edge representation. Two edge operators that are
commonly used are the Sobel Operator and Marr-Hildreth operator. Edge features can be easily detected with a very short time but are not robust for face detection in complex environments. Others have proposed texture-based approaches by detecting local facial features such as pupils, lips and eyebrows based on an observation that they are normally darker than the regions around them (Huang & Mariani 2000; Hao & Wang 2002). Color feature based face detection is derived from the fact that the skin color of different humans (even from different races) cluster very closely.
Several color models are
normally used, including RGB (Satoh, Nakamura & Kanade 1999), normalized RGB (Sun, Huang & Wu1998), HSI (Lee, Kim & Park 1996), YIQ (Wei & Sethi 1999), YES (Saber & Tekalp 1996), and YUV (Marques & Vilaplana 2000). In these color models, HSI is shown to be a very suitable when there is a large variation in feature colors in facial areas such as the eyes, eyebrows, and lips. Motion information is appropriate to detect faces or heads when video sequences are available (Espinosa-Duro, FaundezZanuy & Ortega 2004; Deng, Su, Zhou & Fu 2005). Normally frame difference analysis or moving image contour estimation is applied for face region segmentation. Recently, researchers tend to focus more on multiple feature methods which combine shape analysis, color segmentation, and motion information to locate or detect faces (Qian & Li 2000; Widjojo & Yow 2002). The Template matching approach can be further divided into two classes: feature searching and face models. Feature searching techniques first detect the prominent facial features, such as eyes, nose, mouth, then use knowledge of face geometry to verify the existence of a face by searching for less prominent facial features (Jeng, Liao, Liu & Chern 1996).
Deformable templates are generally used for face models for face
detection. Yuille et al. (1989) extends the snake technique to describe features such as eyes and the mouth by a parameterized template.
The snake energy comprises a
combination of valley, edge, image brightness, peak, and internal energy. In Cootes and Taylor’s work (1996), a point distributed model is described by a set of labeled points and Principal Component Analysis is used to define a deformable model.
Figure 1. Various face detection techniques. Image-based approaches treat face detection as a two class pattern recognition problem and avoid using a priori face knowledge. It uses positive and negative samples to train a face/non-face classifier.
Various pattern classification methods are used,
including Eigenfaces (Wong, Lam, Siu, & Tse 2001), Neural Network (Tivive & Bouzerdoum 2004), Support Vector Machine (Shih & Liu 2005), and Adaboost (Hayashi & Hasegawa 2006). In summary, there are many varieties of face detection methods and to choose a suitable method is heavily application dependent. Figure 1 shows various face detection techniques and their categories. Generally speaking, feature-based methods are often used in real-time systems when color, motion, or texture information is available. Template-matching and image-based approach can attain superior detection performance than feature-based method, but most of the algorithms are computationally expensive and are difficult to apply in a real-time system. Pose Invariant Face Recognition Pose invariant face recognition can be classified into two categories: 2D based approaches and 3D based approaches. Although 3D face models can be used to describe the appearance of a human face under different pose changes accurately and can attain good recognition results for face images with pose variation, there are several disadvantages that limit its application to the CCTV scenario (Bowyer, Chang & Flynn 2004). First, to construct a 3D face model, 3D scanners have to be installed to replace existing cameras in the CCTV system, which is very expensive. Second, to acquire 3D data, the depth of field of the scanners has to be well controlled and this leads to the limitation on the range of data acquisition. Third, using 3D scanners to obtain 3D data is time consuming and cannot be done in real-time. Most researchers thus focus on 2D approaches for dealing with pose variation.
Figure 2. 2D based pose invariant face recognition techniques . The 2D based approaches can be categorized into three classes: pose invariant features, multiple image, and single image as shown in Figure 2. Wiskott et al. ( Wiskott, Fellous, Kuiger & Malsburg 1997) proposed Elastic Bunch Graph Matching for face recognition, which applied Gabor filters to extract pose invariant features. Beymer (1994, 1996) used multiple model views to cover different poses from the viewing sphere. Sankaran and Asari (2004) proposed a multi-view approach on Modular PCA (Pentland, Moghaddam & Starner 1994) by incorporating multiple views of the subjects as separate sets of training data. In 2001, Cootes, Edwards and Taylor (2001) proposed “View-based Active Appearance Models,” based on the idea that a small number of 2D statistical models are sufficient to capture the shape and appearance of a face from any viewpoint.
Sanderson et al (2006, 2007) addressed the pose mismatch problem by
extending each frontal face model with artificially synthesized models for non-frontal views.
Towards Real-time Trials on ICCTV The authors developed Adaptive Principal Component Analysis (APCA) to improve the robustness of PCA to nuisance factors such as lighting and expression (Chen & Lovell, 2003 and 2004). They extended APCA to deal with face recognition under variant poses (Shan, Lovell & Chen 2006) by applying an Active Appearance Model to perform pose estimation and synthesize the frontal view image. However, similar to most face recognition algorithms, the experiments were performed on some popular databases that contain only still camera images with relatively high resolution. Very few tests are done on video databases (Aggarwal 2004, Gorodnichy 2005). We recently constructed a near real-time face detection and classification system and tested it on operational surveillance systems installed in a railway station. The system is composed of four modules: the communication module exchanges data with the surveillance system; the face detection module uses AdaBoost based cascade face detectors to detect multiple faces inside an image; the face normalization module detects facial features such as eyes and the nose to align and normalize face images; the face classification module uses the APCA method to compare face images with gallery images inside the face database. The system is implemented in C++ and runs at 7 to 10 frames per second on an Intel Dual-Core PC. Figure 3 illustrates the system structure. Our system works well in a controlled environment. If we manually align the face images of size 50 by 50 pixels or larger, it can achieves recognition rate up to 95%. But in fully automatic tests in complex uncontrolled environment, the recognition rate drops
significantly. This is due to the combined effects of variations in lighting, pose, expression, registration error and image resolution etc. Figure 4 shows the real-time recognition results on two frames from a video sequence obtained from the railway station.
Figure 3 Real-time Face Detection and Classification System Structure
Figure 4. Real-life trial of face recognition on ICCTV. The green boxes in the images show detection of the faces and the labels indicate the identity of the person from Shan et. al.(2007).
FUTURE TRENDS Modern surveillance systems produce enormous video archives and their rate of growth is accelerating as video resolution and the number of cameras increases due to heightened security concerns. At present these archives are often erased after a few weeks due to cost of storage, but also because the archive have diminished value because there is no automatic way to search for events of interest. Face recognition provides one means to search these data for specific people. Identification reliability will be enhanced if this can be combined with, say, human gait recognition, clothing appearance models, and height information. Moreover human activity recognition could be used to detect acts of violence and suspicious patterns of behavior.
Fully integrated automatic
surveillance systems are certainly the way of the future.
CONCLUSION With the increasing demands of security, multimedia data mining techniques on CCTV, such as face detection and recognition, will deeply affect our daily lives in the near future.
However, current surveillance systems which rely heavily on human
operators are not practical, scalable, nor economical. This creates much interest in ICCTV systems for security applications. The challenge is that existing computer vision and pattern recognition algorithms are neither reliable nor fast enough for large database and real-time applications. But the performance and robustness of such systems will increase significantly as more attention is devoted to these problems by researchers.
REFERENCES Aggarwal, G., Roy-Chowdhury, A.K., & Chellappa, R. (2004). “A System Identification Approach for Video-based Face Recognition.” Proceedings of the International Conference on Patter Recognition, Cambridge, August 23-26. Beymer, D. J. (1994). “Face recognition under varying pose.” Proceedings of the
International
Conference
on
Computer
Vision
and
Pattern
Recognition, 756-761. Beymer, D., & Poggio, T. (1995). “Face Recognition from One Example View.” Proceedings of the International Conference of Computer Vision, 500-507. Beymer, D. (1996). “Feature correspondence by interleaving shape and texture computations.” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 921-928. Bowyer, K. W., Chang K., & Flynn P. (2004). “A survey of approaches to three-dimensional face recognition.” Proceedings of the International Conference on Pattern Recognition, 1, 358-361. Chen, S., & Lovell, B. C. (2003). “Face Recognition with One Sample Image per Class.” Proceedings of Australian and New Zealand Intelligent Information Systems, 83-88. Chen, S., & Lovell, B. C. (2004). “Illumination and Expression Invariant Face
Recognition
with
One
Sample
Image.”
Proceedings
International Conference on Pattern Recognition, 1, 300-303.
of
the
Colmenarez, A. J., & Huang T. S. (1996). “Maximum likelihood face detection”. Proceedings of the International Conference on Automatic Face and Gesture Recognition, 307-311. Cootes, T. F., Edwards G. J., & C. J. Taylor (2001). “Active appearance models.” IEEE Transaction on Pattern Analysis and Machine Intelligence, 23(6), 681-685. Cootes, T. F., & Taylor C. J. (1996). “Locating faces using statistical feature detectors.” Proceedings of the International Conference on Automatic Face and Gesture Recognition, 204-209. Deng Y-F., Su G-D., Zhou J. & Fu B. (2005). “Fast and robust face detection in video.” Proceedings of the International Conference on Machine Learning and Cybernetics, 7, 4577-4582. Espinosa-Duro, V., Faundez-Zanuy M., & Ortega
J. A. (2004). “Face
detection from a video camera image sequence.” Proceedings of International Carnahan Conference on Security Technology, 318-320. Fan J., Yau D. K. Y., Elmagarmid A. K. & Aref W. G. (2001). “Automatic image segmentation by integrating color-edge extraction and seeded region growing.” IEEE Transactions on Image Processing, 10(10), 14541466. Froba, B., & Kublbeck C. (2002). “Robust face detection at video frame rate based on edge orientation features.” Proceedings of the International Conference on Automatic Face and Gesture Recognition, 327-332.
Gorodnichy, D. (2005). “Video-Based Framework for Face Recognition in Video.” Proceedings of the Canadian Conference on Computer and Robot Vision, 330-338. Govindaraju,
V.
(1996).
“Locating
human
faces
in
photographs.”
International Journal of Computer Vision, 19(2): 129-146. Hao, W., & Wang K. (2002). “Facial feature extraction and image-based face drawing.”
Proceedings
of
the
International
Conference
on
Signal
Processing, 699-702. Hayashi, S., & Hasegawa O. (2006). “A Detection Technique for Degraded Face Images.” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1506-1512. Huang W., & Mariani R. (2000). “Face detection and precise eyes location.” Proceedings of the International Conference on Pattern Recognition, 4, 722-727. Jeng S-H., Liao H-Y. M., Liu Y-T., & Chern M-Y. (1996). “Extraction approach for facial feature detection using geometrical face model.” Proceedings of the International Conference on Pattern Recognition, 426430. Lee, C. H., Kim J. S., & Park K. H. (1996). “Automatic human face location in a complex background using motion and color information.” Pattern Recognition, 29(11), 1877-1889.
Marques, F., & Vilaplana V. (2000). “A morphological approach for segmentation
and
tracking
of
human
faces.”
Proceedings
of
the
International Conference on Pattern Recognition, 1064-1067. McCahill, M., & Norris, C. (2002). “Urbaneye: CCTV in London.” Centre for Criminology and Criminal Justice, University of Hull, UK. Pentland, A., Moghaddam B., & Starner T. (1994). “View-based and modular eigenspaces for face recognition.” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 84-91. Qian, G., & Li S. Z. (2000). “Combining feature optimization into neural network
based
face
detection.”
Proceedings
of
the
International
Conference on Pattern Recognition, 2, 814-817. Rowley, H. A., Baluja S., & Takeo K.(1998). “Neural network-based face detection.”
IEEE
Transactions
on
Pattern
Analysis
and
Machine
Intelligence, 20(1), 23-38. Saber, E., & Tekalp A. M. (1996). “Face detection and facial feature extraction using color, shape and symmetry-based cost functions.” Proceedings of the International Conference on Pattern Recognition. vol.3, 654-658. Sanderson, C., Bengio S., & Gao Y. (2006). “On Transforming Statistical Models for Non-Frontal Face Verification.” Pattern Recognition, 39(2), 288-302.
Sanderson C., Shan T., & Lovell, B. C. (2007), “Towards Pose-Invariant 2D Face Classification for Surveillance.” The International Workshop of Analysis and Modeling of Faces and Gestures, 276-289. Sankaran, P., & Asari V. (2004). “A multi-view approach on modular PCA for illumination and pose invariant face recognition”. Proceedings of Applied Imagery Pattern Recognition Workshop, 165-170. Satoh, S., Nakamura Y., & Kanade T. (1999). “Name-It: naming and detecting faces in news videos.” IEEE Multimedia, 6(1), 22-35. Shan, T, Lovell, B, C., & Chen, S. (2006). “Face Recognition Robust to Head Pose from One Sample Image.” Proceedings of the International Conference on Pattern Recognition, 1, 515-518. Shan, T., Chen, S., Sanderson, Sanderson C., & Lovell, B. C. (2007). “Towards
Robust
Face
Recognition
for
Intelligent-CCTV
Based
Surveillance Using One Gallery Image.” IEEE International Conference on Advanced Video and Signal based Surveillance. Shih, P., & Liu C. (2005). “Face detection using distribution-based distance and
support
vector
machine.”
Proceedings
of
the
International
Conference on Computational Intelligence and Multimedia Applications, 327-332. Sun, Q. B., Huang W. M., & Wu J. K. (1998). “Face detection based on color and local symmetry information.” Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, 130-135.
Suzuki, Y., & Shibata T. (2004). “Multiple-clue face detection algorithm using edge-based feature vectors.” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 5, V-737-40. Tivive, F. H. C., & Bouzerdoum A. (2004). “A face detection system using shunting inhibitory convolutional neural networks.” Proceedings of IEEE International Joint Conference on Neural Networks, 4, 2571-2575. Viola, P., & Jones M. (2001). “Rapid object detection using a boosted cascade of simple features.” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1, I-511-I-518. Wei, G., & Sethi I. K. (1999). “Face detection for image annotation.” Pattern Recognition Letters, 20(11-13), 1313-1321. Widjojo, W., & Yow K. C. (2002). “A color and feature-based approach to human face detection.” Proceedings of the International Conference on Control, Automation, Robotics and Vision, 1, 508-513. Wiskott, L., Fellous J. M., Kuiger N., & Malsburg von der C. (1997). “Face recognition by elastic bunch graph matching.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775-779. Wong K. W., Lam K. M., Siu W. C., & Tse K. M. (2001). “Face segmentation and facial feature tracking for videophone applications.” Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing, 518-521.
Yang M. H., Kriegman, D.J., & Ahuja, N. (2002) “Detecting faces in images: a
survey.”
IEEE
Transactions
on
Pattern
Analysis
and
Machine
Intelligence, 24(1), 34-58. Yuille, A. L., Cohen D. S., & Hallinan P. W. (1989). “Feature extraction from faces using deformable templates.” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 104109.
TERMS AND THEIR DEFINITION Active Appearance Model: Active appearance model (AAM) is a method to represent the shape and appearance variations of objects. A statistical shape model and an appearance model are learned from landmarks of the sample data and then correlated by applying Principal Component Analysis on the training data. Adaboost: Adaboost is a short for Adaptive Boosting. It is a boosting learning algorithm to combine many weak classifiers into a single powerful classifier. It adaptively changes the weight for weak classifiers from the misclassified data sample. It is mainly used to fuse many binary classifiers into a strong classifier. Close-Circuit Television: Close-Circuit Television (CCTV) is a system that transmits signals from video cameras to specific end users, such as monitors and servers. Signals are transmitted securely and privately over the network. CCTV is mainly used for surveillance applications.
Face Detection: To determine whether or not there are any faces in the image and return the corresponding location of each face. Pose: In face recognition, pose means the orientation of the face in 3D space including head tilt, and rotation. Intelligent CCTV: Intelligent CCTV (ICCTV) is a CCTV system that can automatically or
semi-automatically implement
surveillance functions.
Some
advanced
techniques of computer vision, pattern recognition, data mining, artificial intelligence and communications etc. are applied into CCTV to make it smart. Pose Estimation: To estimate the head pose orientation in 3D space. Normally estimate angles of the head tilt and rotation.