Detection and Tracking of Objects for Behavioral

0 downloads 0 Views 20MB Size Report
5Database created at the Carnegie Mellon University. 46 ...... data collection can be partially solved by applying so called motion capture suits, that create a 3D model of the person. ...... LIBSVM: a library for support vector ma- chines. 2001.
Technische Universit¨at M¨ unchen Lehrstuhl f¨ ur Mensch-Maschine-Kommunikation

Detection and Tracking of Objects for Behavioral Analysis in Sensor Networks Dejan Arsi´c

Vollst¨andiger Abdruck der von der Fakult¨at f¨ ur Elektrotechnik und Informationstechnik der Technischen Universit¨at M¨ unchen zur Erlangung des akademischen Grades eines

Doktor - Ingenieurs (Dr.-Ing.)

genehmigten Dissertation

Vorsitzender: Pr¨ ufer der Dissertation:

.......................................... 1. .......................................... 2. ..........................................

Die Dissertation wurde am . . bei der Technischen Universit¨at M¨ unchen eingereicht und durch die Fakult¨at f¨ ur Elektrotechnik und Informationstechnik am . . angenommen.

Abstract CCTV systems are omnipresent in daily life and are considered as a widely accepted tool to provide a high level of security in public and private places. The past has unfortunately shown that such systems are not capable to prevent crimes, as the video streams have to be analyzed on the fly, in order to react in time instead of utilizing the recorded data as forensic tool. Due to the massive amount of data, this would be an inherently expensive task for human operators. Therefore fully automated systems are desired, which assist security staff and allow them to take actions immediately and stop potential offenders. The aim of this thesis is to provide a system for behavior detection with all required components. The main focus will be set on high robustness and low complexity, as all system components should be able to operate in real world setups. In order to implement such as system, a wide range of components is required. The first step includes the detection of static and moving objects, where object representation is conducted in arising detail. A simple detection of scene changes, without consideration of the object class, is followed by a more detailed analysis. Depending on the scenario humans, body parts, faces, or even fiducial points have to be detected. Having detected objects of interest in a frame, correspondences have to be detected over time, which is a prerequisite for behavioral analysis. While tracking usually relies on the underlying detector and both Kalman filter and the condensation algorithm suffer of the same limitations as the detector, a novel feature based tracking algorithm will be presented. Although the presented 2D detection and tracking techniques show an outstanding performance, further robustness can be gained by utilizing the known algorithms in multi camera systems, where homography between views will be used to overcome the occlusion problem. It will be demonstrated that the combination of multi-layer homography and 2D tracking approaches is capable to robustly track persons, and lower ID changes drastically. The frequent appearance of so called ghost objects will also be addressed, and a possible solution will be presented. Further, it will be shown that the extracted motion patterns and trajectories can be used for behavioral analysis. Current machine learning approaches, for instance Hidden Markov Models or Support Vector Machines, require the availability of data to learn a generalizing model for an activity. The performance of these methods will be presented for facial actions and the observation of seated persons in aircrafts. As data is usually sparse and cannot be acquired for every scenario, the analysis of trajectories can be performed utilizing expert knowledge, which will be demonstrated for the complex task of left luggage detection.

i

Zusammenfassung Video¨ uberwachungssysteme werden immer h¨aufiger zur Wahrung der Sicherheit in ¨offentlichen und privaten Einrichtungen eingesetzt. Diese dienen derzeit nur zur Abschreckung und zur Aufkl¨arung von Verbrechen, da nur speziell ausgebildetes Sicherheitspersonal die Videodaten in Echtzeit analysieren kann. Deshalb soll die Erkennung von m¨oglichen Gefahren automatisiert werden, um Kosten zu senken und das Personal zu entlasten. Daf¨ ur wurden in dieser Arbeit Systeme zur Detektion und Verfolgung von Personen in Videosequenzen entwickelt. Da in 2D Szenarien oft Verdeckungen relevanter Bereiche auftreten, wurde mittels Homographie zwischen mehreren Kameraperspektiven dieses Problem erfolgreich behandelt. Auf diese Weise k¨onnen nun personenbezogene Merkmale extrahiert werden und mit Verfahren der Mustererkennung vielversprechend auf verd¨achtige Verhaltensmuster hin untersucht werden.

iii

Acknowledgments This thesis has been written during my work as researcher at the Institute for Human Machine Communication at the Technische Universit¨at M¨ unchen. A lot of people were involved in making things happen, and supported me through the last five years. First of all I want to thank Prof. Gerhard Rigoll for giving me the opportunity to work at his institute, and guiding me through my research activities. Although we did not always see the red line in the beginning, with his support we could make it visible at last. He created this special condition where I could work as a scientist and always had feedback from the industry, which has been valuable input. He gave me the freedom to try new things and find my niche. Furthermore I would like to thank my colleagues at the institute for the great time we spent working together. Among these were some who had quite an impact on my research and also became good friends. In particular I would like to thank Dr. Bj¨orn Schuller for every single advice he gave me and for every more or less scientific discussion we had, reaching back to the days when I was still a student. The time was “legen... ’and now wait for it’ ...dary!”. Of course I have to also thank my room mate Benedikt H¨ornler for the interesting discussions and the daily DOW and DAX alert watch. Special thanks go to Florian “Brother Flo” V¨olk, Klaus “Jean Jaques” Laumann, Jakob “Jackl” Putner, Daniel “Xare” Menzel, Stefan “Captain” Kerber and Markus “Franz-Joseph” Fruhmann. Always remember: “Oane hamma oiwei no drunga”. I’d also like to thank Florian Eyben and Martin W¨ollmer for discussing and trying out some unconventional approaches. Last but not least thanks to Peter Brand, Heiner Hundhammer and Ernst Ertl for the hard and software support. I also have to mention my excellent students, who helped implementing my ideas with their hard and passionate work: Stefan Wimmer, Martin Ruß, Yang Liyong, Michael Schmitt, Martin Hofmann, Encho Hristov, Anton Gatev, Nicolas Lehment, Manuel Stein, Claudia Tiddia, Christoph M¨ unch, Wang Qi, Michael Wohlmuth, Aleksandar Rangelov, Atanas Lyutskanov, Luis Roalter, Doris Lang and Martin Straubinger. Finally I would like to thank my parents and my brother Oliver for giving me the opportunity to study and all the sacrifices they made for my education. Without their help through hard times, their faith, and hard work this would not be possible at all. Dejan Arsi´c

v

Contents 1. Introduction 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Structure of Visual Surveillance Systems . . . . . . . . . . . . . . . . . . . 1.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Detection of Stationary and Moving Objects 2.1. Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. Temporal Difference Images . . . . . . . . . . . . . . . . . 2.1.2. Foreground Segmentation . . . . . . . . . . . . . . . . . . 2.1.3. Removing Cast Shadows and Highlights . . . . . . . . . . 2.1.4. Blob Detection and Preprocessing . . . . . . . . . . . . . . 2.2. Supplementing Foreground Segmentation with 3D Range Data . . 2.2.1. Foreground Segmentation in Range Data . . . . . . . . . . 2.2.2. Separation of Humans in 3D Data . . . . . . . . . . . . . . 2.2.3. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3. Detection of Skin Colour Regions . . . . . . . . . . . . . . . . . . 2.3.1. Physically Motivated Skin Locus Model . . . . . . . . . . . 2.3.2. Modeling Skin Color With A Single Gaussian . . . . . . . 2.4. Pedestrian Detection Based on Haar Basis Features . . . . . . . . 2.4.1. Over Complete Representation with Haar Features . . . . 2.4.2. Component Based Representation of the Pedestrian Class . 2.4.3. Feature Selection with AdaBoost . . . . . . . . . . . . . . 2.4.4. Training Procedure . . . . . . . . . . . . . . . . . . . . . . 2.4.5. Pedestrian Detection Results . . . . . . . . . . . . . . . . . 2.5. Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1. Face Detection with Multi Layer Perceptrons . . . . . . . . 2.5.2. Face Detection With a Boosted Detector Cascade . . . . . 2.5.3. Face Detection Evaluation and Post Processing . . . . . . 2.6. Facial Feature Extraction by Matching Elastic Bunch Graphs . . 2.6.1. Gabor Wavelets and Jets . . . . . . . . . . . . . . . . . . . 2.6.2. Bunch Graphs and Bunch Similarity . . . . . . . . . . . . 2.6.3. The Matching Procedure . . . . . . . . . . . . . . . . . . . 2.6.4. Evaluation of the Fiducial Point Localization . . . . . . . . 2.7. Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 4 7 8 8 9 15 20 22 23 23 25 26 26 28 30 30 35 36 38 39 41 42 43 45 47 47 50 52 53 54

vii

Contents 3. Object Tracking 3.1. Reliable Tracking of Blobs . . . . . . . . . . . . . . . . . 3.2. Kalman Filtering for Visual Tracking . . . . . . . . . . . 3.2.1. The Kalman Filter . . . . . . . . . . . . . . . . . 3.2.2. Visual Tracking with Kalman Filters . . . . . . . 3.2.3. Evaluation of the Kalman Filter . . . . . . . . . . 3.3. Object Tracking with the Condensation Algorithm . . . . 3.3.1. The Condensation Algorithm . . . . . . . . . . . 3.3.2. Introducing Multiple Cues . . . . . . . . . . . . . 3.3.3. Extension to Multiple Hypothesis Tracking . . . . 3.3.4. Evaluation of the Condensation Algorithm . . . . 3.4. Feature Based Tracking . . . . . . . . . . . . . . . . . . . 3.4.1. Tracking of Gabor Jets . . . . . . . . . . . . . . . 3.4.2. Tracking of SIFT Features . . . . . . . . . . . . . 3.5. Object Tracking with Deformable Feature Graphs . . . . 3.5.1. Feature Graphs . . . . . . . . . . . . . . . . . . . 3.5.2. Tracking of Feature Graphs . . . . . . . . . . . . 3.5.3. The Dynamic Feature Graph . . . . . . . . . . . . 3.5.4. Re-Identification of Lost Objects with Deformable 3.5.5. Graph Tracking Evaluation . . . . . . . . . . . . 3.6. Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feature Graphs . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

55 56 58 58 58 60 61 61 63 63 64 65 67 68 75 75 76 79 82 87 88

4. Multi Camera Object Detection and Tracking 91 4.1. Data Acquisition in Smart Sensor Networks . . . . . . . . . . . . . . . . . 92 4.2. Camera Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.1. The pin hole camera model . . . . . . . . . . . . . . . . . . . . . . 94 4.2.2. Tsai’s Calibration Method . . . . . . . . . . . . . . . . . . . . . . . 99 4.3. A Short Review On Homography . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.1. The Homographic Transformation . . . . . . . . . . . . . . . . . . . 102 4.3.2. The Homography constraint . . . . . . . . . . . . . . . . . . . . . . 105 4.3.3. Semi-Automated Computation of H . . . . . . . . . . . . . . . . . . 106 4.4. Planar Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.1. Object localization Using Homography in the Ground Plane . . . . 108 4.4.2. Aligning Object Fragments in the Ground Plane Applying Heuristics112 4.4.3. Estimating the height of a Detected Object . . . . . . . . . . . . . 115 4.5. Multi Layer Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.5.1. Object Localization Utilizing Multi Layer Homography . . . . . . . 117 4.6. False Positive Elimination and Handling . . . . . . . . . . . . . . . . . . . 120 4.6.1. Combining Multiple Layers For False Positive Detection . . . . . . 121 4.6.2. Cutting Blobs to Remove Floating Ghost Objects . . . . . . . . . . 121 4.6.3. Applying Geometrical Constraints . . . . . . . . . . . . . . . . . . . 122 4.6.4. False Positive Handling . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.7. Multi Camera Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . 124

viii

Contents 4.7.1. Combining 2D and 3D Tracking Methods . . . . . . . . . . . . . . . 126 4.8. Tracking Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.9. Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5. Behavior Detection Applications 5.1. Facial Expression Recognition . . . . . . . . . . . . . . . . . 5.1.1. Feature Extraction . . . . . . . . . . . . . . . . . . . 5.1.2. Recognition of Facial Actions . . . . . . . . . . . . . 5.1.3. Evaluation of Facial Activity Recognition Systems . . 5.2. The SAFEE Onboard Threat Detection System . . . . . . . 5.2.1. Low Level Feature Extraction . . . . . . . . . . . . . 5.2.2. Suspicious Behavior Detection . . . . . . . . . . . . . 5.2.3. System Evaluation . . . . . . . . . . . . . . . . . . . 5.2.4. The SAFEE SBDS On-line Demonstrator . . . . . . 5.2.5. The SAFEE Access Control and Recognition System 5.3. Activity Monitoring in Meetings . . . . . . . . . . . . . . . . 5.3.1. Recognition of the Relevant View . . . . . . . . . . . 5.4. Recognition of Low Level Trajectory Events . . . . . . . . . 5.4.1. Stationary Object Detection . . . . . . . . . . . . . . 5.4.2. Discriminating Between Walking and Running . . . . 5.4.3. Detection of Splits and Mergers . . . . . . . . . . . . 5.4.4. Detection of Group Movements . . . . . . . . . . . . 5.5. The PETS 2007 Challenge . . . . . . . . . . . . . . . . . . . 5.5.1. Loitering Person Detection . . . . . . . . . . . . . . . 5.5.2. Left Luggage Detection . . . . . . . . . . . . . . . . . 5.6. Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

131 . 131 . 132 . 133 . 135 . 136 . 138 . 140 . 141 . 144 . 146 . 150 . 151 . 152 . 153 . 153 . 154 . 155 . 155 . 155 . 157 . 159

6. Conclusion and Outlook 161 6.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.2. Future Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A. Databases A.1. The SAFEE Facial Activity Corpus . . . A.2. The FEEDTUM Corpus . . . . . . . . . A.3. The Airplane Behavior Corpus . . . . . . A.4. The AMI Database . . . . . . . . . . . . A.5. The PETS2007 Multi Camera Database A.6. The PROMETHEUS Outdoor Scenario . B. Classification Methods B.1. Distance Measures . . . B.2. Support Vector Machines B.3. Neural Networks . . . . B.4. Hidden Markov Models .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

165 . 165 . 166 . 167 . 167 . 167 . 169

. . . .

171 171 171 174 176

. . . .

ix

Contents C. Smart Sensors C.1. Sensors . . . . . . . . . . . . . . . . . . C.1.1. CCD Sensors . . . . . . . . . . C.1.2. Photonic Mixture Devices . . . C.1.3. Infrared Thermography . . . . . C.2. Processing Units . . . . . . . . . . . . C.2.1. Mini PCs . . . . . . . . . . . . C.2.2. Digital Signal Processor Boards

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

181 . 181 . 181 . 181 . 183 . 183 . 183 . 184

Acronyms

185

List of Symbols

187

Bibliography

191

x

Chapter 1. Introduction 1.1. Motivation Terrorism nowadays seems to be omnipresent in daily life and represents a threat for the modern society. An assault’s aim is to harm or kill as many people as possible, in order to emphasize the terrorist’s view of the world and to show his or her power. Therefore highly frequented and infrastructural sensible targets are usually chosen as crime scene. These could be train stations, airports or even football stadiums. Hence the authorities’ and the public demand on systems to prevent possible threats has been growing in recent years. This leads to the installation of a huge amount of Closed Circuit Television (CCTV) systems in public places, both visible and invisible. In practice video surveillance unfortunately failed in the prevention of crimes. Most CCTV cameras are unmonitored and the vast majority of benefits are either in forensic use or deterring potential offenders, as these might be easily recognized and detected [Wel08]. This way events can be analyzed after the fact in order to solve a crime. Threat prevention can be only guaranteed by being proactive, identifying known precursors and possible threats, and directing security personnel to investigate and intercept before it is too late [Pep07]. Highly trained human CCTV operators can perform this task with little to none error. However it is simply not economically to employ as much staff as cameras are in service. Alternatively an operator could monitor multiple video streams at the same time, see fig. 1.1. This is of course limited by human performance, as high demands on the operator’s concentration are required to reliably monitor public spaces. It is commonly agreed that only approximately 0.01% of runtime contains material of interest. Thus it is likely that this small portion will be overseen by the operator. Therefore it seems desirable to support operators in their work for the sake of a way shorter response time. This might help to recognize potential threats in time so forces could be sent directly to the scene. Video analytics applications can optimize the performance of the surveillance infrastructure, by actively monitoring all cameras and automatically detecting potential threats. An alarm can be displayed on the operator’s display and his attention can stay focused on one screen, where just the camera feed has to be adjusted. In contrast to a multi screen environment less fatigue should appear, which would lead to a by far more attentive workforce. The cost of such a system would be restricted to hard-

1

Chapter 1. Introduction

Figure 1.1.: Operator in front of multiple screens at the Police Service of Northern Ireland [Pol08]

ware procurement and maintenance, as simple passive cameras evolve to smart sensors [Gok98]. These are highly integrated devices including a camera and a processing unit to analyze the video material on the fly. For large sensor networks an additional central unit is required to accumulate the single outputs. Due to the fast growing market, smart sensors are steadily becoming cheaper and are easy to maintain. After taking a smart sensor out of the box and switching it on it should be fully functional [Kra08]. There is no need for additional hardware and highly trained staff to install such a device, as it is pre-configured and ready to operate as soon as it is plugged in. Recent technological developments have produced a whole range of new sensors that are applicable to surveillance applications and can overcome the general weaknesses of traditional video cameras. These are usually highly depending on the current illumination of the scene, as they are usually not able to operate in the darkness and suffer of inconvenient illumination influences, such as overexposure if a light source suddenly appears or dark shadows in case highlights are compensated. Therefore traditional cameras are now being supplemented or even replaced by new sensors. Illumination influences can be overcome by applying thermal infrared cameras, which are independent of light and can detect objects that are emitting energy [Han05]. Furthermore traditional 2D sensors need to be calibrated for 3D scene recovery and suffer of the well known occlusion problem in crowded scenes. Therefore 3D sensors, for instance based on time of flight, have been introduced to overcome these problems [Rin07]. Although microphones can be used to pick up events or emotions, and even to localize objects [Sch07a, Alg06], audio surveillance will not be discussed in this work. Considering privacy and ethical issues most people consider wiretapping as more critical than video surveillance, as they are already used to cameras and speech is considered more intimate than just an image. While surveillance is usually employed for security related applications, newly developed systems can be easily utilized in various environments. Popular new research fields emerged from classical video surveillance, and have been devoted in aiding people. So called cognitive systems are trying to identify human need and to determine the adequate

2

1.2. Structure of Visual Surveillance Systems

Sensor

Object Detection

Object Tracks

Behavior Detection

Object Tracking Object Recognition Figure 1.2.: Possible structure of a surveillance system

reaction. One of these is the development of so called smart homes, which are often related to elderly people, who are living alone at home. These usually need assistance for a variety of ordinary activities in life, but do not want to depend on anybody. A smart home could for instance monitor the inhabitants health status and inform an ambulance in case of emergency. Even simple things, such as forgetting to turn off an iron or the stove while sleeping, could be monitored in order to prevent possible threats. Further such systems could be utilized in industrial environments to rise productivity by identifying the current state of work and the automated supplement of tools and material or to prevent accidents at work.

1.2. Structure of Visual Surveillance Systems This thesis will focus on the implementation of various surveillance applications and a possible algorithmic foundation for each component. Fig.1.2 illustrates a block diagram representing the individual steps in a surveillance system, which can easily be extended to a network based surveillance system [Ham05]. This structure will also be the base for this thesis. After the mandatory image acquisition objects of interest, such as pedestrians, faces or luggage items, have to be detected. In the past various approaches have been investigated and mainly put focus on a high detection and low false positive rate. For on-line applications a trade off between computational effort and accuracy has to be made, to be able to process video streams in time. Real time capability therefore is one of the key demands for the detector. Depending on the application scenario, it is required to detect body parts such as hands, faces or legs. Hereby the level of detail can be driven as far as required, so that even eye lids or fingers are modeled independently. Furthermore, discriminative visual features can be extracted for further analysis. Once the localization task has been accomplished the objects have to be traced over time.

3

Chapter 1. Introduction An unique Identity (ID) is assigned to each individual and a spatio temporal trajectory can be established. An optional object recognition module can be used for further robustness and consistent labeling. These trajectories are stored in a storage module. Subsequently the stored trajectories and other extracted features are being analyzed by a behavior detection module. This is usually taking temporal and spatial changes into account to model specific behaviors with a wide variety of approaches [Bux03]. After a detailed behavioral analysis the system should be able to send out an alert to the authorities or security staff.

1.3. Outline The previously presented structure of a general visual surveillance system has been chosen as a rough roadmap to guide the reader through this thesis. By far more detailed information will be provided in this work, which is structured as follows: ˆ Chapter 2: The baseline of every surveillance system is a robust detection of persons in the scene, as their behavior has to be subsequently analyzed. Depending on the application scenario, the representation of a person can be conducted in varying detail. Thereby the most primitive method is a blob detection, which simply indicates the presence of an object without any discriminative abilities. Further, detectors for the human body can be created using machine learning techniques. In this work detectors for the entire body, body parts and faces have been trained based on labeled training data. In an additional step facial feature points are extracted using elastic bunch graph matching, which is considered as prerequisite for facial action recognition ˆ Chapter 3: Having detected objects either in one or multiple views it is inevitable to associate consistent IDs to the persons throughout the entire video sequence. This is required for the subsequent behavior interpretation module, as observations gathered from one frame are usually not sufficient. The first step of each tracking algorithm is the initialization, where any convenient detector can be applied. Subsequently corresponding objects have to be detected in image sequences. Therefore the well known Kalman filter and a blob tracking approach will be presented in the first place. Further, the condensation algorithm will be presented with two major extensions: tracking will be made more reliable by the introduction of multiple cues and an extension to multiple hypothesis tracking will be presented. While most motion models, either rule based or probabilistic, are based on measurements created by the baseline detector and therefore are rather generalizing than discriminative, feature based tracking systems extract features related to the object itself. Besides the traditional tracking of single features, a novel graph representation of objects based on geometrical relationships between extracted SIFT features will be pre-

4

1.3. Outline sented. This step makes tracking far more reliable, as even a re-recognition can be performed. ˆ Chapter 4: Traditional 2D surveillance systems usually suffer of occlusions and lack of precise person localization, as depth information is not available. In this chapter a novel multi camera object detection and tracking system, based on multi-layer homography, will be presented. After a brief review on common camera geometry and the mathematical foundation of homographic transformations, object localization in multiple overlapping views will be briefly discussed. While previous works have already shown promising results utilizing planar homography, it will be shown that a 3D scene reconstruction is possible by extending the system to multiple layers. This extension will rise the object localization performance and avoid the segmentation of persons into multiple object. Furthermore a novel false positive handling module is introduced, which eliminates so called ghost objects. ˆ Chapter 5: Having extracted features, motion patterns and trajectories with the previously described methods, it is possible to analyze human behaviors. Thereby two stages can usually be discriminated: simple so called Low Level Activities (LLA) and more complex scenarios that can be composed from a wide range of activities. A LLA itself can be further decomposed into so called Low Level Features (LLF), that are easily detectable. The crucial task is now to recognize behaviors of interest with an adequate method. Various methods, both static and dynamic, will be investigated in order to recognize facial actions, behaviors of persons in public transportation systems and luggage related events. All recognition systems will be described in context to real application scenarios and evaluated on real data, if available. ˆ Chapter 6: The results presented in this work will be summed up in the conclusion. As most of the implemented algorithms have been evaluated with real data, a short assessment of real world usability of the approaches will be given. Although these may not be working perfectly, the results will be interpreted for their applicability. Furthermore, an outlook to future research directions will be given, providing information on improvements and even new directions.

5

Chapter 1. Introduction

6

Chapter 2. Detection of Stationary and Moving Objects The robust detection of changes and objects in the observed video, is a prerequisite for each video surveillance system [Ham05], where various detail level of information is required. Thereby the probably most ordinary solution is a pixel based classification, which provides information on whether a pixel position is occupied by an object or not. In most common systems this step is a simple segmentation of the current frame into foreground and background [Nas06], allowing a statement whether new objects are present or not. Further models for possibly meaningful colors can be created using prior knowledge, which will be exemplary demonstrated for skin color, but can also be applied in other scenarios, such as traffic sign recognition [Pau07]. The limited information gained by the simple presence of an object is usually insufficient for the task of automated surveillance. Nevertheless it can be considered as valuable preprocessing step, that draws the system’s attention to the detected location and limits the region of interest. This has to be classified subsequently, in order to gain information on what kind of object caused the change [Lin07]. Depending on the surveillance application a wide range of objects has to be detected and modeled with varying detail. For instance a person walking on a highway should trigger an alarm, in contrast to a car. For this purpose commonly model based approaches are used to classify objects. Thereby two tasks have to be discriminated [Eve08]: the classification task, which means the prediction of the presence or absence of an object in the image [Zha07], and the detection task, where the bounding box and label of an object have to be predicted [Fel08]. As this work is mainly about human behavior, a robust person detection system is required, allowing the discrimination between a human and non-human class. Hereby classification is performed in arising detail. While the detection of the complete humans is already a challenging task, furthermore body parts, faces, and even fiducial points1 are extracted with varying approaches.

1

Fiducial points are distinctive points in the face, e.g. eye brows, mouth and eye corners.

7

Chapter 2. Detection of Stationary and Moving Objects

2.1. Change Detection 2.1.1. Temporal Difference Images The detection of moving objects is of great interest in video surveillance tasks, as mostly only changing image regions are of importance [Abd06]. Foreground segmentation approaches, see sec. 2.1.2, only handle changes compared to a static background model and do not state whether an object is moving or not. Additionally, the model has to be maintained and updated, which is rather difficult in highly cluttered scenes with changing lighting conditions and background motion. Therefore methods detecting differences between two images without modeling are desired. A simple and yet effective method to detect changes in a video sequence are temporal difference images [Tia05]. Two subsequent images I(x, y, t) and I(x, y, t + 1) are simply subtracted D(x, y, t) = |I(x, y, t) − I(x, y, t − 1)|. (2.1) The resulting difference image D(x, y, t) is subsequently thresholded to suppress some camera noise and allow small illumination changes. Fig. 2.1 illustrates the result of the temporal difference, which can be used to approximate an object’s shape. This can be explained by background regions being covered respectively becoming visible in the next frame, while the object is moving. For objects with relatively uniform texture only the shape is detected, whereas the inside is neglected. These holes in the shape make a subsequent blob analysis of connected regions rather difficult. However this method can also be subsequently applied to extract so called global motion features [Zob03], describing intensity and direction of changes in the image. The overall intensity i(x, y, t) of changes within the image can be determined with P x,y D(x, y, t) P i(x, y, t) = . (2.2) x,y 1 Subsequently the location of the center of motion m ~ = [mx , my ] can be computed by P P x,y xD(x, y, t) x,y yD(x, y, t) mx = P , my = P . (2.3) x,y D(x, y, t) x,y D(x, y, t) Since the activities are frequently independent of the location, as this usually contains only limited information, changes of the movement’s direction and their value are used by computing the differences ∆m= ~ m(t) ~ − m(t ~ − 1) of means. As the computation of the center of motion requires two frames, ∆m ~ models the motion over three frames. To distinguish between motions which are causing large or small changes between two adjacent frames the mean absolute deviation ~σ = [σx , σy ] is defined by P P x,y D(x, y, t)|x − mx | x,y D(x, y, t)|y − my | P P σx = , σy = . (2.4) x,y D(x, y, t) x,y D(x, y, t)

8

2.1. Change Detection

Figure 2.1.: Exemplary difference image of two subsequent frames. The black line on the right side illustrates ∆m while the white line illustrates the variance.

Changes within a series of variances are furthermore modeled with: ∆~σ = ~σ (t) − ~σ (t − 1). Fig. 2.1 shows the computed features for a passenger in a scene recorded in an Airbus A380. Changes of the center of motion are illustrated by the black line. This can be considered almost constant, given the image size. The variance in contrast, drawn in white, is by far wider and thus indicating massive upward movement with a slight drift to the right hand side. In the scene the passenger has been standing up, which confirms the computed results.

2.1.2. Foreground Segmentation Given a CCTV system with static camera positions, it should be possible to discriminate between background regions and foreground objects. This information can be used to limit the exhaustive search of most detection algorithms to regions of interest. A rather naive description of this task could be the detection of differences between the current frame I(x, y, t) and a background image B(x, y, t) with F G(x, y, t) = |I(x, y, t) − B(x, y, t)|,

(2.5)

resulting in the foreground image F G(x, y, t), as visualized in fig. 2.2. Unfortunately the background image usually is not fixed and has to adopt to: ˆ Illumination changes

– Gradual changes, such as daylight to night – Sudden Changes, such as switching on/off a light source in indoor scenarios ˆ Motion changes

– camera oscillation

9

Chapter 2. Detection of Stationary and Moving Objects

a)

b)

c)

d)

e)

f)

Figure 2.2.: Example for simple background subtraction. The first two images on top are the background and the current frame. The frame difference is given in the top right image. The lower row shows results for a defined threshold of a minimum intensity change of 0, 10 and 25

– moving background patterns, such as bushes or water surfaces ˆ Changes in the background

– objects being removed from the scene – objects being added to the scene In order to fulfill these requirements various approaches have been presented [Pic04]. Due to the lack of annotated databases there are no measurements available. Thus a qualitative performance analysis is given on some example images, extracted from a wide range of application scenarios. Foreground Segmentation By Thresholding Difference Images Taking the previous naive assumption into account, a simple difference between an image of the empty scene B(x, y, t) (fig. 2.2a)) and the current frame I(x, y, t) (fig. 2.2b)) can be computed, which is illustrated in fig. 2.2 c). As can be easily seen the inevitable camera noise also produces differences. These can be removed by introducing a threshold θF G  1 if |I(x, y, t) − B(x, y, t)| > θF G F G(x, y, t) = (2.6) 0 else which results in an binary image mask F G(x, y, t). This threshold denotes the required change in intensity of a pixel position, to be detected as a change. In order to obtain

10

2.1. Change Detection 250

200

g

150

100

50

0

0

50

100

150

200

250

r

Figure 2.3.: Exemplary rg values of a background pixel with two means.

reliable results θF G has to be chosen carefully, as a too small one results in a very noisy binary image (fig. 2.2d)), whereas relevant areas might be filtered with a too large threshold (fig. 2.2f)). A sufficient binary image is illustrated in fig. 2.2e) . Using a static background image will result in a high error rate in time, as the environment will be changing either slowly or rapidly. Therefore it has been proposed to compute the median over the last n frames, which is rather fast, but requires some additional memory [Lo01]. This of course is only an issue on highly integrated Digital Signal Processor (DSP) boards with limited memory. Therefore a so called running average [Wre97] is computed with B(x, y, t + 1) = αI(x, y, t) + 1 − αB(x, y, t), (2.7) and the background model is updated after every time step. The learning rate α is typically set empirically to 0.01, but has to be adjusted according to the application scenario, in order to deliver a reliable output. The main disadvantage of this approach is the involvement of foreground regions, as the entire image is updated. Even moving objects are used for averaging, which will create an incorrect mean value in the background model, especially if it stays static for a short time. Thus only background regions, which are obtained from the actual foreground mask F G(x, y, t), are used for updating: B(x, y, t + 1) = 1 − B(x, y, t)αI(x, y, t) + 1 − (1 − F G(x, y, t))αB(x, y, t).

(2.8)

Adaptive Background Modeling Applying Gaussian Mixture Models Real world scenes unfortunately involve illumination changes, scene changes and moving objects. Fig. 2.3 illustrates a rg-scatter diagram for a pixel located on the ground plane. Due to changing lighting conditions, here clouds in front of the sun, the chosen patch on the ground plane changes its color frequently but in a non periodic manner. A system with a fixed threshold cannot deal with the two observed means, which makes a multi-modal representation necessary [Way02].

11

Chapter 2. Detection of Stationary and Moving Objects Therefore Stauffer and Grimson [Sta99b] introduced a system based on a Gaussian Mixˇ ture Model (GMM), which has been extended by Zivkovi´ c [Ziv04a]. Each pixel changes its value and the resulting observation Xi over time t. At any time t the only thing known about a pixel is its history, a sequence X of observations: X = {X1 , ..., Xt } = I(x, y, i) : 1 ≤ i ≤ t.

(2.9)

Given constant lighting conditions and a static background, a pixel’s values would be quite constant. In order to absorb noise created by small illumination changes, each value is described by a single Gaussian distribution centered at the mean pixel value. Respecting multiple background observations, each pixel is modeled by a set of K states, where the number of states denotes the number of possible background variations. In order to provide a meaningful representation, some other aspects, besides the tracking of lighting changes, have to be considered. A newly added static object is treated as foreground as long as it has not been present for a longer time period than the previous object. This can lead to accumulated errors in the segmentation task and thus resulting in poor tracking behavior. Therefore recent changes may be more important than the determined model parameters. With a pixel’s history and a defined number of distributions K, depending on the available computational power, the probability p(Xt ) to observe a pixel value is p(Xt ) =

K X

ωi,t ν(Xt , µi,t , Σi,t ).

(2.10)

i=1

The weight ωi can be considered as a priori probability of a surface appearing in the pixel view, µi is the mean value of the Gaussian at time t and Σi is the covariance matrix associated to the Gaussian. For computational reasons the dimensions of X, here the RGB channels, are assumed to be independent and of the same statistics, allowing to use following form for the variance Σkt = σk2 I(x, y, t). (2.11) The Gaussian probability density function ν is thereby defined by ν(Xt , µ, Σ) =

1 n 2

(2π) |Σ|

1

1 2

e− 2 (Xt −µt )

T Σ−1 (X −µ ) t t

.

(2.12)

This problem formulation can now be solved with an approximation of the Expectation Maximization (EM) algorithm [Dem77]. As its exact implementation for every pixel is too costly, a new pixel value is simply matched against all K Gaussian components. In case the pixel value is within θσ standard deviations of the distribution, it is considered as background. This threshold can be set for each scenario independently, yet experience has shown reliable results at θσ = 2.5σ. In case neither of the distributions fulfills this requirement, the least probable one is replaced by a new one with the current value as mean and a high variance, but a low weight. These are then adjusted as follows ωk,t = (1 − α)ωk,t−1 + α(Vk,t ),

12

(2.13)

2.1. Change Detection where α is the learning rate and Vk,t is 1 for the matched model and 0 for the remaining ones. The learning rate α can be adjusted with the number of frames it takes to incorporate an object into the background . Small update times enable the handling of rapid lighting changes, but at the same time stationary objects are included into the background model after a short time period. Longer update times in contrast are favourable in case stationary objects have to be also tracked robustly. Subsequently the weights P are normalized, so that K i=1 ωi,t = 1. The parameter α describes a time varying gain, which determines the speed at which the distribution’s parameters change. The weights ωi can be considered as low-pass filtered average of the probabilities that pixel values have matched a model k. While the parameters of the unmatched distributions remain constant, µi and σi are updated with µi,t = (1 − ρi )µi,t−1 + ρXt

(2.14)

2 2 σi,t = (1 − ρi )σi,t−1 + ρi (Xt − µi,t )T ((Xt − µi,t )

(2.15)

ρi = αν(Xt |µi , σi ).

(2.16)

where ρi is effectively another low-pass, which is only defined by data matching a model. New objects, which are added to the background, do not affect the previously built models, as only the least probable one is replaced. The original distribution still exists with the same parameters but a lower weight ωi and can be quickly reincorporated into the background. Since each pixel’s model parameters change from frame to frame, it is necessary to determine which Gaussians are most likely produced by the background process. Experience shows, that these usually have the most supporting evidence and the least variance at the same time. This can be easily explained by the low variance of a static object. In contrast new objects will not match any distribution and therefore create a new one with a high variance. Even in case the object accidentally fits a distribution it usually increases the variance of the existing distribution. Additionally also a moving object’s variance is considered larger than a background pixel until the object stops. In order to model these observations a method to detect background processes is required. All K states are ranked after re-estimating the mixture’s parameter by the value of ωσ , which is proportional to the peak amplitude of the weighted distribution and raises with decreasing variance. This way the most likely background distributions remain on top of the list, while the less probable ones gravitate towards the end of the list and are eventually replaced by a new distribution. Finally the first n of the ranked states, whose summed up probabilities exceed a threshold θBG , are chosen as background model, with n X B = argmin( ωi > θBG ). BG

(2.17)

i=1

13

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.4.: A foreground detection example from the Prometheus database. On the left side the empty background and the current frame are shown. The upper row shows the results with an update time of α = 0.01, the lower one with α = 0.001. The threshold θσ is set to θσ = 4, θσ = 8 and θσ = 16 rising from left to the right.

In this context θBG is a measure for the amount of data, that should at least be accounted for by the background. If small values are chosen for θBG , the background model is usually unimodal, while high values for θBG result in a multi-modal background representation, allowing more than one color to be included into the background model. While K can basically be set empirically and is capable to model most situations it is inevitable to set both θσ and α individually for each scene and sensor setup separately. The influence of the Parameter setting is illustrated with a sequence from the Prediction and inteRpretatiOn of huMan bEhaviour based on probabilisTic structures and HeterogEneoUs sensorS (PROMETHEUS)2 dataset [Nta09], which has been recorded in outdoor conditions. Fig. 2.4 shows the empty scene and the image containing a person in the first column on the left hand side. The first row shows examples with constant update time α = 0.01, whereas α is set to α = 0.001 in the the lower one. The threshold θσ rises from four to 16 in both rows. As it can be seen a too small variance causes a large amount of false positives, which can be lowered by broadening the maximum allowed variance. Shorter update times in contrast eliminate fast changes and create less false positives, at the cost of a larger number of false negatives. The person is included in the background after a while, though he is walking along the way. A higher update time therefore has to be preferred. 2

EU Funded FP7 project, FP7-ICT-214901, PROMETHEUS: Prediction and inteRpretatiOn of huMan bEhaviour based on probabilisTic structures and HeterogEneoUs sensorS

14

2.1. Change Detection RGB color Space

HSV color Space 0.9

250 red green blue

hue saturation luminance

0.8

200

0.7 0.6

Value

Value

150 0.5 0.4

100 0.3 0.2

50

0.1 0 140

160

180

200

220

240

260

280

300

0 140

160

180

Frame number

200

220

240

260

280

300

Frame number

Figure 2.5.: Component changes in RGB and HSV color space due to shadows. The black arrows indicate the passage of a shadow over the image point. All other changes are created by persons passing over the observed pixel.

2.1.3. Removing Cast Shadows and Highlights Common foreground segmentation and motion detection techniques are capable to detect changes of the texture but cannot provide any information on the changes’ cause. Image regions can be altered although they are not occluded by any object in the scene. The most frequent observations are shadows, which are created by objects totally or partially occluding direct light from a light source. Thereby, according to [Jia92], two kinds of shadow can appear:

ˆ Self-shadows are the parts of the object which are not illuminated by a light source. ˆ Cast shadows, as visible in the examples in fig. 2.4, are projected on the scene by objects and can be further divided in umbra regions, where direct light is totally blocked, and penumbra regions, where light is only partially blocked and thus creating a soft transition from dark to light [Sta99a].

A Deterministic Non-Model Based Approach in HSV Color Space In order to suppress these effects, it seems reasonable to examine the color changes of the shadowed regions and how shadows are created in general. A human person is able to recognize the color of an object located in the shadow, although its value in Red Green Blue color space (RGB) has altered. Therefore it seems reasonable to convert the images into a more adequate color space, such as the Hue Saturation Value color space (HSV) [Gon90], which models the human visual perception [Her98] and is more sensitive to brightness changes [Pra01]. Furthermore, shadows cast on a background surface do not significantly change a pixel’s hue, but at the same time it becomes darker and lowers its saturation. This observation is plotted in fig. 2.5, both for the RGB and HSV color space. The arrows indicate changes created by shadows. As it can be seen in the RGB space the intensities

15

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.6.: Exemplary rule based shadow detection. A simple difference of the background and actual input creates multiple false positive pixel, as shown in the third image. After shadow detection only the white marked pixel regions remain, whereas the detected shadow in black is removed.

are a bit lower, whereas the Hue remains almost unchanged but saturation and value change drastically. For a shaded point SP (x, y, t), according to [Duq05], following rule can be formulated   1 if θlow ≤ BIVV (x,y,t) ≤ θup ∧ (IS (x, y, t) − BS (x, y, t)) ≤ θS   (x,y,t)  SP (x, y, t) = ∧ |(IH (x, y, t) − BH (x, y, t))| ≤ θH    0 else

(2.18)

where the channels of the image IHSV (x, y, t) are compared to the current background image BHSV (x, y, t) in HSV color space. The values for the thresholds have to be chosen carefully depending on the scenery, and are usually determined empirically. θlow depends on the strength of the light source and the reflectance of the objects in the scene. Intense light and highly reflective objects imply lower values for θlow . Experiments have shown that values ranging from 0.75 to 0.85 are suitable for most tasks. The parameter θup ranges between 0.9 and 0.97 and is used to avoid the incorrect classification of shadows as part of a moving object. The maximal variance of both hue and saturation is modeled by θS and θH , which are set to 15% of the camera’s saturation range and the hue should not differ more than 60 degrees. Figure 2.6 illustrates the results of the shadow detection technique on a sample from the 2006 dataset of the workshop on Performance Evaluation of Tracking and Surveillance (PETS). On the left hand side the original image is given. In the middle the thresholded areas are illustrated. As it can be seen the shadow is also classified as foreground. White areas in the right image illustrate the foreground regions, while the black ones represent the shadows detected by the shadow removal algorithm.

16

2.1. Change Detection

α

G

θdist1

Et It

αt θdist2 CDt

θdist,min

R

0 β=0

Foreground Shadow

B

Background Highlight

Figure 2.7.: Model for shadow and highlight detection. The actual color has to be located outside the cylinder’s bound to be classified as foreground.

A Statistical Non-Parametric Approach As already stated in sec. 2.1.3, shadowed regions usually have the same color as before, just a little bit darker. In contrast, highlighted regions, as illustrated in fig. 2.8, are by far lighter than the original surface, although having the same color. Such sudden illumination changes could be deleted by a faster update time within the background model at the cost of modeling short term stationary objects as background. Both observations can be used to detect highlighted and shaded areas during the foreground modeling process with negligible additional computational cost. Humans tend to assign a constant color to an object, no matter how the illumination changes over time or space. This phenomena is frequently entitled human constancy [Pra03]. The seen color mainly depends on the surface’s spectral reflection properties, which are invariant to illumination, scene changes, and geometry. On perfectly matte surfaces the color can be considered as product of illumination and spectral reflectance. Therefore the applied model has to separate between the brightness and chromacity component of an object [Hur89]. An expected background pixel value E(x, y, t), computed for instance with a Gaussian mixture model, is represented by its mean µi and variance σi in RGB space. It can be altered by either shadow or illumination into the actually perceived pixel value I(x, y, t), as indicated in fig. 2.7. The task is now to determine the distortion between I(x, y, t) and E(x, y, t), by decomposing the distortion in brightness and chromacity components. The so called brightness distortion bdist (x, y, t) is obtained by minimizing φ(bdist (x, y, t)) = (I(x, y, t) − bdist (x, y, t)E(x, y, t))2 ,

(2.19)

17

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.8.: Example for rapidly changing lighting situations in between five frames. The sun is now able to shine directly into the terminal and highlighting some regions.

which represents a pixel’s brightness compared to the expected value. φ(bdist (x, y, t)) = 1 indicates the constant brightness, while values larger/smaller than one are created by a lighter/darker pixel than expected. The chromacity distortion CD(x, y, t) is simply the orthonormal distance between E(x, y, t) and I(x, y, t) CD(x, y, t) = kI(x, y, t) − bdist (x, y, t)E(x, y, t)k .

(2.20)

With the estimated mean µ(x, y, t) = [µr (x, y, t), µg (x, y, t), µb (x, y, t)] and the associated variance σ(x, y, t) = [σr (x, y, t), σg (x, y, t), σb (x, y, t)], eq. 2.19 and eq. 2.20 become bdist (x, y, t) =

g (x,y,t) b (x,y,t) + Ig (x,y,t)µ + Ib (x,y,t)µ σg2 (x,y,t) σb2 (x,y,t) h i2 h i2 h i2 µg (x,y,t) µr (x,y,t) µb (x,y,t) + σg (x,y,t) + σb (x,y,t) σr (x,y,t)

Ir (x,y,t)µr (x,y,t) σr2 (x,y,t)

(2.21)

and s

2 Ir (x, y, t) − bdist (x, y, t)µr (x, y, t) CD(x, y, t) = + σr (x, y, t)  2  2 Ig (x, y, t) − bdist (x, y, t)µg (x, y, t) Ib (x, y, t) − bdist (x, y, t)µb (x, y, t) + . σg (x, y, t) σb (x, y, t)

(2.22)

Next the variance over time and space of the distortion factors is considered during the background creation process. As different pixels may yield varying distributions the variances σdist and σCD are computed separately for each pixel with s Pt 2 i=0 (bdist (x, y, i) − 1) σdist (x, y, t) = (2.23) t and

18

2.1. Change Detection

Figure 2.9.: Exemplary shadow detection with a statistic approach. The original foreground is shown in the middle. On the right side the detected shaded regions are illustrated in grey.

s σCD (x, y, t) =

Pt

i=0 (CD(x, y, i))

t

2

(2.24)

With the previously made assumptions and adequate thresholds it is possible to create an object mask M (x, y, t), which indicates the type of the pixel with  d y, t) > θCD ∨ bd  F G : CD(x,  dist (x, y, t) < θdist,min , else     B : d bd dist (x, y, t) < θdist,1 ∨ bdist (x, y, t) < θdist,2 , else M (x, y, t) = (2.25)   SP : bd dist (x, y, t) < 0, else     H : otherwise d Both bd dist (x, y, t) and CD(x, y, t) are the normalized values

and

bdist (x, y, t) bd , dist (x, y, t) = σbdist (x, y, t)

(2.26)

d y, t) = CD(x, y, t) CD(x, σCD (x, y, t)

(2.27)

of bdist (x, y, t) and CD(x, y, t). This is necessary to enable the use of a single threshold for all pixels instead modeling each pixel individually. The original background is detected if both brightness and chromacity are similar for E(x, y, t) and I(x, y, t). Shaded background SP (x, y, t) has similar chromacity but lower brightness than the original background, whereas highlighted regions have a by far higher brightness, yet similar chromacity. A pixel is classified as foreground if its chromacity has different values than the background image. However a lower bound θbdist ,min is additionally introduced to avoid the misclassification of pixels with very low RGB values. As all chromacity lines meet in origin, it might be considered similar to any other point and be classified as shadow.

19

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.10.: Exemplary foreground detection with and without highlight detection for the images shown in fig. 2.8. False positive pixel are drastically removed on the right hand side.

The elimination of shadows performs quite similar to the rule based approach presented above, but is still preferred, as no additional computation is required and modeling can be performed parallel to the background model. Further, it is able to suppress the influence of highlights. An example is given in fig. 2.9, where the original image can be seen on the left and the extracted foreground is illustrated in the middle. After applying the shadow detection, the regions in gray in the right image can be removed. As illustrated in fig. 2.10 the proposed method consequently removes most of the false positives create by drastic illumination changes. Nevertheless not all of them are removed, as in some cases overexposure has been observed.

2.1.4. Blob Detection and Preprocessing For a further analysis of regions of interest, it is required to determine areas representing an object or a group of objects. Therefore connected regions have to be detected within the created binary images. In order to receive reliable results some additional preprocessing steps are inevitable. These are illustrated in fig. 2.11 and performed in following order: ˆ Gaussian Smoothing: the first step, a Gaussian filter operation [Dav04], is applied on a gray scale image I(x,y) produced by the foreground segmentation. This way the image is blurred and most of the noise, commonly created by false positives, is removed. By blurring the image most of the holes within the objects are filled, allowing a more detailed 2D abstraction. The two dimensional filter is thereby defined by 1 r2σi2 ν(x, y) = e (2.28) 2πσ12

with the variance σi and the blur radius r = x2 + y 2 , indicating the filter size. The result is illustrated in fig. 2.11b).

20

2.1. Change Detection

a)

b)

c)

d)

e)

Figure 2.11.: Exemplary blob detection results and preprocessing steps, starting with an input image in a). b) The smoothed image, c) after erosion and dilatation, d) the detected blobs, e) the assigned boundaries and bounding boxes

ˆ Thresholding: subsequently a binary image is created by simply thresholding the smoothened gray scale image with a pixel wise decision  0 if I(x, y) ≤ θup 0 I (x, y, t) = (2.29) 1 if I(x, y) > θlow .

The threshold is determined empirically and aligned to the values of shadows and highlights in this case. Fig. 2.11c) shows the image after thresholding. ˆ Erosion and Dilatation: the morphological neighborhood operations erosion and dilatation are frequently used to modify and analyze the form of an object, where the operational result is usually one or zero. In order to fill small holes or cracks and smoothen an object’s contour a dilatation is performed [J¨ah95]. It creates a white pixel if at least one white pixel is present in the neighborhood. This neighborhood are commonly the eight-connected neighbors next to the pixel. Small objects or object noise are removed by eroding the image. This is defined by turning a white pixel into black, in case at least one black pixel in the neighborhood. Both operations can usually be combined and performed with binary operators. An exemplary result is shown in 2.11.

After preprocessing, blobs can be detected in the binary image. Various approaches have been used for connected component analysis [Sha01],[Gon90]. The probably most popular among them is the so called two pass method. A 3 × 3 connected components operator scans every pixel in the image, and performs a decision wether a pixel of I(x, y) is or is not element of the background. In case a white pixel is found, the previously scanned pixels are examined, which are the three above and the one left of the current pixel. Following operations are subsequently performed: ˆ If all four neighbors have not been labeled, assign a new label ˆ If only one neighbor has been previously labeled, assign its label to I(x, y)

21

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.12.: 2 Persons approaching a sofa in a smart room. From left to right: NIR image of the scene with texture information, the aligned range image, and the segmented foreground region

ˆ If more than one of the neighbors has been labeled before, assign the smallest label to I(x, y) and store the equivalence of neighboring labels.

In the second pass the connected components operator once more scans the image and relabels every foreground pixel with the lowest equivalent label. After the labels have been assigned, see fig. 2.11d), the bounding boxes and boundaries of the present objects can be extracted, as illustrated in fig. 2.11e).

2.2. Supplementing Foreground Segmentation with 3D Range Data Most common object detection methods, either based on foreground segmentation [Sta99b] or a trained object detector [Pap00], suffer of discriminative abilities in case two or more objects occlude each other. Besides the insufficient detection abilities tracking is also hardened in case of merging or separating objects. To cope with this, Elgammal and Davis [Elg01] presented a general framework which uses maximum likelihood estimation and occlusion reasoning to obtain the best arrangement for humans. However, a single view is often not sufficient to detect and track objects due to severe occlusion, which as a fact usually requires the utilization of multiple camera views [Kha06]. This approach is usually only applicable in large spaces, as it requires overlapping supplementing views. Furthermore person localization usually fails if the feet are not included within the foreground blob. Hence it is necessary to evaluate other methods for the detection and segmentation task in images. With new emerging technologies it is possible to overcome these problems. In order to overcome the partial occlusion problem, it has been suggested to use a Photonic Mixture Device (PMD) [Ars09b, Wal07], which is based on time of flight (cf. C.1.2). This camera creates a 3D view of the scene and computes the distance between the camera an the object. Fig. 2.12 displays an exemplary view of the PMD sensor, showing two persons approaching a sofa. Both range and textural information can be captured by the camera. Further, it can be observed that both objects are connected in the foreground image and hence there is only one object detected.

22

2.2. Supplementing Foreground Segmentation with 3D Range Data

a)

b)

c)

d)

Figure 2.13.: Comparison of foreground segmentation in range data [a)+b)] and textural information [c)+d)]. Evidently the segmentation in the range image is more exact.

2.2.1. Foreground Segmentation in Range Data In order to detect objects in range images, a common adaptive foreground segmentation method, based on works presented by Stauffer and Grimson [Sta99b], is applied. Each pixel of the image is modeled by K Gaussian mixtures. This seems reasonable, as each pixel’s variance due to noise can be modeled. Usually K = 3 Gaussians are sufficient to compute a model for background, foreground and shadow separately. The parameter α, denoting the update time in frames, has to be set carefully. Experiments have shown that a long update time is required, especially if the persons are next to stationary background objects these would be modeled as a background object after a while. This approach can basically be applied both for the range data and the Near Infrared (NIR) data, which will not influence the further processing chain. Thus experiments have shown, that the foreground segmentation performs more robust in the range image. Due to the reflective properties of the materials in the scene, foreground and background are frequently confused, as illustrated in fig.2.13. This of course is not necessary happening in every sensor setup, and therefore should be adopted in each sensor setup.

2.2.2. Separation of Humans in 3D Data In the first step, the segmentation of foreground regions in the sensor data, the foreground can be further evaluated. The resulting binary foreground mask F G(x, y, t) is now multiplied with the depth image DI(x, y, t), leaving only the region of interest behind. As can be seen in fig. 2.12, there is still range information left to be utilized, although it is basically only one big blob. Now that the depth information is limited to the foreground region, it is possible to

23

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.14.: From left to right: The segmented foreground region, the detected depth gradients, two foreground blobs after thresholding.

separate occluding objects. Therefore a so called depth-gradient Gz (x, y, t) =

δI(x, y, t) δz

(2.30)

is computed for the remaining foreground. As the data is basically represented by a two dimensional matrix, the gradient can be determined by computation of the image gradients Gx (x, y, t) and Gy (x, y, t) by s 2  2 q δI(x, y, t) δI(x, y, t) 2 2 Gz (x, y, t) = Gx + Gy = + . (2.31) δx δy The resulting gradients Gx (x, y, t), Gy (x, y, t) and Gz (x, y, t) are visualized in fig. 2.14. Similar to an image gradient, which detects boundaries and high contrasts, the depth gradient detects gaps in the range data. Obviously the gradients with the highest intensity have been detected mainly at the object boundaries, while some gradients with a smaller absolute value are observed within the objects. As the range image is quite noisy the entire foreground blobs show gradients with values larger than zero. In order to extract the area of interest GI(x, y, t) , here the two separated persons, it has been decided to threshold the gradient image and remove all gradient values larger than a predefined threshold θdist  0 if Gz (x, y, t) > θdist GI(x, y, t) = (2.32) Gz (x, y, t) else . Experience has proved, that it has to be set individually to each application scenario, as the threshold denotes the minimum distance between two objects. Now that only the inner surfaces of the persons remain, these can be used to detect blobs within the gradient image with connected components analysis [Gon90]. The detected blob boundaries hence are the real object boundaries, as illustrated in fig. 2.15. Subsequently the number of objects in the scene is estimated by applying connected components analysis and counting the number of blobs. This method unfortunately creates some noise in the detection process, as it is highly depending on the quality of the foreground segmentation. Hence often small additional objects can be observed in the segmented data, rising the

24

2.2. Supplementing Foreground Segmentation with 3D Range Data

Figure 2.15.: Separated persons in the NIR image

number of detected persons. As a thresholding based on object size is not applicable, objects almost entirely occluded would also be removed, another method had to be found. Examining the output showed that these small additional segments are only visible for a short period of time and appear quite rarely. Therefore it is sufficient to window the frame wise output and remove short time peaks in the detected image sequence. Experiments have shown that a simple majority vote within a window of ten frames is ideal for most sequences.

2.2.3. Evaluation The presented method has the ability to segment objects partially occluding each other in range data, and create an object shape for each individual in the scene [Ars09b]. Due to the lack of annotated data, an evaluation of localization accuracy has been discarded. For first trials the number of detected objects in an image frame is compared to hand labeled ground truth, containing the correct number of persons present in the scene. Tab. 2.1 shows the current results for the smart home scenario recorded for the PROMETHEUS database [Nta09]. 1000 frames, with a constant distance of ten frames have been extracted from the first recorded scenario, for evaluation purposes. This way a sequence with a length of 10.000 frames has been processed. As can be seen the detected number [#] of persons in all frames is given and the provided ground truth are indicating a low false positive rate, which can be explained by errors in the background model. As more meaningful measure the average number of persons in each frame avg[#] and the accuracy acc are also given. This results show that in 95% of all frames the number of blobs has been correctly detected. Further, a gain of twelve percent in accuracy has been recognized after windowing and averaging the detector output. Both approaches show significantly better results than the simple blob detection in the foreground image. The presented approach is currently implemented in MATLAB for evaluation purposes. Even this implementation has been able to process twelve frames per second, which in-

25

Chapter 2. Detection of Stationary and Moving Objects

[#] ground truth 1400 2D data only 1173 gradient only 1803 with averaging 1498

avg[#]

acc.%

1.4 1.2 1,8 1,5

– 61,3% 84.3% 95.3%

Table 2.1.: Detected number of persons in the smart home scenario after the detection of depth gradients and averaging.

dicates a guaranteed real-time processing performance in an optimized implementation in C++. This should enable an integration into existing surveillance systems without notable additional requirements on processing capacities.

2.3. Detection of Skin Colour Regions The detection of skin color regions is usually not sufficient to detect faces or hands robustly, yet it provides a reasonably good guess. Due to the presented methods’ low computational effort, they can be applied to limit the search areas of more complex methods to the most probable regions. In the past various approaches have already been developed, aiming for a mathematical representation of skin color. Within the computer vision community various different color spaces have been introduced, each with different characteristics. For the task of skin color detection the normalized RGB color space (rgb) has frequently been proposed [Vez03]. It is easily obtained from the RGB values with r=

R G B ,g= ,B= . R+G+B R+G+B R+G+B

(2.33)

As the sum of the three normalized components is constant, r + g + b = 1, one dimension can be omitted. The remaining components are now independent on the brightness of the source RGB and therefore often called pure colors. The task is now to define a region, containing skin color, in rg-space. Unfortunately, modeling this region is rather difficult. Depending on the spectrum of the light source, the observed skin color might differ dramatically in two scenario setups. Therefore it is inevitable to consider the overall setup before creating or applying a model. This constraint can be considered by the two approaches described in the following sections.

2.3.1. Physically Motivated Skin Locus Model St¨orring [St¨o99] investigated the influence of varying color temperature of the light source on the image formation process. In the first place the skin reflectance properties are considered, which can be described by a dichromatic reflection model. Light L, reflected from a point on nonuniform material, can be described as a mixture of light Lsurf reflected

26

2.3. Detection of Skin Colour Regions at the surface and reflected by the material body Lbody L = Lsurf + Lbody .

(2.34)

The corresponding layers of the human skin are the thin epidermis at the surface and the thicker dermis, placed under the epidermis. Only 5% of the incident light is reflected at the epidermice surface, whereas the rest is entering the skin, where it is absorbed and scattered by the two layers. The epidermis can be modeled as optical filter and absorbs some of the light, which is usually neglected. The transmitted light depends on its wavelength and the dopa-melanin concentration, which differs for the different human races. In the dermis, whose optical properties are basically the same for all humans, the light is mainly absorbed. Therefore the skin color is determined by the epidermis transmittance and hence is only depending on the dopa-melanin concentration. The color appearance of an opbject can now be described by its reflective properties and the Correlated Color Temperature (CCT) of the light, the object is being exposed with. The spectrum of the light source can now be correlated with the CCT, where a low one gives material a reddish appearance, and a high CCT in contrast illuminates the material in a more blueish way. By the transformation into the rgb color space all human skin colors are located in a narrow band. Changing the CCT will consequently move the skin color area in the rg-plane. Using a white balanced camera with constant parameters the skin colored region forms the so called skin locus. The position of the skin locus, as visualized in fig. 2.16, can be described as a sickle shaped region located near the white spot in the rg-plane. Two quadratic functions can be used to describe the lower θlow and the upper θup bound of the sickle. White color is being removed by the definition of a circle Wr with its center at the color white with r = 0.33 and g = 0.33. Following parameters have shown reliable results in most application scenarios θlow = −1.8423r2 + 1.5294r + 0.0422

(2.35)

θup = −0.7279r2 + 0.6066r + 0.1766

(2.36)

2

2

WR = (r − 0.33) + (g − 0.33) < 0.004 The final classification is then performed with  1 if (g < θup ) ∧ (g > θlow ) ∧ (Wr > 0.004) SM (x, y, t) = , 0 else

(2.37)

(2.38)

where SM (x, y, t) is the resulting skin mask. As can be seen in the detection example shown in fig. 2.17 on the left side, the performance is quite acceptable for an aircraft scenario. Some errors were made in regions obviously not containing skin, such as the passengers’ trousers and the seats. This illustrates the weakness of this approach, the attempt to model different lighting conditions, though it were only four CCTs in this approach. Some colors might be part in the skin locus in a particular sensor setup, but not in another one. Therefore the position and form of the skin locus in the rg-space has still to be adopted, which would require an additional training step [Mar03].

27

Chapter 2. Detection of Stationary and Moving Objects g

g

1,0

1,0

0,75

0,75

0,5

0,5

0,25

0,25

0,25

0,5

0,75

1,0

r

0,25

0,5

0,75

1,0

r

Figure 2.16.: Left: The region covered by the skin locus model. Right: The region covered by a GMM, trained for an indoor aircraft scenario.

2.3.2. Modeling Skin Color With A Single Gaussian In real life applications the physically motivated approach frequently seems to include a way too large region of the rg-space and thus produces false positives. In order to deal with more difficult lighting situations the locus has to be further adapted, which requires additional costly measurements. As an alternative, a single Gaussian can be applied to model an elliptical joint probability function, which is defined as p(I(x, y)|skin) =

1 2π|σi |

1

1 2

e− 2 (I(x,y)−µi )

T σ −1 (I(x,y)−µ ) i i

.

(2.39)

In this case I(x, y) is a color image in RGB and µi and σi are the distribution’s mean and variance. The model parameters are estimated with manually selected skin color patches in a initial training phase [Yan99] by: n

1X µi = Ij (x, y) n j=1 and

(2.40)

n

σi =

1 X (Ij (x, y) − µi )T (Ij (x, y) − µi ), n − 1 j=1

(2.41)

where n is the total number of skin examples. In an exhaustive training phase with images collected from TV-shows and scenarios recorded in an Airbus A380 mock-up, following parameters have been estimated: 0.4212 µS =

28

0.3151

! , ΣS = 10−2

0.4440

−0.2164

−0.2164

0.1459

! .

(2.42)

2.3. Detection of Skin Colour Regions

Figure 2.17.: Skin color detection with skin locus (left) and a single Gaussian (right)

The resulting region covered in rg-space is illustrated in fig. 2.16 on the left side. Compared to the skin locus the computed parameters enclose a by far smaller region. The probability p(I(x, y)|S) can now be used to estimate how skin like a pixel is [Men00]. By thresholding the skin probability a binary mask SM (x, y, t), analogue to the approach presented in sec. 2.3.1, can be created by simple thresholding

SM (x, y, ) =

 1 if θlow < p(I(x, y)|S)

(2.43)

0 else with a previously defied threshold θlow for the minimum skin probability. An already thresholded probability mask is displayed in fig. 2.17 on the right hand side. As can be seen, faces and hands are robustly detected in the image. Compared to the other approach the amount of false positives is drastically lowered, which can be explained by the more narrow rg-region defined as skin color. Although only a single Gaussian has been applied to estimate the skin color region, remarkable results have been achieved. While mixtures of Gaussians are able to model multiple skin like locations [Jon02] with high accuracy, the single Gaussian is frequently favored due to the lower computational effort. The large region defined by the skin locus is advantageous in unknown environment with a white balanced camera, though some additional noise is induced. As soon as a closed setting is chosen, a trained Gaussian’s narrow band is the right choice. Within this work the skin color regions have mainly been used to limit the search regions of face detection systems with higher computational effort. Therefore further preprocessing steps have not been conducted, though some of the noise could be removed and a quite exact positioning of arm and face candidates can be performed. For the detection process only the relative amount of skin located within a sampling window will be considered with Pb pS (rect(x, y)|skin) =

i=x

s

Pb

j=y bs2

sS(i, j)

,

(2.44)

where bs is the size of a quadratic sampling window.

29

Chapter 2. Detection of Stationary and Moving Objects

2.4. Pedestrian Detection Based on Haar Basis Features The detection of pedestrians is an inherently hard task, as their appearance changes rapidly due to perspective, clothing, pose, and illumination. Further difficulties appear due to cluttered background and partial occlusions [Gra07a]. First attempts for the localization of persons have been presented in sec. 2.1.2. These suffer of discriminative abilities, as they only register changes within the scene and require a static camera. Furthermore, the background has to be more or less static. The high variability of the pedestrian class has made it difficult to chose the best fitting method. Therefore various approaches have been investigated in the past [Gav99]. Most of these intend to find an adequate set of features to generalize the entire human and detect it by an exhaustive search over a scale-space pyramid. Each extracted window is subsequently tested with a classifier. In this work an approach, first introduced in [Moh01], is investigated and extended. In its initial form an over complete representation with Haar features is used to describe the human class. In order to discriminate between humans and non humans the resulting feature vector is classified by a Support Vector Machine (SVM), which is a popular approach to solve two-class problems. As real world scenarios suffer of occlusion, varying pose and perspective, it seems reasonable to segment the human class and perform the detection task with a part based classifier. Further, it will be shown that the large resulting feature vectors can be drastically reduced by selecting the most discriminative ones with AdaBoost.

2.4.1. Over Complete Representation with Haar Features A fundamental principle for signal analysis is that a function f (x) can be represented by a linear combination of simple basis functions ψi (x): X f (x) = ωi ψi (x), (2.45) i

where the coefficients ωi indicate the weights of the corresponding basis function. A function space W , being used to approximate a function f , is thereby defined as a set of Haar basis functions [Haa10]. As a baseline of this representation the Haar motherwavelet is defined by:   1 for 0 ≤ x < 12    ψ(x) := −1 for 12 ≤ x < 1 (2.46)    0 otherwise All Haar wavelets can be derived by scaling and translating the motherwavelet ψ(t): m

ψm,n (x) := 2− 2 ψ(2−m x − n).

30

(2.47)

2.4. Pedestrian Detection Based on Haar Basis Features A function f (x) can be approximated at a certain resolution m by a linear combination of translated Haar wavelets. X f 0 (x) = vm,n ψm,n (x) (2.48) n

where the coefficients vm,n are calculated by the projection of f (x) on each basis function Z

(n+1)2m

vm,n = hψm,n , f i =

ψm,n (x)f (x)dx

(2.49)

n2m

f 0 (x) will be referred to be the approximation of f (x) in the function space W m or the representation of f (x) at resolution m. A second set of functions orthogonal to the Haar wavelets can be derived from  1 for 0 ≤ x < 1 ϕ(x) := (2.50) 0 otherwise which is called Haar scaling function. By scaling and and translating an orthogonal function space W m is received with m

ϕm,n (t) := 2− 2 ϕ(2−m t − n)

(2.51)

to approximate the function f (x) in W m Am (f ) =

X

µm,n ϕm,n

(2.52)

n

sµ = hϕm,n , f i

(2.53)

Am (f ) is used to indicate that this operation on the signal is similar to calculation of the mean average of adjacent values. This leads to an algorithm, also known as Fast Haar Transform, where the scaling coefficients s and the wavelet coefficients w are calculated recursively by X sm, n = hk−2n sm + 1, k (2.54) k∈Z

wm, n =

X

gk−2n sm + 1, k,

(2.55)

1 1 h = {. . . , 0, 0, , , 0, 0, . . .} 2 2

(2.56)

k∈Z

with the filter coefficients:

1 1 g = {. . . , 0, 0, − , , 0, 0, . . .} (2.57) 2 2 The scaling coefficients are simply the averages of adjacent coefficients at the coarser resolution, while the wavelet coefficients are the mean differences. Up to now only projections for 1D signals were addressed. As image processing is intended, which obviously requires 2D signals, an extension of superposition principles using Haar

31

Chapter 2. Detection of Stationary and Moving Objects wavelets is needed. 2D wavelets can be derived by the tensor product of two 1D wavelet transforms [Pap00]. For Haar wavelets three 2D wavelet basis functions with strong response to vertical ψver , horizontal ψhor and diagonal ψdiag boundaries are defined by ψver (x, y) = ψ(x) ⊗ ϕ(y)

(2.58)

ψhor (x, y) = ϕ(x) ⊗ ψ(y)

(2.59)

ψdiag (x, y) = ψ(x) ⊗ ψ(y)

(2.60)

The wavelet representation of an image has several advantages for the object detection task. Besides the fast calculation, it maps the pixel domain to a multi-resolutional representation. This means, that the signal can be analyzed in a sequence of subspaces W 0 ⊂ W 1 ⊂ W 2 ⊂ . . . W m ⊂ W m+1 , where W m+1 describes finer details of the signal than W m . Intensity differences over large regions can be observed at a coarse resolution, while the details of local regions are displayed at a fine resolution. A strong response of a coefficient indicates a high similarity between the signal of interest and the appropriate wavelet, while weak responses denote an orthogonal signal. No response to any coefficient occurs in uniform image regions. The motivation to use wavelets for object detection is evidently that the transformation can capture visual features, such as boundaries and intensity differences, and map different images of one object class to similar features. It should be noted that besides the big advantages of simplicity and computational efficiency the Haar approach also inherits one significant drawback. The transformation is not optimal with respect to its stability. This can be simply verified by observing the response of the vertical Haar wavelet. If the filter coefficients are computed exactly at a sharp white-black vertical edge the resulting value will be very high, indicating high signal similarity at the actual image position. If the filter is shifted, by an even very small value to the right or left the coefficient will be significantly smaller. Consequently the Haar feature transform cannot be considered as invariant to small signal shifts. The utilized approach is based on so called Haar like basis features [Vio01], which are reminiscent of Haar wavelets. In contrast to the strict theoretical constraints on translation, scaling and appearance of the wavelets, the Haar like features can be defined more flexible. This flexibility enables one to adapt the filters to some degree to a specific object detection problem by modifications on their structure. Aberrations from the defined translations allow for some amount of redundancy in feature space, making the classification task more stable. For the person detection system following seven features, as displayed in fig. 2.18, will be used: ˆ three two-rectangle features at a resolution of eight, 16 and 32 pixels, for the repre-

32

2.4. Pedestrian Detection Based on Haar Basis Features

Figure 2.18.: Exemplary image representations with all seven feature types for three sizes of bs = 8, 16, 32 with an overlap of 75%.

sentation of local and global intensity differences in horizontal, vertical and diagonal direction, such as the boundaries of shoulders, body and legs ˆ two three-rectangle features at a resolution of eight, 16 and 32 pixels, for the representation of horizontal and vertical lines at different size, for instance arms and legs ˆ one center-surround feature at a resolution of eight, 16 and 32 pixels, for the representation of circles and rectangle areas, like the head or the hands p 2 + ψ 2 , as a non-linear combination of the horizontal and vertiˆ and a feature ψver hor cal feature at a resolution of eight, 16 and 32 pixels, used to approximate the length of the gradient at the feature location

In order to gain a meaningful representation, features are extracted with 75% overlap, creating a redundant feature set. This should create a better fitting model for the pedestrian class. The distance of two neighboring features will accordingly be 14 bs in x and y direction, with bs denoting the block size of a feature. This representation is referred to as feature transform with quadruple density.

33

Chapter 2. Detection of Stationary and Moving Objects Images with a resolution of 128 × 64 will be used for the training procedure. Although this might appear very small, this size should be sufficient for the detection task. Additionally, it limits the amount of possible features within the image. In order to represent characteristic image regions within the pedestrian or the negative class, the use of varying feature sizes seems advisable. The seven Haar-like features will be calculated at the resolution of 32 × 32, 16 × 16, and 8 × 8 pixels. The first resolution of 32 pixels will result in 15 × 5 = 75 feature coefficients, the resolution of 16 pixels in 29 × 13 = 337 coefficients, and the resolution of 8 pixels in 61 × 29 = 1769 feature coefficients. These coefficients need to be computed for all seven feature classes. Although there are three possible color channels to compute the Haar wavelets’ values, only one meaningful feature is chosen. The filter is applied in every channel separately and subsequently the largest absolute value is chosen as most discriminative feature. Doing this will provide some more robustness to varying colors in the pedestrian class and hence provide more stability. Further color invariance can be gained, by the computation of the absolute values of the filter responses. In contrast to the face detection task [Vio01], the direction of the image gradients does not matter. For instance a person wearing a black jacket could be standing in front of white background, which results in a feature response f (x). If the constellation would be inverse, meaning a person wearing a white jacket in front of black background, the response’s sign would be inverted to −f (x). Whereas in the face detection task the sign is important to describe geometrical relationships between image regions, such as a dark eye brow being located above the lighter eye lid. Besides the discriminative abilities, Haar like features are popular due to the very convenient and computationally effective computation. In image processing tasks the Haar feature can be simply computed with help of the integral image II(x, y). Each pixel in the integral image represents the sum of all pixels in the original image contained in the rectangle formed by the pixel itself as the lower-right corner and the image’s origin in the upper right corner at pixel position (1, 1): X II(x0 , y 0 ) = I(x, y). (2.61) (x≤x0 ,y≤y 0 )

The integral image II(x, y) can be computed in one pass over the image with following recurrences tmp(x, y) = s(x, y − 1) + I(x, y) II(x, y) = II(x − 1, y) + tmp(x, y),

(2.62)

where tmp(x, y) is a temporary image storage. This image representation is favorable, as the sum of a rectangular area can be calculated by only four sum operations, no-matter which size the rectangle area has. Thus a two-rectangle feature can be calculated by 9 additions, four additions for each rectangle and the one for the subsequent subtraction. This

34

2.4. Pedestrian Detection Based on Haar Basis Features means that the amount of operations, equivalent to computational cost, is constant for features at all sizes and is therefore a eligible characteristic under real-time considerations.

2.4.2. Component Based Representation of the Pedestrian Class The previously described pedestrian representation has been developed to model the entire human body, and hence to find a generative model of the entire body. Due to varying pose, self occlusions, and occlusions generated by other objects, it is hard to create a model for each possible constellation. These constraints aggravate the robust detection of objects in real world situations, as these are usually more crowded than the utilized training images. Therefore it has been proposed to split the human body into several discriminative parts, and represent these by multiple models [Moh01, Mik04], which will significantly raises the detection performance. The most intuitive segmentation of the human body includes arms, legs, head and torso. Consequently independent classifiers have to be trained for each component, which can be very complex due to the various poses and perspectives appearing in 2D images. Furthermore, large amounts of annotated data are required for training and evaluation. The now arising question is how one could represent the complex human body structure. Forsyth and Fleck proposed to either manually create or learn so called body plans [For97]. Due to the sparsity of manually annotated body part data, an empirically segmentation has been conducted in two variations [Ste08, Qi08]. Thereby the human body structure of the training examples has been intensively studied. For first trials the entire feature vector, as presented in the previous section, has been divided into three parts of equal size and is hence forming three independent feature vectors. These were formed by simple horizontal segmentation of the human body class. As shown in fig. 2.19 on the left, the vectors were formed by the upper third of the training image, containing head and shoulders, the middle representing the torso, and the lower third contains the legs and feet. The feature vectors will be used to train three independent support vector machines. A second, more complex, representation is illustrated in fig. 2.19 on the right. It includes the body segmentation into five parts, namely the arms, legs, and the head including the upper body. The limbs were independently trained for left and right arm and legs respectively, which resulted in five trained classifiers. All feature vectors are normalized independently reducing the effect of features far away from the normalized value. It is expected that this side effect leads to a better detection performance if persons appear under non-uniform illumination. The second challenge, besides the robust detection of body parts, is to find an appropriate approach to group single decisions and decide whether a person is present within a sampling window. While Wu and Nevatia compute the likelihood for the presence of multiple humans [Wu07a], in this work an approach based on heuristics is performed. After utilizing the part based classifiers, these are combined by sliding a window with a

35

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.19.: Both variations of part based classifiers for human detection. Left a simple model with three segments, right a more complex one with five elements

size of 64 × 128 pixels over the image in different scales and check if at least two thirds of the body parts are detected. Allowing few false negative detections enhances the overall detection performance, while keeping false positives at a low value. Further, this approach adds some more generalizing abilities, as more variability in composing the whole human from body parts is provided. Experiments have proved, that a simple vote within a window is not sufficient to robustly dismiss false positives. Therefore a geometrical human model is additionally incorporated. This uses some general obvious information, such as: ˆ The legs are located below head and arms ˆ The arms are located above the legs and usually below the head. Further they are located at the borders of the torso. ˆ The head is the topmost object.

Further, the principle of so called head units, defining geometrical distances of body parts based on the head size, are employed to estimate the most plausible positions of all body parts. All those observations are used to build a geometrical human representation. Although the training data has been aligned quite strictly, meaning constant position and proportion of the body parts, it seems reasonable to allow some additional variation of the human appearance. Therefore the body part’s position is allowed to vary in a predefined region around the originally computed mean position. Additionally the scale factor is allowed to differ by a factor of 0.2.

2.4.3. Feature Selection with AdaBoost One could of course extract and use all features presented in sec. 2.4.1, in order to create a model for the detection task. With an overlap of 75% and the feature sizes of eight 16, and 32 pixels a total of 15 547 values will be extracted and hence occupy 485,8 kbits for each training image or extracted window during the detection process. This results in a very time consuming training phase and very large support vectors in case classification

36

2.4. Pedestrian Detection Based on Haar Basis Features Algorithm 1 The Ada Boost Algorithm w1,l+ ← 2l1+ w1,l− ← 2l1− for r = 1 to R do w wr,l ← Pl r,lw j=1

r,j

for j = 1 to n do determine the thresholds θweak for possible weak classifiers hj P calculate classification error j = l wr,l | hj (xl,j ) − yl | end for chose the weak classifier hj → hr with smallest weighted error j → r r βr ← 1− r if ~xl is correctly classified by hr then el ← 0 else el ← 1 end if wr+1,l ← wr,l βt1−el end for is performed with SVMs [Pap00]. Therefore it seems indespensible to drastically reduce the amount of features, and select the ones with the most discriminative abilities. This way processing speed has been drastically risen while keeping accuracy at a high level or even increasing it. A possibility would be the use of a Principle Component Analysis (PCA) and to subsequently ignore feature dimensions with low variance. The problem with this approach is that features are selected only with respect to their individual characteristics, without taking the relevance of a single classifier in combination with others into account. Freud and Shapire presented the so called AdaBoost Algorithm [Fre95], in order to create an ensemble of features. The basic idea is to combine so called weak classifiers hi (x), which perform slightly better than chance, to a new, more complex and stronger classifier h(x). The assumption is simply, that a weighted vote should perform by far better than chance. Initial weights are assigned to each used example according to its class and the number of examples in each class. For each feature ~xi of the training set a threshold value and its parity is determined. This should provide the lowest classification error for the weighted training-set L. Therefore the assigned labels yk are estimated by using just the weak classifier feature xi and the weighted error is computed. The feature with the smallest weighted error is subsequently selected to update the current example weights according to the actual misclassification performed by the weak classifier. Incorrectly classified examples hence will be weighted higher than correctly classified ones. Therefore in the next

37

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.20.: Some exemplary positive chosen examples containing humans and negative examples

iteration a feature, which mainly focuses on the misclassified examples, will be chosen. This procedure avoids the selection of feature with high discriminative abilities, which show errors on the same examples. A weaker classifier is therefore favored, if it increases the overall performance. The classical AdaBoost algorithm is also able to perform a classification with high accuracy at reasonable computational effort. Nevertheless Bartlett et. al [Bar05] have shown that AdaSVMs, meaning the initial feature selection with AdaBoost and a subsequent classification with SVM, outperform both single classifiers. This could be explained by the excellent discriminative abilities of SVM in combination with the highly reduced feature size.

2.4.4. Training Procedure In order to create a meaningful model and compute valuable support vectors, a large database containing both positive and negative examples is required. Therefore the Pedestrian Finder Database (PFIND) [Pel05], containing 4 108 person images with 64 × 128 resolution has been used. By mirroring the images the number of person images is doubled. Additionally 51 261 negative examples have been created by random choice from urban images without persons. Some example images, both positive and negative, contained in the PFIND image pool are illustrated in fig. 2.20. 1 000 positive and negative examples are randomly picked out to create a unseen hold-out set, which can be used to test each stage of the iterative training procedure. Papageorgio et al. [Moh01] proposed to train a SVM with a huge amount of positive and negative examples. Unfortunately it is hard to handle the huge amount of available data due to the computational effort. On the one hand training is getting very complex, and on the other hand the models that are created by the LibSVM toolbox [Cha01] are becoming very large, which results in a very slow detection process, where loading the models into the memory is the most time consuming part. As only very few support

38

2.4. Pedestrian Detection Based on Haar Basis Features Algorithm 2 Iterative Training Ptrain is the training set with n+,train positive examples x+ and n−,train negative examples x− Pout is a hold-out set with n+,out positive examples x+ and n−,out unseen negative examples x− P0 the initial small training set with n+,0 positive examples x+ and n+,0 negative examples x− , with Ptrain ∈ Ptrain E ← errors allowed (e.g 100) r=0 repeat train classifier hr (~x) with Br control progress with hr (xi ) classifying samples from Pout Ptmp initialize an empty temporary dataset repeat classify examples xi randomly chosen from Ptrain , with hr (xi ) if misclassified (or correctly classified with low probability) then add xi to Ptmp end if until until E is reached Pr+1 = Pr + Ptmp r ←r+1 until abort criterion is reached (e.g. iterations, max. amount examples, error rate) train classifier hR (xi ) with Pr

vectors are required to approximate the most favorable hyper-plane, it seems reasonable to use only the most relevant examples. A samples relevance can hardly be determined a priori and consequently must be determined on line. Therefore the SVM is initialized with 100 positive and 100 negative examples in the first place. The resulting classifier is then applied to randomly chosen data from the remaining image pool. After 100 of the examples were misclassified, these are added to the training data and a new model is created by retraining the SVM. This procedure is then repeated either until a predefined amount of iterations is reached, no training material is added, or the hold-out set is correctly classified. After each iteration the amount of training data will grow and performance will also rise with the amount of examples. Evaluation will show that this procedure raises the performance of the classifier (in comparison with all-at once) by taking into account a higher amount of examples and therefore a more general mathematical model.

2.4.5. Pedestrian Detection Results In order to evaluate the chosen approach, two data sets with varying difficulty have been chosen. The easier one, the PFIND database [Pel05], has been recorded at the Institute for Human Machine Communication. The validation set contains 50 images recorded

39

Chapter 2. Detection of Stationary and Moving Objects

Filter 1,2,3 1,2,3 1,2,3 all all all all all Parts 1 1 1 1 1 1 3 5 Iterative no yes yes yes yes yes yes yes Boost no no yes yes yes yes yes yes Features 1.326 1.326 6.633 6.634 3.094 15.477 15.477 15.477 Classifier SVM SVM Ada SVM SVM SVM SVM SVM DR[%] 81.75 82.34 78.54 80.75 82.15 84.13 0.8631 0.8571 FPPW[%] 0.060 0.031 0.027 0.023 0.029 0.037 0.021 0.053 Table 2.2.: Evaluation of different training strategies for person detection based on Haar Basis Features.

outdoors, where a total of 75 individuals have been filmed. Though the data has been valuable for training purposes, it has not been used for evaluation, as a dr of 100% has been achieved with the first configuaration, with 10−4 F P P W . Therefore a second dataset, which is considered as by far more complex, with 209 images has been recorded by INRIA3 [Dal05]. This database is considered by far more difficult due to the presence of multiple persons and partial occlusions in crowded scenarios. Fig. 2.21 ilustrates some images from the INRIA data set including the detected humans. For all trials a dense search with a sliding window of 64×128 pixel size has been performed. The shift in x and y direction has been set to two pixels, while the scale has been changed by the factor 1.2. Both databases have been examined for the detection rate dr =

detected humans humans in database

(2.63)

and false positives per window f alse positives . (2.64) checked windows Unfortunately it is hard to compare the results with existing approaches, as the evaluation procedure is not standardized, meaning which requirements hypotheses have to fulfill, in order to be counted as correctly classified. Within this work hypotheses are considered to be correct in case their center is located within the ground truth center area with a tolerance of 1/8 of the model’s size in horizontal direction and 1/16 model size in vertical direction. The model size is allowed to differ between 0.5 and 2 times the ground truth vertical size. The margins are set quite strict for the conducted experiments. Obviously less strict bounds will result in higher detection rates and lower F P P W . FPPW =

Some chosen results are given in tab. 2.2 for various parameter setups. As seen, the original approach, based on the classification of Haar features with SVMs [Moh01], which has 3

Institut National De Recherche En Informatique Et En Automatique - INRIA Rhone-Alps, Montbonnot,France

40

2.5. Face Detection

Figure 2.21.: Exemplary human detection results with data from the INRIA data base without further processing of the hypothesis.

been re-implemented, is easily outperformed by a reduced feature set and the iterative training approach. Thereby it is remarkable, that the combination of AdaBoost and SVM tends to create slightly better results than using AdaBoost as a classifier. Further improvements have been achieved by training individual body parts, rather than the entire body. Both approaches, horizontal splitting and more complex modeling with five parts, performed significantly better than the holistic approaches. The presented approach achieved better results than the initial Haar feature based approach suggested in [Moh01]. Slightly better results have been reported by Dalal and Triggs [Dal05], who used histograms of oriented gradients and achieved 89% detection rate at 10−4 F P P W . These results have been confirmed in [Yeh06], who used a detector cascade.

2.5. Face Detection The detection of humans in real world scenarios, as described in sec. 2.4, is sufficient for some basic surveillance tasks. For a more sophisticated analysis of human behavior the body has to be further analyzed. One of the most important cues seems to be the face, which is able to express a wide range of emotions. These are commonly agreed to be a good indicator for a person’s intentions and feelings [Pan06]. Further the detection of humans is often not sufficient to recognize re-entering persons, as these can change their appearance drastically, for instance by dressing differently. The face is usually a by far more robust pattern and hence is used for identification and authentication tasks [Zha03]. Therefore it is inevitable to robustly localize faces in the image. According to Yang et al. [Yan02] the main challenges of the face detection task are: varying pose, presence or absence of structural components, such as beards or glasses, facial expression, partial occlusions, lighting conditions, and the image orientation. From a wide range of possible

41

Chapter 2. Detection of Stationary and Moving Objects 0.1

0.8

0.1 0.1 0.7

0.2 0.5

0.5

Figure 2.22.: Exemplary faces for training of face detectors.

approaches [Yan02, Zha03], the following two have been chosen and further investigated. The first one is based on a trained MLP [Row98], while the second one relies on a boosted set of Haar like features [Vio01], which have already been used for human detection. In order to restrict the detection problem and simplify post processing, only upright frontal faces are used in this work. A large amount of frontal faces is provided by the FERET database4 [Phi00], which has been used to train both systems. As the images show faces in different sizes and with different parts of the upper body, the faces had to be cropped and rescaled in a standardized way. It has been decided to use a quadratic region, that only includes the face itself. The forehead, ears and neck are not considered as facial region, as these show a large variety of different appearances [Wal06a]. The region of interest has been defined by the anchor points illustrated in fig. 2.22, which represent the middle of the eyes and mouth. These have been chosen according to the MPEG-4 standard [Ost98]. Based on the desired quadratic form, the position of the anchor points can be set without respecting the desired block size bs. The center of the mouth is set to the horizontal image center and its distance to the lower border is set to 15 bs of the vertical image size. Likewise both eye centers are located in a distance of 15 bs from the upper left and right corner. Some additional robustness can be gained by loosening above mentioned restrictions and changing the parameters slightly. The previously upright faces are rotated randomly with a maximum angle of 15◦ in both directions. Further, the scale factor is varied by apx. 10%, and the anchor points are shifted up to two pixels into any direction. This procedure is performed with the original training material, in order to create a larger and more representative training set.

2.5.1. Face Detection with Multi Layer Perceptrons Rowley et al. presented a system for face detection, which is entirely based on a Multi Layer Perceptron (MLP) [Row98]. The network parameters are estimated in an iterative 4

FERET: Facial Recognition Technology Database

42

2.5. Face Detection sub-window (20x20 pixel)

Illumination rectification

Histogram equalization

scale

Input pyramid

Receptive Fields

Hidden layer

Input 20x20 pixel Face probability

Preprocessing

MLP

Figure 2.23.: The face detection process with the included preprocessing steps. Each sub-window is evaluated by 26 receptive fields.

training process, where common patterns in the given training material are determined. The actual face detection procedure is illustrated in fig. 2.23, and starts with a pyramidal scale-space sampling of the image with a window size of 20 × 20 pixels. All extracted sub-windows are subsequently preprocessed, where usually an illumination correction and a histogram equalization are performed. These windows are then presented to a neural network [Jor96] with multiple hidden layers. Thereby the image is split into multiple parts and presented to the receptive fields. This procedure intends to model the retinal structure of the human eye. As seen in fig. 2.23 the sub-window is split into four quadratical regions with 10 × 10 pixels size, 16 regions with 5 × 5 pixels size and six overlapping horizontal stripes with 20 × 5 pixels. All outputs of the 26 receptive fields are subsequently propagated to a hidden layer with nine cells. Both the inputs and the outputs of all sub-networks are fully connected. The network’s output is trained to create a vector y = [1, −1]T for positive samples and the inverted vector for negative examples. Training two output elements enhanced the detection performance drastically. Faces have been extracted from the FERET database and have been used as positive examples in order to train the seystem. Unfortunately it is hard to choose a meaningful set of negative examples. Therefore a bootstrapping algorithm has been proposed [Sun98]. The basic idea is similar to the approach presented in sec. 2.4.4. After an initial training of the NN, the resulting system is applied on a set of images without faces and 1000 false positives are chosen to retrain the system. This procedure is repeated until a sufficient detection rate is reached or no additional training material is left.

2.5.2. Face Detection With a Boosted Detector Cascade Viola and Jones [Vio01] proposed an extension to the object detection systems based on Haar like basis features, which has already been used for pedestrian detection. Papagiorgo

43

Chapter 2. Detection of Stationary and Moving Objects

Input

1

T

2

T

n

T

Face

F F F Reject Sub-window Figure 2.24.: Cascade of Classifiers with n stages

and Poggio [Pap00] initially suggested to extract Haar features with a constant overlap and only three sizes, which already creates a huge set of features. This has been further increased by extracting every possible feature size and position within a sampling window of 24 × 24 pixels size. Therefore the feature size basically varies from 1 × 2 to 24 × 24. Not only quadratic but generally rectangular features, with m × n pixel size in width and height, are allowed in favor of a more generalizing feature set. Any possible position in the sampling window, beginning at (1, 1) up to (24 − m, 24 − n), is used to extract features. With an initial set of four Haar feature types approximately 45 396 values can be extracted, which is by far larger than the sum of pixels in the image. Lienhart [Lie02] extended the system to 14 features, resulting in 117 941 values. In contrast to the initial approach the features’ signs are considered as important, as it is used to describe regional differences. As seen in fig. 2.25 the first feature is chosen to describe the intensity difference between the dark eye brow region and the lighter eye lid. In its original form the AdaBoost algorithm [Fre95] is used to enhance classification performance by combining weak classifiers to a stronger one. In common training setups detection performance is improved by adding more features to the classifier, which consequently increases processing time. Therefore a so called attentional cascade has been proposed in [Vio01], which should speed up the detection process dramatically. The key insight is that an overwhelming majority of the sub-windows within any image is negative. Therefore a cascade, as illustrated in fig. 2.24, is designed to reject a large amount of negative sub-windows with a small, yet effective, set of boosted classifiers, which should be able to detect all faces. In the first stages simple classifiers are applied to reject as many false windows as possible and to consequently gain processing speed. More complex classifiers, meaning the extraction of larger feature sets, are applied in the later stages to keep false positive rate as low as possible. Each cascade is trained using the AdaBoost algorithm, where an abort criterion is the minimum amount of eliminated false positives and the minimum detection rate per cascade. Fig. 2.25 illustrates both features determined by the first detector cascade. Nevertheless the detection performance of a two- or three-feature classifier is far from acceptable as a detection system, yet the cascade of classifiers should increase the overall classification performance. Given a trained cascade of classifiers the false positive rate

44

2.5. Face Detection

Figure 2.25.: Illustration of the first two chosen classifiers. Both are focusing on the eye region

F P P W of the entire classifier is FPPW =

n Y

F P P Wi ,

(2.65)

i=1

where n is the number of cascades and F P P Wi is the false positive rate of the ith cascade. The overall detection rate dr of the cascade classifier is dr =

n Y

dri ,

(2.66)

i=1

with dri as detection rate of the ith cascade. Given concrete abort criteria for the training of individual cascades, the detector’s performance can be easily estimated. In case the detection rate of each cascade is set to a minimum of 99% and the false positive rate is set to a maximum of 50%, the over all performance of all n = 10 cascades is dr = 0.9910 = 0.94 and the false positive rate is F P P W = 0.510 = 0.00097 The number of extracted features during the detection process is not necessarily predictable, as it is depending on the image structure and the faceness of a sub-window. Anyway the amount of required operations is highly limited, as all windows will pass through the cascade until it is rejected and only rarely will pass the entire cascade.

2.5.3. Face Detection Evaluation and Post Processing Before the implemented detector systems can be applied and extended to view invariant systems it is inevitable to test the performance of both systems. Due to the increased processing power, the required processing time is not a limiting factor anymore. The MLP based system can process up to ten fps, while the integral image based system is able to process 12 fps on a current state of the art Central Processing Unit (CPU) with 2.4 GHz. Therefore the systems have only been evaluated for detection and false positive

45

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.26.: Two exemplary images from the CMU data set with hypothesis created by the NN based face detector.

rates. These are subsequently compared to the values reported in the literature. Evaluation has been performed with the ”upright set” of the CMU database5 [Row97]. While the integral image based approach reaches 89, 0% detection rate at 0.03% false positive rate, the neural network performs slightly better with 90.3% detection rate and 0.025% false positives. An extension to face detection in omnidirectional views has further been implemented in [Wal04]. Fig. 2.26 shows some hypotheses computed with the MLP. As can be seen each face location is indicated by multiple detections and some false positives are still visible. For further processing it is advisable to merge multiple hypotheses and remove possible face locations which are only detected one time. In order to sum up detections all hypotheses are compared with each other regarding position and size: ˆ In a first step all neighbors of a hypotheses are determined in case their size differs by a maximum of ± 40%. Further, the distance of the hypotheses’ centers is computed. In case this is smaller than a quarter of the larger bounding box, it is considered as a neighbor. ˆ Hypotheses with more than two neighbors are grouped to one. Instead of a simple mean of the center position, a weighted mean is computed, where the weights are the probabilities created by the detector. By this means more probable locations are favored. ˆ Hypotheses without neighbors and all already accumulated hypotheses are removed.

With these assumptions most of the false positives can be robustly removed. Further each face is extracted only once, as illustrated in fig. 2.27, which will enhance the processing speed of sequential processing steps. 5

Database created at the Carnegie Mellon University

46

2.6. Facial Feature Extraction by Matching Elastic Bunch Graphs

Figure 2.27.: Exemplary face detection result before and after merger of relevant hypotheses.

2.6. Facial Feature Extraction by Matching Elastic Bunch Graphs Having detected a face with the previously presented methods, a further analysis can be performed. This is required for a wide variety of computer vision tasks. These include both face recognition [Tur91] and facial action recognition [Pan00], which are both considered as security relevant application scenarios. While face recognition is commonly only used for access control and the identification of persons, facial activities are a reliable indicator for feelings and behaviors of individuals [Meh68]. Therefore it is required to reliably extract meaningful facial features. Most approaches thereby rely on the detection on fiducial points6 that can be used to describe so called Facial Action Unit (FAU), as defined in the Facial Action Coding System (FACS) [Ekm78]. This has been constructed by analysis of facial motions that are usually created by muscular activity, such as inner brow rising, upper lid rising, or mouth corner lowering. Various concepts for FAU detection have been implemented in the past [Zen09], where most approaches relied on the detection of Facial Definition Parameters (FDP), which are based on points defined in the MPEG-4 standard [Pan03]. Among all alternative approaches Elastic Bunch Graph Matching (EBGM) has been chosen due to its additional applicability to face recognition [Wis97a]. The basic idea of a graph representation is to describe an object by nodes and the distances of nodes, which represent the graph’s edges. Thereby the nodes are modeled by Gabor jets, which are commonly used for image representation.

2.6.1. Gabor Wavelets and Jets In order to detect relevant points in the face, the image structure needs to be analyzed. The orientation of edges and the frequency of intensity changes are probably the most popular structural elements. One of many approaches for image representation is based on so called Gabor wavelets. These are biologically motivated convolution kernels in the shape of plane waves, which are restricted by a Gaussian envelope [Dau88]. Gabor wavelets can be used for image representation by varying a set of parameters, which will result in the detection of different structures in the image. A set of convolution coefficients for 6

Distinctive points in the face, e.g. eye brows, eye corners or nostills

47

Chapter 2. Detection of Stationary and Moving Objects

Figure 2.28.: An ensemble of odd (a) and even (b) Gabor filters [Lee96]

kernels with different orientations and frequencies, that are computed at the same image location, is referred to as a jet Jj (~x). It describes a small region of gray values around the pixel location ~x = (x, y), and is based on a wavelet transform, which is defined as a convolution Z Jj (~x) = I(~x)Ψj (~x − ~x0 )d2~x0 . (2.67) Hereby Ψj (~x) represents a family of Gabor kernels  2 2  2 kj2 kj ~x σ Ψj (~x) = 2 exp − 2 exp(i~kj ~x) − exp(− ) , σ 2σ 2

(2.68)

in the shape of plane waves with the vector ~kj , which is restricted by a Gaussian envelope function [Lee96]. Commonly a set of five different frequencies χ= 0, . . . , 4 and eight orientations υ= 0, . . . , 7 ! ! kjx kχ cos ϕυ ~kj = = kjy kχ sin ϕυ (2.69) χ+2 π − 2 kχ = 2 , ϕυ = χ , 8 with index j = υ + 8χ [Wis97b] is considered as sufficient for a detailed image representation. The parameter σi , which is usually set to 2π, thereby controls the width σk of the Gaussian function. An ensemble of odd and even Gabor filters is illustrated in fig. 2.28. Further, it shows that all kernel functions can be generated from one mother wavelet by simple rotation and scaling functions. Each of these features has different discriminative abilities in images, as shown in fig. 2.29. Depending on the filter’s orientation edges in a predefined direction are stressed in the resolution filtered image. These should include the most relevant features for faces. A single set of Gabor coefficients unfortunately has only limited discriminative abilities. Therefore it is reasonable to describe an image point with multiple filter responses and

48

2.6. Facial Feature Extraction by Matching Elastic Bunch Graphs

Figure 2.29.: An image after convolution with four exemplary chosen filter responses. As seen, the main orientation is responsible for the detected edges.

sum these up to a so called jet Jj (~x). Commonly 40 coefficients, with five frequencies and eight orientations, are obtained from one image point and are stored as a vector. A jet can be described by Jj = aj exp(iφj ), (2.70) with magnitudes aj , which are slowly changing for small displacements, and phases φj . These are rotating at a rate which is given by the wave vector ~kj . Fig. 2.30 illustrates the creation of a jet. Only a limited amount of Gabor filters are displayed for visualization purposes. The use of Gabor wavelets is favourable as these are robust against lighting changes in the image, since being DC-free. By normalizing the jets some robustness against varying contrast can be achieved. While the phase changes drastically with translation, the amplitude is quite robust against a limited amount of rotation and distortion. This can be explained by the limited localization in space and frequency. Due to the phase’s high variance in image points located only few pixels apart, it seems reasonable to ignore the phase [Lad93] to compute the similarity Sa (J, J 0 ) of two jets based on the amplitude with P 0 j aj , aj 0 Sa (J, J ) = qP (2.71) P 02 . 2 a a j j j j This way small displacements with high influence on the phase can be ignored. Nevertheless the phase should not be neglected, as it is the only possibility to discriminate between image regions with similar amplitudes. These are usually very similar for nearby pixel locations, while the phase might drastically change with a little displacement. The phase-sensitive similarity function Sφ (J, J 0 ) is able to compensate the phase shift with the term d~ ~kj P 0 0 ~~ j aj aj cos(φj − φj − dkj ) 0 Sφ (J, J ) = , (2.72) P 2 P 02 1 ( j aj j aj ) 2 ~ To compute Sφ (J, J 0 ) the which is valid for object locations with small displacements d. displacement d~ has to be determined in the first place. This can be done by maximizing

49

Chapter 2. Detection of Stationary and Moving Objects

Face image

40 Gabor filters 8 orientations 5 frequencies

Jet with 40 coefficients

Figure 2.30.: 40 Gabor coefficients are computed for an image point and subsequently summed up to a jet.

Sφ (J, J 0 ) in its Taylor expansion [Fle90] with P 0 0 ~~ 2 j aj aj [1 − 0.5(φj − Φj − dkj ) ] 0 P P . Sφ(J, J ) ≈ sqrt j a2j j a02 j Assuming that

δ S δdx φ

~ J 0) = d(J,

=

dx dy

δ S δdy φ

!

(2.73)

= 0 and solving for d~ leads to

1 = × Γxx Γyy − Γxy Γyx

Γyy

−Γyx

−Γxy −Γxx

!

Φx Φy

! ,

(2.74)

if Γxx Γyy − ΓxyΓyx 6= 0, with

Φx =

X

aj a0j kjx (φj − φ0j ),

(2.75)

aj a0j kjx kjy ,

(2.76)

j

Γxy =

X j

and Φy , Γxx , Γyx , Γyy defined correspondingly. This way it is possible to determine the displacement between two jets computed at nearby locations, given that their Gabor kernels overlap. Displacements of up to half the wavelength of the highest frequency kernel can be determined with small error.

2.6.2. Bunch Graphs and Bunch Similarity In the previous section Gabor wavelets have been introduced to compute the similarity between image patches. Unfortunately their discriminative abilities are not sufficient to detect feature points by an exhaustive search on the entire image, as in every case a similar patch will be found. To prevent random matches a threshold could be introduced as a requirement for a minimum similarity. This solution is unfortunately not sufficient for the localization of fiducial points, as usually the point with the largest similarity is detected

50

2.6. Facial Feature Extraction by Matching Elastic Bunch Graphs without further consideration of other features. Indeed the location of these points has to follow strict geometrical properties, e.g. the eyes are above the mouth, the nose is located between eyes and mouth, etc. Following these constraints a graph representation for faces has been chosen to incorporate the geometry of faces. A labeled image graph IG hence consists of Nnodes nodes, which are located on the fiducial points p~j , and Nedges edges between the nodes. These are described by the jets Jj , where n = 1, . . . , N , and the edges are labeled with the distances ∆p~e = p~i − p~j

(2.77)

of edges connecting two nodes i and j. The node locations and edge distances usually vary, depending both on pose and individual face, although these should represent the same fiducial point. Such a graph is able to describe an individual face and localize the features in a new image quite exactly. Frankly this is only possible for an individual face and therefore one person. A more general representation is required to cover a wide range of possible variations of faces, such as different types of beards, differently shaped eyes and mouth or variations due to sex, race, age and weight. Obviously it would be way too expensive to cover all feature variations by single graphs. Therefore it has been suggested [Wis97b] to combine graphs into a stack-like representation, the so called BG, see fig 2.31. Each created model has the same grid structure referring to the same fiducial points, which are now described by an ensemble of jets, the so called bunch. A mouth corner bunch, for instance, may include jets from open, closed, female, and male mouths, to cover all these variations. In order to fit the bunch graph to an unknown face, the best fitting jet from the bunch dedicated to the fiducial point is selected. Therefore the entire bunch graph is able to cover a way larger range of faces than models are actually available. Up to now only the basic principle of face representations with graphs has been described, without any remarks about the generation of graphs. This is simply done by manually labeling images and assigning unique labels to nodes and edges. Therefore the fiducial points illustrated in fig 2.32 have been selected. These are easy to label and should be of sufficient quality to be reliably detected. In order to match the manually created graph to an unknown face, the graph similarity has to be evaluated. It depends on the jet similarities and the distortion of the image grid relative to the bunch graph grid. For a bunch graph BGm with model graphs m = 1, . . . , M and the image graph IG the similarity is defined as

SBG (IG, BG) = 1

NX nodes

Nnodes

n=1

maxm (Sφ(JnIG , JnBGm ))



λ Nedges

Nedges

X (∆p~e IG − ∆p~e BG )2 , (∆p~e BG )2 e=1

(2.78)

51

Chapter 2. Detection of Stationary and Moving Objects

a)

b)

Figure 2.31.: a)An exemplary image graph for the representation of faces. b) Multiple graphs can be combined to a bunch graph with bunches at the fiducial points.

where n = 1, . . . , N nodes and e = 1, . . . , E edges are evaluated. The parameter λ determines the relative importance of jets and metric structure. Since the bunch graph provides multiple jets for each point, the best one is selected and used for comparison. The best fitting jet is called local expert for the face.

2.6.3. The Matching Procedure In the previous section a general representation of faces with bunch graphs has been presented. It can be used to detect fiducial points in an image and to create an image graph that maximizes the similarity with the bunch graph. In order to find the optimal solution an approach based on heuristics, which gradually reduces the graph’s degrees of freedom, has been implemented [M¨ un08]. The procedure is now conducted as follows: 1. Initialization: the position of the face is estimated either by the detection of skin color regions or utilizing a face detector as described before. The bunch graph is simplified by computing the average magnitudes of the jets in each bunch. This average grid is subsequently used to evaluate its similarity at a square lattice with a spacing of four pixels, using the similarity function Sa (J, J 0 ) without phase information. After locating the best fitting location, the scanning is repeated around this position with a spacing of one pixel. This position is then used for the next step. 2. Refinement of position and size: subsequently the bunch graph is varied both in position and size to determine a better matching configuration. The bunch is once more evaluated in a 3 × 3 region around the position computed in step one. This time the bunch’s size is changed at each position and the bunches are not averaged. Experiments have shown that the scale factors 1.2 and 0.8 are sufficient in case the face has previously been correctly localized. Since the distance vectors p~B e are also transformed, the metric similarity can be neglected, as it does not have any effect. The best fitting jet is computed for each variation, and the best variation is kept for the next step. Some additional accuracy can be reached by rotating the graph

52

2.6. Facial Feature Extraction by Matching Elastic Bunch Graphs

Figure 2.32.: Example for precise localization of fiducial points on the left. Mismatches usually appear around the mouth, ass seen on the left

in the image plane, to handle slightly rotated faces, although these are not part of the initial model. 3. Local distortion: in an almost random sequence the position of each individual node is varied to further increase the similarity to the FBG. Now the metric similarity is taken into account by setting λ = 2 and using the vectors p~B e as obtained in Step 3. The resulting image graph can subsequently be used to represent individual faces and model facial activities.

2.6.4. Evaluation of the Fiducial Point Localization In order to evaluate the performance of this approach, the detected fiducial points have been compared to manually labeled data and the actual displacement has been measured. This way an average deviation of the actual position could be determined. The entire SAFEE Facial Action Database (SAFEE-FAC) [Ars06], see app. A.1, contains 405 video sequences of four basic activities, namely neutral, loughing, yawning and speaking, which should be covering a wide range of muscular movements. As only the first frame of each sequence has been annotated, evaluation has been performed on still images. 75 images have been used to create the reference FBG, while the remaining 330 where used for matching. An exemplary result is displayed in fig. 2.32 on the left hand side, which shows very accurate results. Measurements could prove this observation with an average deviation of four pixels with an average face size of 200 × 200 pixels. Errors are mainly appearing in the region around the mouth, see fig. 2.32 on the right, as this is the most elastic object in the image. Further the performance highly depends ion the initialization of the bunch graph. If its size or position are completely miss estimated, the refinement process will find a suboptimal position only. Nevertheless this accuracy should be sufficient for further experiments.

53

Chapter 2. Detection of Stationary and Moving Objects

2.7. Closure This section has presented various approaches for the detection of humans and a more detailed body part representation. All utilized methods have been chosen due to their simplicity and the resulting low computational effort, as real time applicability has been demanded. Though being quite effective at a considerably low complexity, further improvements for the required object detection tasks are desired in the future. An initial guess for possible object locations based on GMMs has been presented in sec. 2.1.2, that provided considerably robust detection results, in combination with a supplementing shadow removal approach. Multiple, frequently occurring patches are used for a multi-modal background representation. Unfortunately some background regions are rarely appearing and are often created by a moving object, such as a door [Mil07]. Furthermore stationary objects are not considered and are causing various problems. These are slowly incorporated into the background and either fully or partially occluded by other objects, resulting in one potentially larger object. Therefore the use of so called dual backgrounds with various update speeds could be integrated for additional robustness [Por08]. The skin color detection algorithms need little to none improvement in controlled environments, as for each setup different models can be trained. In situations with changing light, such as in an aircraft with dimmed light during a night flight and full illumination during boarding, an adaptive system, similar to the Brightness Transfer Function (BTF) [Jav05], aligning illumination changes is required. Both pedestrian and face detection have been performed with holistic models in the past, meaning that a model for the entire object has been trained. As has been shown for the human class, a further segmentation seems reasonable. Such an approach can be easily transferred to any other object, e.g. a car is segmented into tires, bumper, windows etc. Further all presented approaches were initially implemented for two class problems, where sub windows have been checked whether they contain the desired object or not. With rising complexity of the surveillance task a wider range of different objects needs to be detected. Applying multiple models in parallel is possible but time consuming. Therefore approaches trained for multiple models should be introduced in the future. Furthermore the training procedure has been rather laborious up to now, as exact bounding boxes had to be determined for each sample and the view had to be constant, yet allowing some variance. With novel approaches using a so called bag of features [Jia07, Laz06] and a visual dictionary [Chu07], a rotation invariant detection system can be implemented based on a training set with images containing the desired objects without bounding boxes. The creation of face representations with bunch graphs produces reliable results for frontal faces, but usually demonstrates a weaker performance for other gazes. An additional refinement with a local feature search might help to overcome this problem [Kuo05]. Furthermore a detailed comparison of localization results with other approaches, such as Active Appearance Models [Coo99], should be conducted in the future.

54

Chapter 3. Object Tracking Various object detection methods have been described in the previous chapter, which can be applied either in still images or video sequences. A more detailed behavioral analysis requires the observation of persons over a longer time period, and the maintenance of their unique ID throughout the entire sequence [Jav02]. This way it is possible to associate trajectories to individuals and even body parts, which can subsequently be analyzed for anomalies. Furthermore the tracking aspect can be also considered from a computational view, as it can be used to speed up the object detection process itself. Instead of sampling every possible object position in each video frame, the search can be limited to a specific region of interest. In most common surveillance systems this task can be formulated as correspondence problem between two subsequent frames or even an entire sequence of frames [Set87], where various approaches have been described in [Col00]. Object tracking can be basically performed on various levels that are highly depending on the used initialization method and thus either object or motion detection based approaches are used as baseline. Simple foreground segmentation is frequently used as initialization step, and the resulting labeled blobs have to be associated within an image sequence, where usually only spatial information is employed without any further analysis of the underlying object [Son06]. Such techniques usually fail in very crowded situations, where blob association is not that simple. Therefore motion information from the past, such as the last position, can be utilized to predict the current state and enhance robustness [Lu05], where the Kalman filter is usually chosen for linear prediction [Kal60]. With more complex model based detection methods it is possible to combine motion patterns and the detector model to a reliable tracker [Mck96], where the condensation algorithm has to be mentioned in particular [Isa98]. Using the underlying detector model as measurement technique is frequently leading to mix ups, as the model is created to generalize a class, e. g. faces, instead of discriminating between objects of the same class [Mac00]. Therefore it seems reasonable to analyze the object’s appearance and search for features that enable a more adequate representation, which is often estimated from a set of examples. This representation must consequently be detected reliably in each image [Com03]. With advances in image feature representation, which are mostly used for object recognition [Laz03], it is possible to create a separate model for each object in the scene and track theses reliably.

55

Chapter 3. Object Tracking t=t0

Age:8

Age:5

Initialization

t=t0+1

Age:9

Age:6

t=t0+2

Δ max ∆ν(xi , yj , σ) or

(3.20) !

∆ν(x, y, σ) < min ∆ν(xi , yj , σ) it is used as a base for a SIFT feature. Accurate Keypoint Localization Once a keypoint candidate has been located, a fit to the nearby data is performed, allowing to discard points with low contrast or being poorly localized along an edge. Therefore Brown and Lowe [Bro02] developed a method fitting a 3D quadratic function to local sample points. These extrema are now used as key-points for the following localization step. First the Taylor expansion of the scale space function ∆ν(x, y, σ) around the initial position of the feature is evaluated: ∆ν(~x) = ∆ν +

70

δ∆ν T δ 2 ∆ν ~x + 1/2~xT ~x δ~x δ~x2

(3.21)

3.4. Feature Based Tracking where ~x = (x, y, σ) is the offset from the initial extremal position. Since the derivation δ∆ν(~x) is set to zero around the final position, the corresponding offset can be computed δ~x with −1 δ 2 ∆ν δ∆ν . (3.22) xˆ = δ~x2 δ~x In case xˆ is larger than 0.5 in any dimension , the extrema is assigned to the next discrete point in that direction and the computation is repeated. The final offset xˆ is added to the original position of the feature, resulting in the refined position for that point. Unstable extrema with low contrast can be rejected by evaluating ∆ν(ˆ x) with 1 δ∆ν T ∆ν(ˆ x) = ∆ν + xˆ. 2 δ~x

(3.23)

Experience has shown that all extrema with a value |∆ν(ˆ x)| smaller than 0.03 can be rejected for further processing, as the contrast can be considered as too low. Additionally extrema along edges will be removed, as these frequently are unstable to even small amounts of noise, although the difference-of-Gaussian has a strong response. Therefore the principal curvature can be computed with a 2 × 2 Hessian matrix " # ∆νxx ∆νxy h(∆ν) = , (3.24) ∆νxy ∆νyy which is computed at the location and scale of the keypoint. In the case of a straight edge, a high curvature across the edge but a small one perpendicular to it will be detected. As the Eigenvalues of h are proportional to the principal curvatures and only the ratio is required, the expensive computation of the Eigenvalues can be avoided [Har88]. This allows the computation of the sum of the largest and smallest Eigenvalues from the trace of h with T r(h(∆ν)) = ∆νxx + ∆νyy , (3.25) and their product from the determinant Det(h(∆ν)) = ∆νxx + ∆νyy − (∆νxy )2 .

(3.26)

With this assumption it is only required to check if the ratio of the principal curvature is below a predefined threshold, that is defined by the ratio τ between the largest magnitude Eigenvalue and the smaller one T r(H 2 ) (τ + 1)2 < , Det(H) τ

(3.27)

and subsequently eliminate keypoints with a larger curvature. Fig. 3.12 illustrates all keypoints localized in an image from the PETS2006 challenge, which are supposed to have discriminative properties. These points are obviously located at the object boundaries where edges can be observed, whereas in uniform areas no keypoints have been localized.

71

Chapter 3. Object Tracking

Figure 3.12.: Keypoint locations in an image from PETS2006. Most discriminative keypoints are located at object boundaries.

Orientation Assignment In order to achieve invariance to image rotation and a consistent orientation, based on local image properties, an orientation vector is assigned to each keypoint. The orientation is further used to compute a unique descriptor for each keypoint. The primary orientation is calculated from the dominant gradients around the feature on its particular scale. In a first step, the weight ω(x, y) and orientation υ(x, y) are calculated for the image pixels around the feature with p ω(x, y) = (L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2 (3.28) and υ(x, y) = tan

−1



L(x, y + 1) − L(x, y − 1) L(x + 1, y) − L(x − 1, y)

 .

(3.29)

An orientation histogram is subsequently formed from the gradient orientations of sample points around the keypoint. The orientation values υ(x, y) are next weighted with the gradient magnitude ω(x, y). The dominating value of the histogram is then used as the primary orientation for that particular feature. The orientation histogram has a total of 36 bins covering the 360 degrees of orientation. The Local Image Descriptor With the previous operations image location, scale and orientation have been assigned to each keypoint. In order to provide discriminative properties to each feature point, a descriptor, which is as invariant as possible, has to be computed. Since the descriptor is oriented along the main orientation and within a particular level of the scale space, rotation- and scale-invariance are achieved. If for example an object was first seen in the foreground of the view Ci and then were to appear tilted in the FOV of camera Cj , the descriptor will nevertheless be the same. To compute the descriptor, the already known magnitudes and orientations around the feature are used. The coordinates x and y are however subject to a rotary transformation

72

3.4. Feature Based Tracking

6

6

4

4

2

2

0

0

−2

−2

−4

−4

−6 −6

−4

−2

0

2

4

6

−6 −6

−4

−2

0

2

4

6

Figure 3.13.: Two examples of SIFT descriptors with 4 × 4 histogram bins.

to account for the tilt of the main orientation, resulting in the transformed coordinates xˆ and yˆ: ! ! xˆ x = R(φmain ) (3.30) yˆ y Then the weighted gradients of the samples around the SIFT-feature are computed: D(ˆ x, yˆ) = ω(ˆ x, yˆ)φ(ˆ x, yˆ)

(3.31)

To avoid the recalculation of these values, a simple interpolation between already computed values of ω(x, y) and υ(x, y) is applied. A 16 × 16 pixel region around every chosen sample is used and split into 16 regions. For each of those, a gradient histogram of the values D(ˆ x, yˆ) with 8 bins is calculated, resulting in 128 values describing the particular feature in a rotation invariant fashion. Two exemplary descriptors with 16 histogram bins are visualized in fig. 3.13

Matching of SIFT Descriptors SIFT features have been introduced to find corresponding objects in large databases. Therefore at first the descriptors are extracted from the database and stored in an array. In the case of tracking the features from the previous frame I(x, y, t − 1) are stored in a temporary database. Subsequently features are extracted form the new frame I(x, y, t). These are compared to the stored descriptor by computing the Euclidean distance between two features Di (t) and Dj (t − 1), where the most similar one is considered as match. In order to avoid mismatches a threshold is introduced, as otherwise a random feature would be assigned. Hence a match is detected if v u 128 uX t (Di (t − 1) − Dj (t))2 > θSIF T . (3.32) i=1

This way keypoints can be matched either in subsequent frames or even in different camera perspectives.

73

Chapter 3. Object Tracking

Figure 3.14.: Exemplary matching of descriptors within two frames. The computation on the entire image produces unnecessary descriptors, which are removed in the next steps: either by comparing the descriptor positions, the computation of descriptors in foreground regions or the removal of descriptors in the background.

As the upper example in fig. 3.14 illustrates, all keypoints within two images are matched without consideration whether an object has moved or not. Nevertheless this example shows the performance of the matching procedure and that most features are assigned, though some false assignments can be observed. These can be removed by setting a higher threshold for the similarity measure. In order to limit the tracking to regions of interest, the search area should be limited to foreground regions. Therefore various possibilities have been evaluated with different results. It is possible to remove most of the background by comparing the positions of matched keypoints and removing these, if their position has not changed. This way only moving objects should be matched, but some false positive matches between foreground and background are still incorporated into the tracking process, as illustrated in fig. 3.14 in the second row. These can be avoided by applying foreground segmentation. Hereby it should be noted that the computation of keypoints limited to the foreground creates by far less keypoints, as the neighborhood is dramatically changed. Therefore it seems reasonable to compute the SIFT features on the entire image and subsequently remove descriptors located in the background, which will provide less matches between foreground and background. After all processing steps there are still some mismatches observable, where some are matched to a wrong position within the same object and others even between varying objects. Despite these errors an ID maintenance is still possible as the blobs with the most matches will be considered as

74

3.5. Object Tracking with Deformable Feature Graphs D4

D3

D1

D2

Figure 3.15.: Simplified SIFT-mesh, consisting of 4 elements. Descriptors illustrated simplified by 2x2 regions, aligned to main orientation

same object. The tracking procedure is then simply conducted by counting the number of point correspondences in between all present blobs. Thereby a larger amount of matches indicates a larger probability for an object match. The more matches are detected in one blob, the higher the probability for a correspondence of two objects in two different frames.

3.5. Object Tracking with Deformable Feature Graphs 3.5.1. Feature Graphs Most common tracking algorithms, which are based on local feature patterns, suffer of false ID assignments, as mismatches frequently appear. These are usually observed at object boundaries or very similar objects. In very inconvenient situations this way a person’s foot might be confused with another person’s head. Therefore it is not reasonable to simply match features and e. g. compare the object IDs of matching feature points. Further improvement can be achieved by including the features’ spatial arrangement [Gom03]. Hence a so called Deformable Feature Graph (DFG), or mesh, is being introduced. Similar techniques have already been applied both for tracking [Tan05, Luo07] and face recognition [Kis07]. Due to their discriminative abilities, graphs should be also applicable to reassign a unique ID to a lost track. Such a mesh can be considered as an undirected graph, where the nodes are represented by features, and the edges represent the spatial relationship between the nodes. The object graph Oi is thereby represented as matrix  P~1 D1  . ..  .. Oi =  .    ~ PN DN 

(3.33)

where P~i denotes the feature position and Di the computed descriptors.

75

Chapter 3. Object Tracking

3.5.2. Tracking of Feature Graphs The basic idea of the graph tracking algorithm is to initialize a tracking graph OT R = {(P~T1R , DTR,1 ), . . . , (P~TNR , DTR,N )}

(3.34)

and detect it in the subsequent frame. Therefore a new image graph 1 M OIM = {(P~IM , DIM,1 ), . . . , (P~IM , DIM,M )},

(3.35)

with a possibly different amount M 6= N of descriptors DTR,i and DIM,j , is computed in the subsequent frame. The challenge is now to find parts of OT R within OIM . As the new mesh is computed either on the entire frame or pre-segmented regions, the number of nodes might significantly differ. Nevertheless it can be expected that OT R is either fully or partially included in OIM . Although a non-rigid pattern matching would presumably create the best matching results, it is not applied due to the high computational effort. Instead a spatially limited nearest neighbor matching around the expected object location is performed. Therefore the Euclidean distance d(DTR,i , DIM,j ) = (DTR,i T DIM,j )2 (3.36) is computed between the descriptors DTR,i ∈ OT R of the tracking mesh and descriptors DIM,j ∈ OIM located in the image mesh. The search area is thereby limited to a predefined j region d(P~Ti R , P~IM ) < θmax in order to keep computational effort low. Mismatches in image regions with low contrast are avoided by assigning matches only if the weighted euclidean distance is still smaller than the next best match ωd(DTR,i , DIM,j ) < d(DTR,i , DIM,k ) → {i, k}.

(3.37)

Experience has shown more reliable results for the weighted Euclidean distance than simple thresholding of distance measures od descriptor vectors. All possible alignments found by the local feature comparison are hence stored in a list A{i, j}. An exemplary spatial configuration of descriptors after matching is visualized in fig. 3.16. Furthermore invalid matches are removed by considering the features’ spatial arrangement, which is stored in matrices. Therefore the graphs OT R and OIM have to be compared to each other. In this work two iterative approaches, a distance and an angle based one, have been investigated as follows. Iterative Distance Based Refinement In order to compare both graphs, a representation of the edges, based on keypoint distances, will be introduced first. The matching elements from A{i, j} will be arranged in exactly the same order as in the list in two matrices ∆PTR and ∆PIM that contain all

76

3.5. Object Tracking with Deformable Feature Graphs

Figure 3.16.: Spatial configuration of a feature graph for the person located in the bounding box in the left image. SIFT features are illustrated both in a coordinate system and in the texture image. Circles indicate matched descriptors, crosses remaining nodes.

possible distances between the feature positions in the corresponding graphs. The entries ij ∆PTijR and ∆PIM are determined with ∆PTijR = d(P~Ti R , P~Ti R ), with P~Ti R and P~Tj R ∈ OT R

(3.38)

ij j i i i ∆PIM = d(P~IM , P~IM ), with P~IM and P~IM ∈ OIM

(3.39)

and respectively, where d(P~ i , P~ j ) is the Euclidean distance between two points. In case all descriptors are being matched and no distortion occurs both graphs will be exactly the same, except for an eventual scaling factor s. For a mesh with three features located at P~1 , P~2 and P~3 the graph representations would be  d(P~T1R , P~T2R ) d(P~T1R , P~T3R )   = d(P~T1R , P~T2R ) 0 d(P~T2R , P~T2R ) , d(P~T1R , P~T3R ) d(P~T2R , P~T3R ) 0   1 2 1 3 0 d(P~IM , P~IM ) d(P~IM , P~IM )   = d(P~ 1 , P~ 2 ) 0 d(P~ 2 , P~ 2 ) . 

∆PTR

∆PIM

0

IM

IM

1 3 2 3 d(P~IM , P~IM ) d(P~IM , P~IM )

IM

(3.40)

IM

0

Given exact matches and just a scale difference s, the difference between both graphs 1 ∆PTR − ∆PIM = |∆PTR − ∆PTR | = 0 (3.41) s results in an empty matrix.

77

Chapter 3. Object Tracking Anyway, in real tracking scenarios distortion and mismatches will definitively occur, which will create a matrix with entries different from zero   0 1 2  ∆PTR − 1 ∆PIM =  (3.42) 1 0 3  , s 2 3 0 with i being distances different than zero. In case only the third point had been assigned incorrectly, 1 is set to zero. This error now creates values unequal zero both in the third row and column, as it has been at the third position in the list. Therefore it seems reasonable to remove the point with the largest mean deviation from A{i, j}, by comparing the means of each column. This procedure is subsequently repeated until a maximum allowed error θdist is reached. For a robust elimination of wrong assignments it is inevitable to estimate the correct scale factor s. The obvious solution of trying all possible values in a loop is rather inefficient and leads to frequent errors. A better solution has been achieved by an approximation of s in each iteration step by solving the following minimization problem 1 (3.43) sˆ = argmin ∆PTR − ∆PIM , s s and using sˆ in the next iteration step. However this process cannot guarantee an optimal solution, as a local minimum might be found. Besides the hard estimation of the scale factor s and the danger of running into a local minimum, the threshold θdist needs to be set very carefully. It has to be adjusted dynamically to different scales of OT R as the internally used coordinate systems do not necessarily correspond. Iterative Angle Based Refinement A more robust approach for the refinement can be achieved by including the angles between the graphs’ nodes. Therefore the distance vectors between the points in each mesh are calculated with

d~TijR = ij d~IM =

xij TR

!

yTijR ij xIM

yTijR

= P~Ti R − P~Tj R , (3.44)

! j i = P~IM − P~IM .

Subsequently a matrix Ξ containing the angles ξij between the edges of OTR and OIM is calculated by

78

3.5. Object Tracking with Deformable Feature Graphs

  T ij ij ~ ~ d d ~ij IM T R  , ξij = (γ1 dT R + γ2 ) arccos  ij ij d~T R d~IM

(3.45)

where the factors γ1 and γ2 allow for a more flexible matching by penalizing angular ij deviations to distant points. For this purpose d~TR has to be be normalized. Furthermore points with a distance lower than a threshold θshif t are assumed to have zero angular deviation. This can be done as some points shift differently at differing scales. Therefore a point at a low scale can stay stationary, while a point on a higher scale shifts from the left to the right of the nearby low-scale point. To avoid the resulting high penalty, all angular deviations resulting from points with j d(P~Ti R , P~IM ) < θShif t

(3.46)

are ignored. As in the distance-based approach, the mean deviation for every match is determined. In this case, the deviation can be referenced directly to a certain angular deviation. In contrast to the distance based approach, now the quality of a match can be judged directly and quantitatively. If there is any deviation above a preset limit, the match with the maximum deviation is deleted and the whole process repeated until a satisfying result is obtained.

3.5.3. The Dynamic Feature Graph The matching and refinement of feature graphs has been exploited in the last sections. It has been assumed that two graphs in subsequent frames can be simply compared. A graph hence describes a person or object in a comfortable way. As e. g. humans are deformable objects and change their appearance continuously during the tracking process, the graph consequently also changes its form and descriptor nodes, which can disappear, reappear, or appear for the first time. While tracking a graph, these changes have to be considered to create a more reliable person representation. The importance for frequent updates of the graph representation becomes more clear if the average lifetime of SIFT features is considered. Fig. 3.17 illustrates the number of matched nodes over time, that are remaining after being initialized in the first frame. Obviously a large decay in the detected features can be observed. Even if a higher threshold θ is set for the mesh refinement, the number of nodes is decreasing drastically. Hence the suppression of adding new features leads to a fast loss of tracks. Therefore it seems reasonable to add newly detected features. The most obvious updating strategy is probably to track the mesh and randomly add features in the region of the graph without further analysis. Experience has shown, that

79

Chapter 3. Object Tracking 120 Features detected Mesh features

number of features

100

80

60

40

20

0

0

5

10

15 20 time (in frames)

25

30

35

80 Features detected Mesh features

70

number of features

60 50 40 30 20 10 0

0

5

10

15

20 25 time (in frames)

30

35

40

Figure 3.17.: Exemplary SIFT feature decay for two objects in a scene from the PETS2007 dataset [Leh08]

this leads to overcrowding of nodes and especially incorporates unstable features. These are usually appearing for very few frames only, and do not assist to discriminate objects on a longer horizon. Therefore an additional graph OCA , with the same structure as the tracking graph is created. This third graph contains all possible new features, which are not yet included into the tracking process, and is referred to as candidate graph. While tracking is performed, nodes in the candidate graph are matched to nodes in the image graph OIM without further refinement but the euclidean distance of the descriptor vectors. Each time a descriptor Di is detected its assigned life time value α is incremented, respectively decremented in case a descriptor is not detected in the current frame. Once a feature reaches α < 0 it is removed from the candidate mesh. The features in the tracking graph OT R are treated in the same way. Combining the update mechanism with the local search and refinement features that are not re-detected, for instance on the moving arm, these are likely to become candidates in the subsequent update step. The entire update process is exemplary visualized for four frames in fig. 3.18. Tracking is initialized with the detected features 1, 2 and, 3. In the next frame two features are lost and a new feature is detected. One of the two lost features is re-appearing after a while and therefore not deleted, while the other one reaches a lifetime smaller zero after t = t3 and is therefore suspended. The new feature with the number four is reaching a lifetime α4 ≥ αmin and is subsequently added to the tracking graph. As the original

80

3.5. Object Tracking with Deformable Feature Graphs

1

1

2

1

2

2

3 t0

lost

2

3

3

4

4 t1

detected

1 3 4

t2

new removed

t3

Figure 3.18.: Exemplary update process of a feature graph. Lost features are deleted after a while. New features have to appear for a minimum time before being incorporated into the graph [Ars08b].

feature positions are always stored relatively to the graph’s position at the moment of initialization, the new feature position P~T R,i in the tracking graph can only be determined via the transformation of its position in the image graph P~IM,j with j P~Ti R = HT R,IM P~IM ,

(3.47)

where HT R,IM is the homography between the two coordinate systems. The determined position is only an approximation, as the feature positions in OT R are elastic and not fixed and some error i  = HT R,IM (P~IM − P~Tj R ) (3.48) is experienced. An approximation H0T R,IM for the transformation matrix can be computed by minimizing the error  with i H0T R,IM = argmin HT R,IM (P~IM − P~Tj R ).

(3.49)

HT R,IM

Since the tracking procedure requires stable points, HT R,IM has to be determined as accurately as possible. This is only possible if there are spatially stable points present in OT R . Otherwise new positions are computed incorrectly and consequently destabilizing the tracking graph. The presented update procedure unfortunately creates some instability especially on the limbs and in fast moving regions. Both torso and shoulders are more stable, as these are more rigid, and can therefore be tracked by far better. Up to now it has only been investigated how to update and handle a dynamic feature graph. The main error source of the update process is observed in the selection of appropriate features. It would be possible to compute features on the entire image, which is not leading to the optimal configuration. Therefore a simple foreground segmentation, as presented in sec. 2.1.2, is applied and blobs are extracted with a subsequent connected component analysis. These are subsequently assigned to the individual feature graphs. The problem formulation is simplified by assuming that the blob covers approximately the same region as the feature graph, which results in determining the largest overlapping

81

Chapter 3. Object Tracking regions for a correct assignment. However, the blobs are just a good indicator to restrict the search area for new features. Further new challenges, such as a merger of blobs, harden a flawless assignment of features to individuals. The updates entirely rely on the limitation of the tracking region by foreground segmentation and therefore correct assignment of features is difficult. Therefore a reliable blob tracking is performed in parallel to detect splits and mergers which will result in following special situations: ˆ New object appears: in case a blob is not assigned to any graph a new one is created and a new ID is assigned. ˆ Occlusion of tracked objects: in case two graphs are assigned to one blob the update procedure is suspended to avoid the merger of meshes. ˆ Object splits: objects usually split due to partial occlusions or a failure of the background model. Therefore the objects are subsequently tracked separately, which can lead to ambiguities in case a person has been partially occluded, as illustrated in fig. 3.19. Nevertheless the real object is usually further tracked. ˆ loss of track: in case a track is lost for a couple of frames, the mesh is not further tracked and deleted. Unfortunately, the loss can also result from rapid changes in appearance. Therefore new features have to be included faster if a large amount of features is deleted in each time step. If the track is finally lost due to a failure in feature updates, usually a new graph with a new ID is created.

3.5.4. Re-Identification of Lost Objects with Deformable Feature Graphs Most common tracking techniques can maintain IDs while the observed object is visible. Confusions usually appear in case of a merger or if an object disappears and re-appears after some time. Re-appearances usually happen if an object is totally occluded or leaves the FOV and subsequently shows up in the same or possibly even a completely different view or scene. The re-identification of objects is frequently performed either based on color histograms [Gra07b] or on biometric features, such as the face [Tur91]. The latter is frequently disregarded, as a person’s face is not necessarily visible or the gaze may differ drastically. Color histograms have shown great performance especially in tracking applications [Ziv04b], but frequently fail to re-identify a person due to changing lighting conditions [Jav05]. Therefore a more robust representation is desired. SIFT features have been already utilized to detect and recognize objects in images in the past [Ham08, Low04]. As these are used for the tracking application already, it seems reasonable to apply them to create an object model without additional computational

82

3.5. Object Tracking with Deformable Feature Graphs

Figure 3.19.: Exemplary update process of a feature graph. Lost features are deleted after a while. New features have to appear for a minimum time before being incorporated into the graph.

cost. In case a person’s track is lost, a model graph ORef with N nodes is stored as reference, and can be subsequently compared to a newly acquired graph OIM with M nodes. Similar to the previously described one, a graph similarity measure is required. While up to now two graphs have been simply matched, it is now required to detect similar feature constellations in two graphs where scale and orientation may differ, see fig. 3.20. It can be assumed that only few descriptors will be matching, as large amounts have to be rejected because the limbs, clothing, and bags are changing rapidly and hence are removed. Throughout the recognition task only stable core regions, such as the torso, will be used. The matching procedure can be interpreted as a combinational problem, where the assignment of descriptors in ORef and OT R has to be optimized [Ber04, Ren05]. This can be done by minimizing the cost function c(~a) of following Quadratic Assignment Problem (QAP) c(~a) = ~aT W1~a + ω ~ 2T ~a,

(3.50)

with an assignment vector ~a ∈ [0, 1], where one indicates an assignment of two descriptors i and j, and zero indicates two non matching descriptors. Further W1 and ω ~ 2 are introduced as weights. These have to be determined in the first place. As this assignment problem is computationally rather expensive and an optimal solution often cannot be found in finite time [Sah76], an approximative solution will be utilized in the following.

83

Chapter 3. Object Tracking

ORef

OTR

Figure 3.20.: An exemplary reference graph OT R and a possible distorted version in OT R . Similar descriptors are shaded in similar grey values.

Checking all possible assignments of OT R and ORef is avoided by the reduction of possible matches by computing the Eucledian distance of all descriptors q d(DRef ,i , DTR,j ) = (DRef ,i − DTR,j )T (DRef ,i − DTR,j ), (3.51) and subsequent thresholding. This way only matching descriptors are included into the assignment problem. Furthermore nodes from OT R are allowed to be similar to more than one node in OIM and vice versa. With these matches it is possible to create a catalog of candidates ~ T R, C ~ Ref , C ~ A] C = [C (3.52) ~ T R contains all descriptors DT R that were matched in ORef and the corresponding where C ~ Ref . The vector C ~ A is a simple helper variable, which groups match DRef is stored in C ~ A is multiple assignments from OT R → ORef . The maximum number of entries in C determined by the number nRef of matched reference nodes. Consequently a vector ~a with n entries, which are all initially set to one, can be created. Using these assignments it is possible to compute ω ~ 2 with the Euclidean distance q (3.53) ω2i = (DTR,i − DRef ,i )T (DTR,i − DRef ,i ). Both ~a and ω ~ 2 can now be used to compute the cost of descriptor assignments with cd (~a) = ω ~ 2T ~a.

(3.54)

By now only the descriptors have been used for comparison. In the next step the spatial arrangement of the nodes in ORef and OT R is computed and compared. Therefore eq. 3.45 can be altered to   ij T ~ij ~ dRef dT R ~ij ij  , W1 = (γ1 dRef + γ2 ) arccos  (3.55) ij ~ij ~ d d Ref T r

84

3.5. Object Tracking with Deformable Feature Graphs ij are the distance vectors’ corresponding points. All values on the where d~TijR and d~Ref matrices’ diagonal or values indicating perfect correspondences between DiRef and DjTR are set to zero in this form. Subsequently the cost

cspatial (~a) = ~aT W1~a

(3.56)

can be minimized. A perfect match is reached if cspatial (~a) = 0, while the best combination of assignments just minimizes the cost function. For further optimization the number of elements is increased by allowing so called non-correspondence entries, which allows the inclusion of similar descriptors although the spatial configuration completely differs. The corresponding entries of ω ~ 2 are initially set to zero and to one in ~a. In order to provide a robust graph matching, some more constraints V have to be introduced. Since every additional node rises the computational effort and adds uncertainty, the number of non-correspondences is limited to a predefined percentage of real matches with V~1~a > NZero , (3.57) where V~1 is a vector, containing ones at the positions with non-correspondences. Although multiple assignments of reference descriptors are allowed to enhance robustness, only one assignment usually is reasonable. With n being the maximum number of matched nodes in ORef V2~a = ~t (3.58) where ~t is n-dimensional vector. The matrix V1 is initialized as empty nT R × nRef matrix ~ A . Subsequently an assignment of and filled by considering the nT R elements of C ~ A,k k,C

V2

=1

(3.59)

is performed. Likewise multiple assignments of tracking to reference descriptors have to be avoided. A further constraint is the requirement to force an alignment of each node in ORef resulting in V~3 = [1 · · · 1], dim = NRef V~3~a = NRef .

(3.60)

All these constraints can be summed up into one matrix with ~  V1   V = V2  V~3

(3.61)

85

Chapter 3. Object Tracking

Figure 3.21.: Exemplary tracking of persons located within a designated tracking region. Partial occlusions are resolved without problems.

Keeping these constraints in mind, it is now required to solve the quadratic assignment problem. Thereby following two constraints have to be considered c(~a) = min ~aT W1~a + ω~2 T ~aV~a a

(3.62)

As it is a complex task to find an exact solution an approximation has been described in [Mac03], which can be used in this work. In a first step the minimum cost of a set of assignments in W1 is computed with X ij ~ 1i = min W W1 ~aj (3.63) j

A lower bound clow of the overall cost can now be estimated using a descriptor similarity ~c ~ 1 )T ~a. clow (~a) = (~c + W (3.64) Thereby it is possible to compute a first assignment a0 by minimizing the energy a0 = argmin clow (~a)

(3.65)

a

Starting with a gradient search from a0 it is possible to determine a value aF with c(aF ) = min cost(a). The computed result does not necessarily resemble the best possible solution, as false assignments might be inevitable due to bad parameter selection. Nevertheless it is a good initial guess for the similarity of two graphs. The model with the smallest cost should consequently be created by the same person as the actually tracked graph. In order to avoid intrusions a threshold θcost can be set, rejecting matches with a too low similarity.

86

3.5. Object Tracking with Deformable Feature Graphs

Figure 3.22.: Re-assignment of the correct tracking ID after a total occlusion [Gat08].

3.5.5. Graph Tracking Evaluation Due to the lack of video data with annotated bounding boxes, it is hard to provide a quantitative quality measure. Therefore it has been decided to evaluate a small set of data, where bounding boxes have been automatically created in multiple views by the implemented multi camera tracking system, which is based on homographies between views. The bounding boxes were subsequently aligned with the given ground truth data, which only indicates the persons’ positions on the ground plane. As the most interesting scenes are the ones where ambiguities appear, all scenes in which ID changes were detected have been used for evaluation. The homography system created 15 ID changes in the eight sequences of the PETS2007 data set [Ars08b], resulting in 60 2D scenes which had to be analyzed in detail, as four views are provided. Usually these ambiguities are created by partially or totally occluding blobs in all available views, which indicates difficult tracking conditions. Furthermore, all other consistently labeled trajectories were also transformed back into 2D to demonstrate the approach’s performance. Evaluation has shown that the SIFT based tracker creates most ID changes in the same situations as homography, as it loses track. In total 19 confusions have been created, of which 17 appeared at the same position as in the multi camera system. The remaining two were detected during regular tracking. Though some confusions still appear, deformable graphs are a reliable additional source in the multi camera tracking framework. Furthermore the algorithm has been evaluated by inspection with unlabeled data for qualitative results. Fig. 3.21 shows a tracking sequence for persons being localized in a predefined tracking region. During the tracking process there were no critical events detected, besides some partial occlusions. Obviously, these can be handled without difficulty. The almost complete occlusion of the person passing behind the girl standing in front at the left hand side should be noted. An exemplary re-assignment of an ID is shown in fig. 3.22, where two persons have been manually initialized. The person is totally occluded by a passing man and therefore his trajectory is lost. After re-appearance the object ID is correctly re-assigned. The task is eased in this special case as both pose and position do not change drastically.

87

Chapter 3. Object Tracking

Figure 3.23.: Re-assignment of the correct tracking ID in differing FOV [Leh08].

This approach has been initially implemented to re-identify lost tracks within the same view after severe occlusion. Due to the high performance it has been decided to apply it also to re-identification in non-overlapping FOVs. First experiments have been conducted with the airport scenario of the PROMETHEUS Database [Nta09] and promising results have been obtained. Both the limited database size and the early project state did not allow further investigation in this work, but motivate further research. Fig. 3.23 displays an example of a correct ID assignment although angle, scale, and size drastically differ. Furthermore the large amount of correctly assigned descriptors is noticeable. The advantage of the graph matching procedure becomes obvious if the descriptor matches of two different persons are compared. By far more descriptors are usually confused and a recognition cannot be reliably performed based on the nodes themselves.

3.6. Closure In this chapter various tracking approaches have been exploited. It should be noted that all of these require detection at some point and hence performance is depending on the initialization procedure. In case this fails, the entire tracking cue will not operate as expected. The rather primitive blob tracking approach, based on position and overlap of subsequent foreground regions, can be considered as quite effective initial guess for more sophisticated approaches as long as no overlaps are observed. The same conclusion can be made both for Kalman filtering and the condensation algorithm, which can be considered as probabilistic modeling of object motion. Each of these relies on measurements provided by the baseline detection system, which are often based on blobs or a trained model for an object class. These are usually used to generalize a class rather than to discriminate between objects. Hence these are basically not suitable for reliable tracking of multiple objects and used to keep the number of re-initializations as low as possible. Nevertheless both approaches are popular and sufficient for simple sequences. With more sophisticated

88

3.6. Closure methods for the creation of models for dynamic processes, such as Hidden Markov Models (HMM) or Dynamic Bayesian Networks (DBN) [FVJ01], a more exact estimation of motion parameters can be achieved [Che01]. However, more efficient solutions to estimate inference have to be found before DBNs are widely applied [Yil06]. Feature point based tracking approaches have been identified as more discriminative, here especially SIFT features, as these can also be used for recognition tasks [Lop08]. With the extension to the tracking of feature graphs a by far more robust object representation has been formulated. Thereby the geometrical structure, both distance and angles between features, has been considered, resulting in a scale and rotation invariant directed graph. Though being quite robust even with partial occlusions, the update procedure has yet to be refined in order to guarantee long term stability of the created model [Mat04]. This happens to be rather complex in crowded scenarios, where an automated assignment of new points is hard to conduct, and novel segmentation methods have to be considered [Leh09]. For future developments in the tracking context it would be reasonable to additionally incorporate object shape information, that could be used to restrict the search region for other features. The object silhouette, such as head and shoulders or a hand, can commonly not be modeled by simple geometrical shapes. Thereby especially contour tracking [Kan04] tends to show reliable results. Moreover recent works have demonstrated that contextual information should be considered during the detection and recognition phase, meaning that for instance a person cannot walk along the walls of a building [Tor03].

89

Chapter 3. Object Tracking

90

Chapter 4. Multi Camera Object Detection and Tracking One of the major goals of automated visual surveillance systems is to detect objects in a scene and to track these over time. The most challenging problem thereby is to segment people in complex scenes, where high object density leads to heavy occlusions. To model individual behaviors these have to be resolved robustly. Tracking techniques based on a single view, such as as KLT Features [Shi93] or the mean shift algorithm [Com00], are able to track objects robustly, but require an initialization of single objects prior to the group formation and the subsequent handling of merge and split events [Niu04]. Nevertheless, with current 2D tracking approaches it is not possible to segment people entering as a group. To cope with this, Elgammal and Davis [Elg01] presented a general framework which uses maximum likelihood estimation and occlusion reasoning to obtain the best arrangement for people. However, a single view often is not sufficient to detect and track objects due to severe occlusion, which as a fact requires the utilization of multiple camera views. Likewise, camera networks frequently are applied to extend the limited field of view of one camera, performing tracking in each sensor separately and fusing this information [Agg99]. In order to deal with dense crowds, the cameras should be mounted to view defined regions from different perspectives. Within these perspectives corresponding objects now have to be located. Appearance based methods, such as matching color [Orw99], lead to frequent errors due to different color settings and lighting situations in the individual sensors. Some enhancement can be achieved by adding further features, like the face [Num03], to enhance robustness. Approaches based on geometrical information purely rely on geometrical constraints between views, using calibrated data [Yue04] or homography between uncalibrated views, which e.g. Khan [Kha06] suggested to localize feet positions. A similar algorithm has been applied by Eshel [Esh08] to detect heads. However, as Khan’s approach only localizes feet, it consequently tends to segment persons into further parts. In these respects a novel extension to this framework is herein presented, applying homography in multiple layers to successfully overcome the problem of aligning multiple segments belonging to one person. As convenient side effect the localization performance will increase dramatically. Furthermore the issue of frequently appearing

91

Chapter 4. Multi Camera Object Detection and Tracking ghost objects, which disturb the tracking process, will be addressed. Instead of increasing the number of cameras [Kha06], a new method, based on geometric constraints and an additional preprocessing step, will be introduced to lower the false positive rate. The maintenance of object IDs over time is crucial to the subsequent behavior detection task and can be performed in the 3D domain, creating frequent ID changes. It is suggested to combine the presented 3D tracking method with standard 2D tracking methods, to lower assignment errors. The high performance of the proposed algorithms will be demonstrated on the challenging PETS2007 data set [Fer07] (see app. A.5), which has been created for the comparison of tracking and behavior detection systems applying multiple camera views. This set not only fits the desired application scenario, but its public availability allows for competitive evaluation. In total nine scenarios were recorded from four fields of views. For the relevant persons in the scenarios ground truth is provided both for the person’s position and the time stamp for the event.

4.1. Data Acquisition in Smart Sensor Networks Compared to single sensor environments, the installation of multiple sensors creates new challenges in both sensor placement and data acquisition. Following list gives a short overview on some things to be taken care of: ˆ Sensor Placement: inconvenient sensor positions frequently harden the surveillance task without any necessity. For the current homography framework the overlap of the fields of view should be maximized to cover large regions of the scenery. Otherwise some areas will only be visible in one perspective at the cost of the advantages of a multi camera system. Especially in small and narrow environments it is difficult to accomplish this task, as the field of view is restricted. Therefore it is advantageous to use optics with a short focal length, providing a wide angled view at the cost of additional radial distortion. Cameras observing adjacent regions should have at least some overlap, to enable consistent labeling [Cal08] and avoid the task of re-recognizing people in different views. This of course is often inevitable in a real world scenario. ˆ Data storage and compression: due to the massive amount of data to be stored in CCTV applications, video compression is inevitable. A trade off between the used compression codec, bit rate, and resulting compression artifacts has to be made. This usually results in additional motion and lower object detection performance in subsequent processing steps. A frequently used setting is resembled by the MPEG-4 codec [Koe02] with a high bit rate. ˆ Synchronization: for a reliable surveillance system it is inevitable to synchronize the video streams at frame level. This procedure will guarantee a sensor fusion with-

92

4.2. Camera Geometry

Figure 4.1.: A person clapping with his hands visible from three fields of view. The manually labeled frame is used for synchronization

out delay, which would cause errors in estimating a persons position and behavior. For on line applications there are basically two concurring versions. One would be the use of a Network Time Protocol (NTP) server [Mil85] as reference for all computing units. Each sensor will add a time stamp in every single frame. In the processing unit only frames with equal time stamps are used for a combined analysis. Point Gray presented a hardware based solution as alternative, which requires processing rather in a central unit than directly in the smart camera. For instance, in case more than one Dragonfly [Poi08] camera is present on the IEEE1394 bus, the cameras automatically synchronize their acquisition time. Of course the acquired frames can subsequently be processed in a distributed and parallelized fashion. These methods can also be applied for off-line processing tasks. If this specialized hardware is not available or the additional cost of installation is not feasible for data recordings, a manual synchronization procedure can be performed. Therefore a short visual signal, which is visible in all cameras, can be used as trigger. This could be a camera’s flash, a person standing in sight of the cameras and performing a clap with his hands [Lo04] or a clapperboard as known from the movies. Fig. 4.1 shows a sample of a synchronized video created with the clapping technique.

4.2. Camera Geometry Camera calibration is required to extract metric information from 2D images [Zha00]. Therefore the internal geometric and optical characteristics (intrinsic parameters) need to be determined. Additionally a sensor’s 3D position, relative to a world coordinate system (extrinsic parameters), has to be known. While various calibration methods have been implemented in the last years [Sun05], this work relies on Tsai’s calibration method [Tsa87], as experience has shown sufficient accuracy for the multi camera application. In this section the fundamentals of camera geometry will be described in the first place, followed by a detailed description of Tsai’s method.

93

Chapter 4. Multi Camera Object Detection and Tracking

4.2.1. The pin hole camera model To be able to calibrate a set of cameras, the geometrical characteristics of an optical system have to be known. A camera basically collects light through a lens and bundles it in the optical center. This process is commonly described by the pinhole model [J¨ah95], shown in fig 4.2. The lens is located at distance f to the image plane, which denotes the focal length of the lens, which has to be calibrated. A ray of light, reflected by an object point in 3D computer coordinates p~ = (x, y, z) or world coordinates P~ = (xw , yw , zz ), is passing an infinitesimally small hole, which represents the lens center, and intersects the image plane at p~ = (xu , yu , −f ), resulting in the undistorted image coordinates p~ = (xu , yu ). The relationship between the 3-D and 2-D world is thereby given by xu = −

fx , z

yu =

fy , z

(4.1)

which can be computed by the use of similar triangles [Pra91]. As both coordinate systems are parallel to the image plane they are simply scaled by the factor fz . Objects are therefore reproduced in the correct ratio of world coordinates: Neither their seize nor the distance of the objects can be reconstructed from the created image. This relationship can be mathematically described by a perspective transformation matrix M in homogeneous coordinates with 

1 0

0

  0 1 0   M= 0 0 1   0 0 −1  f 1 0

0



 0   0  .  0   0

(4.2)

The use of homogeneous coordinates enables a simple formulation of concatenated operators, such as rotation and translation, which can be considered as matrix multiplication. According to Roberts [Rob66] homogeneous coordinates can be created by a simple dimension change from n to n + 1  



sx1



x1    sx2       x2    → · · ·  .  ···       sxn    xn s

(4.3)

Where s denotes a scale factor between differently scaled coordinate systems and is commonly set to s = 1. For further computation the vector is normalized by simple division

94

4.2. Camera Geometry

y

 P

z

x

p

pinhole Image plane

f

focal plane

z

object plane

Figure 4.2.: Pinhole model

of all components by s: 



x1



x1 s x2 s

    x2          ...  →  ...     xn   xn    s s 1

     .   

(4.4)

Utilizing this formulation the vector product 0 p~ = MP~ 0

(4.5)

results in the corresponding normalized image plane coordinates    0 p~ =   

sx sy sz s+

sz f



     ⇒ p~ =    

fx f −x fy f −y fz f −z

  . 

(4.6)

A specific image point p~ can be back projected into three-dimensional object space through an inverse perspective transformation in an analogue way. A two-dimensional image can be interpreted as perspective transformation of a threedimensional scene onto the image plane. Thereby a relationship between two coordinate systems has to be created, as illustrated in fig. 4.3. Setting the center of image plane as origin of the image coordinates, points can be transformed from one view to another with basic transformations, such as translation, rotation, scaling and perspective projection [J¨ah95]. First the origin of the world coordinates is translated into the origin of the camera coordinate system by the translation vector T. Then the orientation of the shifted coordinates is changed by rotation about suitable axes. Mathematically this process can

95

Chapter 4. Multi Camera Object Detection and Tracking be interpreted as matrix multiplication, which represents the rotation matrix R, and a subsequent translation by T p~ = RP~ + T. (4.7) With the introduced model the image formation process can be represented by simple matrix multiplications in the 4 × 4 domain. Therefore the translation vector can be written as   1 0 0 TX   TX    0 1 0 TY     . T =  TY  → T =  (4.8)  0 0 1 T Z   TZ 0 0 0 1 Due to the three available axes, the original rotation  R11 R12 R13   R21 R22 R23 R=  R  31 R32 R33 0

0

0

matrix with the form  0  0  , 0   1

(4.9)

can be decomposed into three subsequent rotations about the axes X, Y and Z with R = Rx Ry Rz . The individual components are determined by the appropriate rotation angles θ, φ and ψ   1 0 0 0    0 cosθ −sin 0   , Rx =   0 sinθ cosθ 0   0 0 0 1   cosφ 0 sinφ 0    0 1 0 0   , Ry =  (4.10)   −sinφ 0 cosφ 0  0 0 0 1   cosψ −sinψ 0 0    sinψ cosψ 0 0  . Rz =   0 1 0 0    0 0 0 1 The scale factor S between the two coordinate  s1 0   0 s2 S=  0 0  0 0

96

systems can be changed with  0 0  0 0  , s3 0   0 1

(4.11)

4.2. Camera Geometry

o oi

y

Zw

x p = x u , y u p = x d , y d 

y Yw

x

z  P = x w , y w , z w  Xw

Figure 4.3.: Tsai’s extended pinhole camera model with additional distortion

and the perspective projection P is already known as 



1 0

0

0

  0 1 P=  0 0  0 0

0

 0  . 0   0

1 −1 f

(4.12)

In this place the advantage of the homogeneous formulation becomes apparent: A sequence of transformations can be modeled as subsequent matrix multiplications. As the multiplication is a non commutative operation the order is crucial and has to be respected. Additionally all required matrices for the transformation from world to image coordinates can be composed of these elementary matrices with following decomposition: M = SPRz Ry Rx T.

(4.13)

The model up to now does not consider lens distortion. As seen in fig. 4.3 the ray of light is not passing the estimated image point but is distorted to xu = xd + Dx and

(4.14)

yu = yd + Dy , where (xd , yd ) is the real image coordinate on the plane and (xu , yu ) represent the com-

97

Chapter 4. Multi Camera Object Detection and Tracking puted undistorted image coordinates. The distortion can be modeled with Dx = xd (κ1 r2 + κ2 r4 , . . .) (4.15)

and Dy = yd (κ1 r2 + κ2 r4 , . . .),

p where r = x2d + yd2 and the distortion parameters κi need to be calibrated. Depending on the used convention for image representation in computer coordinates, the real image coordinates have to be transformed into computer coordinates (xf , yf ) with 0

xf = sx dx−1 xd + Cx and

(4.16)

yf = d−1 y yd + Cy , where (xf , yf ) represents the row and column numbers of image pixel in computer frame memory and (Cx , Cy ) is the center position of the computer frame memory. The distances of adjacent sensor elements in X− and Y − direction are denoted by dx and dy , providing Ncx d0x = dx , (4.17) Nf x with the number Ncx of sensor elements in a row and the number Nf x of sampled pixels in a row. The uncertainty scale factor sx has to be computed during calibration. In case a CCD camera is utilized, it is advisable to set dy , Ncx and Nf x to one. For other sensors the distance between adjacent sensor elements in x direction is usually set to xd Nf x X= . (4.18) dx Ncx In order to understand the calibration process it is necessary to provide an exact relationship between world and computer image coordinates. Using the knowledge about the camera model it is possible to create this relationship with x 0 −1 0 2 s−1 x dx X + sx dx Xκ1 r = f z (4.19) y 0 2 dy Y + dy Y κ1 r = f , z p 0 2 2 where r = (s−1 x dx X) + (dy Y ) . By a substitution of eq. 4.7 one receives r1 xw + r2 yw + r3 zw + Tx r7 xw + r8 yw + r9 zw + Tz (4.20) r4 xw + r5 yw + r6 zw + Ty 0 2 dy Y + dy Y κ1 r = f . r7 xw + r8 yw + r9 zw + Tz Applying these relationships it is possible to compute correspondences between world and computer coordinates. Normally the parameters dx and dy are provided by the camera manufacturer, (Ncx ,Nf x ) are set to one and (Cx , Cy ) is the center of the computer image. The remaining parameters, that need to be calibrated, can be basically split into two categories: The so called extrinsic parameters which transform world into computer coordinates, here the translation vector T and the rotation matrix R. Further the so called intrinsic parameters, here the effective focal length f , lens distortion κi and the uncertainty scale factor s. 0 −1 0 2 s−1 x dx X + sx dx Xk1 r = f

98

4.2. Camera Geometry

Figure 4.4.: Calibration Pattern used for the Calibration of the PETS2007 data set. The markers on the floor are visible in all views.

4.2.2. Tsai’s Calibration Method In order to calibrate a camera, only the parameters mentioned before are required. These include the rotation matrix R, the translation matrix T, the focal length f , the scale factor s, and the distortion κi . Currently systems usually depend on known point correspondences between a 2D image and the real world. The required points can be either manually labeled or automatically detected [Sze01]. These points can be either located on a plane [Sir07], such as a pattern on the floor in fig. 4.4, or on a 3D object, where often two or three orthogonal planes are used [Cha97]. Given a set of such point correspondences it should be possible to determine the calibration parameters by solving a system of equations which matches world coordinates P~ = (xw , yw , zz ) to image coordinates p~ = (x, y). Azis and Karara [AA71]developed the so called Discrete Linear Transformation (DLT), which restricted the calibration to solving a set of linear equations. Unfortunately this technique does not include the inevitable impact of lens distortion. As soon as this is required the DLT requires a full non-linear search. Therefore Tsai introduced a novel technique in [Tsa87], considering radial lens distortion, as illustrated in the utilized model in fig. 4.3. In order to calibrate cameras, point correspondences between world and camera coordinates are required in the first place. The positions (xf , yf ) of all i calibration points can be determined either manually or automatically. Further, the known parameters dx , dy Ncx and Nf x are obtained by the camera manufacturer, and (Cx , Cy ) is set to the image center. Using eq. 4.16 the distorted positions of the calibration points can be determined

99

Chapter 4. Multi Camera Object Detection and Tracking with 0 xdi = s−1 x dx (xf i − Cx )

ydi = dy (yf i − Cy )

(4.21)

Next it is possible to compute the translation matrix in two steps. First the five unknowns Ty−1 r1 , Ty−1 r2 , Ty−1 Tx , Ty−1 r4 , Ty−1 r5 can be determined by setting up following linear equation system with the known correspondences of world and computer coordinates   Ty−1 r1  −1    h i  Ty r2   ydi xwi ydi ywi ydi −xdi xwi −xdi ywi  Ty−1 Tx  (4.22)  = Xdi .    T −1 r4   y  Ty−1 r5 This equation system becomes overdetermined with more than five calibration points, which means that the unknowns can be determined. Utilizing eq. 4.22, it is possible to compute Ty by introducing a new 2 × 2 matrix # " r # " r2 1 r10 r20 T T = r5y r5y , (4.23) C= 0 0 r4 r5 Ty Ty where C is a submatrix of the rotation matrix R. If no column or row vanishes, Ty can be computed with s Sr − [Sr2 − 4(r10 r50 − r40 r20 )]2 Ty = , (4.24) 2(r10 r50 − r40 r20 )2 0

0

0

0

where Sr = r12 +r22 +r42 +r52 . With an arbitrarily chosen point i and the initial assumption that the sign of Ty is +1, following equations can be solved: r1 = (Ty−1 r1 )Ty r2 = (Ty−1 r2 )Ty , r4 = (Ty−1 r4 )Ty , r5 = (Ty−1 r5 )Ty ,

(4.25)

Tx = (Ty−1 Tx )Ty , x = r1 xw + r2 yw + Tx , y = r4 xw + r5 yw + Ty , where the unknowns Ty−1 r1 , Ty−1 r2 , Ty−1 Tx , Ty−1 r4 , Ty−1 r5 have been already computed in eq. 4.22. In case x and xd have the same sign, and y and yd have the same sign then the sign of Ty is +1, else it is −1. Having determined the correct sign of Ty it is possible to recompute the equations in eq. 4.25 and compute the correct value of Tx Tx = (Ty−1 Tx )Ty ,

100

(4.26)

4.2. Camera Geometry

sensor location

Figure 4.5.: Calibration error depending on the object’s distance to the sensor. The lines indicate the resulting error in distance and direction, comparing ground truth and computed image to world coordinate transformation

and the rotation matrix with 

r1 r2

(1 − r12 − r22 )1/2



  R =  r4 r4 s(1 − r42 − r52 )1/2  , r7 r8

(4.27)

r9

where s = −sgn(r1 r4 + r2 r5 ) and r7 , r8 , r9 are determined from the outer product of the first two rows using the orthonormal right -handed property of R. In a last step both the focal length f and Tz are estimated by establishing a linear equation system " # f [yi − dy yi ] = wi dy yi , (4.28) Tz with yi = r4 xwi + r5 xwi + Ty and wi = r7 xwi + r8 ywi . Given several calibration points this overdetermined equation system can be easily solved. The obtained result for Tz and f can be used as initial guess for a standard optimization scheme such as steepest descent to exactly determine the parameters f, Tz and κ1 using d0y y + dy yκ1 r2 = f

r4 xw + r5 yw + r6 zw + Ty r7 xw + r8 yw + r9 zw + Tz

(4.29)

from eq. 4.20. Applying this calibration technique allows the computation of point correspondences between world and image coordinates. While it has been reported to be quite exact, precision

101

Chapter 4. Multi Camera Object Detection and Tracking depends on a wide range of influences, which will be demonstrated with the example provided in fig. 4.5. In the first place the positions of calibration points have to be measured exactly, as these resemble the ground truth in world coordinates. Especially in case large spaces are covered in one field of view and there is no regular pattern available this easily becomes a problem, because high accuracy is required. Large distances become a more evident problem when the corresponding points have to be detected in image coordinates. The red patterns in fig. 4.5 are used as calibration markers. Objects located further away from the sensor position naturally appear smaller in the image than objects located near the sensor. Therefore these objects cover smaller regions in the image and for instance corners cannot be extracted precisely, even if a manual localization is performed. Both uncertainties and minor errors in the estimation of κ1 and Tz , lead to small but noticeable errors. These have been visualized in fig. 4.5, where the known point correspondences have been transformed from image to world coordinates. The lines demonstrate the displacement, which is up to 0.5 m for regions located far away from the camera. These errors have to be considered during the following localization task.

4.3. A Short Review On Homography 4.3.1. The Homographic Transformation Homography [Har03] is a special case of projective geometry. It enables the mapping of points in spaces with different dimensionality Rn [Est]. Hence a point p~ observed in a view 0 can be mapped into its corresponding point p~ in another perspective or even coordinate system. Fig. 4.6a) illustrates this for the transformation of a point p~ in world coordinates 0 R3 into the image pixel p~ in R2 0

p~ = (x, y) → p~ = (x, y, z).

(4.30)

Planar homographies, here the matching of image coordinates into the ground plane, in contrast only require an affine transformation from R2 → R2 . This can be interpreted as a simple rotation with R and translation with T 0

p~ = R~p + T.

(4.31)

In order to avoid the expensive task of two processing steps it seems reasonable to combine both steps in one operation. By moving the computation into a higher dimension the transformation can be represented by a single matrix multiplication R T 0 p~ = RT~p = (4.32) p~ = H~p 0 1 by a homography matrix H in homogeneous coordinates. The resulting function h h h 11 12 13 0 p~ = h21 h22 h23 p~ with det H 6= 0, h31 h32 h33

102

(4.33)

4.3. A Short Review On Homography XI YI

XI

p'

YI

Zw

a)

Yw Xw

p

Zw

b)

Yw Xw

Figure 4.6.: a) Transformation of point p in world coordinates to the point p’ in image coordinates. b) A rectangular image region mapped into the corresponding ground plane. The result can be interpreted as the shadow of the original object, here the dotted rectangle.

contains all required components for translation h13 , h23 , scale and rotation h11 , h22 , shear and rotation h12 , h21 and the homogeneous scaling factor h33 . With h31 = 0 and h32 = 0, H describes an affine transformation matrix. A projective transformation with additional distortion can be achieved by setting h31 and h32 to values different to zero. In total 8 degrees of freedom are being provided by H, defining projective transformations [Shi02]. Among these are following special cases:

ˆ Isometry: transformation from one metric space into another one preserving the scale factor using rotation and translation. cosθ −sinθ T x 0 p~ = sinθ cosθ Ty p~, with  ∈ ±1. (4.34) 0 1 0 ˆ Similarity transformation: in case the scale factor s is changed by the operation a similarity transformation is conducted, preserving the object ratio scosθ s − sinθ T x 0 p~ = ssinθ scosθ Ty p~, with s ∈ R. (4.35) 0 1 0 ˆ Affine Transformations: the geometrical operations of translation, scaling, shear and rotation are combined in a affine transformation with a cosθ −a − sinθ T 12 x 11 0 a22 cosθ Ty p~, with a ∈ R. p~ = a21 sinθ (4.36) 0 0 1

103

Chapter 4. Multi Camera Object Detection and Tracking

Figure 4.7.: Homographies for the 4 views of the PETS2006 data set

In case an object is detected in an image, as illustrated in fig. 4.6 b), each pixel inside the boundaries is transformed into world coordinates. Due to the restriction to R2 there is no height information available. The entire object is transformed into the ground plane, resulting in a stretched rectangular shape. This can be interpreted as an object’s shadow created by a light source at the sensor’s original position.

Within this work the homography between views, in particular the camera view and a virtual top view, see fig. 4.7, has been computed with the camera calibration technique introduced by Tsai [Tsa87]. The four different fields of view of the PETS2006 data set are shown on the left hand side. Subsequently the entire image is transformed into the ground plane, providing a synthetic bird’s eye view. Due to lens distortion and calibration errors some of the views are a little blurred. As only one plane is known, even elevated objects, e.g. walls or pedestrians, are also mapped into the ground plane. Calibration is not necessarily required as shown by Calderara et al. [Cal08], who were determining changes in supporting views to compute the homography on-line.

104

4.3. A Short Review On Homography

Ci

Hij

Cj pj 

pi  Plane Parallax P Piercing Points

Planar Surface π

Figure 4.8.: The homography constraint visualized with a cylinder standing on a planar surface

4.3.2. The Homography constraint This section describes the characteristics of the projective geometry between multiple cameras and a plane in world coordinates, introduced in [Kha06]. For better understanding, the problem will be limited to two cameras in the first place, though an extension is possible without additional effort. A point p~π located on the plane π is visible as p~iπ in view Ci and as p~jπ in a second view Cj . Applying eq. 4.32, p~iπ and p~jπ can be determined with p~iπ = Hiπ p~π and p~jπ = Hjπ p~π , (4.37) where Hi,π , denotes the transformation between view Ci and the ground plane π. The composition of both perspectives results in a homography [Har03] p~jπ = Hjπ H−1 ~iπ = Hij p~iπ iπ p

(4.38)

between the images planes. This way each pixel in a view can be transformed into another arbitrary view, given the projection matrices for the two views. A 3D point p~π located off the plane π visible at location p~iπ in view Ci can also be warped into another image with p~w = H~piπ , with p~w 6= p~2π . The resulting misalignment is called plane parallax. As illustrated in fig. 4.8 the homography projects a ray from the camera center Ci through a pixel p~ and extends it until it intersects with the plane π, which is referred to as piercing point of a pixel and the plane π. The ray is subsequently projected into the camera center of Cj , intersecting the second image plane at p~w . As can be seen points in the image plane do not have any plane parallax, whereas those off the plane have considerable one. Each scene point p~π located on an object in the 3D scene and on plane π, will therefore be projected into a pixel p~1π , p~2π , · · · , p~nπ in all available n views if the projections are located in detected foreground regions F Gi with p~iπ ∈ F Gi .

(4.39)

105

Chapter 4. Multi Camera Object Detection and Tracking

Figure 4.9.: A scene viewed by a video and a thermal infrared sensor. Next to the images the extracted foreground regions are illustrated. The yellow rectangles indicate the lowest point in foreground blob.

Furthermore each point p~i can be determined by a transformation between view i and an arbitrary chosen one indexed with j p~iπ = Hij p~jπ ,

(4.40)

where Hij is the homography of plane π from view i to j. Given a foreground pixel p~i ∈ F Gi in view Ci , with its piercing point located inside the volume of an object inside the scene, the projection p~j = Hij p~i ∈ F Gj (4.41) lies in the foreground region F Gj . This proposition, the so called homography constraint, is segmenting out pixel corresponding to ground plane positions of objects and helps resolving occlusions, see sec. 4.4.1. The homography constraint is not necessarily limited to the ground plane and can be used in any other plane in the scene, as it will be shown in sec. 4.5. For the localization of objects the ground plane seems sufficient to find objects touching the ground plane. In the context of pedestrians a detection of feet is performed, which will be explained in the following sections.

4.3.3. Semi-Automated Computation of H For most common calibration techniques point correspondences between the real world and the image plane are inevitable, which are also required for the homography between two views Ci and Cj . Unfortunately there is frequently no convenient method to find point correspondences in the sensor data. Reasons may be a very low resolution as for PMD data or a sensor technology measuring other values than the intensity values of visible light, such as thermal infrared. Fig. 4.9 shows a thermal image and a RGB image taken from a similar view. As it can be seen the calibration pattern, here the crosses on the floor, are not visible and there is almost no other structure in the image visible. A practical solution would be the use of additional markers, which are based on temperature differences, such as ice water on warm ground, parallel to the visual markers. This is

106

4.3. A Short Review On Homography

∆x

∆y

∆z

Indoor Setup: 0.05 0.06 0.08 Outdoor Setup: 0.07 0.08 0.25 Table 4.1.: Error for the semi automated calibration process in meters. The accuracy in x and y direction is considered to be sufficient. In contrast the error in z is by far too large.

often not possible due to the distance of a sensor to the ground or simply logistic reasons. While the homography Hij between views can be computed on-line [Cal08] the following approach requires the manual calibration of the visual sensor. It could be avoided, but then there would be no information related to the world coordinates available, which is required for evaluation and backup purposes. In order to simplify the correspondence problem this method is limited to one object located in the scene. At first a foreground segmentation is performed in both sensors resulting in the object F Gi and F Gj . The blobs’ lowest points p~n,min , which usually touches the ground plane, can be determined with connected components analysis in the both views and a subsequent computation spatially lowest Y coordinate. Supposing that the detected points are corresponding and the homography constraint, as defined in eq. 4.40, is fulfilled it can be assumed that H1π p~1,min = H2π p~2,min = p~π .

(4.42)

Now that the point correspondences between real world and image coordinates in the thermal image are known, it is possible to perform any preferred calibration method. In case more than the required seven points are available, the calibration becomes more exact and less error can be observed. Fig. 4.9 shows an exemplary view recorded for the Prometheus database [Nta09]. As can be seen the markers are visible both in thermal infrared and the normal camera, which can be used for manual calibration. This data can subsequently be used for the evaluation of the semi automated calibration task. In total two setups have been tested during the recordings, one outdoors and the other one indoor. For both scenarios the manually annotated points have been used as reference to compute the calibration error, which is given for both setups in tab. 4.1. These points are transformed into world coordinates and should be located on the corners of the calibration grid. Though the calibration is quite exact, some error is still noticeable if transformed data is compared with the markers on the ground plane. The error in X and Y direction is at an low 0.05 m in average. In contrast the deviation in Z direction is up to 0.25 m.

107

Chapter 4. Multi Camera Object Detection and Tracking

Sensor 1

Sensor n

Image Capture

Image Capture

Foreground Segmentation

Foreground Segmentation

Homography

Homography

Data Storage

r1

Fusion

a)

Object Region

r2

b)

Figure 4.10.: a) Scheme of object localization with planar homography. b) Object detection example applying homography on 2 exemplary views in the PETS2007 dataset. The frames are first thresholded and subsequently transformed into the ground plane. The fusion is a simple threshold of the sum of polygons of all available camera data.

4.4. Planar Homography 4.4.1. Object localization Using Homography in the Ground Plane Now that it is possible to compute point correspondences from the 2D space to the 3D world and vice versa, it is also possible to determine the number of objects and their exact location in a scene. Fig. 4.10a) gives a brief overview of the applied processing steps, which are illustrated in an example with PETS2007 data in fig. 4.10b). In the first stage a synchronized image acquisition is needed, in order to compute the correspondences of moving objects in the current frames C1 , C2 , . . . , Cn . Additionally the sensors should be set up keeping in mind that the observed region should be as large as possible and direct occlusions of the sensor should be avoided. Therefore a field of view looking down on the scenery from an elevated point would be preferable. Subsequently a foreground segmentation is performed in all available smart sensors to

108

4.4. Planar Homography detect changes from the empty background B(x, y) [Kha06] : F Gi (x, y, t) = Ii (x, y, t) − Bi (x, y)

(4.43)

where the appropriate technique to update the background pixel, here based on Gaussian Mixture Models [Ziv04a], is chosen for each sensor individually. It is advisable to set parameters, such as the update time, separately in all sensors to guarantee a high performance. Computational effort is reduced by masking the images with a predefined tracking area. Now the homography Hiπ between a pixel p~i in the view Ci and the corresponding location on the ground plane π can be determined. In all views the observations x1 , x2 , . . . , xn can be made at the pixel positions p~1 , p~2 , . . . , p~n . Let X resemble the event that a foreground pixel p~i has a piercing point within a foreground object with the probability P (X|x1 , x2 , . . . , xn ). With Bayes’ law p(X|x1 , x2 , . . . , xn ) ∝ p(x1 , x2 , . . . , xn |X)p(X),

(4.44)

the first term on the right side is the likelihood of making an observation x1 , x2 , ..., xn given an event X happens. Assuming conditional independence, the term can be rewritten to p(x1 , x2 , . . . , xn |X) = p(x1 |X) × p(x2 |X) × . . . × p(xn |X).

(4.45)

According to the homography constraint, a pixel within an object will be part of the foreground object in every view p(xi |X) ∝ p(xi ), (4.46) where p(xi ) is the probability of xi belonging to the foreground. An object is then detected in the ground plane when p(X|x1 , x2 , . . . , xn ) ∝

n Y

p(xi )

(4.47)

i=1

exceeds a threshold θlow . In order to keep computational effort low, it is feasible to transform only regions of interest [Ars07a]. These are determined by thresholding the entire image, resulting in a binary image, before the transformation and the detection of blobs with a simple connected component analysis. This way only the binary blobs are transformed into the ground plane instead of probabilities. Therefore eq. 4.47 can be simplified to n X p(X|x1 , x2 , . . . , xn ) ∝ p(xi ) (4.48) i=1

without any influence on the performance. The value of theta θlow is usually set dependent on the number n of camera sensors to θlow = n − 1, in order to provide some additional robustness in case one of the views accidentally fails. The thresholding on sensor level has a further advantage compared to the so called soft threshold [Kha06, Bro01], where the entire probability map is transformed and probabilities are actually multiplied as in eq. 4.47. A small probability or even xi = 0 would result in a small overall probability,

109

Chapter 4. Multi Camera Object Detection and Tracking XI2

XI1

YI2

YI1

XI2

XI1

YI2

YI1

Zw

a)

Yw Xw

Zw

b)

Yw Xw

Figure 4.11.: a) Planar homography for object detection.b) Resolving occlusions by adding further views.

whereas the thresholded sum is not affected that dramatically. Using the homography constraint hence solves the correspondence problem in the views C1 , C2 , . . . , Cn , as illustrated in fig 4.11a) for a cubic object. In case the object is human, only the feet of the person touching the ground plane will be detected. The homography constraint additionally resolves occlusions, as can be seen in fig. 4.11a). Pixel regions located within the detected foreground areas, indicated in dark gray on the ground plane, and representing the feet, will be transformed to a piercing point within the object volume. Foreground pixel not satisfying the homography constraint are located off the plane, and are being warped into background regions of other views. The piercing point is therefore located outside the object volume. All outliers indicate regions with high uncertainty, as there is no depth information available. This limitation can now be used to detect occluded objects. As visualized in fig. 4.11b) one cuboid is occluded by the other one in view C1 , as apparently foreground blobs are merged. The right object’s bottom side is occluded by the larger object’s body. In contrast both objects are visible in view C2 , resulting in two detected foreground regions. A second set of foreground pixel, located off the ground plane π, in view C2 will now satisfy the homography constraint and localize the occluded object. This process allows the localization of feet positions, although they are entirely occluded, by creating a kind of see through effect. The implemented algorithm, as illustrated in an abstract example in fig. 4.10a) can be described as following: ˆ Foreground objects ui,j are detected in all n views and a binary map is created. Subsequently n object boundaries can be extracted utilizing connected components analysis in the binary image ˆ Object boundaries are then being transformed into a predefined reference view

Uij = Hiπ uij .

(4.49)

Though any of the views can be chosen, the most convenient one is a top view on the ground plane, visualizing spatial relationships between objects.

110

4.4. Planar Homography

Figure 4.12.: Detection example applying homographic transformation in the ground plane. Detected object regions are subsequently projected into view 3 of the PETS2006 data set. The regions in yellow represent intersecting areas. As can be seen, some objects are split into multiple regions. These are aligned in a subsequent tracking step.

ˆ Next the intersections of the polygons are computed. These can be calculated by a plane-sweep algorithm within the reference view. The binary represented regions Ri (x, y) ( ) 1 if p(x, y) ∈ Uij Ri (x, y) = (4.50) 0 else

located within detected foreground, are now transformed into the ground plane. In a subsequent step these values are summed up to Rs (x, y) =

n X

Ri (x, y).

(4.51)

i=1

ˆ The resulting map Rs (x, y) is subsequently thresholded with the previously defined parameter θ to encounter possible object regions ORi (x, y) ( 1 if B(x, y) ≥ θ ORi (x, y) = (4.52) 0 else

This is usually computed with θlow = n−1 to obtain higher reliability in the tracking process. ˆ Finally coherent regions indicating feet positions are indexed applying a simple connected component analysis.

The results of the fusion step are shown in fig. 4.12, where the yellow regions on the left hand side represent possible object positions. For an easier post processing the resulting intersections are interpreted as circular object q regions ORi with center point p~j (x, y, t) and

its radius rj (t), which is given by rj (t) = region.

Aj (t) , π

where Aj (t) is the size of the intersecting

111

Chapter 4. Multi Camera Object Detection and Tracking

O2

O1

R3 R4

O3

R1

1 Create Copy

2 Predict new position

5 Compute motion vector

R2

3 Compute alignement vector

6 Create new object

4 Align objects and regions

7 Remove objects

Figure 4.13.: Computation Steps of the tracking algorithm

4.4.2. Aligning Object Fragments in the Ground Plane Applying Heuristics Depending on the actual pose of the human body, it is frequently represented by multiple object regions. Especially walking persons will produce at least two possible positions due to the distance and the separation of feet, see the red circles in fig 4.14, thus regions related to the same object have to be combined in a post processing step. Therefore the detected regions ORj (t), defined by the center point p~j (t) and its radius rj (t) have to be matched to the related object Oi ORj → Oi (4.53) As the alignment of regions is quite difficult in the spatial domain, the combination is performed during tracking based on simple heuristics. A look ahead strategy [Kha06] with blobs stacked in a space time volume and subsequent analysis using normalized cuts [Shi00] could be applied. As in on line applications the future is not known only elapsed tracks are required. Inspired by Kalman filters [Kal60] a simple but yet effective tracking, as visualized in fig. 4.13 can be performed [Ars07a, Hof07]. Each object Oi (t) is uniquely described by its centre point p~i (t), motion vector ~vi , radius ri (t) and its ID. At first a copy of previously detected object positions is created with Oi (t) = Oi (t − 1).

(4.54)

In case the object age is older than 2 frames, its new position is estimated with p~i,est (t) = p~i (t − 1) + ~vi (t − 1),

112

(4.55)

4.4. Planar Homography 1

Weight of region

O1 r1

r2

r3 0

0

Distance of region and object center

Figure 4.14.: Weights for three regions aligned to an object and resulting alignments on PETS2006 data

where ~vi (t − 1) is a motion vector ~vi (t) = p~i (t) − p~i (t − 1),

(4.56)

which has been computed in the previous frame. In the next step the detected object regions ORj with center p~j are aligned to the predicted object positions p~i,est . Therefore a vector k~i , describing the relationship between regions and objects, is computed. This is modeled by a Gaussian normal derivation, where regions located nearer to the object’s center are assigned a higher weight than the ones further away. ~ki is thereby computed by  1  ki  2     ki  ||~pi,est (t) − p~j (t)||2 j ~   with ki = exp − ki =  , (4.57)  σ2  ...  kij where σi represents the standard deviation and is set to σ = [0.3, . . . , 0.5], which approximates the radius of a human person. Fig. 4.14 illustrates the creation of the alignment vector ~ki in case of three present regions. The result would be: ~k1 = [0.91 0.85 0.05]T , indicating 2 of 3 regions belonging to object O1 (t). The objects’ positions can now be determined exactly as the weights in the alignment vector are known. Regions with a large weight are therefore favored in contrast to low weighted ones. Additionally the surface area is also considered, as small objects tend to be false positives. Considering all these constraints, the new object position is computed with P 0.1pi,est (t) + j 2π(rj )2 kij p~j (t) p~i (t) = , (4.58) P 0.1 + j 2π(rj )2 kij where the constant factor 0.1 is used to include the predicted position into the alignment procedure. Now it is checked whether all regions are assigned to an existing object or if

113

Chapter 4. Multi Camera Object Detection and Tracking

Figure 4.15.: Track consistency applying simple heuristics in an example from the PETS2006 challenge.

they are representing a new one. In case the distance between a region ORi and an object Oj is larger than a predefined threshold θlow , here the object size ||~pj − p~i || > ri (t) ⇒ create object,

(4.59)

a new object is created and a new ID is assigned. As last step objects disappearing from the scenery have to be detected. To overcome the problem of a short tracking loss an object is only removed from the list, if no regions are assigned a couple of frames, here after five frames without assignment: ||~pi − p~j || > ri + rj

&&

f rames > θlow ⇒ remove object

(4.60)

Exemplary results for the alignment process are given in fig. 4.15, where the small red

114

4.4. Planar Homography

P1 Ps

P0

PF

Figure 4.16.: Computation of the epipolar line an the intersection with an orthonormal plane

regions are correctly assigned to the objects visualized by circles with a fixed radius of 0.25m in different shades of blue. Each tone is assigned to a unique person ID, which is used to create trajectories on the ground plane. The maintenance of the assigned IDs is illustrated in fig. 4.15 with a short sequence from the PETS2006 challenge. As it can be seen the persons are tracked consistently and only the actual feet positions are visualized in the images. In case the feet are located outside the field of view the feet position is only visible in the top view.

4.4.3. Estimating the height of a Detected Object The tracking approach presented in sec. 4.4.2 tends to create ID changes in crowded scenarios, especially in case two persons stand very close to each other. Therefore it seems appropriate to integrate a second, possibly more robust, cue. Even though only transformations into the ground plane are available up to now, the object’s height can be approximated with only one more transformation, as illustrated in fig. 4.16. This procedure could help to separate people based on differeing sizes. Assuming that a blob’s lowest point represents the feet, the blob’s topmost point should be the tip of the head and its world coordinates are supposed to be located in the area above the feet. This point is located on a line g, which can be determined with two points 

x0





x1 − x0



    g = p~0 + λ~v = p~0 + λ(~p1 − p~0 ) =  y0  + λ  y1 − y0  . z0

(4.61)

z1 − z0

The blob’s highest point is transformed into the ground plane at z0 = 0 m and into z1 = 1.2 m, defining the points p1 and p0 of the line. As the X− and Y − coordinates of feet and head do not necessarily correspond, a plane π, orthonormal to the ground, is created with E : ~n~x + D = 0

(4.62)

115

Chapter 4. Multi Camera Object Detection and Tracking

Figure 4.17.: Estimated shape and height of a person in the Pets2007 dataset, Scene 08, Frame 281

Its normal vector has the same direction, as the line g       v1 x1 − x0 n1       ~n =  v2  =  y1 − y0  =  n2  0

0

(4.63)

n3

and ~x is set to the foot locations on the ground   xF   ~x =  yF 

(4.64)

0 Applying (4.63) and (4.64) D is given by D = −n1 xF − n2 yF − n3zF = −n1 xF − n2 yF

(4.65)

The parameter λ is thereby computed by intersecting line g with plane π λ=

−n1 x0 − n2 y0 − D n1 v1 + n2 v2

(4.66)

With (4.61) and (4.66) the coordinates of the plane line intersection can be determined with   xS = y0 + λv1   (4.67)  yS = x0 + λv2  . zS = z0 + λv3 The height zs is now only depending on the chosen height zw of the second point, as the lower one has been set to Zw (P0 ) = 0m. Hence the object height can be set to zs = z0 + λv3 = 0 + λv3 = λv3 .

(4.68)

Depending on the actual camera configuration and extracted foreground regions the results from the various available views may differ extremely. This is usually caused by multiple

116

4.5. Multi Layer Homography objects, which are detected as one huge blob. By computation of the height in all available views and averaging the results an approximated one is given. Fig. 4.17 shows the computed height of a person in the PETS2007 dataset, where 1unit is scaled to 1.8m in world coordinates. With the same technique it is possible to project the object shape of an object onto a plane at the estimated location [Woh08].

4.5. Multi Layer Homography The major drawback of planar homography is the restriction to the detection of objects touching the ground. This will lead to the following unwanted phenomena: ˆ Pedestrians are split into two objects: a human person usually has two legs and therefore two feet touching the ground, but unfortunately not necessarily positioned next to each other. Walking people will show a distance between their feet of up to one meter. Computing intersections in the ground plane consequently results in two object positions per person. Fig. 4.15 illustrates the detected regions for all four persons present in the scene. This can be solved by combining two or more regions into one candidate, but at the same time additional ambiguities will be created if multiple persons are standing next to each other. So the arising question to answer would be: which foot belongs to whom? ˆ Low localization performance: while people are walking along a place the feet are not necessarily touching the ground, which is a natural phenomenon. As it is assumed that the blobs‘ lowest points are touching the ground, there is a shift in the localization. Furthermore in some cases the foot further away from the camera is being picked as lowest point, as the other one is being lifted at the moment. ˆ Lack of spatial information: as only the position of the feet is determined, remaining information on body shape and posture is dismissed. As a consequence distances between objects and individuals cannot be determined exactly. For instance a person might try to reach an object with her arm and be just few millimeters away from touching it, though the computed distance would be almost 1meter.

4.5.1. Object Localization Utilizing Multi Layer Homography To resolve these limitations, it seems reasonable to try to reconstruct the observed scenery as a 3D model. Therefore various techniques have already been applied: recent works mostly deal with the composition of so called visual hulls from an ensemble of 2D images [Lau94, Kut98], which requires a rather precise segmentation in each smart sensor and the use of 3D constructs like voxels or visual cones. These are subsequently being intersected in the 3D world. A comparison of scene reconstruction techniques can be found in [Sei06]. An approach for 3D reconstruction of objects from multiple views applying homography has already been presented in [Kha07]. All required information can be gathered by fusion

117

Chapter 4. Multi Camera Object Detection and Tracking

Z Z

Z

1.8m 1m

0m

Y Y

Y

X(1m) X(0m) X

X

a)

Y(0m) Y(1m) Yw(1.8)

X

b)

c)

Figure 4.18.: a) Computation of layer intersections using two points. b) Transformed blobs in multiple layers. c) 3D reconstruction of a cuboid

of silhouettes in the image plane, which can be resolved by planar homography. With a large set of cameras or views a quite precise object reconstruction can be achieved, which is not required for this work. This approach can be altered to localize objects and approximate the occupied space with low additional effort [Ars08b], which will improve the detection and tracking performance The basic idea is to compute the intersections of transformed object boundaries in additional planes, as illustrated in fig. 4.18b). This transformation can be computed rapidly by taking epipolar geometry into account, which will be computationally more efficient than computing the transformation for each layer separately. All possible transformations of an image pixel I(x, y) are basically located on an infinite line g in world coordinates (xw , yw , zw ). This line can be described by two points p1 and p2 and following equations xw (zw ) =

Gx zw + bx

yw (zw ) = Gy zw + by ,

(4.69)

with the gradients Gx and Gy

x2 − y 1 z2 − z1 y2 − y1 Gy = z2 − z1 and the offsets bx and by in the direction of X and Y Gx =

x2 − x1 z1 z2 − z1 y2 − y1 y1 − z1 z2 − z1

(4.70)

bx = x 1 − by =

(4.71)

By computing the homography in just two layers, as illustrated in fig. 4.18a), eq. 4.69 can be simplified to xw (zw ) = (xw (z1 ) − xw (z2 ))zw + xw (z2 ) (4.72) yw (zw ) = (yw (z1 ) − yw (z2 ))zw + yw (z2 ),

118

4.5. Multi Layer Homography

Figure 4.19.: Detection example on PETS2007 data projected in all 4 camera views. All persons, expect the lady in the ellipse, have been detected and labeled consistently. The error occurred already in the foreground segmentation.

by choosing z1 = 1 m and z2 = 0 m. Therefore only two transformations, which can be precomputed, are required for the subsequent processing steps. This procedure is usually only valid for a linear stretch in space, which can be assumed in most applied sensor setups. The procedure described in sec. 4.4.1 is applied for each desired layer, resulting in intersecting regions in various heights, as illustrated in fig 4.18 b) and c). The object’s height is not required as the polygons are only intersecting within the region above the person’s position. In order to track humans it has been decided to use ten layers with a distance of 0.20 m covering the range of 0.00 m - 1.80 m, as this is usually sufficient to separate humans and only the head would be missing in case the person is by far taller. The ambiguities created by the planar homography approach are commonly solved by the upper body. Therefore the head, which is usually smaller than the body, is not required. The computed intersections have to be aligned in a subsequent step in order to reconstruct the objects’ shapes. Assuming that an object does usually not float above another one, all layers can be stacked into one layer by projecting the intersections in the single layer’s view onto the floor. This way a top view is simulated applying a simple summation of the pixel P~ = (xw , yw , zw ) in all layers into one common ground floor layer GF (x, y) with: GF (xw , yw ) =

n X

P~ (xw , yw , zl ).

(4.73)

l=1

Subsequently a connected component analysis is applied, in order to assign unique IDs to all possible object positions in the projected top view. Each ID is then propagated to the layers above the ground floor, providing a mapping of object regions in the single layers. Besides the exact object location additionally volumetric information, such

119

Chapter 4. Multi Camera Object Detection and Tracking XI2

XI1

YI2

YI1

XI2

XI1

YI2

YI1

Zw

a)

Zw

Yw Xw

False Positive

b)

Yw Xw

False Positives

Figure 4.20.: Creation of false positive regions in case multiple objects are located in a scene: Example for the creation of false positives in the ground plane and floating ghost objects due to multi layer homography.

as height, width and depth, is extracted from the image data, providing a more detailed scene representation than the simple localization. Figure 4.19 shows detected object positions warped into the top view of the scene and the multi-layer representation of the scene. For visualization purposes the extracted object regions are subsequently projected back into the single image views, see fig. 4.19, where cylinders approximate the object volume. The operating area has been restricted to the predefined area of interest, which is the region with the marked up coordinate system. As can be seen, occlusions can be resolved easily without any errors. One miss, the lady marked with the black ellipse, appeared because of an error in the foreground segmentation. She has been standing in the same spot even before the background model has been created, and therefore not been detected.

4.6. False Positive Elimination and Handling The ability to detect partially occluded objects applying homography with high accuracy, comes at the cost of a possibly large number of false positives [Kha06] or so called ghost objects [Mic08]. In case only one single object is present in the scene no errors will occur. As soon as there are two or more objects visible additional post processing steps have to be performed. Depending on the constellation of objects and cameras the boundaries of the transformed blobs may create additional intersections. These usually appear in regions covered by all objects and are hence not being visible in all views, as illustrated in fig. 4.20. Ambiguities like these can be resolved by adding further fields of view. It has thus been commonly agreed to use more cameras to reduce the number of false positives and increase the number of true positives [Kha06, Esh08]. In real world applications the amount of hardware and the computational effort are supposed to be held as low as possible. Furthermore ghost objects are created in higher layers, as transformations are drawn into the direction of the camera location. Hereby usually a floating ghost object is

120

4.6. False Positive Elimination and Handling

l = 0m

&&

l = 0.3m

&& ||

||

l = 0.8m l = 1.2m && Figure 4.21.: Rule based fusion approach for object localization in multiple layers

created.

4.6.1. Combining Multiple Layers For False Positive Detection Ghost objects most frequently appear because of the upper parts of transformed blobs, which are intersecting with other blobs. Two of those may intersect at the transformed hip or head position of a person. Applying multi layer homography at the possible object location would create small objects occluded by larger ones, which also have to be considered. Therefore a rule based approach, illustrated in fig. 4.21, combining detections in multiple layers has been presented in [Ars07a]. Assuming that objects have to have intersections in at least two planes and have a minimum size, which is small enough to cover luggage items, false positive candidates can be detected. Additionally one of the intersections has to be located at layer height l = 0 m or at l = 0.3 m, which grants objects touching the ground.

4.6.2. Cutting Blobs to Remove Floating Ghost Objects Applying the multi-layer approach on complete blobs tends to produce even more ghost objects than just the ground plane based one. These usually appear in upper layers and are floating objects, as can be seen in fig. 4.20. Moving the transformations to a higher layer they are drawn closer to the camera position, and therefore tend to accidentally create additional intersections, though these are not present. The simple removal of floating objects is not suitable as it might actually be a real object detected in a visual hull based approach [Fra03]. Errors like these can be avoided by cutting the blobs prior the actual transformation. Therefore a blob’s height in world coordinates hw is estimated. Assuming that a blob’s lowest point is touching the ground its height hi is computed in image coordinates with the difference of the blob’s highest and lowest Y −coordinates hi = Ymin − Ymax .

(4.74)

The height of the required layer li in image coordinates can subsequently be determined by weighting the blob’s height with the ratio of the layer’s height and the object size in

121

Chapter 4. Multi Camera Object Detection and Tracking

Figure 4.22.: From left to right: the original image with its foreground regions visualized (red) in the binary and the real image. The result of cutting the blob at height l = 0.3 m is indicated again in both image types.

world coordinates

lw hi . hw Finally the blob is being cut at the level lcut of a straight line with li =

lcut = Ymin − li .

(4.75)

(4.76)

The result of this procedure is visualized in fig. 4.22, where the red polygon indicates the original object boundaries and the green one show the region transformed after blob cutting at the layer l = 0.3m. As can be easily seen, no perspective correction has been performed to cut the parallel to the ground floor. Extensive experiments have shown, that this simplification delivers similar results. All floating ghost objects have been removed by applying this method.

4.6.3. Applying Geometrical Constraints In contrast to the the blob cutting based approach the false positive elimination based on geometrical constraints examines all possible object locations additionally in the field of view of each camera [Mic08], [Ars08c]. Experience has shown that a real object Oi is commonly by far larger than a false positive candidate F Pi , as these are created only by parts of the corresponding foreground blobs, which are located above the approximated feet region. Supplementary ghost objects are usually occluded by real objects. Both constraints can be detected by the transformation of all possible object locations Owi in world coordinates back into all n views Oni = Hπn Owi

(4.77)

as illustrated in fig. 4.23. This will provide detailed information on the object arrangement in each view. With a rule based approach the exact positioning is examined. If a back projected region Oni is located within another one and its surface Ani is smaller than the intersecting one in all views, that is ∀n Oni ∈ Onj ∧ Ani < Anj with j 6= i,

122

(4.78)

4.6. False Positive Elimination and Handling

Figure 4.23.: Left: The original image and the resulting 3D reconstruction. In total eleven objects have been detected. Right: Detections projected into the 3rd view. The white regions indicate false positives. Five false positives have been removed, as shown in dark blue on the lower right [Hri08].

it can be assumed that Owi is a false positive candidate. Fig. 4.23 shows an example taken from the PETS2007 data set, which has been used as indicator for the algorithms performance. The pink regions indicate eleven possible object locations, though only 5 persons are present in the scene. By removing the backprojections fulfilling eq. 4.78, drawn in white, 5 out of 6 ghost objects are eliminated. The last remaining one could not be removed, as it origins already from a faulty foreground segmentation. Due to the lack of labeled material no quantitative evaluation could be conducted. Tests have shown, that the number of over all detected objects has been decreased by 35% in all eight PETS2007 scenarios, without any affect on the tracking evaluation.

4.6.4. False Positive Handling The detected regions are not necessarily false positives, as there is no real evidence present. It is not possible to make a statement on areas occluded in all fields of view, as there could be a hidden object. Thus a new class, the false positive candidate, is introduced and taken

123

Chapter 4. Multi Camera Object Detection and Tracking Disappearing object

New Object

Reappearing object

Kalman filtering

Moving Object

Both objects moving

Predicted pos.

Candidate tracking

Observed pos. Corrected pos. Observed track Interpolation Object Candidate

Figure 4.24.: Handling multiple models for one object. Where the standard Kalman filter fails (first row), the additional model can reassign identities if a tracking alternative seems more reasonable.

into account by the tracking application. If a new candidate appears, there are three cases the algorithm has to consider and look back at past events: ˆ A tracked object disappears: in case an object disappears the tracking process is continued with a false positive fulfilling the required constraints. ˆ A new object appears: if a new object appears in the middle of the observed region and is not created by a forking trajectory, a false positive candidate region is updated to an object region. ˆ No object related events are detected: the candidate is included in the tracking process and still treated as a candidate.

4.7. Multi Camera Object Tracking By warping the object locations into the ground plane most of the available information is lost. The applied algorithm therefore has to be able to consistently label the detected objects only with their actual coordinates and previously gathered information. Kalman Filters [Kal60] have shown reliable results in tracking objects in 2D. For consistent labeling smooth object movement without jumps and short disappearances or suddenly appearing objects is required. Both constraints cannot be guaranteed by the detection module, where especially disappearances often occur. Fig. 4.24 illustrates the arising problem: in case an already tracked object disappears for a few frames, it might be assigned to a new one appearing near the original position. After reappearing the original object would be handled as a new object. The resulting ID change cannot be reverted afterwards. Even if

124

4.7. Multi Camera Object Tracking

Y 6 9

9

7

12 5

4

4

X Figure 4.25.: Frame 304 in scene S03 of the PETS2007 dataset. Both the original detected positions and the interpolated trajectories are drawn on the ground floor. The object number 12 on top left in the top view is a false positive candidate object, not included into the tracking process.

no new object interferes, an object disappearing for a while might be lost forever. These events usually occur in very crowded situations, when tracking fails or false positive candidates are not being removed.

Therefore a multiple hypothesis tracking with a memory of past events and the introduction of so called object candidates is applied as illustrated in fig. 4.24. In case an object disappears for few frames a newly appearing object is assigned to an existing ID, if the Euclidean distance is smaller than a given threshold and its probability to be produced by a Kalman process is larger than another threshold. Being a potentially new object, a temporary maintenance counter ctmp is set to zero without any affect on the age counter cage and it is treated as an object candidate with a low priority. The region near the previous predictions will be memorized and a high priority will be assigned to this region. A reappearing object will consequently also be matched to all existing trajectories and the ones left behind by objects vanishing up to tmax = 15 frames ago. Although it is further away from the new prediction, it is considered as candidate, because it is located in a highly prioritized region. Additionally previous predictions are reviewed for plausibility. In case the new candidate fits better to these and has a higher priority, it can reclaim its original ID.

An exemplary result is illustrated in fig. 4.25 with frame 304 in scene S03 of the PETS2007 dataset. Both the original detected positions and the interpolated trajectories are drawn into the ground floor.

125

Chapter 4. Multi Camera Object Detection and Tracking

4.7.1. Combining 2D and 3D Tracking Methods Though the homography framework provides an effective method to exactly localize objects in 3D space and resolve occlusions, one side effect must not be neglected: Its performance still depends on the segmentation in each smart sensor. Objects merging in all views will consequently create a merger in the 3D domain. Due to the lack of texture or other significant patterns it is hard to reacquire the individuals and maintain single tracks. Therefore a set of other distinctive features has to be introduced. The most obvious solution seems to be the addition of image data, which has previously only been used to detect foreground regions, and perform 2D tracking separately in each image [Ars08c], where in fact any approach can be used. 2D objects are initialized in all n views by simple detection of foreground regions F Gnj . In a subsequent step the corresponding regions are determined once more with the homography framework. Having consistent labels for the blobs in each view, the 2D tracker, here based on deformable feature graphs [Tan05], is initialized for all foreground regions recognized as objects. While performing the 3D tracking as described in chapter 3, the projections into the single views are checked for consistency. This is done by comparison of a simple majority vote of the n sensor decisions and the homography tracker’s output. In case one of the decisions is outnumbered the assigned ID is changed to the majority’s vote.

4.8. Tracking Evaluation As stated earlier various tracking algorithms have been presented in the past and these have consequently attracted the interest of the industry. A common evaluation strategy has to be defined not only for comparison of state of the art systems but also to show weaknesses and allow further development of algorithms. Therefore a set of metrics has to be defined analyzing tracking results both on frame wise detection [LM08] and trajectory level [Yin07]. To provide meaningful studies, the video test data has to be characterized by means of complexity [Ell02], ranging from easy sceneries to difficult ones. These categories were also used for classification of the PETS2007 scenarios, which were the basis for developing and testing the multi camera tracking framework presented in this work. Unfortunately ground truth has not been provided for all persons present in the scene, but only for individuals related to a specific event. Therefore it is difficult to provide exact numbers on true positive, false positive, false negative and split and merged trajectories. Hence, the sparse data provided has been evaluated on a frame base in the first part, determining the average localization error i PT p (xg (t) − xd (t − 1))2 + (yg (t) − yd (t − 1))2  = t=1 (4.79) T by averaging of the distance between the ground truth (xg (t), yg (t)) and the detected position (xd (t), yd (t)) over time t. Trajectory alignment is performed by grouping the

126

4.8. Tracking Evaluation

Tracking Evaluation ID1 1 0.9

Distance in meters

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

500

1000

1500

2000

2500

Frame number

Tracking Evaluation ID2 1 0.9

Distance in meters

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

500

1000

1500

2000

2500

Frame number

Figure 4.26.: Distance of Ground Truth and detected object for both objects annotated in scene S03. The average distance is 0.13 m Both persons are being confused after they merge and split in frame 524, visualized with the changing plot style, denoting their ID.

detection with the lowest distance to the ground truth labels. Respecting the tracking ID of the assigned objects the number of ID changes ID(t) 6= ID(t − 1) within a trajectory can be determined. Both measures are shown for scene S03 in fig. 4.26. The displacement error, averaged to 0.13 m over all 8 scenes, in each annotated frame is given in the y-axis, while ID changes are indicated by changes of the line type. In this case the reason for the ID change can be seen easily: both individuals are mixed up after a merger in frame 524, which could not be resolved in the 3D tracking domain, and ID1 is additionally changed another time at the end of the track. These ID changes can be partially eliminated by combining 2D and 3D tracking approaches [Ars08c], such as tracking of deformable feature graphs [Tan05]. Table 4.2 shows detailed numbers for all approaches presented in this work. The localization performance has constantly risen from tracking in the ground plane to the multi layer approach with 0.12 m less displacement. Little influence has been recognized with a rising number of layers. Likewise the number of ID changes has been decreased by introducing multiple layers from 30 to 18. Utilizing height as additional cue for planar tracking resulted in only two less ID changes, due to the very similar size of the persons walking in the scene. Combining the 3D tracking approach with 2D methods managed the elimination of 16 further ID changes, leaving only two behind.

127

Chapter 4. Multi Camera Object Detection and Tracking

Figure 4.27.: Exemplary tracking scene from the PETS2007 data set. IDs are maintained with in all four available views until a non-resolvable merger appears. The blue false positive is created by an error in the foreground segmentation.

128

4.9. Closure

Approach

δ

Ground plane 0.25 m Ground plane +height 0.25 m 5 planes 0.15 m 10 planes 0.13 m 2D + 10 planes 0.13 m

ID changes 30 28 18 18 2

Table 4.2.: Evaluation of homography approaches on the complete PETS2007 data set. The layer based approaches showed less displacement δ and ID changes than tracking in the ground plane. Combining 2D and 3D tracking delivered most reliable ID maintenance.

Fig. 4.27 illustrates the tracking performance and ID maintenance over four overlapping fields of view. As can be seen tracking is performed correctly, as long as no merger occurs. In this case even the combined approach failed, leaving an ID change behind. As only a limited set of colors has been used for visualization newly appearing persons are eventually being assigned a color which has been previously used, such as the red bounding box.

4.9. Closure In this chapter an extension to Khans [Kha06] planar homography tracking approach, the transformation of detected foreground blobs into multiple layers, has been presented. This approach for object localization does not rely on a person’s feet position and therefore does not erroneously segment a person in multiple parts. Consequently the typically arising alignment problem could be overcome, as the entire body region is also reconstructed and thus resolving some ambiguities. The approach works reliably enough surveillance applications in scenes without overcrowding. Nevertheless it cannot be applied out of the box, as calibration is still required. Therefore it is advisable to automatically calibrate cameras [Ste99], where trajectories in multiple views are used for a rough planar alignment. This procedure is preferred to other approaches that need to find point correspondences with e.g. SIFT features [Che07]. Depending on the settings of the cameras the illumination can vary by a large scale, making it almost impossible to find point correspondences. Furthermore the frequent appearance of ghost objects created by the unfavourable arrangement of cameras and persons in a scene has been successfully addressed. It could been shown that most of the usually appearing false positives can be removed by preprocessing and additional consideration of geometrical constraints. As could be also demonstrated, the combination of 2D and 3D tracking techniques further significantly enhances the handling of ambiguities. However, in the future these two complementary approaches could be incorporated at an earlier stage, trying to perform tracking with a reconstructed 3D figure. Therefore it would be necessary to detect tex-

129

Chapter 4. Multi Camera Object Detection and Tracking tural correspondences in all available 2D views and match these onto a model. Such a model could be constructed applying the information gathered from the 3D reconstruction approach in multiple layers and an introduction of a so called stick-man model [Pop07]. Moreover, the introduction of other sensors than cameras, as shown e.g. by Ahlberg et al. [Ahl08], could help in the segmentation of groups, which cold be especially enhanced by the use of 3D sensors, such as a photonic mixture device or thermal infrared sensors. Even the use of acoustic signals could further increase localization precision as shown in [Zho08]. Furthermore the tracking has to be extended to non overlapping camera regions. Therefore a re-identification of humans, for instance based on colour histograms [Gra07b] could be used, or detected trajectories could be linked in adjacent camera regions [Kay08]. Oncoming more appropriate training material should make the use of more sophisticated behaviour detection techniques possible, which could also include the analysis of gait [Wan06], and gestures in the 3D domain [Wu07b]

130

Chapter 5. Behavior Detection Applications Various methods for robust tracking and detection of objects have been presented in the previous sections, whereby various level of detail representation has been considered. The created trajectories and changes in motion patterns can now be used by a behavior interpretation module, which subsequently either triggers an alarm signal or reacts to the observed activity [Hu04]. This module is basically matching an unknown observation sequence to stored reference samples and performing a comparison. The basic problem is to find a meaningful representation of human behavior, which is a quite challenging task even for highly trained human operators, who indeed should be experts in the field. A wide range of classifiers, based on statistical learning theory, has been employed in the past, in order to recognize different behavior. The probably most popular approaches involve the use of dynamic classifiers, such as HMMs [Oli00] or Dynamic Time Warping (DTW) [Tak94]. Nevertheless static classifiers, e. g. SVM or Neural Network (NN), should be further explored, as these may outperform dynamic ones. Due to the high complexity of human behavior analysis there is currently no methodology available, that can be utilized for each application scenario. Therefore this work focused on the evaluation of existing classification algorithms and their applicability for a wide range of scenarios.

5.1. Facial Expression Recognition The analysis of facial expressions is considered as one key aspect in surveillance and human machine interaction applications. Humans tend to express moods or interest rather unconscious with their face, which is therefore considered as a meaningful additional source for behavior modeling. Even spoken language can be analyzed in further detail, as a person’s mood does not necessarily correspond to the spoken content [Rus97]. While a wide range of approaches has already been explored [Pan00], this work focuses on the optimum real time capable solution for security relevant scenarios. The presented approaches will be evaluated with two databases.

131

Chapter 5. Behavior Detection Applications

Figure 5.1.: 20 defined fiducial points (left) and the geometrical description (right)

5.1.1. Feature Extraction In order to achieve a high precision and keep computational effort as low as possible it has been decided to work with a considerably small set of meaningful feature points located in the face. This way large parts of the face can be discarded and the required amount of data can be drastically reduced. Based on the physiology of the face, the MPEG-4 standard defines feature points, which are possibly relevant for facial expressions [Ost98]. Out of the set of predefined points 20 have been chosen to describe faces, as these can be detected automatically and are sufficient to describe facial movement. All required fiducial points are illustrated in figure 5.1 on the left hand side. The extracted coordinates are usually not generalizing faces in a person independent fashion. The size of eyes, mouth and eyebrows normally varies from person to person and the faces’ orientation has not yet been considered. In a first step, the size of the faces is normalized and they are rotated into an upright position. Therefore the angle between the eye corners is computed, as these are usually located on the same height level. Further, all images are scaled to the same size, by aligning the distance between both eyes to 200 pixels. Though the data is normalized, a large person dependency and a large variance between the faces can be observed. Therefore the distances between relevant feature points are determined and a graph is created. This describes the geometrical relationships between feature points. The 35 chosen distance measures are illustrated in figure 5.1 on the right hand side. As only points with potentially differing distances are required, a representation with a fully connected graph seems unnecessary. Considering that image sequences are available, it is reasonable to use information, retrieved from motion within the face. This can be modeled either by the change of the label positions or the change of distances between nodes in the mesh of feature points. Speed and acceleration can be derived by computing the first and second derivation over time.

132

5.1. Facial Expression Recognition

Figure 5.2.: Three different states within one yawn

As a result each face in a frame is represented by a total of 125 features, which are: 20 x/y− coordinates (p), 35 distance measures (d), 35 speeds (s) as well as 35 accelerations (a).

5.1.2. Recognition of Facial Actions As both databases that were used for the evaluation of the implemented algorithms, here FEED-TUM1 and SAFEE-FAC2 (see sec. A), contain video sequences with constant length, static classifiers seemed to be the first choice to classify the resulting feature vectors. This procedure has been applied for both databases with a wide range of combined feature sets. As video sequences, even if their length is constant, are dynamic sequences of observations, additional trials with HMMs have been conducted with entire sequences in the first place, to provide for warping capabilities. Facial activities have been further analyzed for common patterns, which resulted in the observation of short motions that were repeatedly performed. Figure 5.2 shows three different stages of the facial action yawning. In this specially selected example, a neutral start is followed by a mouth opening and finally becoming a yawn. In a real world scenario it is difficult to segment captured data, to receive such precise start and end points and also the transitions between the states. As a result a short sequence of approximately one second may contain several different actions. For instance a laughing face may transform into a yawning one without temporarily returning in the neutral state. If complete actions have to be modeled, all possible transitions need to be considered. An increasing number of facial activities consequently leads to an exponential growth of required models. Therefore a decomposition of facial behaviors into so called submotions has been proposed in [Ars06]. These may contain transitions between different actions, the actions themselves and common characteristics. The number of required submotions is obviously larger than the amount of facial actions that need to be detected. However, further facial actions can now be added and simply described by the available submotions. This increase in facial actions will not necessarily lead to a larger amount of required models. A major 1 2

Facial Expression and Emotion Database recorded at TUM SAFEE - Facial Action Corpus recorded for the SAFEE project

133

Chapter 5. Behavior Detection Applications problem is the definition of submotions, which describe facial actions satisfactorily and are also detectable. A set of submotions has been collected manually, by inspecting video material. In order to show the performance of the decomposition, this should be sufficient. An entropy based automatic decision might be performed, in case the submotion approach proves reliable. Considering the limited amount of training data only six submotions were pre-defined. Their order has been manually labeled within the training material. The submotions were chosen by observing the state of the mouth. In particular a closed, open, opening, closing, and an open mouth with its maximum opening either in horizontal or vertical direction were used. The start and end frames have not been assigned in the transcriptions. A sample of one second length can contain up to 5 submotions. Hence the database of 405 facial actions contains nearly 1500 submotions. Their length varies between 3 and 10 frames and is not depending on the actual class. HMMs are applied for the first task, as these can cope with dynamic sequences with variable length, like the applied submotions. In order to train the required models, only the order and number of submotions for each sample in the database are given. Neither the length, nor start and end frame are known to the training system. Each submotion is represented by a three or four state, left-right continuous HMM and trained using the Baum-Welch-Algorithm [Bau72]. During the training-process the submotions are aligned to the training data via the Viterbi-Algorithm in order to find the start and end frames of the contained submotions. For the recognition of sequences of occurring submotions, the Viterbi-Algorithm has been applied again. In the second stage the sequence of submotions, gained form the previous stage, has to be interpreted to recognize higher level facial actions. Therefore a dictionary, containing facial actions and various possible submotion based representations of each activity, is required. As the recognized submotion sequences suffer from insertion, deletion and confusion, a simple table lookup is not suited for the interpretation. Hence, dynamic programming is utilized to align the sequence of submotions in an optimal way to the corresponding facial actions. A recognized sequence of submotions can for instance be compared with dictionary entries, by computing the Levenshtein Distance. This approach assumes that the order of submotions is crucial for the recognition of the represented facial units. To show that the frequency of the submotions within the sequences is not as significant as their order, SVMs are applied for comparison. The feature vector contains the amount of appearances of each submotion, and thus it had six features for the actual implementation. The feature-vector therefore represents the frequencies of submotions without an order. As shown in the results in sec. 5.1.3 these frequencies are less significant for the interpretation of the facial actions. This proves that the facial action is determined rather by the order of the submotions than by their frequencies.

134

5.1. Facial Expression Recognition

accuracy[%]

p

d

s

d+s

d+s+a

p+d

G

G+G’

all

features

40

35

35

70

110

75

800

1600

935

87.6 52.6 71.3 47.5

50.5 45.3

65.9 43.3

SVM HMM

73.2 86.6 56.7 74.7 58.6 64.2 44.3 67.8

71.1 60.7

Table 5.1.: Recognition results for four classes with the SAFEE-FAC data set. Both SVMs and HMMs have been applied on entire scenes, where the static SVM classification performed significantly better

.

5.1.3. Evaluation of Facial Activity Recognition Systems In order to evaluate both implemented facial activity recognition systems, two databases, namely FEEDTUM and SAFEE-FAC (see sec. A.1 and A.2), have been used. While the FEEDTUM database is containing the “big six” emotions plus neutral, as defined by Ekman [Ekm84], the SAFEE-FAC contains four facial activities not related to emotion, namely yawning, speaking, neutral and laughing. In both cases classifiers based on SVMs and HMMs have been trained to compare dynamic and static classifiers. The submotion based approach has only been considered for the SAFEE-FAC, as the video sequences in the FEEDTUM corpus could not be further segmented. Nevertheless the results will be a valuable indicator for the classifiers performance. All methods have been applied with different feature sets, which contained combinations of distance measures (d ), speed (s) and acceleration (a). Furthermore the impact of just the positions of fiducial points (p) and the Gabor coefficients (G) has been tested. Table 5.1 shows the results of a five fold stratified cross evaluation with holistic approaches and various applied feature sets using the SAFEE-FAC. The SVM based approach is obviously performing significantly better than the the dynamic recognition with HMMs (HMM ). Tab. 5.2 shows the recognition rate of the submotions (SUB ) and their subsequent interpretation to facial actions via SVMs (SVM ) and dynamic programming (DP ). Note that due to temporal constraints Gabor coefficients have been only used for tests with holistic approaches. Further improvements were not expected, due to the weak performance in first trials. Within all approaches the distance measures seem to be more reliable than just the normalized coordinates of fiducial points, whereas the first derivation produces lower rates. Combining positions and geometric distances will result in most reliable recognition results. Adding acceleration has also a positive effect on the recognition rate, with the cost of a growing number of features. Comparing the classification of a whole sequence and the analysis of a series of submotions, the decomposition of facial actions performs better than the standard approaches. A

135

Chapter 5. Behavior Detection Applications

accuracy[%]

p

d

s

d+s

d+s+a p+d

features

40

35

35

70

110

75

SUB

60.2 65.9 55.6

72.9

76.5

75.2

SVM DP

69.9 79.6 42.8 70.2 84.1 54.2

82.3 87.9

83.6 90.1

81.3 85.7

Table 5.2.: The considerably low recognition performance of submotions (SUB) is enhanced by using dynamic programming (dp) for the classification of complete sequences.

.

promising recognition rate of 90.1 % has been reached using all available features based on fiducial points. The submotion based approach outperforms the holistic classification with HMMs, which reaches 71.3% recognition rate, by far. Furthermore the novel submotion decomposition leads to a recognition rate improvement of three percent compared to the initially proposed SVM based approach. The results also show the importance of the interpretation of a sequence of submotions, as mistakes within the submotion recognition can be compensated. Although only 76.5 % of the submotions were recognized, the fusion yielded far better results. The results also show the advantage of static classification, as SVMs with a constant number of frames perform significantly better compared to holistic HMM based approaches. A major performance boost has been achieved by the introduction of submotions, which is also the best reported result for this dataset [Ars06]. Similar results have been achieved for the FEEDTUM dataset with seven classes, though recognition rate is lower due to the larger number of classes. The static classification once more outperforms the dynamic classification with HMMs. These results outperform the original implementation with Macro Motion Blocks [Wal06b] by 4% and reach similar results as Active Appearance Models (AAM) combined with SVM [Bes07]. Recent improvements of the refinement of AAMs with edge images showed higher classification results of up to 92% [Mar08], which indicates that the AAM parameters are more reliable than the applied EBGM technique.

5.2. The SAFEE Onboard Threat Detection System The aim of the EU funded Project Security of Aircraft in the Future European Environment (SAFEE)3 is to increase on-board security in aircrafts. Its general idea is to implement a system, which is capable to automatically detect potentially threatening sit3

EU Funded FP6 project, Grant Number: AIP3-CT-62003-503521, SAFEE

136

5.2. The SAFEE Onboard Threat Detection System

accuracy[%]

p

d

s

d+s

d+s+a

p+d

G

G+G’

all

features

40

35

35

70

110

75

800

1600

935

65.3 44.9 55.6 42.1

46.9 39.8

51.0 45.6

SVM HMM

52.3 50.3 50.2 44.9 51.2 49.2 44.3 41.3

55.1 54.7

Table 5.3.: Overview on recognition results for four classes with the FEEDTUM data set

.

uations. The Suspicious Behavior Detection System (SBDS) work package, which has been an integral part of the On-board Threat Detection System (OTDS) subproject, aimed to provide a systematic solution to monitor people within enclosed spaces in the presence of heavy occlusion, analyze these observations and derive threat intentions [Car08]. These are subsequently reported to the crew, which will then decide what kind of steps to take in order to (re-)gain control in the current situation. Threats may include unruly passenger behavior (due to intoxication etc.), potential hijack situations, and numerous other events of importance to both flight crew and ground staff. Indicators of such events have been collected to a set of so called Pre-determined Indicators (PDI), such as nervousness or frequent visits to the lavatory. These PDIs have been assembled to complex scenarios, which can be interpreted as combination and temporal sequence of so called low level activities. In order to detect these LLAs following three systems have been employed by various Subject Matter Experts (SME): ˆ Acoustic Event Detection ˆ Tracking of passengers ˆ Low level visual stress detection

All observations are subsequently fed into a common scene-understanding module, which produces a reasoned output from the various elements. In order to achieve this aim, aircrafts are being be equipped with cameras, which are observing the isles, the cockpit region and all seated passengers. Additionally, microphones have been installed, which are spread all over the cabin. Fig. 5.3 illustrates possible sensor positions in the aircraft. In this work the modules for seated behavior detection have have been implemented, and outputs have been send to a scene interpretation module [Car08]. In order to robustly pick up visual stress indicators it is important to know how a LLA can be characterized. According to experts these can be further decomposed into so called low level features which can be chosen with respect to their detectability [Ars05b]. These LLFs can subsequently be combined to detect a more complex behavior. Experts have

137

Chapter 5. Behavior Detection Applications

Boarding Camera

Figure 5.3.: Cabin map of the SAFEE demonstrator. Only two exemplary seat cameras are illustrated, although eight have been installed.

thereby identified the following six security relevant activities: aggression, cheer, intoxication, nervousness, neutrality and tiredness. In order to model these, the so called Airplane Behavior Corpus (ABC) has been recorded, as described in app. A.3. The expertise, gathered throughout this work, in face detection and tracking has further been used to design an Access Control and Recognition System (ACRS). This work package aimed to guarantee that the passenger boarding at the moment is the same person as the one who checked in. Therefore a reliable face recognition system has been implemented.

5.2.1. Low Level Feature Extraction Discussions within the SAFEE consortium have shown large scale privacy issues for surveillance tasks in public transportation systems. While video cameras have been widely accepted, as the public already has become accustomed to the presence of CCTV, the processing of acoustic features seems to be more problematic. It is commonly agreed that eavesdropping is not accepted and considered as privacy violation, while cameras are more likely to be accepted for security reasons. Therefore this work has been restricted to visual cues. Nevertheless the speech emotion community has recently shown great interest in the security domain [Sch08c] and also investigated various approaches for this classification task [Sch07b] with promising results. A major issue for the aircraft application scenario is the required real time capable feature extraction and classification. Additionally, the features have to be extracted robustly from

138

5.2. The SAFEE Onboard Threat Detection System

1

2

3

Figure 5.4.: Visualization of applied features: Extracted global motions, face detection and facial feature extraction, skin motion detection in the left and right half of the image.

every frame. Facial feature points as defined by the MPEG7 standard [Ost98] have been discarded, as they are not visible in all frames. Nevertheless these have been investigated by Wimmer [Wim08]. Therefore it has been decided to focus on global motion features [Zob03], extracted from various parts of the image based on a simple difference image D(x, y, t) = I(x, y, t) − I(x, y, t − 1). These features are extracted, as illustrated in fig. 5.4, from various different image parts and categorized: ˆ Global Motion Features: global motion features (g) are computed from the entire image for each image of D(x, y, t) in the first place. ˆ Face Motions: real time face tracking based on an initialization with an MLP [Row98], combined with the condensation algorithm [Isa98], is utilized to restrict feature extraction to the facial region, where face motions can be computed. Besides the computation on the entire detected face (fc ), features are extracted from three different face regions (f3 ) in order to model the face more detailed. With the upper half split in two parts both the left and right eye position are approximated, where the most dominant movement is produced by eye blinks. Furthermore a feature extraction is performed in the lower half of the face, which will model movements of the mouth. ˆ Skin Motions: in a third stage hands are detected by applying a simple skin detection algorithm [MS00]. As the position of the face is already known, it can be assumed that remaining skin parts are representing hands and arms. Skin motion features are either computed for the entire frame (sc ) or separately for the right and left arm s2 , by simply splitting the video stream in the middle.

In addition to the face motion features the face’s displacement (∆fc ) is determined in x and y direction based both on the face detector output and the smoothed detector output.

139

Chapter 5. Behavior Detection Applications

#(features)

g

fc

f3

sc

s2

c

9

9

36

9

18

81

Table 5.4.: Number of features extracted in each frame for each LLF class.

Face motion in depth is considered by the computation of scale changes. Furthermore the position change of the left and right eye brow (∆brow) is computed over time. Both the face displacement (∆f ) and eye brow movements (∆brow) have been added to the facial features f3 . Table 5.4 shows the detailed number of all 81 extracted features for each type and the entire feature set (c).

5.2.2. Suspicious Behavior Detection So far up to 81 features have been extracted from each frame within a video sequence and need to be classified subsequently. Various different classifiers have been evaluated with the extracted feature sets, in order to find the most appropriate solution. The utilized classifiers can be basically divided into two groups: dynamic classifiers that are able to classify features of variable length, which are naturally given by video sequences of different durations. HMMs have been established as the standard method for the classification of data with unknown length. As the samples contain only a single activity, no further segmentation has been required and detection could be performed without the use of a grammer. Static classifiers are only able to process feature vectors with constant length. Therefore the video sequences have to be preprocessed in order to guarantee a fixed vector length. The probably most convenient solution is to process every single frame in a video independently, which automatically results in feature vectors with a fixed number of entries. This method incorporates various drawbacks: classification becomes computationally more expensive and hence real time performance cannot be granted. Additionally dynamic feature changes are neglected, which are important to describe a behavior. By segmenting each video with a window with a constant length of n frames – here trials have been conducted with 25 frames without any overlap – dynamic changes can still be respected. Subsequently, the resulting feature vector ~x with the size of N = 25 × #(f eatures) can be classified. From the variety of static classifiers SVMs and a Bayesian Network (BN) based approach have been chosen for the classification task. Due to the segmentation of the video sequence a subsequent fusion of the segment based classification has to be performed. This is done either by a simple majority vote or a further analysis of the opbtained probabilities. In order to provide a reliable evaluation, various feature configurations have been tested. Likewise it has been tested, whether the classification of combined large vectors (early fusion) or the classification of smaller feature sets and a subsequent fusion (late fusion) should be preferred [Ars07b]. As most common classifiers produce scores for each class,

140

5.2. The SAFEE Onboard Threat Detection System input unruly

aggressive

normal

nervous

intoxicated

cheerfull

neutral tired

Figure 5.5.: Hierarchical classification of unruly and normal behavior patterns with SVMs. The introduction of two additional classes enhances the overall recognition rate of the six class problem

these can be combined to a stronger classifier. Although Jaeger [Jae06] proposes to combine classifier outputs based on their confidence, a simple accumulation of scores is frequently sufficient, in case the classifiers can be considered as independent. The largest combined probability is subsequently used as detector output. Up to now the behavior detection task has been considered as a six class problem, due to the classes the SBDS intends to detect. Naturally, the reduction of classes will result in higher detection results. Likewise unnecessary classes can be removed leaving the interesting ones behind or several classes could be summed up to one single class with more training examples. There are basically only two important classes for the passenger surveillance task: unruly behavior, including aggression, nervousness, and intoxication, and neutral behavior, including cheer, neutrality, and tiredness. Nevertheless it is important to receive more detailed information on the observed behavior, than just an alert signal. Therefore a hierarchical approach is proposed [Ars09a]. The arising question is now how to design the classifier’s branches, as the tree can be build up with a large variety of possibilities. Anyways, this work suggests to introduce two more classes, namely unruly and normal, for the above mentioned reasons, as illustrated in fig. 5.5. In the first stage the by far simpler two class problem of separating neutral and unruly behavior is solved. Experiments have shown a sufficient performance of SVMs for this task. Subsequently two three class problems, with respect to the output of the previous stage, have to be solved. Therefore two separate models have been trained for both the unruly and neutral activities.

5.2.3. System Evaluation In this section the results of the suspicious behavior classification task will be presented and compared to other works. Due to the very limited amount of data available in the ABC, a 5-fold Stratified Cross Validation (SCV) strategy has been used for a reliable evaluation. This way disjunctive sets have been trained and tested with the entire database.

141

Chapter 5. Behavior Detection Applications

accuracy[%]

s2

g+f+s

c

sum(g,f,s)

75.5

75.2

69.2

80.5

SVMunruly3 72.9 77.9 74.3 75.1 73.3 SVMnormal3 64.3 71.4 70.6 67.9 63.8

79.5 73.9

68.3 61.4

81.2 74.1

57.5 47.5

44.7 29.7

60.1 49.3

SVM2class

SVM6class HMM25f6

g

f

f3

s

73.1 72.0 70.9 76.2

55.1 54.9 52.1 51.6 51.8 45.3 41.7 41.3 43.1 40.9

Table 5.5.: Recognition rates of windows with 25 frames. Results are provided for two classes (SVM2class), for three classes where unruly (SVMunruly3) and neutral (SVMnormal3) were separated and the classification of all six classes with SVMs (SVM6class) and HMMs (HMM25f6).

Tab. 5.5 shows the results of a 5-fold SCV on segments with a constant length of 25 frames for all applied feature sets. All approaches had to classify a total of 4 511 frames, while the classification of unruly and normal behaviors included only about half of the data. Due to the lower dimensionality the classification of smaller feature sets outperforms the classification of the larger ones. This observation is extremely noticeable for the entire feature set, as the amount of features drastically exceeds the number of examples. Summing up the scores of the three smaller sets, namely global motion (g), skin motion (s) and face motion (f), creates the best recognition rate for all approaches. This can be easily explained, as each activity is probably best characterized by either one or more of the feature types. The trained classifiers hence are able to separate different samples in a better way than other ones. As noted previously, the reduction of classes and a smaller feature vector size will result in a higher recognition rate, which can also be seen in tab. 5.5. Further it has to be noted that unruly behaviors are by far better discriminated than normal ones, which indicates the small inter class variance of the normal behaviors. As entire video sequences have to be classified, the segments have to be aligned in a final step. The classifier outputs are now being combined by a simple addition of the scores over the duration of the entire video sequence. As it is unlikely that the weakest classifiers will outperform the stronger ones after a temporal integration, only the results with the most promising feature sets are shown in tab. 5.6. As can be seen, the temporal integration of the single classifiers is able to remove some of the errors of the segment based recognition process, where especially the classifiers of the tree based approach reach high classification rates. Once more it can be observed that the elements of the normal class are harder to discriminate than the ones in the aggressive one. Yet the tree based classification (SVMtree6) shows the best performance for this classification task. Obviously the recognition rates of the the HMM based classification of segments (HMM25f6) do not significantly differ to the approach to classify entire sequences with HMMs (HMMseq6), and both fail compared to the static classification with SVMs.

142

5.2. The SAFEE Onboard Threat Detection System

accuracy[%]

g+f+s

sum(g,f,s)

81.8 79.9 84.5

85.9

87.9

SVMunruly3 81.2 85.2 83.3 SVMnormal3 66.7 83.8 70.4

87.0 79.6

90.7 75.2

60.8 73.8 49.5 51.3 34.8 29.1 40.9

66.5 74.9 52.6 52.3 32.1 30.2 42.3

SVM2class

SVM6class SVMtree6 HMM25f6 HMMseq6 FL NN BN

g

59.3 69.3 50.1 49.2 31.8 28.1 41.9

f

57.3 72.5 45.3 46.8 27.5 27.2 38.7

s

56.6 71.7 46.7 47.2 29.1 26.3 33.2

Table 5.6.: Recognition rates for entire sequences by accumulation of segment based classification results. The discrimination between the neutral and unruly class (SVM2class) obviously performs well. The subsequently resulting three class problem performs even better in case only unruly samples (SVMunruly3) are considered, while neutral samples are harder to discriminate (SVMnormal3). Finally the results for the six class problem are provided. Obviously the presented tree based classification outperforms all other ones.

Tab. 5.6 also shows the results for experiments with other classification techniques. These are in particular BN [Cha91], NN [Jor96] and Fuzzy Logic (FL) [Zim86]. As all of these perform by far weaker than SVMs and HMMs. Thus no further information of functionality and configuration will be given at this place. Though previous works with synthetic data has shown the applicability of BNs to the behavior detection problem [Ars05c], the promising results could not be repeated for real data. A more detailed analysis of the achieved results is provided by the confusion matrix displayed in tab. 5.7. This matrix illustrates the confusions between all six classes. As can be seen, nervous behavior is recognized best, whereas intoxication is recognized worst. Furthermore f1 measures for the other classes are distributed almost equally. In the past other promising approaches have been evaluated with the ABC corpus, and shall now be compared to the presented results. Deformable models have been fitted to the passengers’ faces in [Sch07c] in order to extract features. After feature reduction with a sequential floating forward search, the remaining features were used for a timeseries-analysis, which resulted in 61.1% recognition rate. This shows the advantage of the approach presented in this work, which did not rely on complex facial features. Further experiments with acoustic behavior recognition in [Sch08c] demonstrated a classification approach with a large set of low level audio descriptors and functionals, resulting in 73.3%

143

Chapter 5. Behavior Detection Applications

truth aggr cheer aggr cheer intox nerv neu tired

85 11 5 8 7 8

0 78 5 0 2 2

intox nerv neu tired 6 3 17 3 0 2

2 0 2 73 4 0

3 13 4 9 64 8

0 0 0 0 2 34

[#]

f1 [%]

96 105 33 93 79 54

77.3 81.3 58,6 83.9 71,1 75,5

Table 5.7.: Confusions of behaviors and f1 -measures by use of SVM in a 5-fold SCV with 3 separate feature sets on the ABC

webcams

Figure 5.6.: An Airbus A340 mock-up has been used as SAFEE demonstration platform. The seats’ arrangement and camera positions can be seen on the left side. An exemplary field of view is shown on the right.

recognition rate. Evidently both acoustic and the presented visual behavior detection methods seem to operate at the same level. More reliable results can be achieved by combining audiovisual features [Sch08a], which has been conducted in [Wim08] with a recognition rate of 81.1%, although relying on more complex facial features.

5.2.4. The SAFEE SBDS On-line Demonstrator The presented SBDS has also been integrated into an Airbus A340 mock-up for demonstration purposes. Ordinary webcams with VGA resolution have been mounted underneath the overhead bins and were observing two seats at the same time. Due to computational reasons the grabbed frame is split into two parts according to the arrangement of the seats. The camera installation and an exemplary field of view are displayed in fig. 5.6, where eight seats are observed with four cameras. The processing hardware, see app. C.2.1, has been placed in the overhead bins and has been connected to the central observation unit via network. For demonstration purposes a complex scenario has been developed by the British au-

144

5.2. The SAFEE Onboard Threat Detection System

Figure 5.7.: Extraction of global motions, skin motion and facial features in an aircraft environment.

Figure 5.8.: The SAFEE on-board threat detection system. The last five alerts are shown in the table on the left side and visualized in the cabin-map on the right.

thorities, which only required the robust detection of aggressive and nervous behavior. Therefore the implemented system, which is based on SVMs, has been re-trained with features extracted from the ABC, resulting in a model of the two required classes and neutral behavior as initial state. Global motions, facial features and skin motion can be detected robustly, although the camera perspective has been changed compared to the training set, see fig. 5.7. All required features were computed on-line, and 25 consecutive frames were subsequently classified by three separately trained SVMs. As a single decision is not required every second and is likely to be wrong, five outputs are accumulated and the class with the highest vote is finally chosen. This way the classification rate has been drastically risen. Further, the computational effort of maintaining a behavior database which administrated the single detector outputs could be kept low, as MySQL4 is only able to process a limited amount of requests at a time. The final decision is sent to the OTDS system and alerts are displayed on a terminal screen, as illustrated in fig. 5.8. The last five events, the assigned probabilities, and the seat numbers are displayed on the left hand side. For visualization purposes alerts are additionally indicated in a cabin map on the correct seat number.

4

MySQL has been used to maintain a behavior database and provide a communication platform between various modules.

145

Chapter 5. Behavior Detection Applications

truth aggressive aggressive nervous neutral

nervous

neutral

[#]

rec(%)

20 8 4

1 18 2

0 3 424

21 29 430

95 66 98

Table 5.8.: Confusions of behaviors and detection rates for the SAFEE demonstrator with three classes

The system has been evaluated during various demonstration sessions, where actors were playing scenarios according to predefined scripts. They were not told how to act aggressively or nervously as different interpretations of the behaviors had to be picked up. Tab. 5.8 shows the results of the evaluation procedure. It should be noted that four persons have been continuously monitored for 30 minutes, resulting in 120 minutes of video data. They were usually behaving neutrally during this time. The 21 appearances of aggression and 29 nervous activities had an average length of 15 seconds, resulting in 50 · 15 = 750 seconds of unruly behavior. Therefore the passengers were behaving neutral for 107.5 minutes, which can be split into 430 segments. The classes seem quite unbalanced for a significant evaluation at first glance, but represent a real flight by far better than a balanced validation set. Unruly situations are hopefully rarely observed during a flight, which makes neutral the most dominant class and thus has to be detected very robustly. Both the authorities and security staff favor systems with low false positive rate, even at cost of the detection rate. As shown in tab. 5.8, neutral and aggressive behavior can be picked up very robust, while nervous behavior is frequently confused with aggressive. This can be partially explained by the observed overacting actors, which has been confirmed by experts in the field of behavior analysis. Most remarkable is the very low error rate of the detection of neutral behavior. These results with person independent data confirm the results of the trained system evaluated with a 5-fold SCV with the airplane behavior corpus. The high performance of this approach can of course be explained by the small number of classes. Nevertheless it has been shown that behaviors can be detected on-line with a disjunct validation set.

5.2.5. The SAFEE Access Control and Recognition System Besides the detection of suspicious behaviors, it is desired to guarantee that only passengers and crew members related to a specific flight are allowed to board an aircraft. Up to now a passenger’s ID is only verified at the check in and potentially compared to the authorities’ databases with criminals at the moment tickets are sold. A person could therefore let somebody else buy a ticket and check in. Subsequently he could take the boarding pass and enter the aircraft without any problems. Furthermore even a corrupt staff member could gain access to an airplane, although he is not responsible for

146

5.2. The SAFEE Onboard Threat Detection System

Check-In

FOV

Faces

Store Model Figure 5.9.: Exemplary FOV at a check-in. The extracted faces are stored on a smart card.

the aircraft at that time. Therefore an identification mechanism is desired both by the authorities and the carriers. According to various airlines such a system can only be accepted if it does not disturb the boarding process and keeps downtimes at a minimum. Therefore it has been decided to use faces to identify passengers and crew, as the recognition process is non invasive and can be performed quite robustly because of the short temporal difference between check in and boarding. As passengers are usually unknown when booking a flight, a so called enrollment has to be conducted. Thereby an image of each person’s face has to be taken. It is advisable to do this during check-in at the counter or quick check-in terminal, as shown in fig. 5.95 , where the ground staff could kindly ask the passenger to look into a camera, which will subsequently trigger the enrollment software. This will start the previously presented face tracker, localize the passenger’s face and create suitable face models. These are subsequently stored on a boarding pass with a Radio Frequency ID (RFID) tag6 . In order to avoid errors created by the face detection module and enhance recognition performance, multiple face images are stored on a RFID tag. Due to the limited memory on long range ID tags, it is advisable to store only model parameters. After the boarding pass is written, it is handed out to the passenger, who is now allowed to proceed to the gate. As soon as the passenger actually boards the aircraft, meaning that he passes the plane’s door, he will present the boarding pass to a RFID reader with a limited reading range of 10 cm. This procedure triggers the access control software. In a first step faces are detected and tracked in the aircraft’s entrance area. As space is very limited in the aircraft, the camera’s position has to be chosen carefully. It has to pick up only the person entering the aircraft and his face should be visible in an 5 6

Scene from the NBC series ”Friends”, episode 4x15 ”The One With All The Rugby” The used RFID tags with MyFair technology had a range of 10 cm. This distance has been chosen as a compromise between reading/writing speed and handling, as the tag can be located in a wallet

147

Chapter 5. Behavior Detection Applications

Figure 5.10.: Exemplary FOV in the entrance area of an Airbus A340. The lower row shows the face detection result with and without background modeling. The false positive on the left side is eliminated by the GMM which has been used for background modeling.

upright position. Further, the position should be chosen so that cuing passengers are not occluding faces of passengers boarding after them. Fig. 5.10 shows three exemplary fields of view in the entrance area of an A340 mock-up, which have been chosen as suitable. Fortunately the area is quite large and there is some time available to track the passengers’ faces. For these trials an IEEE-1394 camera with 1 024 × 768 pixels resolution has been utilized. As can be seen, large parts of the image are usually just background regions, in case passengers are not boarding in crowds. In order to save processing time and avoid false positives in the background regions, a foreground detection algorithm is applied in order to restrict the region of interest to the passengers body. As shown in fig. 5.10, the processed area is drastically smaller and the person’s face is detected robustly. The extracted faces are then propagated to the face recognition software, which creates face models in the first place. Before a model for the applied Pseudo-2D Hidden Markov Model (P2D-HMM) approach is created, a sequence of M-dimensional feature-vectors is generated [Eic99, Bev08]. This is done by computing a block wise two dimensional Discrete Cosine Transformation (DCT) of N × N -pixel sized blocks with C(u, v) =

−1 N −1 N X X x=0 y=0

f (x, y) cos

(2x + 1)uπ (2y + 1)vπ cos 2N 2N

(5.1)

and storing the first ten DCT-Coefficients for each block. The first ten DCT-coefficients C(u, v) are those, for which the condition (u + v) ≤ 3 is fulfilled, which corresponds to

148

5.2. The SAFEE Onboard Threat Detection System Markerstates Start

End

Columnmodel Transition to next column

Column-wise selftransition

Figure 5.11.: Structure of a 3 x 3-state left-right P2D-HMM

the lowest horizontal and vertical cosine frequencies. The feature vector sequence is determined by scanning the columns of the image, repeatedly shifting a sliding window by d pixels down and computing the DCT-coefficients in each block. If the lower boundary of the image is reached, the sliding window is positioned on top of the column located d pixels to the right from the current one. This procedure is repeated until the the right corner on the lower boundary of the image is reached. Because of d < N a scanning scheme of overlapping windows is achieved. The DCT-feature vectors are subsequently quantized using k-means clustering, resulting in distinct observation symbols, a discrete model of the observation probabilities is created. All created models are now compared to the stored models. As multiple models are available, a majority vote can be applied to enhance recognition performance. Recognition is conducted with a P2D-HMM, a modification of the popular Hidden Markov Models. The step from one dimensional HMMs to P2D-HMMs is not complicated at all. The principle is to model each column of a two dimensional field as a one dimensional HMM. Additional so called marker states are included at the beginning of each HMM. Transitions are allowed either from the last state of a column back to its marker state (column-wise auto transition) or to the marker state of the next column (column-wise forward transition), which is illustrated in fig. 5.11. This leads to a warping ability in two directions. The mentioned marker states contain a maximum output probability for a unique value which is also added to the beginning of every value sequence describing a column. Since the overall structure of this P2D-HMM can be transformed into a 1-D-HMM, all computations can be done accordingly as for the 1D-HMM. First evaluation attempts of the P2D-HMM approach have been performed using the FERET7 database [Phi00], which contains a total of 3 737 images of 1 195 individuals. 7

FERET: Facial Recognition Technology database created by the National Institute of Standards and Technology (NIST)

149

Chapter 5. Behavior Detection Applications

states-cb accuracy [%]

7x7-2000 7x7-500 93.64

92.74

6x6-100

5x5-500

92.46

90.46

Table 5.9.: Identification Results with P2DHMMs on the FERET Database for a varying number of states and codebook entries (cb)

Figure 5.12.: Exemplary boarding sequence in an Airbus A340 mock-up. The passenger has thereby to look directly into the camera (left) to be detected in the current FOV (middle). The ACRS will subsequently create an output on a terminal display (right).

The models are trained with the commonly known methods with a varying number of states and code book entries (cb). Table 5.9 shows recognition results with different configurations and a maximum recognition rate of 93.6%. The results have been proved on the given scenario with the ABC, if a frontal face was availible. For differenet gazes, than the trained ones, the recognition rate decreases. The on-line demonstration systems performs with a similar performance if people are being told to stop at the door and look into the camera, as shown in fig. 5.12. After the comparison of the newly acquired model and the stored one, the status of the passenger will be displayed on a terminal screen. The requirement of looking straight into the camera is of course not suitable for a real world application, as it is intruding the boarding process and introducing new time consuming stops. Therefore it seems advisable to investigate novel view independent techniques, such as the matching of profile and front views [Wal05] or the creation of 3D face models for recognition either from omni directional views [Sch08b] or with the use of 3D cameras [Kit05].

5.3. Activity Monitoring in Meetings Most of the employees dislike business meetings, because of the effort, the duration and the low efficiency. The Augmented Multi-party Interaction - Distance Access (AMIDA)project8 [Al 06] attempts to increase the efficiency by the use of modern machine-learning techniques. One of the main ideas of the AMIDA project is that a camera selection can be performed in smart-meeting rooms, which are equipped with several cameras, so that 8

EU funded FP6 project: AMIDA

150

5.3. Activity Monitoring in Meetings

Figure 5.13.: Exemplary views in the AMI Corpus, which contains four close up view, two left/right views and a global view. Features are not extracted from the global view. Nevertheless all seven views were used for the automated camera selection task.

the most relevant information is shown in the created output video. This video could be used to broadcast a video stream of the meeting on-line, even on small devices such as cell phones, in order to catch up missed parts of an currently ongoing meeting or to store summaries of past meetings. In order to create this video two approaches have been followed. A complex segmentation of meetings according to the actually detected event, such as monologue, discussion, or agreement, has been presented in [Rei06]. This seems to be interesting to browse through the meeting and find key sequences rapidly after the meeting actually took place. For on-line applications it would be interesting to choose one of the cameras mounted in the meeting room and only transmit the currently most interesting video stream. The task is now to carefully select the correct camera, that contains the most active attendee in the current situation [Al 07]. The following approach has been implemented and evaluated with parts of the Augmented Multi-party Interaction (AMI) database, see sec. A.4

5.3.1. Recognition of the Relevant View In order to solve the camera selection task, the features presented in sec. 5.2.1 have been once more employed. As microphones were available, simple acoustic features were further extracted. These simply describe who of the four participants is actively speaking in each video frame, resulting in a four dimensional feature vector with the binary information 1 for person speaking and 0 for being quiet in the current frame. A SVM based approach for behavioral analysis has been presented in [Ars07b] and can be easily adopted to the meeting scenario. Each video sequence is segmented by a window with a constant length of 25 frames without overlap. The minimum length of a selected shot is thereby set to 1 s. In contrast to a frame based decision this way the motion changes over time can be modeled in a finer way. Features are extracted from every segment, and the resulting vector, with a constant size of N = 25 × #(f eatures), is subsequently classified by a SVM.

151

Chapter 5. Behavior Detection Applications

type:

C

C+s

LR

#(features) 8100 8104 8100 RR[%] 38.5 61.0 32.5

LR+s

All

All+s

8104 55.3

16200 39.2

16204 60.8

Table 5.10.: Recognition results on video data with with the different feature sets

A couple of combinations of video modes9 and modalities has been tested for the recognition of the seven classes, representing the seven possible fields of view. In the first place only visual features have been used for the creation of models. Fig. 5.13 illustrates three exemplary fields of view, which were used for feature extraction. In order to monitor the attendees of a meeting four close up views (C), two left/right views (LR), and a global view are being recorded. Due to the perspective of the seventh view no features could be extracted in this view. Nevertheless it has been employed in the camera selection task. In order to show which cameras are reasonable, the close up view, left right views and all views were used separately for the training procedure. Secondly acoustic features were added to each group of views (+s). Table 5.10 illustrates recognition results for the various feature sets after a four-fold SCV evaluation [Ars08a]. Training and test set have been chosen manually, which guaranteed, that actors would not appear in both sets. This way a person independent classification has been performed. As can be seen, in all cases the involvement of acoustic features boosted the performance drastically by more than 20% on average. It is also remarkable that the smaller and easier to handle close up feature set performs as good as the entire set. Reasons might be the drastically smaller amount of data and the weak performance of the LR view in general, which might even disturb the classification of the entire feature set. The achieved recognition rate (RR) of 61% outperforms most known approaches. A rule based decision and a Two-Layer-HMM have been presented in [Al 07], where the best Recognition Rate (RR) has been 51.5% using a Two-Layer-HMM. Further, a Graphical Model (GM) based approach has been implemented, reaching a RR of 46.6% [H¨or09b]. One of the main reasons might be the presegmentation of the videos in this approach, which allowed for a more general analysis than the approach to try to force a decision for every frame. Comparable results with a recognition rate of 61.8% or FER 38.1% have been achieved using semantic features and HMMs [H¨or09a].

5.4. Recognition of Low Level Trajectory Events The recognition of complex events on trajectory level requires a detailed analysis of temporal events. A trajectory can be interpreted as an object projected into the ground plane, and therefore techniques from the 2D domain can be used. According to Fran9

Each FOV is referred to as video mode. Seven available views were available for the recognition task.

152

5.4. Recognition of Low Level Trajectory Events cois [Fra04] and Choi [Cho08] the most relevant events are defined as follows: continue, appear, disappear, split, and merge. As only split and merge cannot be handled by the tracking module, these have to be handled separately. Additionally, motion patterns, such as speed and stationarity, are being analyzed.

5.4.1. Stationary Object Detection For some scenarios, such as left luggage detection, objects not altering their spatial position have to be picked up in a video sequence. Due to noise in the video material or slight changes in the detector output, e. g. the median of a particle filter, the object location is slightly jittering. A simple spatial threshold over time is usually not adequate, because the jitter might vary in intensity over time. Therefore the object position pi (t) is averaged over the last N frames: t 1 X pi (t0 ) (5.2) pi = N t0 =t−N Subsequently the normalized variance in both x− and y− direction t 1 X σi (t) = (pi (t0 ) − pi )2 N 0 t =t−N

(5.3)

is computed [Auv06, Ars07a]. This step is required to smooth noise created by the sensors and errors during image processing. Stationarity can then be assumed for objects with a lower variance than a predefined threshold θ:  1 if var < θ stationarity = , (5.4) 0 else where 1 indicates stationarity and 0 represents walking or running. Given only the location coordinates this method does not discriminate between pedestrians and other objects, enabling the stationarity detection for any given object in the scene. A detection example is illustrated in fig. 5.14.

5.4.2. Discriminating Between Walking and Running Various gait recognition systems [Wan06, Che06], based on machine learning techniques, have been investigated in the past. Their aim is to recognize pedestrians from gait or discriminate between different kinds of gait, such as walking, running or sneaking. These are commonly trained with 2D data acquired from a predefined field of view, which cannot be guaranteed in a real world situation. Retraining these algorithms for every possible system setup is a rather expensive task, as video material has to be collected and annotated. Considering the trajectories projected in a virtual top view, a human operator would probably be analyzing the object’s speed to discriminate between walking and running. This observation is utilized in this work. Defining walking as movement up to a

153

Chapter 5. Behavior Detection Applications maximum speed, here vmax = 6 km/h = 1.66 m/s, which requires that a person is not detected as stationary, and faster movements as running, a simple thresholding operation can be performed  walking if vi (t) < vmax and stationary s(t) = (5.5) running if vi (t) > vmax The speed vi (t) can be computed easily with the covered distance in meters and the frame rate of the captured video in frames per second (fps): vi (t) =

p

(xi (t) − xi (t − 1))2 + (yi (t) − yi (t − 1))2 f ps.

(5.6)

Once again jitter in the detection process is flattened by averaging the frame based results over time. Experience has shown that the summation of up to 25 frames is sufficient for this task. While the discrimination between walking and running relies solely on the covered distance, the direction of motion can be simply computed by the difference between two adjacent positions d~i (t) = p~i (t) − p~i (t − 1).

5.4.3. Detection of Splits and Mergers According to Hu [Per06] splits and merges have to be detected in order to maintain IDs in the tracking task. Guler [Gul01] tried to handle these as low level events describing more complex scenarios, such as people getting out of cars or forming crowds. A merger usually appears in case two previously independent objects O1 (t) and O2 (t) unite to a normally bigger one O12 (t) = O1 (t − 1) ∪ O2 (t − 1). (5.7) This observation is usually made in case two objects are either located extremely close to each other or touch one another in 3D, whereas in 2D a partial occlusion might be the reason for a merger. In contrast two objects O11 (t) and O12 (t) can be created by a splitting object O1 (t − 1), which might have been created by a previous merger. While others usually analyze object texture and luminance [Vig01], the herein applied rule based approach only relies on the object position and the regions’ sizes. Disappearing and appearing objects have to be recognized during the tracking process, in order to incorporate a split or merge: ˆ Merge: one object disappears but two objects can be mapped on one and the same object during tracking. In an optimal case both surfaces would intersect with the resulting bigger surface

O1 (t − 1) ∩ O12 (t) & O2 (t − 1) ∩ O12 (t).

154

(5.8)

5.5. The PETS 2007 Challenge ˆ Split: Similar to the object split two objects at frame t are mapped to one object at time t − 1, where the objects both intersect with the old splitting one

O11 (t) ∩ O1 (t − 1) & O12 (t) ∩ O1 (t − 1).

(5.9)

5.4.4. Detection of Group Movements As in various cases persons are interacting with each other, it seems reasonable to model combined motions. This can be done according to the direction of movement, proximity of objects, and velocity. As the direction of motion can be simply computed, it is possible to elongate the motion vector ~v and compute intersections with interesting objects or other motion vectors. Further the distance between object positions can be easily detected with q dij = (xi (t) − xj (t))2 + (yi (t) − yj (t))2 . (5.10) Thereby most relevant LLAs can be detected applying simple heuristics, as already employed for left luggage detection [Auv06]. Among the required activities the following need to be detected: ˆ Approaching a stationary object or person: the mean motion vector is simply elongated and intersections with stationary persons or objects are computed. If an intersection is detected and maintained for a time t > θ, the person is approaching a stationary object. ˆ Two persons walking or standing next to each other: the distance between all objects in the scene is computed continuously over time. In case the distance is constant over time, allowing some variance, or getting smaller over time, and both persons are heading into the same direction with the same speed, the objects are considered walking or standing next to each other, as illustrated in fig 5.14. ˆ A person following another one: two persons are heading into the same direction for a time t > θ for a pre-defined threshold. ˆ Two persons approaching each other: the distance of two persons is getting smaller over time and the elongated motion vectors are intersecting at any time.

Utilizing these simple rules, it is possible to model all cases with a simple yet effective method.

5.5. The PETS 2007 Challenge 5.5.1. Loitering Person Detection According to the British authorities, a person is usually observing a scene for a while until the supposable right point of time comes, prior to performing a potential threat. He is

155

Chapter 5. Behavior Detection Applications

3 1 2 1

2

2 1

3

3

ID1: Stationary, ID2: Walking, ID3:Stationary, ID1 close to ID3

Figure 5.14.: Example for detected low level trajectory events. The scene contains two stationary and one walking person. All three LLAs have been detected. Additionally the small distance between object 1 and 3 has been detected. The persons’ positions on the ground plane are illustrated on the right. The IDs are illustrated in the tracking view in the middle, and a close up of the scene is provided on the right.

waiting for external circumstances, which have to be met. Observations are frequently performed from a well-defined place in the scenery, where the person tries not to draw attention to himself, requiring steady movement in a crowded environment. Therefore it is important to monitor the visibility of pedestrians in sensitive areas. The PETS2007 challenge defines loitering as a subject being located in the field of view longer than a predefined hard time threshold, here θloit > 60 s [Fer07]. This kind of behavior can be easily recognized with a rule based approach, which has been integrated into the person tracking module [Ars07a]. While tracking an individual object’s age, meaning the time an object is visible in the scene, can be determined by simply counting the frames an object ID is maintained, and subsequently weighting the number of frames with the amount of frames captured per second age =

f ramecount . f ps

(5.11)

Tracks older than θloit will now trigger an alarm as visualized in fig. 5.15a). With this simple approach both loitering persons in the dataset have been recognized, while providing ten additional alerts, which were not annotated in the provided ground truth. The subsequent visualization and manual analysis of the outputs showed by far better results, than the simple comparison of detected alerts and the manually annotated ground truth. All actors were usually strolling around for some time before dropping off or swapping a piece of luggage, hence creating six additional alerts. In practice these would not be considered as errors, because an operator’s attention is drawn to a possible threat even earlier, which is actually wanted. Analysis has been performed on blob level, not discriminating between left luggage and pedestrians. The integration of a luggage piece detector as presented in [Dam08] or a pedestrian detection system [Pap00] could eliminate two more questionable false positives. Challenging lighting situations, creating ghost objects as in fig. 5.15 b), are the reason for two further errors.

156

5.5. The PETS 2007 Challenge

a

b Figure 5.15.: Example for a correctly detected loitering person and a backpack detected as loitering person

5.5.2. Left Luggage Detection After the unsuccessful terror acts of Cologne, Germany, in 2006 so called backpack bombers are in the focus of authorities’ attention. People are not supposed to leave their luggage unattended in public spaces and should report left items to security forces, which is stressed by frequent announcements at airports and train stations. A luggage item could contain explosives or other dangerous goods and is therefore considered as potential threat. Besides the immediate detection of a potential threat, an automated system could provide a precise time stamp for forensic analysis and create a link between the owner and the left item. Luggage is considered as attended, as long as its owner is located within a three meter radius. Leaving this radius for more than 25 s will result in an alarm event. The luggage item should not be re-attended by the owner or attended by a second party within this predefined time period. Due to the lack of training material and the large amount of possible camera views and poses, most machine learning techniques seem inappropriate, though for instance temporal boosting has shown promising results [Smi05, Can07]. Therefore a system based on simple heuristics, inspired by Auvinet [Auv06], has been implemented in [Ars07a]. In contrast to other works [Bha07, Kra06, Dam08] this approach relies only on detected object positions in the ground plane and does not require the detection of a so called luggage class. A simple set of rules, as visualized in fig. 5.16, can be interpreted as a set of few low level activities on trajectory level that are detectable, where split and merge are commonly agreed to be among the most relevant ones [Fra04, Cho08]. The steps of the activity “Leaving Luggage Behind” can be summed up as follows: ˆ Trajectory split: a person, carrying a piece of baggage, is usually detected as one blob in all camera views. Dropping off an item will obviously create a second object next to the original position. The analysis of the trajectory will result in an

157

Chapter 5. Behavior Detection Applications

Stationary object

split

d>3m 25s unattended

d>3m

left

t

Figure 5.16.: Schematic and exemplary visualization of leaving a piece of luggage. During the tracking of a person a forking event is recognized, resulting in a stationary item and a moving object. The luggage item is considered as left behind if its original owner is located further than 3m for more than 25s.

object split, where the new one will inherit all object properties of the old one. The relationship between both objects is memorized within the event handling module. Considering the size of luggage, which is usually smaller than a person, it has been decided to treat the smaller object as candidate. ˆ Stationary object detected: abandoned baggage is usually of stationary nature and not moving at all. Even humans standing or sitting in a spot would show some minimized motion. Therefore it can be assumed that baggage can be represented by small stationary objects, where some variance is allowed. Its size can easily be determined via homography. Nevertheless bigger objects, e. g. a ski bag have to be considered. ˆ Person leaves stationary object behind: in the next step the distance d between the stationary object and the moving person is measured with p d = (x1 − x2 )2 + (y1 − y2 )2 (5.12)

and subsequently thresholded. In case d > 3 m, as defined in the PETS2007 challenge, the stationary luggage item is considered as unattended. This state will

158

5.6. Closure

Figure 5.17.: Unattended luggage warnings for scene S07 and S08 of the PETS2007 dataset, followed by an alert in S08.

change to left luggage in case it has been valid for more than 25 s or the owner’s track is lost and cannot be reacquired. A different person than the owner entering the three meter warning zone should not be affecting the scenario recognition module.

The PETS2007 dataset contains two left luggage scenarios, which had to be detected without any training material. This task is considered by far more complex than the detection of loitering persons. Especially the split and merge detection tends to be rather difficult in very crowded situations. The maximum distance between two objects has to be set carefully and be fitted to each application scenario. Here the maximum distance for splits was set to 0.5 m and resulted in no false positive and no miss for left luggage detection. Table 5.11 shows the time stamps for the detection of unattended luggage and abandoned luggage. These were consistent with the ground truth, allowing a delay of one second. Additionally the position of the original owner (xown , yown ) is indicated if the person is still located in the tracking region. The x− and y− coordinates of a bag left behind are also determined and compared to the labeled position P (0, 0), resulting in an average difference of approximately only 0.09 m, confirming the high localization performance of the multi camera person tracking system. Fig. 5.17 shows resulting unattended luggage warnings for both scenes, S07 and S08, of the PETS2007 dataset. The warning in scene S08 is followed by an alert. No alert has been set in scene S07, as the luggage item has been picked up by the owner again.

5.6. Closure This chapter has demonstrated various approaches for human activity detection. While both facial expression and seated person activity recognition rely on machine learning techniques, the analysis of trajectories has been entirely conducted with a set of simple heuristics. This observation is quite notable, as data collection and annotation is a time consuming and expensive task. Further, the captured data possibly has to be

159

Chapter 5. Behavior Detection Applications

Scene:

td

x

y

xown

yown

S07 unatt. 1491 -0.01 -0.20 NA S08 unatt. 1147 -0.14 0.04 -2.38 S08 left 1773 -0.12 0.04 NA

NA -1.05 NA

Table 5.11.: Timestamps and positions for unattended and left luggage in S07 and S08. NA, meaning not available, as the original owner walked out the scene.

adopted to various sensor setups or even re-recorded, which is avoided by the presented approach. Therefore, systems based on expert knowledge seem favorable. Nevertheless machine learning techniques are required in more complex scenarios, where a wide range of influences and features have to be considered. Human activity, represented by a set of meaningful low level features, can be recognized by various classifiers, which had to be evaluated for each scenario. It has not been possible to determine which method is capable to cope with all application scenarios. Nevertheless both HMMs and SVMs have shown reliable results in most scenarios. Although behaviors are represented by dynamic feature sequences of varying length, a segmentation and static classification with SVMs frequently outperformed HMMs. The results presented in this chapter are a quite good indicator for further development, as these illustrate that human activities can be discriminated automatically. Nevertheless further research has to be conducted in a wide range of fields, Beginning with a more detailed representation of the human body [Cha06], where the limbs are modeled in detail for a more elaborate motion analysis. With advances in technology, the main problem of data collection can be partially solved by applying so called motion capture suits, that create a 3D model of the person. It should be possible to match this data to video sequences recorded from an arbitrary view [Ram03]. Especially a probabilistic behavior model, even if it is based on expert knowledge only, should improve the activity recognition task. Integrating further observation in e. g. a Bayesian network will enhance the correct interpretation although a person’s behavior could have multiple reasons. In case LLAs are modeled by such a network it seems reasonable to use a dynamic model, for instance a DBN, to model state transitions between observed LLAs [Car06].

160

Chapter 6. Conclusion and Outlook The goal of this thesis has been to investigate and implement algorithms for automated video surveillance applications. The behavioral analysis is thereby usually the last step in the processing chain, and requires a range of inputs prior to the activity recognition task. The main focus during the selection of appropriate algorithms and their implementation has been set on real time capability and robustness. Therefore mainly simple, yet powerful, algorithms have been chosen. As shown in the introduction the structure of most common surveillance systems is based on detection, tracking, and a subsequent analysis for unruly behavior of persons using trajectories and motion patterns. Though this segmentation into various tasks might not be the optimal solution, promising results have been achieved throughout this work. The results of each component will be briefly discussed in the following section. Furthermore possible improvements for the implemented components will be pointed out in the outlook section.

6.1. Conclusion Prior to the extraction of motion patterns or trajectories the regions and objects of interest have to be detected. This has been done with rising complexity, where the simple presence of objects can be indicated by change or foreground detection. In a further step the limited region of interest has been searched for human bodies and faces. Hereby approaches based on Haar basis features achieved detection rates above 90% at a low false positive rate for both classes. Furthermore, it has been shown that a body part based pedestrian detection performs significantly better than holistic approaches, which indicates a more generalizing model creation. The face class has been further refined by the detection of fiducial points using elastic bunch graphs. As only a limited amount of information can be derived from a single observation, the detected objects have to be tracked over time in sequence of observations. This is required for activity recognition on the one hand and to speed up the detection process on the other hand. Tracking is usually based on the underlying detector, which is frequently applying a generalizing model for the localization task. Both Kalman filters and the

161

Chapter 6. Conclusion and Outlook condensation algorithm demonstrated a high ID maintenance in scenarios containing few objects. Confusions are frequently created in crowded scenes, as the object appearance is not being taken into account and the models lack of discriminative abilities. Therefore a representation with a deformable feature graph has been introduced to overcome this problem. SIFT features are extracted from the object region and traced over time using their descriptors and spatial configuration. The graph representation is thereby considered as more stable than the ordinary matching of SIFT descriptors, as mismatches can be avoided. In order to maintain stable models, the graph has to be updated permanently, and by applying graph similarity measures even re-recognition could be performed.

Due to the limited depth perception of traditional video cameras, both object detection and tracking cannot be reliably conducted in crowded scenarios, where heavy occlusions are observable. Therefore the use of multiple camera tracking systems, which cover a scene from different fields of views, has been introduced in the past to resolve occlusions. The presented approach is based on the commonly known homography constraint. This is usually applied to localize feet positions in overlapping views, which results in a segmentation of objects and a low localization precision. Hence this work suggests to apply homography in multiple layers and create a rough 3D model of the scene, which has drastically lowered the localization error to 0.13 cm. Furthermore, the arising problem of ghost objects, which are created in inconvenient constellations, has been addressed. The amount of detected objects could be lowered by utilizing geometrical constraints and introducing a false positive candidate class. Classical multi camera tracking approaches further rely on the detected ground floor position for ID maintenance. While the popular Kalman filter has demonstrated a quite reliable performance, 15 ID changes have been observed during tests on the PETS2007 dataset, the introduction of a further cue, here 2D tracking with deformable feature graphs, led to a higher ID maintenance, and resulted in two ID changes. After the extraction of trajectories and motion patterns potential threats can be detected in video sequences. While facial expressions have been examined in general, security related activities of seated and walking persons have been tested in an application context. These were arising from projects funded by the European Union or the PETS challenge. It has been shown that all presented approaches perform reasonably using recorded data. Both seated person behavior analysis, either in aircrafts and meetings, and facial activity recognition have been conducted with machine learning techniques. Thereby it is not yet possible to determine an approach that is suitable for all tasks, as HMMs and SVMs performed with similar accuracy. Nevertheless the reliability of the fusion of low level features has been impressively demonstrated. These approaches usually suffer of the need for large amounts of training data. It has been shown that trajectories can be robustly analyzed using simple heuristics. This way loitering people and left luggage could be recognized without any flaws.

162

6.2. Future Developments

6.2. Future Developments The present thesis has shown the impressive current state of automated video surveillance systems. For most of the tasks, from facial expression to luggage related event recognition, reliable systems could be implemented with focus on real time capability. However, research activities cannot be concluded at the current state, but could use the presented work as baseline for further research. Though object detection and localization operate sufficiently for the presented tasks, all steps still require special attention, beginning with the considerably simple task of foreground segmentation, which usually requires manual tweaking of parameters depending on environmental influences. These should be modeled automatically, so that even parameter adjustment is not required anymore. Nevertheless, the detected blobs are a valuable source for more complex detectors based on machine learning techniques. As most objects of interest are either highly deformable or usually appear in complex scenarios, holistic approaches with narrow constraints are not practicable. Therefore a modular definition of objects seems reasonable, as has been shown with first attempts to detect body parts. Considering the use of basic geometric objects, their arrangements and introducing novel visual words, a by far more reliable object representation should be possible. With newly emerging object detection approaches current trackers based on the underlying detector output might also perform more reliable, as the measurement will be performed more robustly. Unfortunately this assumption is only correct for small amounts of objects, due to the lack of discriminative abilities. As shown appearance based methods are more robust and should be further investigated. Besides the obvious introduction of novel features that are distinctive at small scales and little contrast, especially the update procedure requires further research. Current methods cannot determine the confidence of a detected feature point and subsequently incorporate weights into the feature graph. Further, dismissed features are removed from the graph at the moment, although these might be used for re-recognition of objects after reentry or total occlusion. Therefore distinctive 3D models could be estimated from 2D data for further enhancement. A general weakness of all investigated approaches is the need of motion modeling, which is frequently faulty due to the nonlinearity of human motion. This could be modeled with more elaborate methods such as DBNs or HMMs. Next, multi camera person tracking applications have been successfully introduced for a detailed scene analysis in crowded spaces that cannot be reliably monitored from one field of view. Although localization performance is acceptable in moderately crowded scenes, overcrowding cannot be handled by the homography approach. Therefore a more sophisticated segmentation method could be utilized in each sensor view, to create more information that can be exploited by the fusion framework. Further incorporation of 2D information, e.g. texture, could also enhance tracking performance. While this has been already demonstrated in single views, a 3D person model could be created in case the

163

Chapter 6. Conclusion and Outlook person can be reconstructed and the texture can be mapped onto the resulting polygon. All features, either motion patterns or trajectories, have been employed for behavioral analysis. As shown, even simple heuristics can lead to remarkable results in some application scenarios. Nevertheless it has not been possible to demonstrate a method fitting the needs of all interesting behaviors. Extracting more stable features willl probably enhance recognition performance, as this is the most common error source. With the introduction of further LLAs, which could be modeled with BNs, a more complex scenario description can be designed. This step unfortunately requires the collection of more training data. While the behavior of a single person can be considered as a sequence of states, methods to correlate behaviors of multiple individuals are required. Coupled DBNs could be used for this task, although the inference problem has not yet been solved. Summed up the methods presented in this article could be used as assistance for a human operated CCTV system, helping staff to focus attention on noticeable persons at a low false positive rate, though at the same time ensuring minimal false negatives. However, additional research will be needed to fully automate a surveillance system in the far future.

164

Appendix A. Databases A.1. The SAFEE Facial Activity Corpus Common databases that are related to facial activities usually contain the six basic emotions suggested by Ekman [Ekm84], and neutral as additional one [HUM04]. Especially potentially security relevant actions, such as yawning or activities that are not related to emotion, are usually not included in public databases. Consequently it has been decided to create a new database with real facial actions, the so called SAFEE Facial Expression Corpus (SAFEE-FAC). It has been decided to record laughing, speaking, yawning and other facial expressions. It can be assumed that laughing, speaking and a wide range of other activities can be naturally produced in an interview situation. However real yawning cannot be simulated, as most people consider this activity as simply opening the mouth, where a maximum distance between upper and lower lip is frequently chosen. It is noticeable that even untrained persons can discriminate between a fake and a real yawn easily. Therefore most of the recordings were performed late at night or early in the morning, because of the high probability of observing tired test subjects.

A flight situation has been simulated during the recordings. The passengers were told what is happening on board and they had to react to the announcements. The resulting reactions were recorded with a PAL resolution camera, whose field of view usually covered the face and upper body of the test subject. Speaking can be considered as real, also laughing, as the test subjects were amused because of some of the announcements. The persons were also advised not to suppress any activities, this way also some real yawns could be recorded within the short sessions. After segmenting the video material, sequences were chosen in such a manner that the face had to be looking almost straight into the camera for at least 15 frames. Rotation in the image plane was no issue. This way at least 101 samples for each of the classes had been extracted, with an average length of 25 frames. The face is about 200 × 200 pixels large. Begin and end of the sequence were not fixed, so any transitions between facial activities may appear.

165

Appendix A. Databases

Figure A.1.: Examples from the SAFEE-FAC. From left to right: ing,laughing, yawning and neutral movements.

anger

happiness

disgust

neutral

speak-

fear

sadness

surprise

Figure A.2.: Examples for all seven emotions in the FEEDTUM database.

A.2. The FEEDTUM Corpus As emotional databases with real data have been sparse for a long time, a new database, the so called FEEDTUM database, has been recorded at the Institute for Human Machine Communication [Wal06b]. The corpus currently contains the six basic emotions described by Ekman, namely anger, disgust, fear, happiness, sadness and surprise for each of the 18 recorded subjects. Each of the emotions has been recorded three times. Additionally faces with neutral facial expression have been included, resulting in seven classes. To elicit the emotions as natural as possible it has been decided to play several carefully selected stimuli video clips and record the participants’ reactions. For this purpose a video monitor together with a camera mounted on top were employed, which enables a direct frontal view. Both devices were controlled by a dedicated software that induced the desired emotions and started the recordings at the expected times.

166

A.3. The Airplane Behavior Corpus

A.3. The Airplane Behavior Corpus Experts in the field of aircraft security defined a set of six activities, which could be important indicators for potential threats and therefore should be detected. These are namely aggressive, cheerful, intoxicated, nervous, neutral and tired. Due to the lack of freely available databases, a large database has been recorded. In order to obtain equivalent conditions of several subjects of diverse classes acted data has been recorded. It is believed, that mood induction procedures create more realistic reactions. Therefore a scenario has been developed, which leads the subjects through a guided storyline. Five speakers have recorded announcements, which a hidden testconductor will play to the actors. As a general framework a vacation flight with return flight was chosen, consisting of 13 and 10 scenes as start, serving wrong food, turbulences, conversation with a neighbor or falling asleep. Respecting a possible setup of the camera in the seat’s back rest in front of each passenger, the activity radius is restricted to the head including the upper body , see figure A.3. A seat for the subject was positioned in front of a blue screen. A condenser microphone and DV-camera were fixed without occlusions of the objects. 8 actors, both male and female, between 25 and 48 years in age, took parts in these recordings, which created a total of 11.5h video material. This has been pre segmented and annotated by three experienced male persons. In total 460 video clips with an average length of 8.4s have been recorded

A.4. The AMI Database For this task a subset of the AMI-Corpus [Car05], which has been recorded for the AMIproject1 , with a total length of three hours, recorded at the IDIAP-Smart-Meeting-Room, was created. The subset contains 36 meetings with a length of five minutes. The IDIAProom is equipped with seven cameras, 22 microphones, a white board and a projector with a screen. Four of these cameras record closeup views of the four participants, two cameras show the left, respective the right side of the table and one camera captures the white board and the projector screen, as displayed in fig. A.4. Eight microphones capture close-talking audio which is used of the video-editing task.

A.5. The PETS2007 Multi Camera Database The PETS 2007 benchmark data set presents four typical security relevant problems at a busy airport terminal. The first is the detection of luggage left unattended for more than 25 s. This seems a relevant task for law enforcement, as luggage containing explosives or chemical threats could be left behind. The difficulty in this task is to reliable detect luggage items and additionally determine the owner of the luggage, in case they have left the item unattended for at least 25 s. For this problem several approaches have been 1

EU funded FP6 project: AMI Augmented Multi Party Interaction

167

Appendix A. Databases

aggressive

nervous

cheerful

neutral

intoxicated

tired

Figure A.3.: Examples for all six emotions in the Airplane Behavior Corpus.

Figure A.4.: FOV of all seven cameras used for recordings in the IDIAP meeting room.

168

A.6. The PROMETHEUS Outdoor Scenario

Figure A.5.: All four views of the PETS 2007 data set

already described in the PETS 2006 challenge [Thi06]. The second task is the detection of loitering people. A person is considered loitering if she enters a view and stays there at least 60 s. As third scenario luggage theft is considered. Theft is defined as an item of luggage moved further than three meters away from the original owner. A variation would be two individuals swapping a luggage item as fourth scenario. Whether the initial owner notices this procedure or not should not be an issue, as both scenarios are realistic. For each of the for scenarios two data sets, recorded from four camera views, are provided. The views were carefully chosen to provide the best possible camera constellation both for single and multiple camera tracking.

A.6. The PROMETHEUS Outdoor Scenario One of the integral parts of the PROMETHEUS corpus is the security related outdoor scenario [Nta09]. It has been recorded in an outdoor facility using three synchronized overview Firewire cameras with a resolution of 1076 × 768 pixels. These were utilized to track persons along the paths and the lawn in the scene. The cameras were setup respecting the scene geometry, in order to resolve occlusions created by trees and bushes. Furthermore lenses with a short focal length have been installed, to enlarge the field of view. Additionally a detail camera with PAL resolution has been installed at the ATM, providing a more detailed view on the relevant region. This way even the persons limbs could be modeled. Furthermore a photonic mixture device, that creates a depth image of the scene, has been used in in front of the ATM, which can be used to resolve occlusions in dense environments.

169

Appendix A. Databases

Figure A.6.: All four views of the PROMETHEUS outdoor scenario

As the recordings were conducted in a public place multiple people and groups could be observed in the video material. Eleven actors have been engaged to simulate both luggage and ATM related events, which will be addressed in this work. Therefore actors were told to draw money at a simulated ATM machine and eventually cue in line behind a person operating the ATM. Throughout the three hours of video material the behavior of operating the ATM has been recorded 12 times, whereas only six robberies occurred. While an actor was drawing money, in some cases another actor has been instructed to rob the person at the ATM. Therefore the robber would approach the person, grab the money or hand bag and run away into a random direction. As the actors did not know, when they might be robbed the reaction was quite spontaneous and various reactions have been observable. Some were shouting and following the thief, others were just standing in front of the ATM and screaming for help. Screams have been recorded by a microphone array behind the ATM, although audio is not used in this part of the work. In order to be able to evaluate the system’s performance, the entire amount of three hours of video material have been manually annotated. Thereby the persons’ position has been determined for every fifth frame in the sequence in world coordinates. Furthermore the timestamps of ATM incidents have been also annotated.

170

Appendix B. Classification Methods Various classification methods have been evaluated throughout this work without any details on their functionality or possible configurations. This section will provide a brief summary on distance measures, SVMs, NNs, and HMMs and explain the basic principle of all four classifiers. The interested reader can find more detail in the referenced literature.

B.1. Distance Measures The probably simplest, and yet quite effective, classifiers are distance based. The general idea is to compute the distance between a sample ~x and a reference vector ~xκ for a class κ. An exhaustive learning phase is usually not required, as the sample is compared to the reference during the classification task. This fact allows a very flexible framework, as new classes can be added at any time. The classification task itself can be quite time consuming, as all possible references have to be compared. Thereby the entries of the reference and sample vector with dimensionality n are compared. The most common measure [Fuk87] is the Euclidean distance v u n uX d(~x, ~xκ ) = t (xi − xκ,i )2 , (B.1) i=1

which is a special case of the so called Minkovski-measure. In order to compare two probability density functions, such as histograms, the so called Bhattacharyya-Distance is computed with: v u n X u t xi xκ,i (B.2) d(~x, ~xREF ) = 1 − i=1

Based on the chosen distance measure, the recognized class κr is determined by evaluation of the smallest distance with κr = argmin d(~x, ~xκ ). (B.3) κ

B.2. Support Vector Machines Support Vector Machines, as introduced by Vapnik [Vap95], are a popular classification approach as they are independent of the vector size and are able to create a highly gener-

171

Appendix B. Classification Methods alizing model. Based on statistical learning theory SVMs are capable to robustly separate two classes after analyzing an initial training set Λ = (~xi , yi )|i = 1, . . . , I with ~xi ∈ Rn and yi ∈ +1, −1

(B.4)

where ~xi denotes a training sample, which can be of any dimensionality n. The sum of all training samples is split into positive examples ~xi ∈ Ω1 with yi = +1 and negative examples ~xj ∈ Ω2 with yi = −1. Given the data’s linear separability in feature space Rn [Bur98], the data is separated by a so called hyper plane π(w, ~ b), defined by a vector w ~ and a margin b, π(~xi ) = w ~ T ~xi + b. (B.5) For each example the so called functional margin, the distance between a feature and the hyperplane, can be computed with γ(~xi ) = yi (w ~ T ~xi + b),

(B.6)

where γ(~xi ) > 0 denotes a correctly classified input vector ~xi . In case w ~ and b are nor1 malized with kwk , the margin is simply the euclidean distance. The margin of separation ~ ζΛ consequently is the minimum absolute value of all distances between feature vectors ~xi ∈ Λ and the hyper plane π ζΛ = min(γ(~xi )). (B.7) A reliable and discriminative classifier for unseen data can only be achieved if this margin is maximized. Therefore a hyper plane, which separates the classes in Λ with a maximum value of ζΛ , has to be found by solving the following optimization problem min −γ w,b,γ ~

kwk ~ 2 = 1.

yi (w ~ T ~xi + b) ≥ γ

(B.8)

Alternatively it is possible to set the margin to γ = 1 and minimize the norm of w, ~ which alters the optimization problem to 1 kwk ~ 2 yi (wx ~ i + b) ≥ 1 .γ = 1 (B.9) w,b ~ 2 This quadratic programming problem can be solved by transformation to the Lagrangian function L(w, ~ b, φ) with the Lagrange multipliers αi ≥ 0. Therefore the Lagragian’s primal form I X 1 2 L(w, ~ b, α) = kwk ~ − αi (yi (w ~ T ~xi + b) − 1) (B.10) 2 i=1 min

has to be maximized with respect to αi and minimized with respect to w ~ and b. Therefore the derivations of L ∂ ∂ L(w, ~ b, α) = 0, L(w, ~ b, α) = 0 (B.11) ∂b ∂w ~ become zero, which consequently results in I X i=1

172

αi yi xi = 0,

w ~=

I X i=1

αi yi xi .

(B.12)

B.2. Support Vector Machines x2

w

b ζ

ζ

H(w,b) x1 Figure B.1.: Optimal separation of two classes with a maximal margin.

The hyper plane can now be represented with the resulting support vectors, that are defined by the points located on the margin with the Lagrangian multipliers αi ≥ 0. All other data points can be ignored, hence providing a memory efficient representation of the decision function. The dual representation of the optimization problem can be derived by substituting the constraints in eq. B.12 into the primal form in B.10. LD (α) =

I X i=1

I 1X αi αj yi yj (~xTi ~xj ) αi − 2 i,j=1

(B.13)

reveals the optimal maximal margin hyper plane, with the constraints αi ≥ 0, and

I X

αi yi = 0.

(B.14)

i=1

The optimization problem can now be solved by maximizing LD , which results in I X h(~xi ) = sgn(w ~ ~xi + b) = sgn( αi yi (xTi x) + b), T

(B.15)

i=1

and can be used for classification of unseen examples. In common real applications it is usually not possible to linearly separate two or more classes, which means that inevitably some samples are located on the wrong side of the separation plane. Therefore a so called kernel trick has been introduced [Chr04], which transforms a feature space Rn in a new usually higher dimensional feature space Rne with Θ : Rn → Rne .

(B.16)

A so called kernel function K(x, y) with

173

Appendix B. Classification Methods

K(x, y) = Θ(x)T Θ(y)

(B.17)

is used to transform the initial input space Rn into the feature space Rne . The kernel function itself hast to be symmetric, satisfy the Cauchy-Schwarz condition and have a positive semi-definite matrix. In the past following kernels have been employed: ˆ Polynomial Kernels with K(x, y) = (x, y+1)p , where p denotes the polynom’s degree ˆ Radial Basis Functions (RBF) with K(x, y) = e σ of the Gaussian function

kx−yk2 2σ 2

, with the standard deviation

ˆ Sigmoid kernel K(x, y) = tanh(k(xT , y) + θ), with the offset θ and the amplification k.

There is no method to determine the optimal kernel and parameter configuration for a classification problem. This can only be done empirically by evaluating different models and configurations. In order to separate feature vectors consequently the following equation can be used dw,b ~ T Θ(x) + b. (B.18) ~ = w Up to now it is only possible to separate two classes, which is not sufficient for most classification problems. There are a few possibilities to expand the SVM to K > 2 classes. ˆ One against all: K SVM models are created with Ω1 = Ωk and Ω2 = {Ω\Ωk } with k = 1, . . . , K. After performing all K decisions the one with largest dw,b ~ (x) is chosen. ˆ One vs one: Hastie et al. presented a classification method based on pairwise coupling [Has98]. Hereby 0.5K(K − 1) binary decisions are performed and the final decision is made in a majority vote. Although the computational effort is higher than with other approaches, the classification performance seems to be outstanding.

Though other approaches are also given in the literature, these two have shown satisfying performance and have been used throughout this work.

B.3. Neural Networks Another static classification method are artificial neural networks (ANN) [Jor96], which are biologically inspired by nervous systems and are able to model any function. Thereby each network is constructed from so called neurons, which send out electric impulses via their axons. Each output, depending on the input and the applied activation function, is propagated to other neurons in the network architecture. As illustrated in fig. B.2 the neuron’s inputs ~x = x1 , . . . , xN are weighted by w ~ = w1 , . . . , wN in the first place. An

174

B.3. Neural Networks

x1

w1

x2

w2

xn

wn

b



T x'

x'

x'

y

Figure B.2.: Exemplary model of a neuron

offset b can be further added if required. All weighted elements are subsequently summed up with N X x0 = b + w i xi , (B.19) i=1

and processed by a usually nonlinear neuron activation function T (x0 ). This can usually be regarded to as a threshold, as a high value is usually activating the neuron and hence high edge steepness is commonly favored. Among the most commonly used transfer functions are thresholds, sigmoid, and tanh functions. Such neurons can now be combined to a complex network architecture, allowing a representation of more elaborate functions. Hereby the most popular topology seem to be the so called feed forward networks or multi layer perceptrons. These are generally split into three parts, where neurons are only connected with neurons in the following layers. Although a fully connected network topology is possible, it is not necessarily required to model arbitrary functions. The successive layers can be split into: ˆ Input layer: The first layer is built up with N input nodes for the elements ~x = x1 , . . . , xN . ˆ Hidden layer: The input is subsequently propagated to one or more hidden layers and processed within these. Thereby the weight vectors of all perceptrons in each ~ i }, resulting in the output yl = Wl~xl +~b. layer are represented by a matrix W = {W Propagating the output yl to the next layer l + 1, the result can be computed with

yl+1 = Wl+1 (Wl~xl + ~bl ) + ~bl+1

(B.20)

ˆ Output Layer: Finally an output vector ~y = y1 , . . . , yM is computed, which represents the result of the classification task.

In order to classify input vectors, it is required to decode the output ~y (~x, W). Given K possible classes Ωκ , each is represented by a binary combination of K outputs yi . Obviously the weights have to be adjusted during the learning phase for the classification task. This is commonly done by the so called backpropagation algorithm [Rum86], which will be explained with a single neuron for reasons of simplicity. As known from other

175

Appendix B. Classification Methods

x1

. . .

. . .

. . .

Input layer

...

Hidden layer

. . .

xn

y1

ym

Output layer

Figure B.3.: Exemplary topology of a feed forward network.

data driven approaches a large training set Λ with training examples ~xi and the according desired output ~y is required. An example ~x is presented to a randomly initialized network and the output ~y 0 is determined. This is subsequently compared to the desired output ~y using the quadratic error as metric 1 (~xi , W) = (yi − yi0 )2 . 2

(B.21)

The idea of the backpropagation algorithm is now to propagate the error signal  back to all neurons, whose outputs were used as input signals for the neuron. Their weights are now updated according to the computed error   δ(~x, W) wi (k + 1) = wi (k) − β , (B.22) δwi in order to minimize the error with a pre-defined step width β and the gradient Using eq. B.21 for the quadratic error the gradient is computed with δ(~x, W) = −(y 0 − y)xi . δwi

δ(~ x,W) . δwi

(B.23)

This procedure is repeated until a defined amount of iterations is reached or the error is not changing anymore.

B.4. Hidden Markov Models In contrast to the previously described approaches, HMMs, as described by Rabiner for speech recognition [Rab89], allow a probabilistic model of time series with variable length. This characteristic seems suitable for the behavior recognition task, as most activities usually vary in duration. HMMs are based on Markov Models (MM), which are commonly used to model stochastic processes, that only depend on the previous state. A simple MM, such as illustrated in fig. B.4, usually consists of N states S1 , . . . Sn . While time t proceeds, transitions

176

B.4. Hidden Markov Models

a11 S1 a21

a13

a12 S2

S3 a32

Figure B.4.: Exemplary Markov model with three states and five transitions

between states or self transitions can be observed, where the transition probabilities can be computed aij = P (αt+1 = Sj |αt = Si ), (B.24) with i = 1, . . . , N and j = 1, . . . , N . Following conditions have to be respected: aij > 0 and

N X

aij = 1.

(B.25)

j=1

The current state St of the model is observable at any time. The classic MM can now further be extended by assigning a probability of the observation xt to each state. The resulting sequence x = x0 , . . . , xT −1 , with a length of T observations, corresponds to a sequence that has to be classified in the application process. Now that every state is able to emit so called observations with a probability of bi (xt ) = P (xt |αt = Si ),

(B.26)

the transitions between states cannot be followed anymore and hence are regarded to as hidden. These are commonly modeled by approximation of the original PDF with a Gaussian function 1 − 12 (x−µi )T Σ−1 i (x−µi ) , ν(x, µi , Σi ) = (B.27) 1 e π (2π) 2 Σi 2 with its mean µ, the covariance matrix Σ and ~x ∈ Rn . As a single Gaussian is not able to model multiple deviations, it is reasonable to represent these by a weighted sum of K Gaussians K X bi (~x) = ci,k ν(~x, µ, Σ), (B.28) i=1

P with the weighting factor cik , where k = 1, . . . , K and K i=1 ci,k = 1. Given the high amount of parameters for the representation of observations, a large set of training material Λ is required to determine these. The transition probabilities aij are now formulated as matrix A = aij . In case A is fully equipped with entries aij > 0 it is commonly regarded to as ergodic HMM, meaning that transitions from one state into any other state are possible. A variety of applications, such as speech recognition, in contrast, favors

177

Appendix B. Classification Methods a11 S0

a01

a22

a12

S1

S2

a22

aNN SN

a02

aNN+1

SN+1

a2N

Figure B.5.: Exemplary left-right HMM with three states and additional start and stop state

topologies where probabilities aij = 0 are allowed. The most popular among these is the left-right model topology, allowing only transitions from Si to Sj with j > i. Thereby all elements below the diagonal of the matrix are set zero. Further the number of states is extended to N + 2, introducing a start S0 and end SN +1 state. A HMM, as shown in fig. B.5, is hence fully described by the parameters A and bi with λ = {A, b1 , . . . , bN }

(B.29)

Such an HMM is now representing one class Ωκ ∈ Ω, which leads to the requirement of M HMMs λκ for all M possible classes. All trained HMMs are concurring during the recognition phase and the class κe with the highest probability k = argmax p(x|λκ )P (Ωκ )

(B.30)

κ

is chosen for an observed sequence x. The observation probability of a sequence x with T observations is now computed for each HMM λκ with respect to p(x|λκ ). This could be done by computing all possible paths through each model, which is basically possible though computationally way too expensive, as 2T N T operations are required. Therefore the forward algorithm is utilized to lower computations to roughly T N 2 operations. This is achieved by introducing a so called forward probability qt (j), that provides the probability of reaching a state Sj at time T and correctly producing the feature sequence x = x0 , . . . , xT −1 . Thereby the forward probabilities are initialized with qt=0 (j) = at=0,j bj (x0 ) With a recursive computation step the remaining probabilities ! N X qt+1 = qt (i)aij bj (xt+1 ) i=1

can be determined with 1 ≤ t ≤ T − 2 and 1 ≤ j ≤ N .

178

(B.31)

(B.32)

B.4. Hidden Markov Models In order to compute the current observation of a HMM λκ , the sum over all partial probabilities is computed with p(x|λκ ) =

N X

qT −1 (i).

(B.33)

i=1

An alternative decoding algorithm is the Viterbi algorithm, which, in contrast to the forward algorithm, only determines the most probable path through a HMM. Therefore it is required to store the order Ψt (j) of states, that are run through in the optimal path, and the probability of the most likely path δt (j). The Viterbi algorithm is first initialized with (B.34) δ1 (j) = a0,j bj (x1 ) and Ψ1 (j) = 0, where 1 ≤ j ≤ N . The values for other time steps can subsequently be determined in a recursive manner δt (j) = max (δt−1 (i)aij )bj (xt ) 1≤i≤N

Ψt (j) = argmax(δt−1 (i)aij ) with 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N.

(B.35)

1≤i≤N

Subsequently the probability for a path can be determined p∗ = max (δT (i)) 1≤i≤N

rT∗

= argmax(δT (i)).

(B.36)

1≤i≤N

Further, in case the order of the states is required, backtracking is performed with ∗ rt∗ = Ψt+1 (rt+1 ) where t = T − 2, T − 3, . . . , 0

(B.37)

Prior to the application of HMMs it is required to generate models for each class, which is basically the task of finding a set of parameters, that maximizes the probability p(x|λ) of generating a sequence x λ∗ = argmax p(x|λ) (B.38) λ

In order to determine the necessary parameters, commonly the EM algorithm is used as iterative approximation procedure. In the expectation step the expected value is determined with the current parameter set, while in the maximization step the estimated parameters are optimized. Similar to the forward probabilities a so called backward probability sT (i) = 1, 1 ≤ i ≤ N st (i) =

N X

aij bj (xt+1 )st+1 , with t = T − 2, T − 3, . . . , 0; 1 ≤ i ≤ N

(B.39)

j=1

is estimated. Given an observation sequence it is now possible to determine the probability γ of a state Si at time t qt (i)st (i) γt (i) = . (B.40) p(x|λ)

179

Appendix B. Classification Methods

SN

q2(N)

...

q4(N)

SN

SN

SN

...

q4(2)

S2

S2

q1(2) S2

S2

S2

...

q1(1) S1

q4(1)

S1

S1

S1

S1

1 x0

2 x1

3 x2

4 x3

t

Figure B.6.: Exemplary Trellis diagram for a HMM with three states and the optimum path

Further the transition probability ζt (i, j) from state Si to Sj at time t is determined with ζt (i, j) =

qt (i)aij bj (xt+1 )st+1 (j) p(x|λ)

(B.41)

The HMM’s parameters can now be computed by a0j = γi (j), PT −1 ζt (i, j) aij = Pt=1 , and T −1 γ (i) t t=1 PT −1 γt (j) . bj (xt ) = Pt=0 T −1 t=0 γt (j)

(B.42) (B.43) (B.44)

As the result of this iterative procedure is only an asymptotic approximation for the parameters of a HMM, an abort criterium is required during the training process. This is usually either a predefined number of iterations or if a local minimum for parameter changes is reached.

180

Appendix C. Smart Sensors C.1. Sensors C.1.1. CCD Sensors The most common sensors in surveillance tasks are probably video cameras, where a wide range of sensor types can be integrated, while optic characteristics are more or less the same. Thereby Charged Coupled Device (CCD) sensors can be considered as standard. These consist of a matrix of light sensitive photo diodes, which are referred to as pixels. Transparent electrodes are located above a doped semiconductor, while electrons are located below. The light collected by the lens emits its energy to the semiconductor, which creates electrons and holes. The electrons are stored in the cell just as in a condenser. The energy is proportional to the amount of energy absorbed from the light. The charges are subsequently shifted by one position at a time, until all packets are processed. To utilize these sensors in photo and video cameras, color filters are alternating located above the elements of the CCD sensors. The color information of an image pixel is hence computed by multiple neighboring sensor elements. Most sensors, which are using the so called Bayer pattern1 , are equipped with 50% green and 25% red and blue filters above the sensor elements. The resolution of a camera now depends on the number of cells located on the sensor chip, and influences the level of visible detail. As high reolutions create large amount of data, it is inevitable to choose the resolution of the used camera according to the scenario. For man machine interaction tasks a low resolution camera, for instance a webcam with 640 × 480 pixels, is considered as sufficient due to the proximity. Tasks that require an overview of a scene usually need a by far higher resolution. The PETS2007 data has been recorded with PAL resolution, while the PROMETHEUS database has been recorded with 1024 × 768 pixels. This has been necessary due to the larger distance between objects and camera.

C.1.2. Photonic Mixture Devices Standard video cameras only provide a projection of the 3D world to 2D data. Thereby valuable depth information is inevitably lost an can hardly be reconstructed. To over1

The Bayer pattern determines the order of RGB filters above the cells

181

Appendix C. Smart Sensors

a)

b)

c)

d)

Figure C.1.: a)+b) Exemplary view of a CCD and PMD sensor in an Airbus A340 mock-up. c)+d) Exemplary view of a CCD and Thermal senor recorded for the PROMETHEUS project.

come this problem, novel, so called PMD sensors, which are based on the time of flight principle, have been developed. In its most simple form a ray of light is emitted by a light source, reflected by a surface and the target distance is measured by determining the turn-around time from sender to receiver. A PMD camera in contrast illuminates the entire scene with modulated infrared light. The scene can now be observed by an intelligent pixel array, where each pixel measures the turn-around time of the modulated light. These intelligent pixels are commonly realized in CMOS technology, that capture the reflected illumination. Several so called smart pixels are combined to create the 3D surface reconstruction of the scene. To calculate the distance between objects in the scene and the camera, the autocorrelation function of the optical and electrical signal is computed with four samples N1 , . . . , N4 that a each shifted by 90◦ . Hence the phase φ, which is proportional to the distance, can be computed with   N1 − N3 φ = arctan . (C.1) N2 − N4 Further the strength of the signal p (N1 − N3 )2 + (N2 − N4 )2 a= 2

(C.2)

and the offset b of samples N1 + N2 + N3 + N4 , (C.3) 4 that represents the gray value of each pixel, can be computed. The distance d between target and camera now depends on the modulation frequency fmod and the used wavelength λmod . As the light has to cover the distance between sender and receiver twice, the maximum distance is λmod /2. The distance can now be computed with b=

d=

cφ . 4πfmod

(C.4)

An exemplary depth image recorded in an Airbus A340 mock-up is illustrated in fig. C.1, where the distance has been encoded by the RGB values. As seen red pixels are located near to the sensor, while blue pixels are further away.

182

C.2. Processing Units

Figure C.2.: Illustration of the size of a MiniPC used within the SAFEE project

C.1.3. Infrared Thermography Each object with a temperature above absolute zero emits thermal energy according to the black radiation law. The amount of radiation emitted by an object increases with its temperature, which can be measured by determining the intensity of the emitted radiation. Thermographic cameras are able to detect radiation in the infrared range of the electromagnetic spectrum (900 - 14 000 nm). Common visual sensors are usually not sensitive within this part of the spectrum. Therefore new sensor types are required. One possible solution is a quantum well infrared photodetector, which consists of semiconductor materials with multiple quantum wells. Current affordable cameras have a resolution of 320 × 240 pixels, which is basically sufficient for the object detection task. As the emitted radiation is not depending on external illumination, it can be visualized even in situations humans cannot see anything. Fig. C.1c) and d) shows an image recorded with a standard CCD camera and a thermal infrared image. The persons’ limbs and heads are obviously by warmer than the background of the scene, resulting in an easy foreground segmentation task. Clothing unfortunately has a very imilar temperature as the background, resulting in a more difficult segmentation.

C.2. Processing Units C.2.1. Mini PCs A smart sensor can be basically described as a sensor combined with an additional processing unit. Thereby the most primitive case would be a sensor, that is hooked up to an off the shelf computer. The connection is usually established either via USB, IEE1394 or Network2 . In an real world application it is unfortunately quite difficult to install a large amount of off the shelf computers, which approximately the size of standard desktop PCs. Within the SAFEE project it would have been impossible to integrate all required computers in a demonstration platform. Therefore a specialized solution had to be found. Due to the large cost of the preferred solution, the use of DSP boards (see sec. C.2.2), an alternative has been required. 2

The connection type has to be chosen carefully, as the length of cables is usually limited

183

Appendix C. Smart Sensors

Figure C.3.: The utilized DSP board and its architecture.

Both limiting factors, here space and energy consumption, finally led to the use of hardware initially designed for mobile application, in particular notebooks. While being almost as powerful as desktop hardware, energy consumption and produced heat of Intel’s mobile Core2Duo architecture are remarkably lower. This allows the integration of such notebook components into so called mini PCs, as they are already used for Apple’s MacMini or Aopen’s MiniPC3 , see fig. C.2. With dimensions of 16 cm × 16 cm × 5 cm this has been the preferred platform for the SAFEE demonstrator. Although size has been very limited, a Core2Duo 2, 4 GHz, 2 Gb of RAM and a 160 Gb hard drive have been installed, providing a wide range of ports. All algorithms regarding behavior detection, face tracking and recognition were operating on such a system with openSUSE 10.3 as operating system.

C.2.2. Digital Signal Processor Boards Various application scenarios, such as the observation of aircrafts, have special demands on the utilized hardware and do not allow the use of standard PCs. In order to minimize the required space, weight and energy consumption an alternative had to be found. Thereby highly integrated DSP boards came up as the favored solution. These enable a compact hardware configuration and avoid the use of mechanic components, such as hard drives or optical drives. Further a wide range of algorithms relevant for video processing applications have been already optimized for parallel computation, resulting in very efficient use of the limited recourses. The used DSP board has been equipped with a 720 M Hz C64xx CPU by Texas Instruments. Utilizing a VLIW architecture a total of eleven operation pipelines can be constructed from two independent data paths. Each path consists of 32 registers and four computation units, here a ALU, bit-shifter, multiplication and an addition/subtraction unit. A total of 32 M b RAM are connected with the CPU via a 256-Bit bus. Besides the powerful processing unit three video and one audio input are available, where specialized MPEG encoders for video are already integrated on the DSP. An optimized version of the condensation algorithm has been implemented and ported to the DSP architecture. While tracking could be performed with 8 frames per second, the initialization step took about 5 s. 3

Although there are industrial solutions with smaller form factors, consumer hardware has been chosen due to lower cost.

184

Acronyms AAM ABC ACRS AMI AMIDA

Active Appearance Models. Airplane Behavior Corpus. Access Control and Recognition System. Augmented Multi-party Interaction. Augmented Multi-party Interaction - Distance Access.

BN BT BTF

Bayesian Network. Blob Tracking. Brightness Transfer Function.

CCD CCT CCTV CPU

Charged Coupled Device. Correlated Color Temperature. Closed Circuit Television. Central Processing Unit.

DBN DCT DFG DLT DSP DTW

Dynamic Bayesian Networks. Discrete Cosine Transformation. Deformable Feature Graph. Discrete Linear Transformation. Digital Signal Processor. Dynamic Time Warping.

EBGM EM

Elastic Bunch Graph Matching. Expectation Maximization.

FACS FAU FDP FL FOV

Facial Action Coding System. Facial Action Unit. Facial Definition Parameters. Fuzzy Logic. Field of View.

GM GMM

Graphical Model. Gaussian Mixture Model.

185

Acronyms HMM HSV

Hidden Markov Models. Hue Saturation Value color space.

ID

Identity.

LLA LLF

Low Level Activities. Low Level Features.

MHT MLP MM

Multiple Hypotheses Tracking. Multi Layer Perceptron. Markov Models.

NIR NN NTP

Near Infrared. Neural Network. Network Time Protocol.

OTDS

On-board Threat Detection System.

P2D-HMM PCA PDI PETS

Pseudo-2D Hidden Markov Model. Principle Component Analysis. Pre-determined Indicators. Performance Evaluation of Tracking and Surveillance. PFIND Pedestrian Finder Database. PMD Photonic Mixture Device. PROMETHEUS Prediction and inteRpretatiOn of huMan bEhaviour based on probabilisTic structures and HeterogEneoUs sensorS. QAP

Quadratic Assignment Problem.

RFID RGB rgb RR

Radio Frequency ID. Red Green Blue color space. normalized RGB color space. Recognition Rate.

SAFEE

Security of Aircraft in the Future European Environment. Suspicious Behavior Detection System. Stratified Cross Validation. Scale Invariant Feature Transform. Subject Matter Experts. Support Vector Machine.

SBDS SCV SIFT SME SVM

186

List of Symbols ∆PIM

ij ∆PIM ∆PTijR ∆m ~ ∆~σ ∆ν(x, y, σ) Ψj (~x) Σi Ξ α χ i κi λ µi ν ωi φj π ψi (x) ρi σi ~σ τ θBG θlow θσ θup υ

Matrix containing alls distances of feature positions in the image graph. Matrix containing alls distances of feature positions in the tracking graph. Distance of features i and j in the image graph. Distance of features i and j in the tracking graph. Difference of Center of motion in x and y direction. Difference of Variance in x and y direction. Difference of Gaussians. Gabor Jet. Covariance. Matrix containing angles in a feature graph. Learning rate / Update time. Frequency of Gabor Waveklet. Error. Distortion parameter. Importance of jets. Mean. Gaussian Probability Density Function. Weight. Phase. A plane. A basis function. Lowpass Filter. Variance. Variance of the center of motion. Ratio between two eigenvalues. Minimum amount of data for BG model. Lower Bound for Thresholding. Maximum allowed deviation. Upper Bound for Thresholding. Orientation of wavelets/gradients.

A{i, j} aj

List of matched features. Amplitude.

∆PTR

187

List of Symbols ~a

Assignment Vector.

B(x, y, t) BHSV (x, y, t) BG bdist (x, y, t) bs bx by

Background Image. Background Image in HSV space. Bunch Graph. Brightness Distortion. Block Size. Offset in X. Offset in Y.

Ci ~A C

Field of view number i. Helper Variable grouping multiple assignments from OT R → ORef . Candidate Vector containing all descriptors DRef ,i matched in OT R . Candidate Vector containing all descriptors DTR,i matched in ORef . Chromacity Distortion. Center of image in computer coordinates. Cost Function.

~ Ref C ~T R C CD(x, y, t) (Cx , Cy ) c(~a) DIM,j DTR,i Di F G(x, y, t) DI(x, y, t) D(x, y, t) d~

188

dr

Descriptor of the Image Feature Graph. Descriptor of the Tracking Feature Graph. SIFT descriptor. Foreground Image. Depth Image. Difference Image. Distance / Displacement vector. Detection rate.

E(x, y, t)

Expected Image.

FPPW Ft+1,t f f (x)

False Positives Per Window. Transition Matrix of the Kalman filter. Focal Length. An arbitrary function f(x).

GF (x, y) GI(x, y, t) Gx (x, y, t) Gy (x, y, t) Gz (x, y, t)

Ground Floor Layer. Gradient Image. Gradient in X. Gradient in Y. Depth Gradient.

List of Symbols g

Line.

H Hi,π Ht h(x) hi (x)

Homography Matrix. Transformation matrix of a point in view Ci to the ground plane. Measurement Matrix of the Kalman filter. Strong classifier. Weak classifier.

IHSV (x, y, t) I(x, y, t) II(x, y) I(x, y) IG i(x, y, t) i

Frame of a video sequence in HSV space. Frame of a video sequence. An integral image. A simple image. Image Graph. Intensity of changes. Counting Variable.

Jj (~x)

Gabor Jet.

K ~kj

Number of states/mixtures. Shape of a plane wave.

L L(x, y, σ) L

Training Data Set. Laplacian Scale Space. Light.

M M (x, y, t) m ~ = [mx , my ]

Perspective transformation matrix in homogeneous coordinates. Object Mask. Center of motion.

Nf x Ncx n

Number of sampled elements in a row. Number of sensor elements in a row. Number of elements.

OIM ORef OT R Oi ORi (x, y)

Image Feature Graph. Reference Feature Graph. Tracking Feature Graph. Object. Object region.

P

Perspective Projection Matrix in homogenous coordinates.

189

List of Symbols p(Xt ) Probability of observing Xt . p~ = (xu , yu ) Point in undistorted image coordinates. p~ = (x, y) Point in image coordinates. ~ P = (xw , yw , zz ) Point in world coordinates. p~ = (x, y, z) Point in computer coordinates.

190

R Rs (x, y) Ri (x, y)

Rotation Matrix in homogenous coordinates. Sum of all transformations into the ground plane. Binary representation of transformations of all objects in view i into reference view.

S SP (x, y, t) Sa (J, J 0 ) Sφ (J, J 0 ) SM (x, y, t) s

Scale Matrix in homogenous coordinates. A shaded point. Amplitude. Amplitude. Skin Mask. Scale Factor.

T tmp(x, y) t

Translation vector in homogenous coordinates. A temporary image. Time.

ui,j

Foreground object i in view j.

V Vk,t

Matrix with constraints to be met for graph matching. Matched Model Variable.

Wm w

Function Space with dimension m. Wavelet Factor.

Xi (xf , yf ) (xd , yd ) (xu , yu )

An Observation in a process. Computer coordinates. Distorted coordinates. Undistorted coordinates.

Bibliography [AA71]

Abdel-Aziz, Y.I., Karara, H.M. Direct Linear Transformation Into Object Space Coordinates in Close-Range Photogrammetry. In Proceedings Symposium on Close Range Photogrammetry, University of Urbana Champaign, Urbana, Il, USA, 1–18, 1971. 99

[Abd06]

Abdelkader, M. F., Chellappa, R., Zheng, Q., Chan, A. L. Integrated Motion Detection and Tracking for Visual Surveillance. In Proceedings of the Fourth IEEE International Conference on Computer Vision Systems, ICVS’06, Washington, DC, USA, 2006. 8

[Agg99]

Aggarwal, J. K., Cai, Q. Human Motion Analysis: A Review. Journal on Computer Vision and Image Understanding, 73(3): 428–440, 1999. 91

[Ahl08]

Ahlberg, J., Arsi´ c, D., Ganchev, T., Linderhed, A., Menezes, P., Ntalampiras, S., Olma, T., Potamitis, I., Ros, J. Prometheus: Prediction and interpretation of human behavior based on probabilistic structures and heterogeneous sensors. In Proceedings 18th ECCAI European Conference on Artificial Intelligence, ECAI 2008, Patras, Greece, 2008. 130

[Al 06]

Al Hames, M., Hain, T., Cernocky, J., Schreiber, S., Poel, M., Mueller, R., Marcel, S., van Leeuwen, D., Odobez, J.-M., Ba5, S., Bourlard, H., Cardinaux, F., Gatica-Perez, D., Janin, A., Motlicek, P., Reiter, S., Renals, S., van Rest, J., Rienks, Rutger, Rigoll, G., Smith, K., Thean, A., Zemcik, P. Audio-Visual Processing in Meetings: Seven Questions and Current AMI Answers. In Proceedings MLMI, 2006. 150

[Al 07]

Al Hames, M., H¨ ornler, B., Mu ¨ ller, R., Schenk, J., Rigoll, G. Automatic Multi-Modal Meeting Camera Selection for Video-Conferences and Meeting Browsing. In Proceedings 8th International Conference on Multimedia and Expo, ICME 2007, Beijing, China, 2074–2077, 2007. 151, 152

[Alg06]

Alghassi, H., Tafazoli, S., P. Lawrence, P. The Audio Surveillance Eye. In Proceedings IEEE International Conference on Automated Video and Signal Based Surveillance,AVSS’06, November 2006. 2

[Ars04]

Arsi´ c, D. Definition und Entwicklung eines Systems zur automatischen Detektion von Verhaltensmustern. Diploma Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, October 2004. 64

[Ars05a] Arsi´ c, D., Wallhoff, F., Schuller, B., Rigoll, G. Bayesian Network Based Multi Stream Fusion for Automated Online Video Surveillance. In Proceedings EUROCON 2005, IEEE, Belgrade, Serbia, 995–998, November 2005. 68 [Ars05b] Arsi´ c, D., Wallhoff, F., Schuller, B., Rigoll, G. Video Based Online Behavior Detection Using Probabilistic Multi-Stream Fusion. In Proceedings IEEE International Conference on Image Processing ICIP2005, Genoa, Italy, 606–609, September 2005. 63, 137

191

Bibliography [Ars05c] Arsi´ c, D., Wallhoff, F., Schuller, B., Rigoll, G. Vision-Based Online MultiStream Behavior Detection Applying Bayesian Networks. In Proceedings 6th International Conference on Multimedia and Expo ICME 2005, Amsterdam, The Netherlands, 1354–1357, July 2005. 143 [Ars06]

Arsi´ c, D., Schenk, J., Schuller, B., Rigoll, G. Submotions for Hidden Markov Model Based Dynamic Facial Action Recognition. In Proceedings IEEE International Conference on Image Processing,ICIP2006, Atlanta, GA, USA, 673–676, October 2006. 53, 67, 133, 136

[Ars07a] Arsi´ c, D., Hofmann, M., Schuller, B., Rigoll, G. Multi-Camera Person Tracking and Left Luggage Detection Applying Homographic Transformation. In Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, Rio de Janeiro, Brazil, October 2007. 109, 112, 121, 153, 156, 157 [Ars07b] Arsi´ c, D., Schuller, B., Rigoll, G. Suspicious Behavior Detection in Public Transport by Fusion of Low-Level Video Descriptors. In Proceedings 8th International Conference on Multimedia and Expo ICME 2007, Beijing, China, 20018–20021, June 2007. 140, 151 [Ars08a] Arsi´ c, D., Ho ¨rnler, B., Rigoll, G. Automated Video Editing for Meeting Scenarios Applying Multimodal Low Level Feature Fusion. In Proceedings 5th Joint Workshop on Machine Learning and Multimodal Interaction, MLMI 08, Utrecht, The Netherlands, 2008. 152 [Ars08b] Arsi´ c, D., Lehment, N., Hristov, E., H¨ ornler, B., Schuller, B., Rigoll, G. Applying Multi Layer Homography for Multi Camera Tracking. In Proceeedings Second ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC2008, Stanford, CA, USA, September 2008. 81, 87, 118 [Ars08c] Arsi´ c, D., Schuller, B., Rigoll, G. Multiple Camera Person Tracking in multiple layers combining 2D and 3D information. In Proceedings Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), 2008, Marseille, France, October 2008. 122, 126, 127 [Ars09a] Arsi´ c, D., H¨ ornler, B., Schuller, B., Rigoll, G. A Hierarchical Approach for Visual Suspicious Behavior Detection in Aircrafts. In Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Biometric recognition and verification of persons and their activities for video surveillance” , DSP2009, Santorini, Greece, July 2009. 141 [Ars09b] Arsi´ c, D., H¨ ornler, B., Schuller, B., Rigoll, G. Resolving Partial Occlusions in Crowded Environments Utilizing Range Data and Video Cameras. In Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Fusion of Heterogeneous Data for Robust Estimation and Classification” , DSP2009, Santorini, Greece, July 2009. 22, 25 [Auv06]

192

Auvinet, E., Grossmann, E., Rougier, C., Dahmane, M., Meunier, J. Left Luggage Detection Using Homographies and Simple Heuristics. In Proceedings of the ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS2006, New York, NY, USA, October 2006. 153, 155, 157

Bibliography [Bar05]

Bartlett, M. S., Littlewort, G., Frank, M., Lainscsek, L., Fasel, I., Movellan, J. Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, volume 2, 568–573, June 2005. 38

[Bau72]

Baum, L. E. An Inequality And Associated Maximalization Technique In Statistical Estimation For Probabilistic Function Of Markov Processes. In Inequalities, volume 3, 1–8, 1972. 134

[Ber04]

Berg, A.C., Berg, T.L., Malik, J. Shape Matching and Object Recognition using Low Distortion Correspondences. Technical report, EECS Department, University of California, Berkeley, December 2004. 83

[Bes07]

Beszedes, M., Culverhouse, P. Facial Emotions and Emotion Intensity Levels Classification and Classification Evaluation. In Proceedings of British Machine Vision Conference, BMVC2007, 2007. 136

[Bev08]

Bevilacqua, V., Cariello, L., Carro, G., Daleno, D., Mastronardi, G. A face recognition system based on Pseudo 2D HMM applied to neural network coefficients. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 12(7): 615–621, 2008. 148

[Bha07]

Bhargava, M., Chen, Chia-Chih, Ryoo, M.S., Aggarwal, J.K. Detection of abandoned objects in crowded environments. In Proceedings IEEE Conference on Advanced Video and Signal Based Surveillance, AVSS 2007, London, UK, 271–276, September 2007. 157

[Bro01]

Broadhurst, A., Drummond, T.W., Cipolla, R. A Probabilistic Framework for Space Carving. In Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, 388–393, 2001. 109

[Bro02]

Brown, M., Lowe, D.G. Invariant Features from Interest Point Groups. In Proceedings British Machine Vision Conference, BMVC2002, 656–665, 2002. 70

[Bur98]

Burges, C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121–167, 1998. 172

[Bux03]

Buxton, Hilary. Learning and understanding dynamic scene activity: a review. Journal on Image Vision Computing, 21(1): 125–136, 2003. 4

[Cal08]

Calderara, S., Cucchiara, R., Prati, A. Bayesian-Competitive Consistent Labeling for People Surveillance. IEEE Transactions on Pattern Analysis Machine Intelligence, 30(2): 354–360, 2008. 92, 104, 107

[Can07]

Canotilho, P., Moreno, R. P. Detecting Luggage Related Behaviors Using a New Temporal Boost Algorithm. In Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, IEEE, Rio de Janeiro, Brazil, October 2007. 157

[Car05]

Carletta, J., Ashby, S., Bourban, S., Flynnand, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., Wellner, P. The AMI Meetings Corpus. In Proceedings of the Measuring Behavior symposium on Annotating and Measuring Meeting Behavior, 2005. 167

193

Bibliography [Car06]

Carter, N., Young, D., Ferryman, J. A Combined Bayesian Markovian Approach for Behaviour Recognition. In Proceedings of the 18th International IEEE Conference on Pattern Recognition, ICPR06, 761–764, Washington, DC, USA, 2006. 160

[Car08]

Carter, N., Ferryman, J. The SAFEE On-Board Threat Detection System. In International Conference on Computer Vision Systems, 79–88, May 2008. 137

[Cha91]

Charniak, E. Bayesian Networks Without Tears: Making Bayesian Networks More Accessible to the Probabilistically Unsophisticated. AI Magazine, 12(4): 50–63, 1991. 143

[Cha97]

Chatterjee, C., Roychowdhury, V.P., Chong, E.K.P. A Nonlinear GaussSeidel Algorithm for Noncoplanar and Coplanar Camera Calibration with Convergence Analysis. Computer Vision and Image Understanding, 67(1): 58–80, 1997. 99

[Cha99]

Cham, T., Rehg, J.M. A multiple hypothesis approach to figure tracking. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR1999, volume 2, 1999. 64

[Cha01]

Chang, Chih-Chung, Lin, Chih-Jen. LIBSVM: a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 38

[Cha06]

Chadwicke, O.C., Gonzalez, G., Loper, M. Dynamical Motion Vocabularies for Kinematic Tracking and Activity Recognition. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition Workshop, CVPRW06, 147, Washington, DC, USA, 2006. 160

[Che01]

Chen, Y., Rui, Y., Huang, T.S. JPDAF based HMM for real-time contour tracking. In Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition, CVPR2001, volume 1, 543–550, 2001. 89

[Che06]

Chen, D., Liao, H. Mark, Shih, S. Continuous Human Action Segmentation and Recognition Using a Spatio-Temporal Probabilistic Framework. In Proceedings of the Eighth IEEE International Symposium on Multimedia, ISM06, 275–282, Washington, DC, USA, 2006. 153

[Che07]

Cheng, Z., Devarajan, D., Radke, R. J. Determining vision graphs for distributed camera networks using feature digests. EURASIP Journal on Applied Signal Processing, 2007(1): 220–220, 2007. 129

[Cho08]

Choi, J., Cho, Y., Cho, K., Bae, S., Yang, H. S. A View-based Multiple Objects Tracking and Human Action Recognition for Interactive Virtual Environments. The International Journal of Virtual Reality, 7: 71–76, September 2008. 153, 157

[Chr04]

Christianini, N., Taylor, J. Shawe. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. 173

[Chu07]

Chum, O., Zisserman, A. An Exemplar Model for Learning Object Classes. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR07, 1–8, June 2007. 54

[Col00]

Collins, R.T., Lipton, A.J., Kanade, T. Introduction to the special section on video surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8): 745–746, Aug 2000. 55

194

Bibliography [Com00] Comaniciu, D., Ramesh, V., Meer, P. Real-time tracking of non-rigid objects using mean shift. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR2000, volume 2, 142–149, 2000. 91 [Com03] Comaniciu, D., Ramesh, V., P. Meer, P. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5): 564–577, May 2003. 55, 59 [Coo99]

Cootes, T., Taylor, C. Statistical models of appearance for computer vision. Technical report, University of Manchester, Wolfson Image Analysis Unit, Imaging Science and Biomedical Engineering, Manchester, UK., September 1999. 54

[Cue05]

Cuevas, E., Zaldivar, D., Rojas, R. Kalman filter for vision tracking, Technical Report B 05-12. Technical report, Berlin, Germany, 2005. 58, 59

[Dal05]

Dalal, N., Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005., volume 1, 886–893, 2005. 40, 41

[Dam08] Damen, D., Hogg, D. Detecting Carried Objects in Short Video Sequences. In Proceedings of the 10th European Conference on Computer Vision, ECCV 2008, Marseille, France, 154–167, 2008. 156, 157 [Dau88]

Daugman, J.G. Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on Acoustics, Speech and Signal Processing, 36(7): 1169–1179, Jul 1988. 47

[Dav04]

Davies, E.R. Machine Vision : Theory, Algorithms, Practicalities. Morgan Kaufmann, December 2004. 20

[Dem77] Dempster, A.P., Laird, N.M., Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(Series B): 1–38, 1977. 12 [Duq05]

Duque, D., Santos, H., Cortez, P. Moving object detection unaffected by cast shadows, highlights and ghosts. In Proceedings IEEE International Conference on Image Processing (ICIP) 2005, Genoa, Italy, III–413–16, September 2005. 16

[Eic99]

Eickeler, S., Mu ¨ ller, S., Rigoll, G. High Performance Face Recognition Using Pseudo 2-D Hidden Markov Models. In Proceedings European Control Conference, ECC, 1999. 148

[Ekm78] Ekman, P., Friesen, W. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978. 47 [Ekm84] Ekman, P., Scherer, K. Approaches To Emotion. Lawrence Erlbaum Associates, 1984. 135, 165 [Elg01]

Elgammal, A.E., Davis, L.S. Probabilistic framework for segmenting people under occlusion. In Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, British Columbia, Canada, volume 2, 145–152, July 2001. 22, 91

[Ell02]

Ellis, T. Performance Metrics and Methods for Tracking in Surveillance. In Third IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS2002, Copenhagen, Denmark, 26–31, June 2002. 126

195

Bibliography [Esh08]

Eshel, R., Moses, Y. Homography based multiple camera detection and tracking of people in a dense crowd. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR2008, 2008, Anchorage, Alaska, USA, 1–8, June 2008. 91, 120

[Est]

Estrada, F.J., Jepson, A.D., Fleet, D. Planar Homographies, Lecture Notes Foundations of Computer Vision. 102

[Eve08]

Everingham, M., Gool, L. Van, Williams, C. K. I., Winn, J., Zisserman, A. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. http://www.pascal-network.org/, 2008. 7

[Fel08]

Felzenszwalb, P., McAllester, D., Ramanan, D. A Discriminatively Trained, Multiscale, deformable part model. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, 1–8, June 2008. 7

[Fer07]

Ferryman, J., Tweed, D. An Overview of the PETS 2007 Dataset. In Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, Rio de Janeiro, Brazil, October 2007. 92, 156

[Fle90]

Fleet, D.J., Jepson, A.D. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5(1): 77–104, 1990. 50

[For97]

Forsyth, D.A., Fleck, M. Body plans. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR97, 678, Washington, DC, USA, 1997. 35

[Fra03]

Franco, J.-S., Boyer, E. Exact polyhedral visual hulls. In British Machine Vision Conference BMVC03, volume 1, 329–338, 2003. 121

[Fra04]

Francois, A. R. J. Real-Time Multi-Resolution Blob Tracking. In IRIS Technical Report, IRIS-04-422, University of Southern California, Los Angeles, USA, April 2004. 153, 157

[Fre95]

Freund, Y., Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings European Conference on Computational Learning Theory, 23–37, 1995. 37, 44

[Fuk87]

Fukunaga, K. Introduction to Statistical Pattern Recognition. Academic Press, 1987. 171

[FVJ01]

F. V. Jensen, Finn V. Bayesian Networks and Decision Graphs. Information Science and Statistics. Springer, July 2001. 89

[Gat08]

Gatev, A. Investigating 2D Object Tracking Approaches. Master Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, September 2008. 87

[Gav99]

Gavrila, D.M. The visual analysis of human movement: a survey. Elsevier Journal Computer Vision and Image Understanding, 73(1): 82–98, 1999. 30

[Gok98]

Gokstorp, M., Forchheimer, R. Smart vision sensors. In Proceedings IEEE International Conference on Image Processing, ICIP1998, Chicago, Illinois, USA, volume 1, 479–482, 1998. 2

196

Bibliography [Gom03] Gomila, C., Meyer, F. Graph-based object tracking. In Proceedings IEEE International Conference on Image Processing, ICIP2003, volume 2, II–41–4 vol.3, September 2003. 75 [Gon90]

Gonzalez, R. C., Wintz, P. Digital Image Processing, Second Edition. Addison Wesley, 1990. 15, 21, 24

[Gra07a] Grabner, H., Roth, P., Bischof, H. Is Pedestrian Detection Really a Hard Task? In Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, IEEE, Rio de Janeiro, Brazil, October 2007. 30 [Gra07b] Gray, D., Brennan, Shane, Tao, H. Evaluating Appearance Models for Recognition, Reacquisition, and Tracking. In Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, IEEE, Rio de Janeiro, Brazil, October 2007. 82, 130 [Gul01]

Guler, S. Scene and Content Analysis from Multiple Video Streams. In Proceedings of the 30th IEEE Workshop on Applied Imagery Pattern Recognition, AIPR01, 119, 2001. 154

[Haa10]

Haar, A. Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen, 69: 331–371, 1910. 30

[Ham05] Hampapur, A., Brown, L., Connell, J., Ekin, A., Haas, N., Lu, M., Merkl, H., Pankanti, S. Smart Video Surveillance: Exploring the Concept of Multiscale Spatiotemporal Tracking. IEEE Signal Processing Magazine, 22(2): 38–51, March 2005. 3, 7 [Ham08] Hamdoun, O., Moutarde, F., Stanciulescu, B., Steux, B. Person ReIdentification in Multi-Camera System by Signature Based on Interest Point Descriptors Collected on Short Video Sequences. In Proceedings Second ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC2008, Stanford, CA, USA, 1–8, September 2008. 82 [Han05]

Han, J., Bhanu, B. Human Activity Recognition in Thermal Infrared Imagery. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, CVPR2005 Workshops, 17–17, June 2005. 2

[Har88]

Harris, C., Stephens, M. A Combined Corner and Edge Detection. In Proceedings of The Fourth Alvey Vision Conference, 147–151, 1988. 71

[Har03]

Hartley, R., Zisserman, A. Multiple View Geometry in Computer Vision. Cambridge University Press, March 2003. 102, 105

[Has98]

Hastie, T., Tibshirani, R. Classification by Pairwise Coupling. In Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998. 174

[Her98]

Herodotou, N., Plataniotis, K.N., Venetsanopoulos, A.N. A color segmentation scheme for object-based video coding. IEEE Symposium on Advances in Digital Filtering and Signal Processing, 25–29, Jun 1998. 15

[Hof07]

Hofmann, M. 3D Person Tracking applying Homographic transformations in Multiple layers. Bachelor Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, July 2007. 112

197

Bibliography [H¨or09a] Ho c, D., Schuller, B., Rigoll, G. Boosting Multi-Modal Camera ¨rnler, B., Arsi´ Selection With Semantic Features. In Proceedings 10th International Conference on Multimedia and Expo ICME 2009, Cancun, Mexico, June 2009. 152 [H¨or09b] Ho c, D., Schuller, B., Rigoll, G. Graphical Models For Multi¨rnler, B., Arsi´ Modal Automatic Video Editing in Meetings. In Proceedings 16th IEEE International Conference on Digital Signal Processing, DSP2009, Special Session “Fusion of Heterogeneous Data for Robust Estimation and Classification” Santorini, Greece, 2009. 152 [Hri08]

Hristov, E. Implementation of a 3D Tracking System Based on Multilayer Homography. Master Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, July 2008. 123

[Hu04]

Hu, W., Tan, T., Wang, L., Maybank, S. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 34(3): 334–352, August 2004. 131

[HUM04] HUMAINE - Network of Excellence. http://emotion-research.net/, 2004. 165 [Hur89]

Hurlbert, A.C. The Computation of Color. Technical report, Cambridge, MA, USA, 1989. 17

[Isa98]

Isard, M., Blake, A. Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1): 5–28, 1998. 55, 61, 139

[Jae06]

Jaeger, S. From Informational Confidence to Informational Intelligence. In Proceedings 10th International Workshop on Frontiers in Handwriting Recognition, IWFHR, 173–178, October 2006. 141

[Jav02]

Javed, O., Shah, M. Tracking and Object Classification for Automated Surveillance. In Proceedings of the 7th European Conference on Computer Vision, ECCV02, 343– 357, London, UK, 2002. Springer-Verlag. 55

[Jav05]

Javed, O., Shafique, K., Shah, M. Appearance Modeling for Tracking in Multiple Non-Overlapping Cameras. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR05, volume 2, 26–33, Washington, DC, USA, 2005. 54, 82

[J¨ah95]

J¨ ahne, Bernd. Digital image processing (3rd ed.): concepts, algorithms, and scientific applications. Springer-Verlag, London, UK, 1995. 21, 94, 95

[Jia92]

Jiang, C., Ward, M.O. Shadow identification.

[Jia07]

Jiang, Y., Ngo, C., Yang, J. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM International Conference on Image and video retrieval, CIVR’07, 494–501, New York, NY, USA, 2007. ACM. 54

[Jon02]

Jones, M.J., Rehg, J. M. Statistical Color Models with Application to Skin Detection. International Journal of Computer Vision, 46(1): 81–96, 2002. 29

[Jor96]

Jordan, M. Neural Networks. MIT AI Laboratory, AI Memo, (4562), March 1996. 43, 143, 174

198

606–612, June 1992. 15

Bibliography [Kal60]

Kalman, R. E. A New Approach to Linear Filtering and Prediction Problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D): 35–45, 1960. 55, 58, 112, 124

[Kan04]

Kang, J., Cohen, I., Medioni, G. Object reacquisition using invariant appearance model. In Proceedings of the 17th IEEE International Conference on Pattern Recognition, ICPR2004, volume 4, 759–762 Vol.4, August 2004. 89

[Kay08]

Kayumbi, G., Anjum, N., Cavallaro, A. Global trajectory reconstruction from distributed visual sensors. In Proceedings Second ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC2008, Stanford, CA, USA, 1–8, September 2008. 130

[Kha06]

Khan, S.M., Shah, M. A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In Proceedings of the 10th European Conference on Computer Vision, ECCV 2006, Graz, Austria, 133–146, 2006. 22, 91, 92, 105, 109, 112, 120, 129

[Kha07]

Khan, S. M., Yan, P., Shah, M. A homographic framework for the fusion of multi-view silhouettes. In Proceedings Eleventh IEEE International Conference on Computer Vision, ICCV2007, Rio de Janeiro, Brazil, October 2007. 117

[Kis07]

Kisku, D.R., Rattani, A., Grosso, E., M. Tistarelli, M. Face Identification by SIFT-based Complete Graph Topology. In Proceedings IEEE Workshop on Automatic Identification Advanced Technologies, 63–68, June 2007. 69, 75

[Kit05]

Kittler, J., Hilton, A., Hamouz, M., Illingworth, J. 3D Assisted Face Recognition: A Survey of 3D Imaging, Modeling and Recognition Approaches. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR2005 Workshops., 114–114, June 2005. 150

[Koe84]

Koenderink, J.J. The structure of images. Biological Cybernetics, 50: 363–396, 1984. 69

[Koe02]

Koenen, Rob. Overview of the MPEG-4 Standard. In International Organisation for Standardisation, ISO N4668. http://www.chiariglione.org/mpeg/standards/ MPEG-4/MPEG-4.htm, March 2002. 92

[Kra06]

Krahnstoever, N., Tu, P., Sebastian, T., Perera, A., Collins, R. Multi-view Detection and Tracking of Travelers and Luggage in Mass Transit Environments. In Proceedings of the ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2006, IEEE, New York, NY, USA, October 2006. 157

[Kra08]

Krahnstoever, N. Applications and challenges for multi-camera and multi-sensor vision systems. In Multi-camera and Multi-modal Sensor Fusion: Algorithms and Applications (M2SFA2 2008), Marseille, France,in conj. with ECCV 2008, 2008. 2

[Kuo05]

Kuo, P., Hillman, P., Hannah, J. Improved facial feature extraction for modelbased multimedia. In Proceedings 2nd IEE European Conference on Visual Media Production, CVMP2005, 137–146, 2005. 54

[Kut98]

Kutulakos, K., Seitz, S. A theory of shape by space carving, Technical report TR692. Technical report, Computer Science Deptartment, University Rochester, 1998. 117

199

Bibliography [Lad93]

Lades, M., Vorbruggen, J.C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, R.P., Konen, W. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42(3): 300–311, Mar 1993. 49

[Lau94]

Laurentini, Aldo. The Visual Hull Concept for Silhouette-Based Image Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2): 150–162, 1994. 117

[Laz03]

Lazebnikand, S., Schmid, C., Ponce, J. Affine-invariant local descriptors and neighborhood statistics for texture recognition. In Proceedings Ninth IEEE International Conference on Computer Vision, ICCV2003. Proceedings, 649–655 vol.1, Oct. 2003. 55

[Laz06]

Lazebnik, S., Schmid, C., Ponce, J. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR06, 2169–2178, Washington, DC, USA, 2006. IEEE Computer Society. 54

[Lee96]

Lee, Tai Sing. Image Representation Using 2D Gabor Wavelets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10): 959–971, 1996. 48

[Leh08]

Lehment, N. Tracking and Recognition of Persons with Deformable Meshes. Bachelor Thesis at the Institute for Human Machine Communication, Technische Universit¨ at M¨ unchen, Germany, July 2008. 80, 88

[Leh09]

Lehment, N., Arsi´ c, D., Lyutskanov, A., Schuller, B., Rigoll, Gerhard. Supporting Multi Camera Tracking by Monocular Deformable Graph Tracking. In Proceedings Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2009, Miami, FL, USA, June 2009. 89

[Lie02]

Lienhart, R., Maydt, J. An extended set of Haar-like features for rapid object detection. In Proceedings International Conference on Image Processing, ICIP2002, volume 1, 900–903, 2002. 44

[Lin94]

Lindeberg, T. Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics, 21: 225–270, 1994. 69

[Lin07]

Lin, H.Y., Wei, J.-Y. A Street Scene Surveillance System for Moving Object Detection, Tracking and Classification. In Proceedings IEEE Intelligent Vehicles Symposium, 1077–1082, June 2007. 7

[Liu08]

Liu, M., Wu, C., Zhang, Y. Multi-resolution optical flow tracking algorithm based on multi-scale Harris corner points feature. In Proceedings Chinese Control and Decision Conference, CCDC 2008, Shenyang, China, 5287–5291, July 2008. 68

[LM08]

Lazarevi´ c-McManus, N., Renno, J. R., Makris, D., Jones, G. A. An objectbased comparative methodology for motion detection based on the F-Measure. Compututer Vision and Image Understanding, 111(1): 74–85, 2008. 126

[Lo01]

Lo, B. P. L., Velastin, S.A. Automatic Congestion Detection System for Underground Platforms. Proceedings 2001 International Symposium on Intelligent Multimedia, Video and Speech processing, Hong Kong, May 2001. 11

200

Bibliography [Lo04]

Lo, D., Goubran, R.A., Dansereau, R.M. Multimodal talker localization in video conferencing environments. In Proceedings 3rd IEEE International Workshop on Haptic, Audio and Visual Environments and Their Applications, HAVE 2004, 195–200, Oct. 2004. 93

[Lop08]

Lopez-Garcia, F. SIFT features for object recognition and tracking within the IVSEE system. In Proceedings 19th IEEE International Conference on Pattern Recognition, ICPR2008, 1–4, December 2008. 89

[Low04]

Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60: 91–110, 2004. 69, 70, 82

[Lu05]

Lu, S., Tsechpenakis, G., Metaxas, D.N., Jensen, M.L., Kruse, J. Blob Analysis of the Head and Hands: A Method for Deception Detection. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS’05, January 2005. 55

[Luo07]

Luo, J., Ma, Y., Takikawa, E., S, S. Lao, Kawade, M., Bao-Liang, L. PersonSpecific SIFT Features for Face Recognition. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP2007, volume 2, II–593– II–596, April 2007. 69, 75

[Mac00]

MacCormick, J., Blake, A. A Probabilistic Exclusion Principle for Tracking Multiple Objects. International Journal of Computer Vision, 39(1): 57–71, 2000. 55

[Mac03]

Maciel, J., Costeira, J.P. A global solution to sparse correspondence problems. IEEE Transactions on pattern Analysis and Machine Intelligence, 25: 187–199, 2003. 86

[Mar03]

Martinkauppi, B., Soriano, M., Pietik¨ ainen, M. Detection of Skin Color under Changing Illumination: A Comparative Study. In Proceedings of the 12th International IEEE Conference on Image Analysis and Processing, ICIAP ’03, 652, Washington, DC, USA, 2003. 27

[Mar08]

Martin, C., Werner, U., Gross, H.M. A Real-time Facial Expression Recognition System based on Active Appearance Models using Gray Images and Edge Images. In Proceedings eighth IEEE International Conference on Face and Gesture Recognition, FG08, 2008. 136

[Mat04]

Matthews, L., Ishikawa, T., Baker, S. The template update problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6): 810–815, June 2004. 89

[Mck96]

Mckenna, S., Gong, S. Tracking Faces. In Proceedings International IEEE Conference on Automatic Face & Gesture Recognition, 271–276. IEEE Computer Society Press, 1996. 55

[Meh68]

Mehrabian, A. Communication without words. Psychology Today, 2(4): 53–56, 1968. 47

[Men00]

Menser, B., Wien, M. Segmentation and tracking of facial regions in color image sequences. In Proceedings SPIE Visual Communication and Image Processing, VCIP, 731–741, 2000. 29

201

Bibliography [Mic08]

Michoud, B., Bouakaz, S., Guillou, E., Briceno, H. Largest SilhouetteEquivalent Volume for 3D Shapes Modeling without Ghost Object. In Proceedings Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), October 12-18, 2008, Marseille, France, October 2008. 120, 122

[Mik02]

Mikolajczyk, K., Schmid, C. An affine invariant interest point detector. In Proceedings European Conference on Computer Vision, ECCV, 128–142. Springer Verlag, September 2002. 69

[Mik04]

Mikolajczyk, K, Schmid, C., Zisserman, A. Human detection based on a probabilistic assembly of robust part detectors. In Proceedings European Conference on Computer Vision, ECCV2004, volume I, 69–81, 2004. 35

[Mil85]

Mills, Dave L. Network Time Protocol (NTP). Network Working Group Request for Comments: 958, 1–14, 1985. 93

[Mil07]

Miller, A., Shah, M. Foreground Segmentation in Surveillance Scenes Containing a Door. In Proceedings IEEE International Conference on Multimedia and Expo, ICME2007, 1822–1825, July 2007. 54

[M¨ un08] Mu ¨ nch, C. Facial Action Recognition Applying Elastic Bunch Graph Matching. Bachelor Thesis at the Institute for Human Machine Communication, Technische Universit¨ at M¨ unchen, Germany, September 2008. 52 [Moh01] Mohan, A., Papageorgiou, C., Poggio, T. Example-Based Object Detection in Images by Components. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(4): 349–361, 2001. 30, 35, 38, 40, 41 [MS00]

M. Soriano, B. Martinkauppi S. Huovinen, Laaksonen, M. Skin detection in video under changing illumination conditions. In Proceedings 15th International Conference on Pattern Recognition, Barcelona, Spain, 839–842, 2000. 139

[Nas06]

Nascimento, J.C., Marques, J.S. Performance evaluation of object detection algorithms for video surveillance. IEEE Transactions on Multimedia, 8(4): 761–774, 2006. 7

[Nee01]

Needham, Chris J., Boyle, Roger D. Tracking multiple sports players through occlusion, congestion and scale. In Proceedings British Machine Vision Conference, BMVC2001, 93–102, 2001. 64

[Niu04]

Niu, W., Long, J., Han, D., Wang, Y.-F. Human Activity Detection and Recognition for Video Surveillance. In IEEE International Confenrence on Multimedia and Expo, Taipei, Taiwan, 719–722, June 2004. 91

[Nta09]

Ntalampiras, S., Arsi´ c, D., Sto ¨rmer, Andre, Ganchev, T., Potamitis, I., Fakotakis, N. PROMETHEUS Ddatabase: A Multi-Modal Corpus For Research On Modeling And Interpreting Human Behavior. In Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Fusion of Heterogeneous Data for Robust Estimation and Classification” , DSP2009, Santorini, Greece, 2009. 14, 25, 88, 107, 169

[Num03] Nummiaro, K., Koller-Meier, E., Svoboda, T., Roth, D., Gool, L. J. Van. Color-Based Object Tracking in Multi-camera Environments. In 25th Pattern Recognition Symposium, DAGM’03, number 2781 in LNCS, 591–599, 2003. 91

202

Bibliography [Oli00]

Oliver, N., Rosario, B., Pentland, A.P. A Bayesian Computer Vision System for Modeling Human Interactions. IEEE Transactions on Pattern Analysis Machine Intelligence, 22(8): 831–843, 2000. 131

[Orw99]

Orwell, J., Remagnino, P., Jones, G.A. Multi-camera colour tracking. In Proceedings Second IEEE Workshop on Visual Surveillance, VS’1999, July 26 1999, Fort Collins, CO, USA, 14–21, July 1999. 91

[Ost98]

Ostermann, J. Animation Of Synthetic Faces in MPEG-4. Computer Animation, 49–51, 1998. 42, 132, 139

[Pan00]

Panti´ c, M., Rothkrantz, L. Automatic Analysis of Facial Expressions: The State of the Art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12): 1424–1445, December 2000. 47, 131

[Pan03]

Pandzi´ c, I., Forchheimer, R. MPEG-4 Facial Animation: The Standard, Implementation and Applications. John Wiley & Sons, Inc., New York, NY, USA, 2003. 47

[Pan06]

Panti´ c, M., Patras, I. Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 36(2): 433–449, 2006. 41

[Pap00]

Papageorgiou, C., Poggio, T. A Trainable System for Object Detection. International Journal of Computer Vision, 38(1): 15–33, 2000. 22, 32, 37, 44, 156

[Pau07]

Paulo, C. F., Correia, P. L. Automatic Detection and Classification of Traffic Signs. In Proceedings of the Eight IEEE International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS07, Washington, DC, USA, 2007. IEEE Computer Society. 7

[Pel05]

Pellkofer, A. Detecting Pedestrians With Neural Networks. Master Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, October 2005. 38, 39

[Pep07]

Pepperdog Ltd. A Pepperdog Whitepaper: A Video Analytics Primer for CCTV Users. July 2007. 1

[Per06]

Perera, A.G., Srinivas, C., Hoogs, A., Brooksby, G., Hu, W. Multi-Object Tracking Through Simultaneous Long Occlusions and Split-Merge Conditions. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR ’06, 666–673, Washington, DC, USA, 2006. 154

[Phi00]

Phillips, P. Jonathon, Moon, Hyeonjoon, Rizvi, Syed A., Rauss, Patrick J. The FERET Evaluation Methodology for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10): 1090–1104, 2000. 42, 149

[Pic04]

Piccardi, M. Background subtraction techniques: a review. In IEEE International Conference on Systems, Man and Cybernetics, volume 4, 3099–3104 vol.4, Oct. 2004. 10

[Poi08]

Point Grey Research. Dragonfly Camera Synchronization. October 2008. 93

[Pol08]

Police Service of Northern Ireland. Press Releases. October 2008. 2

203

Bibliography [Pop07]

Poppe, Ronald. Vision-based human motion analysis: An overview. Journal on Computer Vision and Image Understanding, 108(1-2): 4–18, 2007. 130

[Por08]

Porikli, F., Ivanov, Y., Haga, T. Robust abandoned object detection using dual foregrounds. EURASIP Journal on Advances in Signal Processing, 2008(1): 1–10, 2008. 54

[Pra91]

Pratt, W. K. Digital image processing (2nd ed.). John Wiley & Sons, Inc., New York, NY, USA, 1991. 94

[Pra01]

Prati, A., Miki´ c, I., Grana, C., Trivedi, M.M. Shadow detection algorithms for traffic flow analysis: a comparative study. In Proceedings IEEE Intelligent Transportation Systems, 25.08 - 29.08, 1Oakland, CA, USA, 340–345, August 2001. 15

[Pra03]

Prati, A., Miki´ c, I., Trivedi, M. M., Cucchiara, R. Detecting moving shadows: Formulation, algorithms and evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25: 2003, 2003. 17

[Qi08]

Qi, W. Component based Object Detection. Master Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, October 2008. 35

[Rab89]

Rabiner, Lawrence. A Tutorial On Hidden Markov Models And Selected Applications In Speech Recognition. In Proceedings of the IEEE, volume 77, 257–286, 1989. 176

[Ram03] Ramanan, D., Forsyth, D. A. Automatic Annotation of Everyday Movements. Technical Report UCB/CSD-03-1262, EECS Department, University of California, Berkeley, Jul 2003. 160 [Rei06]

Reiter, S., Schuller, B., Rigoll, G. A combined LSTM-RNN-HMM-Approach for Meeting Event Segmentation and Recognition. In Proceedings International IEEE Conference on Acoustics, Speech, and Signal Processing, ICASSP2006, Toulouse, France, volume II, 393–396. IEEE, 2006. 151

[Ren05]

Ren, X., Berg, A.C., Malik, J. Recovering human body configurations using pairwise constraints between parts. In Proceedings 10th International Conference on Computer Vision, ICCV05, volume 1, 824–831, 2005. 83

[Rin07]

Ringbeck, T., Hagebeuker, B. A 3D time of flight camera for object detection. In Proceedings 8th Conference on Optical 3-D Measurement Techniques, Zurich, Switzerland, July 2007. 2

[Ris04]

Risti´ c, B., Arulampalm, S., Gordon, N. Beyond the Kalman filter : particle filters for tracking applications. Artech House, Boston, Ma. ; London :, 2004. 61

[Rob66]

Roberts, L.G. Homogeneous Matrix Representation and Manipulation of NDimensional Constructs. Technical report, Tech. Report MS-1405,Lincoln Lab, MIT, Cambridge, Mass., USA, 1966. 94

[Row97]

Rowley, H., Baluja, S, Kanade, T. Rotation Invariant Neural Network-Based Face Detection. Technical Report CMU-CS-97-201, Computer Science Department, Pittsburgh, PA, December 1997. 46

204

Bibliography [Row98]

Rowley, H., Baluja, S., Kanade, Takeo. Neural Network-Based Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1): 23–38, January 1998. 42, 139

[Rum86] Rumelhart, D. E., Hinton, G.E., Williams, R.J. Learning internal representations by error propagation. MIT Press Computational Models Of Cognition And Perception Series, 318–362, 1986. 175 [Rus97]

Russell, J., Fernandez-Dols, J. The psychology of facial expression. Cambridge University Press, 1997. 131

[Sah76]

Sahni, S., Gonzalez, T. P-Complete Approximation Problems. Journal of the ACM, 23(3): 555–565, 1976. 83

[Sch07a] Schuller, B. Mensch, Maschine, Emotion - Erkennung aus sprachlicher manueller Interaktion. VDM Verlag Dr. M¨ uller, Saarbr¨ ucken, 2007. ISBN 78-3-8364-1522-4, 240 S. 2 [Sch07b] Schuller, B., Wimmer, M., Arsi´ c, D., Rigoll, G., Radig, B. Audiovisual Behavior Modeling by Combined Feature Spaces. In Proceedings IEEE Intern. Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2007, Honolulu, Hawaii, USA, volume II, 733–736. IEEE, 2007. 15.-20.04.2007. 138 [Sch07c] Schuller, B., Wimmer, M., Arsi´ c, D., Rigoll, G., Radig, B. Audiovisual Behavior Modeling by Combined Feature Spaces. In Proceedings ICASSP 2007, IEEE, Honolulu, Hawaii, USA, 15.-20.04.2007, April 2007. 143 [Sch08a] Schro ¨der, M., Cowie, R., Heylen, D., Pantic, M., Pelachaud, C., Schuller, B. Towards responsive Sensitive Artificial Listeners. In Proc. 4th Intern. Workshop on Human-Computer Conversation, Bellagio, Italy, 2008. 06-07.10.2008. 144 [Sch08b] Schreiber, S., Sto ¨rmer, A., Rigoll, G. Omnidirectional Tracking and Recognition of Persons in Planar Views. In Proceedings IEEE International Conference on Image Processing, ICIP2008, San Diego, CA, USA, 1476–1479, October 2008. 150 [Sch08c] Schuller, B., Wimmer, M., Arsi´ c, D., Moosmayr, T., Rigoll, G. Detection of Security Related Affect and Behaviour in Passenger Transport. In Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia, 265–268. ISCA, 2008. 138, 143 [Sei06]

Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 17-22 June, New York, NY, volume 1, 519–528, 2006. 117

[Set87]

Sethi, I.K., Jain, R. Finding trajectories of feature points in a monocular image sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1): 56– 73, 1987. 55

[Sha01]

Shapiro, Linda G., Stockman, George C. Computer Vision. Prentice Hall, January 2001. 21

[Shi93]

Shi, J., Tomasi, C. Good Features to Track. Technical report, Ithaca, NY, USA, 1993. 68, 91

205

Bibliography [Shi00]

Shi, J., Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8): 888–905, Aug 2000. 112

[Shi02]

Shirley, P. Fundamentals of Computer Graphics. A. K. Peters, Ltd., Natick, MA, USA, 2002. 103

[Sir07]

Sirisantisamrid, K., Tirasesth, K., Matsuura, T. An Algorithm for Coplanar Camera Calibration. In Proceedings of the Third International IEEE Conference on Information Hiding and Multimedia Signal Processing, 596–599, Washington, DC, USA, 2007. 99

[Smi05]

Smith, P., da Vitoria Lobo, N., Shah, M. TemporalBoost for event recognition. In Proceedings Tenth IEEE International Conference on Computer Vision, ICCV2005, volume 1, 733–740 Vol. 1, October 2005. 157

[Son06]

Song, X., Nevatia, R. Robust Vehicle Blob Tracking with Split/Merge Handling. In Proceedings First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006, Southampton, UK, volume 4122 of Lecture Notes in Computer Science, 216–222. Springer, 2006. 55

[St¨o99]

Sto ¨rring, M., Andersen, H.J., Granum, E. Skin colour detection under changing lighting conditions. In Proceedings of the Seventh International Symposium on Intelligent Robotic Systems, Coimbra, Portugal, 187–195, 1999. 26

[Sta99a]

Stauder, J., Mech, R., Ostermann, J. Detection of moving cast shadows for object segmentation. IEEE Transactions on Multimedia, 1(1): 65–76, 1999. 15

[Sta99b] Stauffer, C. Adaptive background mixture models for real-time tracking. In Proceedings IEEE conference on Computer Vision and Pattern Recoginition, CVPR, Fort Collins, USA, 246–252, 1999. 12, 22, 23 [Ste99]

Stein, G. P. Tracking from multiple view points: Self-calibration of space and time. In Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, CVPR99, volume 1, 519–527, 1999. 129

[Ste08]

Stein, M. A framework for detection and tracking of pedestrians. Bachelor Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, April 2008. 35

[Sun98]

Sung, K.-K., Poggio, T. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1): 39–51, January 1998. 43

[Sun05]

Sun, W., Cooperstock, J.R. Requirements for Camera Calibration: Must Accuracy Come with a High Price? In Seventh IEEE Workshops on Application of Computer Vision, WACV/MOTIONS ’05, volume 1, 356–361, January 2005. 93

[Sze01]

Szenberg, F., Carvalho, P.C., Gattass, M. Automatic Camera Calibration for Image Sequences of a Football Match. In Proceedings of the Second International Conference on Advances in Pattern Recognition, ICAPR ’01, 301–310, London, UK, 2001. Springer-Verlag. 99

[Tak94]

Takahashi, K., Seki, S., Kojima, E., Oka, R. Recognition of dexterous manipulations from time-varying images. In Proceedings of the 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 23–28, Nov 1994. 131

206

Bibliography [Tan05]

Tang, F., Tao, H. Object Tracking With Dynamic Feature Graph. In Proceedings 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, PETS05, 25–32, October 2005. 75, 126, 127

[Tao00]

Tao, H., Sawhney, H.S., Kumar, R. A Sampling Algorithm for Tracking Multiple Objects. In Proceedings of the International Workshop on Vision Algorithms: Theory and Practice, 53–68, London, UK, 2000. Springer-Verlag. 63

[Thi06]

Thirde, D., Li, L., Ferryman, J. Overview of the PETS2006 Challenge. In Proceedings of the ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2006, IEEE, New York, NY, USA, October 2006. 169

[Tia05]

Tian, Y., Hampapur, A. Robust Salient Motion Detection with Complex Background for Real-Time Video Surveillance. In Proceedings of the IEEE Workshop on Motion and Video Computing, WACV/MOTION’05, 30–35, Washington, DC, USA, 2005. IEEE Computer Society. 8

[Tom98] Tommasini, T., Fusiello, A., Trucco, E., Roberto, V. Making good features track better. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR1998, 178–183, June 1998. 68 [Tor03]

Torralba, A. Contextual Priming for Object Detection. International Journal on Computer Vision, 53(2): 169–191, 2003. 89

[Tsa87]

Tsai, R. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of Robotics and Automation, 3(4): 323–344, Aug 1987. 93, 99, 104

[Tur91]

Turk, M., Pentland, A. Face recognition using eigenfaces. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR91, 586–591, 1991. 47, 82

[Van01]

Van der Merwe, R., Wan, E.A. The square-root unscented Kalman filter for state and parameter-estimation. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’01, volume 6, 3461–3464, 2001. 60

[Vap95]

Vapnik, V. The Nature of Statistical Learning Theory. Springer Verlag, 1995. 171

[Vez03]

Vezhnevets, V., Sazonov, V., Andreeva, A. A survey on pixel-based skin color detection techniques. In Proceedings Graphicon 2003, 2003. 26

[Vig01]

Vigus, S., Bul, D.R., Canagarajah, C.N. Video Object Tracking Using Region Split and Merge and a Kalman Filter Tracking Algorithm. In Proceedings International Conference On Image Processing, ICIP2001, Thessaloniki, Greece, volume x, 650– 653, october 2001. 154

[Vio01]

Viola, P., Jones, M. Robust Real-time Object Detection. In Proceedings Second International Workshop On Statistical and Computional Theories of Vision - Modeling, Learning, Computing, and Sampling, Vancouver, July 2001. 32, 34, 42, 43, 44

207

Bibliography [Wal04]

Wallhoff, F., Zobl, M., Rigoll, G., Potucek, I. Face Tracking in Meeting Room Scenarios Using Omnidirectional Views. In Proceedings 17th IEEE International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, volume 4, 933–936, 2004. 46, 61

[Wal05]

Wallhoff, F., Arsi´ c, D., Stadermann, B. Schuller J., Rigoll, G. Hybrid Profile Recognition on the Mugshot Database. In Proceedings EUROCON 2005, IEEE, Belgrade, Serbia, 1405–1408, November 2005. 150

[Wal06a] Wallhoff, F. Entwicklung und Evaluierung neuartiger Verfahren zur automatischen Gesichtsdetektion, Identifikation und Emotionserkennung. PhD thesis, Technische Universit¨at M¨ unchen, Germany, 2006. 42 [Wal06b] Wallhoff, F., Schuller, B., Hawellek, M., Rigoll, G. Efficient Recognition of Authentic Dynamic Facial Expressions on the Feedtum Database. IEEE International Conference on Multimedia and Expo, ICME2007, 493–496, 2006. 136, 166 [Wal07]

Wallhoff, F., Ruß, M., Rigoll, G., G¨ obel, J., Diehl, H. Improved Image Segmentation Using Photonic Mixer Devices. In International Conference on Image Processing, ICIP1998, San Antonio, Texas, USA, volume VI, 53–56. IEEE, 2007. 16.19.09.2007, CD-ROM. 22

[Wan06] Wang, L. Abnormal Walking Gait Analysis Using Silhouette-Masked Flow Histograms. In Proceedings of the 18th International Conference on Pattern Recognition, 473–476, Washington, DC, USA, 2006. IEEE Computer Society. 130, 153 [Way02]

Wayne, P., Johann, P., Schoonees, A. Understanding Background Mixture Models for Foreground Segmentation Abstract. Proceedings Image and Vision Computing New Zealand 2002, 267–271, November 2002. 11

[Wel04]

Welch, G., Bishop, G. An Introduction to the Kalman Filter. Technical report, 2004. 60

[Wel08]

Welsh, B., Ferrington, D. Effects of closed circuit television surveillance on crime. In Campbell Systematic Reviews, volume 17, December 2008. 1

[Wim08] Wimmer, M., Schuller, B., Arsi´ c, D., Radig, B., Rigoll, G. Low-Level Fusion of Audio and Video Feature for Multi-Modal Emotion Recognition. In Proceedings 3rd International Conference on Computer Vision Theory and Applications VISAPP, Funchal, Madeira, Portugal, volume 2, 145–151, 2008. 139, 144 [Wis97a] Wiskott, L., Fellous, J.M., Kru ¨ ger, N., von der Malsburg, C. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7): 775–779, 1997. 47 [Wis97b] Wiskott, L., Fellous, J.M., Kru ¨ ger, N., von der Malsburg, C. Face recognition by elastic bunch graph matching. In Proceedings 7th International Conference on Computer Analysis of Images and Patterns, CAIP’97, Kiel, number 1296, 456–463. Springer-Verlag, 1997. 48, 51 [Wit84]

208

Witkin, A. Scale-space filtering: A new approach to multi-scale description. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’84., volume 9, 150–153, 1984. 69

Bibliography [Woh08] Wohlmuth, M. 3D reconstruction of Tracking Scenarios Applying Multi Sensor Fusion. Bachelor Thesis at the Institute for Human Machine Communication, Technische Universit¨at M¨ unchen, Germany, March 2008. 117 [Wre97]

Wren, Christopher, Azarbayejani, Ali, Darrell, Trevor, Pentl, Alex. Pfinder: Real-Time Tracking of the Human Body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19: 780–785, 1997. 11

[Wu07a] Wu, B., Nevatia, R. Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet based Part Detectors. International Journal of Computer Vision, 75(2): 247–266, 2007. 35 [Wu07b] Wu, Chen, Aghajan, H. Model-based human posture estimation for gesture analysis in an opportunistic fusion smart camera network. Proceedings IEEE Conference on Advanced Video and Signal Based Surveillance, AVSS2007, 453–458, September 2007. 130 [Yan99]

Yang, M.H., Ahuja, N. Gaussian mixture model for human skin color and its application in image and video databases. In Proceedings of SPIE99, San Jose CA, 458–466, 1999. 28

[Yan02]

Yang, M., Kriegman, D., Ahuja, N. Detecting Faces in Images: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1): 34–58, January 2002. 41, 42

[Yeh06]

Yehn, Q. Zhuand M., Cheng, K., Avidan, S. Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR2006, 2006. 41

[Yil06]

Yilmaz, A., Javed, O., Shah, M. Object tracking: A survey. ACM Computing Surveys (CSUR), 38(4): 13, 2006. 89

[Yin07]

Yin, F., Makris, D., Velastin, S.A. Performance Evaluation of Object Tracking Algorithms. In Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, IEEE, Rio de Janeiro, Brazil, October 2007. 126

[Yue04]

Yue, Z., Z, S.K., Chellappa, R. Robust two-camera tracking using homography. Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2004, 17-21 May 2004, Montreal, Quebec, Canada, 3: 1–4, May 2004. 91

[Zen09]

Zeng, Z., Panti´ c, M., Roisman, G.I., Huang, T.S. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1): 39–58, January 2009. 47

[Zha00]

Zhang, Z. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11): 1330–1334, November 2000. 93

[Zha03]

Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A. Face recognition: A literature survey. ACM Computing Surveys, 35(4): 399–458, 2003. 41, 42

[Zha07]

Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal on Computer Vision, 73(2): 213–238, 2007. 7

209

Bibliography [Zho08]

Zhou, H., Taj, M., Cavallaro, A. Target Detection and Tracking With Heterogeneous Sensors. IEEE Journal of Selected Topics in Signal Processing, 2(4): 503–513, Aug. 2008. 130

[Zim86]

Zimmermann, H.J. Fuzzy sets, decision making and expert systems. B.V. Kluwer, Deventer, The Netherlands, 1986. 143

[Ziv04a]

Zivkovi´ c, Z. Improved Adaptive Gaussian Mixture Model for Background Subtraction. In Proceedings 17th IEEE International Conference on Pattern Recognition, ICPR’04, 28–31, Washington, DC, USA, 2004. 12, 109

[Ziv04b] Zivkovi´ c, Z., Krose, B. An EM-like algorithm for color-histogram-based object tracking. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR2004., volume 1, 798–803, June-2 July 2004. 82 [Zob03]

210

Zobl, M., Wallhoff, F., Rigoll, G. Action Recognition in Meeting Scenarios using Global Motion Features. In Proceedings Fourth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS-ICVS), Graz Austria, 32–36, March 2003. 8, 139

Suggest Documents