these

1 downloads 0 Views 11MB Size Report
In this chapter we presented an overview of intelligent video surveillance systems. It ... such as flying birds and moving clouds, etc. ...... Figure 5.15: RGB histogram of pixel (x, y) over the learning period for ...... classifier is measured by its error, according to the weights: ϵt = ∑ i:ht(xi)=yi. Dt(i) ... αt is the weight assigned to ht .
Collège doctoral N° attribué par la bibliothèque |__|__|__|__|__|__|__|__|__|__|

THESE pour obtenir le grade de Docteur de l’Ecole des Mines de Paris Spécialité «Informatique temps réel – Robotique - Automatique» présentée et soutenue publiquement par Hicham GHORAYEB le 12/ 09/2007

CONCEPTION ET MISE EN ŒUVRE D'ALGORITHMES DE VISION TEMPS REEL POUR LA VIDEO SURVEILLANCE INTELLIGENTE

Directeur de thèse : Claude LAURGEAU

Jury M. Patrick MEYRUEIS (Univ. Strasbourg) M. Patrick SIARRY (Paris XII) M. Mohamed AKIL (ESIEE) M. Bruno STEUX (EMP) M. Claude LAURGEAU (EMP) M. Fernand MEYER (EMP) M. Karl SCHWERDT (VIDATIS)

Rapporteur Rapporteur Examinateur Examinateur Examinateur Examinateur Examinateur

´ A Hussein et Najah mes parents

i

Remerciements Cette th`ese n’aurait vu le jour sans la confiance, la patience et la g´en´erosit´e de mon directeur de th`ese, Monsieur Claude Laurgeau, directeur du centre de robotique de l’Ecole des Mines de Paris. Je voudrais aussi le remercier pour le temps et la patience qu’il m’a accord´e tout au long de ces ann´ees, d’avoir cru en mes capacit´es et de m’avoir fourni d’excellentes conditions logistiques et financi`eres. Mes plus sinc`eres remerciements vont ´egalement `a Monsieur Bruno Steux, qui en agissant `a titre d’encadrant a fortement enrichi ma formation. Ses conseils techniques et ses commentaires auront ´et´e fort utiles. De plus, les conseils qu’il m’a divulgu’e tout au long de la r´edaction, ont toujours ´et´e clairs et succincts, me facilitant grandement la tˆache et me permettant d’aboutir `a la production de cette th`ese. J’aimerais par ailleurs souligner la collaboration importante avec Monsieur Mohamed Akil, professeur `a l’ESIEE. Notre collaboration dans le cadre de l’acc´el´eration des d´etecteurs visuels d’objets sur des cartes graphiques et des cartes FPGA. Dans le cadre de cette collaboration j’ai encadr´e trois groupes d’´etudiants de l’ESIEE dans leurs stages. Les stages ont aboutit `a des prototypes de validation de l’id´ee. Je voudrais ´egalement remercier de tout mon coeur Monsieur Fawzi Nashashibi, pour ses pr´ecieux conseils et son support moral tout le long de la th`ese. Je remercie Monsieur Patrick Siarry, professeur ´a l’universit´e Paris XII, pour l’int´errˆet qu’il a apport´e `a mon travail en me consacrant de son temps pour rapporter sur le pr´esent travail. Je remercie ´egalement Monsieur Patrick Meyrueis, professeur `a l’universit´e de Strasbourg, pour la rapidit´e avec laquelle il m’a accord´e du temps pour rapporter sur le pr´esent travail. Je remercie Monsieur Karl Schwerdt, et monsieur Didier Maman, fondateurs de la soci´et´e VIDATIS pour leur collaboration dans le co-encadrement du stage de Khattar Assaf. Je remercie les stagiaires avec qui j’ai travaill´e: Khattar Assaf, Hamid Medjahed, et Marc-andr´e Lureau. Enfin une pens´ee ´emue pour mes amis avec qui j’ai partag´e la L007 pendant ces ann´ees de th`ese: Jacky, Clement, Jorge, Marc et Amaury. Et n’oubliant pas les autres coll`egues dans l’aquarium: Safwan, Xavier, Rasul et Laure. Et surtout, ceux qui sont dans des bureaux avec une salle de r´eunion: Fawzi, Francois, Samer, Iyad et Ayoub. Et les coll`egues qui sont dans des bureaux ´eparpill`es comme Bogdan.

iii

R´ esum´ e Mots cl´ es: Reconnaissance des formes, analyse vid´eo int´elligente, syst`emes de transport int´elligents, vid´eo surveillance intelligente, GPGPU, apprentissage automatique, boosting, LibAdaBoost, soustraction du fond adaptative, projet PUVAME. Ce travail de th`ese a pour sujet l’´etude des algorithmes de vision temps r´eel pour l’analyse vid´eo intelligente. Ces travaux ont eut lieu au laboratoire de robotique de l’Ecole des Mines de Paris (CAOR). Notre objectif est d’´etudier les algorithmes de vision utilis´es aux diff´erents niveaux dans une chaˆıne de traitement vid´eo intelligente. On a prototyp´e une chaˆıne de traitement g´en´erique d´edi´ee `a l’analyse du contenu du flux vid´eo. En se basant sur cette chaˆıne de traitement, on a d´evelopp´e une application de d´etection et de suivi de pi´etons. Cette application est une partie int´egrante du projet PUVAME. Ce projet a pour objectif d’am´eliorer la s´ecurit´e des personnes dans des zones urbaines pr`es des arrˆets de bus. Cette chaˆıne de traitement g´en´erique est compos´ee de plusieurs ´etapes : d’une part d´etection d’objets, classification d’objets et suivi d’objets. D’autres ´etapes de plus haut niveau sont envisag´ees comme la reconnaissance d’actions, l’identification, la description s´emantique ainsi que la fusion des donn´ees de plusieurs cam´eras. On s’est int´eress´e aux deux premi`eres ´etapes. On a explor´e des algorithmes de segmentation du fond dans un flux vid´eo avec cam´era fixe. On a impl´ement´e et compar´e des algorithmes bas´es sur la mod´elisation adaptative du fond. On a aussi explor´e la d´etection visuelle d’objets bas´e sur l’apprentissage automatique en utilisant la technique du boosting. Cependant, On a d´evelopp´e une librairie intitul´ee LibAdaBoost qui servira comme un environnement de prototypage d’algorithmes d’apprentissage automatique. On a prototyp´e la technique du boosting au sein de cette librairie. On a distribu´e LibAdaBoost sous la licence LGPL. Cette librarie est unique avec les fonctionnalit´es qu’elle offre. On a explor´e l’utilisation des cartes graphiques pour l’acc´el´eration des algorithmes de vision. On a effectu´e le portage du d´etecteur visuel d’objets bas´e sur un classifieur g´en´er´e par le boosting pour qu’il s’ex´ecute sur le processeur graphique. On ´etait les premiers `a effectuer ce portage. On a trouv´e que l’architecture du processeur graphique est la mieux adapt´ee pour ce genre d’algorithmes. La chaˆıne de traitement a ´et´e impl´ement´ee et int´egr´ee `a l’environnement RTMaps. On a ´evalu´e ces algorithmes sur des sc´enarios bien d´efinis. Ces sc´enarios ont ´et´e d´efinis dans le cadre de PUVAME.

v

Abstract Keywords: Intelligent video analysis, intelligent transportation systems, intelligent video surveillance, GPGPU, boosting, LibAdaBoost, adaptive background subtraction, PUVAME project. In this dissertation, we present our research work held at the Center of Robotics (CAOR) of the Ecole des Mines de Paris which tackles the problem of intelligent video analysis. The primary objective of our research is to prototype a generic framework for intelligent video analysis. We optimized this framework and configured it to cope with specific application requirements. We consider a people tracker application extracted from the PUVAME project. This application aims to improve people security in urban zones near to bus stations. Then, we have improved the generic framework for video analysis mainly for background subtraction and visual object detection. We have developed a library for machine learning specialized in boosting for visual object detection called LibAdaBoost. To the best of our knowledge LibAdaBoost is the first library in its kind. We make LibAdaBoost available for the machine learning community under the LGPL license. Finally we wanted to adapt the visual object detection algorithm based on boosting so that it could run on the graphics hardware. To the best of our knowledge we were the first to implement visual object detection with sliding technique on the graphics hardware. The results were promising and the prototype performed three to nine times better than the CPU. The framework was successfully implemented and integrated to the RTMaps environment. It was evaluated at the final session of the project PUVAME and demonstrated its fiability over various test scenarios elaborated specifically for the PUVAME project.

vii

Contents I

Introduction and state of the art

1

1 French Introduction 1.1 Algorithmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 4 4 5

2 Introduction 2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 6 7

3 Intelligent video surveillance systems (IVSS) 3.1 What is IVSS? . . . . . . . . . . . . . . . . . . . . 3.1.1 Introduction . . . . . . . . . . . . . . . . . . 3.1.2 Historical background . . . . . . . . . . . . 3.2 Applications of intelligent surveillance . . . . . . . . 3.2.1 Real time alarms . . . . . . . . . . . . . . . 3.2.2 User defined alarms . . . . . . . . . . . . . . 3.2.3 Automatic unusual activity alarms . . . . . 3.2.4 Automatic Forensic Video Retrieval (AFVR) 3.2.5 Situation awareness . . . . . . . . . . . . . . 3.3 Scenarios and examples . . . . . . . . . . . . . . . . 3.3.1 Public and commercial security . . . . . . . 3.3.2 Smart video data mining . . . . . . . . . . . 3.3.3 Law enforcement . . . . . . . . . . . . . . . 3.3.4 Military security . . . . . . . . . . . . . . . 3.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Technical aspect . . . . . . . . . . . . . . . 3.4.2 Performance evaluation . . . . . . . . . . . . 3.5 Choice of implementation methods . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

9 10 10 10 11 11 11 12 12 13 13 13 14 14 14 14 14 15 15 16

x

II

Table of Contents

Algorithms

17

4 Generic framework for intelligent visual surveillance 4.1 Object detection . . . . . . . . . . . . . . . . . . . . . 4.2 Object classification . . . . . . . . . . . . . . . . . . . 4.3 Object tracking . . . . . . . . . . . . . . . . . . . . . . 4.4 Action recognition . . . . . . . . . . . . . . . . . . . . 4.5 Semantic description . . . . . . . . . . . . . . . . . . . 4.6 Personal identification . . . . . . . . . . . . . . . . . . 4.7 Fusion of data from multiple cameras . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

18 19 19 20 20 21 21 21

5 Moving object detection 5.1 Challenge of detection . . . . . . . . . . . . . . . . . . . . 5.2 Object detection system diagram . . . . . . . . . . . . . . 5.2.1 Foreground detection . . . . . . . . . . . . . . . . . 5.2.2 Pixel level post-processing (Noise removal) . . . . . 5.2.3 Detecting connected components . . . . . . . . . . 5.2.4 Region level post-processing . . . . . . . . . . . . . 5.2.5 Extracting object features . . . . . . . . . . . . . . 5.3 Adaptive background differencing . . . . . . . . . . . . . . 5.3.1 Basic Background Subtraction (BBS) . . . . . . . . 5.3.2 W4 method . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Single Gaussian Model (SGM) . . . . . . . . . . . . 5.3.4 Mixture Gaussian Model (MGM) . . . . . . . . . . 5.3.5 Lehigh Omni-directional Tracking System (LOTS): 5.4 Shadow and light change detection . . . . . . . . . . . . . 5.4.1 Methods and implementation . . . . . . . . . . . . 5.5 High level feedback to improve detection methods . . . . . 5.5.1 The modular approach . . . . . . . . . . . . . . . . 5.6 Performance evaluation . . . . . . . . . . . . . . . . . . . . 5.6.1 Ground truth generation . . . . . . . . . . . . . . . 5.6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Evaluation metrics . . . . . . . . . . . . . . . . . . 5.6.4 Experimental results . . . . . . . . . . . . . . . . . 5.6.5 Comments on the results . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

22 23 25 26 27 28 28 28 29 29 30 31 32 33 35 36 46 47 48 48 48 49 54 56

6 Machine learning for visual object-detection 6.1 Introduction . . . . . . . . . . . . . . . . . . . 6.2 The theory of boosting . . . . . . . . . . . . . 6.2.1 Conventions and definitions . . . . . . 6.2.2 Boosting algorithms . . . . . . . . . . 6.2.3 AdaBoost . . . . . . . . . . . . . . . . 6.2.4 Weak classifier . . . . . . . . . . . . . 6.2.5 Weak learner . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

57 58 58 58 60 61 63 63

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Table of Contents 6.3

6.4

6.5

6.6

Visual domain . . . . . . . . . . . . . . . . . 6.3.1 Static detector . . . . . . . . . . . . 6.3.2 Dynamic detector . . . . . . . . . . . 6.3.3 Weak classifiers . . . . . . . . . . . . 6.3.4 Genetic weak learner interface . . . . 6.3.5 Cascade of classifiers . . . . . . . . . 6.3.6 Visual finder . . . . . . . . . . . . . . LibAdaBoost: Library for Adaptive Boosting 6.4.1 Introduction . . . . . . . . . . . . . . 6.4.2 LibAdaBoost functional overview . . 6.4.3 LibAdaBoost architectural overview . 6.4.4 LibAdaBoost content overview . . . . 6.4.5 Comparison to previous work . . . . Use cases . . . . . . . . . . . . . . . . . . . 6.5.1 Car detection . . . . . . . . . . . . . 6.5.2 Face detection . . . . . . . . . . . . . 6.5.3 People detection . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . .

7 Object tracking 7.1 Initialization . . . . . . . . . . . . . . . 7.2 Sequential Monte Carlo tracking . . . . 7.3 State dynamics . . . . . . . . . . . . . 7.4 Color distribution Model . . . . . . . . 7.5 Results . . . . . . . . . . . . . . . . . . 7.6 Incorporating Adaboost in the Tracker 7.6.1 Experimental results . . . . . .

III

xi

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

66 67 68 68 75 76 77 80 80 81 85 86 87 88 89 90 91 92

. . . . . . .

98 98 99 100 100 101 102 102

Architecture

8 General purpose computation on the GPU 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 8.2 Why GPGPU? . . . . . . . . . . . . . . . . . . . . . 8.2.1 Computational power . . . . . . . . . . . . . . 8.2.2 Data bandwidth . . . . . . . . . . . . . . . . . 8.2.3 Cost/Performance ratio . . . . . . . . . . . . 8.3 GPGPU’s first generation . . . . . . . . . . . . . . . 8.3.1 Overview . . . . . . . . . . . . . . . . . . . . 8.3.2 Graphics pipeline . . . . . . . . . . . . . . . . 8.3.3 Programming language . . . . . . . . . . . . . 8.3.4 Streaming model of computation . . . . . . . 8.3.5 Programmable graphics processor abstractions 8.4 GPGPU’s second generation . . . . . . . . . . . . . .

107 . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

108 108 109 109 110 112 112 112 114 119 120 121 123

xii

Table of Contents

8.5

8.4.1 Programming model . . . . . . . . . . . . . . . . . . . . . . . 124 8.4.2 Application programming interface (API) . . . . . . . . . . . . 124 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9 Mapping algorithms to GPU 9.1 Introduction . . . . . . . . . . . . . . . . . . . 9.2 Mapping visual object detection to GPU . . . 9.3 Hardware constraints . . . . . . . . . . . . . . 9.4 Code generator . . . . . . . . . . . . . . . . . 9.5 Performance analysis . . . . . . . . . . . . . . 9.5.1 Cascade Stages Face Detector (CSFD) 9.5.2 Single Layer Face Detector (SLFD) . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . .

IV

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Application

10 Application: PUVAME 10.1 Introduction . . . . . . . . . . . . . 10.2 PUVAME overview . . . . . . . . . 10.3 Accident analysis and scenarios . . 10.4 ParkNav platform . . . . . . . . . . 10.4.1 The ParkView platform . . 10.4.2 The CyCab vehicule . . . . 10.5 Architecture of the system . . . . . 10.5.1 Interpretation of sensor data 10.5.2 Interpretation of sensor data 10.5.3 Collision Risk Estimation . 10.5.4 Warning interface . . . . . . 10.6 Experimental results . . . . . . . .

V

. . . . . . . .

Conclusion and future work

126 126 128 130 131 133 133 134 135

137 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . relative to relative to . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the intersection the vehicule . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

138 138 139 140 141 142 144 144 145 149 149 150 151

153

11 Conclusion 154 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 11.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 12 French conclusion

VI

157

Appendices

161

A Hello World GPGPU

162

Table of Contents

xiii

B Hello World Brook

170

C Hello World CUDA

181

Part I Introduction and state of the art

Chapter 1 French Introduction Contents 1.1

Algorithmes . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2

Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Application . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Pour accentuer la bonne lisibilit´e de cette th`ese, et pour permettre au lecteur de se cr´eer une vision globale du contenu avant mˆeme d’entamer la lecture, nous voudrions bri`evement exposer `a cette place les points cl´es de la th`ese acompagn´es de quelques remarques. Ces quelques lignes vont, comme nous le croyons, servir `a une meilleure orientation dans l’ensemble de ce trait´e. Cette th`ese ´etudie les algorithmes de vision temps r´eel pour l’analyse vid´eo intelligente. Ces travaux ont eut lieu au laboratoire de robotique de l’Ecole des Mines de Paris (CAOR). Notre objectif est d’´etudier les algorithmes de vision utilis´es aux diff´erents niveaux dans une chaˆıne de traitement vid´eo intelligente. On a prototyp´e une chaˆıne de traitement g´en´erique d´edi´ee `a l’analyse du contenu du flux vid´eo. En se basant sur cette chaˆıne de traitement, on a d´evelopp´e une application de d´etection et de suivi de pi´etons. Cette application est une partie int´egrante du projet PUVAME. Ce projet a pour objectif d’am´eliorer la s´ecurit´e des personnes dans des zones urbaines pr`es des arrˆets de bus. Cette chaˆıne de traitement g´en´erique est compos´ee de plusieurs ´etapes : d’une part d´etection d’objets, classification d’objets et suivi d’objets. D’autres ´etapes de plus haut niveau sont envisag´ees comme la reconnaissance d’actions, l’identification, la description s´emantique ainsi que la fusion des donn´ees de plusieurs cam´eras. On s’est int´eress´e aux deux premi`eres ´etapes. On a explor´e des algorithmes de segmentation du fond dans un flux vid´eo avec cam´era fixe. On a impl´ement´e et compar´e des algorithmes bas´es sur la mod´elisation adaptative du fond. On a aussi explor´e la d´etection visuelle d’objets bas´e sur l’apprentissage automatique

Part I: Introduction and state of the art

3

en utilisant la technique du boosting. Cependant, On a d´evelopp´e une librairie intitul´ee LibAdaBoost qui servira comme un environnement de prototypage d’algorithmes d’apprentissage automatique. On a prototyp´e la technique du boosting au sein de cette librairie. On a distribu´e LibAdaBoost sous la licence LGPL. Cette librarie est unique avec les fonctionnalit´es qu’elle offre. On a explor´e l’utilisation des cartes graphiques pour l’acc´el´eraction des algorithmes de vision. On a effectu´e le portage du d´etecteur visuel d’objets bas´e sur un classifieur g´en´er´e par le boosting pour qu’il s’ex´ecute sur le processeur graphique. On ´etait les premiers `a effectuer ce portage. On a trouv´e que l’architecture du processeur graphique est la mieux adapt´ee pour ce genre d’algorithmes. La chaˆıne de traitement a ´et´e implement´ee et integr´ee `a l’environnement RTMaps. On a ´evalu´e ces algorithmes sur des sc´enarios bien d´efinis. Ces sc´enarios ont ´et´e d´efinis dans le cadre de PUVAME. Le document est divis´e en quatre parties principales derri`ere lesquelles nous rajoutons une conclusion g´en´erale suivie par les annexes: 1. Nous commen¸cons dans Chapitre 2 par une introduction en anglais du manuscrit. Chapitre 3 pr`esente une discussion g´en´erale sur les syst`emes d’analyse vid´eo intelligente pour la vid´eo surveillance. 2. Dans Chapitre 4 on d´ecrit une chaˆıne de traitement g´en´erique pour l’analyse vid´eo intelligent. Cette chaˆıne de traitement pourra ˆetre utilis´ee pour prototyper des applications de vid´eo surveillance intelligente. On commence par la description de certaines parties de cette chaˆıne de traitement dans lec chapitres qui suivent: on discute dans Chapitre 5 des diff´efrentes impl´ementations pour l’algorithme de mod´elisation adaptative du fond pour la s´egmentation des objets en mouvement dans une sc`ene dynamique avec cam´era fixe. Dans Chapitre 6 on d´ecrit une technique d’apprentissage automatique nomm´ee boosting qui est utilis´ee pour la d´etection visuelle d’objets. Le suivi d’objets est introduit dans le Chapitre 7. 3. Dans Chapitre 8 on pr´esente la programmation par flux de donn´ees. On pr´esente ainsi l’architecture des processeurs graphiques. La mod´elisation en utilisant le mod`ele par flux de donn´ees est d´ecrite dans Chapitre 9. 4. On d´ecrit le projet PUVAME1 dans Chapitre 10. 5. Finalement, on d´ecrit dans Chapitre 11 la conclusion de nos traveaux, et notre vision sur le futur travail `a munir. Une conclusion en fran¸cais est fournie dans Chapitre 12. 1

Protection des Vuln´erables par Alarmes ou Manoeuvres

4

1.1

Part I: Introduction and state of the art

Algorithmes

Cette patie est consacr´ee, en cinq chapitres, aux algorithmes destin´es `a ˆetre constitutifs d’une plateforme g´en´erique pour la vid´eo surveillance. L’organisation op´erationnelle d’une plateforme logicielle pour la vid´eo surveillance est d’abord pr´esent´ee dans un chapitre. La d´etection d’objets en mouvement est ensuite trait´ee dans le chapitre suivant avec des m´ethodes statistiques et d’autres non lin´eaires. Des algorithmes sont pr´esent´es bas´es sur une approche adaptative d’extraction de l’arri`ere plan. La d´etection d’objets est d´ependante de la sc`ene et doit prendre en compte l’ombrage et le changement de luminosit´e. La question des fausses alarmes se pose pour renforcer la robustesse. A cet effet, deux approches sont trait´ees, l’approche statistique non param´etrable et l’approche d´eterministe. L’´evaluation des algorithmes consid´er´es se base sur des m´etriques prises en compte dans deux cas de sc`enes: un interne et un externe. L’apprentissage pour la d´etection d’objets est consid´er´ee dans un avant dernier chapitre. Des algorithmes de Boosting sont introduits notamment AdaBoost avec un Algorithme g´en´etique pour l’apprentissage faible. Un ensemble de classifieurs faibles est ´etudi´e et impl´ement´e allant des classifieurs rectangulaires de Viola and Jones aux classifieurs de points de control et d’autres. On pr´esente la librairie logicielle LibAdaBoost. LibAdaBoost constitue un environnement unifi´e pour le boosting. Cet environnement est ouvert et exploitable par ceux ayant `a valider des algorithmes d’apprentissage. Le dernier chapitre traite du tracking d’objets avec une m´ethode Monte Carlo Tracking. La m´ethode est robuste sous diff´erentes situations d’illumination et est renforc´ee par l’int´egration du d´etecteur visuel par boosting.

1.2

Architecture

Cette partie en deux chapitres propose une application temps r´eel de la d´etection de visage avec impl´ementation sp´ecifique pour acc´eleration des algorithmes sur processeur graphique. L’accent est mis sur l’impl´ementation GPU, travail de recherche en plein d´eveloppement et notemment pour parall´eliser et acc´el´erer des algorithmes g´en´eraux sur un proceseur graphique. Apr`es une description succinte et p´edagogique aussi bien de l’architecture du GPU, de son mod`ele de programmation que des logiciels de programmation des GPUs, on d´ecrit avec d´etails: l’ompl´ementation, les optimisations, les diff´erentes strat´egies possibles de partitionnement de l’application entre CPU et GPU. La solution adopt´ee est d’implanter le classifieur sur le GPU, les autres ´etapes de l’application sont programm´ees sur le CPU. Les r´esultats obtenus sont tr`es prometteurs et permettent d’acc´el´erer d’un facteur 10.

Part I: Introduction and state of the art

1.3

5

Application

Cette partie applicative est consacr´ee `a une application des traveaux de la th`ese dans le cadre du projet PUVAME, qui vise `a concevoir une alarme de bord pour alerter les chauffeurs de bus, en cas de risque de collision avec des pi´etons, aux abords des stations d’autobus et aux intersections.

Chapter 2 Introduction Contents 2.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

While researching the materials presented in this thesis I have had the pleasure of covering a wider range of the discipline of Computer Science than most graduate students. My research began with the european project CAMELLIA1 . CAMELLIA aims to prototype a smart imaging core dedicated for mobile and automotive domains. I was in charge of the implementation of image stabilization and motion estimation algorithms [SAG03a]. As well as the design of a motion estimator core [SAG03b]. I was soon drawn into studying the details of data parallel computing on stream architectures and more specially on graphics hardware. I focused also on intelligent video analysis with machine learning tools. In this dissertation, we are primarily concerned with studying intelligent video analysis framework. We focused on object detection and background segmentation. We explored the usage of graphics hardware for accelerating computer vision algorithms as low cost solution. We evaluated our algorithms in several contexts: video surveillance and urban road safety.

2.1

Contributions

This dissertation makes several contributions to the areas of computer vision and general purpose computation on graphics hardware: • We have evaluated several implementations of state of the art adaptive background subtraction algorithms. We have compared their performance and we have tried to improve these algorithms using high level feedback. 1

Core for Ambient and Mobile intELLigent Imaging Applications

Part I: Introduction and state of the art

7

• We present boosting for computer vision community as a tool for building visual detectors. The pedagogic presentation is based on boosting as a machine learning framework. Thus, we start from abstract machine learning concepts and we develop computer vision concepts for features, finders and others. • We have also developed a software library called LibAdaBoost. This library is a machine learning library which is extended with a boosting framework. The boosting framework is extended to support visual objects. To the best of our knowledge this is the first library specialized in boosting and which provides both: a SDK for the development of new algorithms and a user toolkit for launching learnings and validations. • We present a streaming formulation for visual object detector based on visual features. Streaming is a natural way to express low-level and medium-level computer vision algorithms. Modern high performance computing hardware is well suited for the stream programming model. The stream programming model helps to organize the visual object detector computation optimally to execute on graphics hardware. • We present an integration of intelligent video analysis block within a large framework for urban area safety. Efficient implementation of low-level computer vision algorithms is an active research area in the computer vision community, and is beyond the scope of this thesis. Our analysis started from a general framework for intelligent video analysis. We factorized a set of algorithms at several levels: pixel-level, object-level and frame-level. We focused on pixel-level algorithms and we integrated these algorithms to a generic framework. In the generic framework we developed state of the art approaches to instantiate the different stages.

2.2

Outline

We begin in Chapter 3 with a background discussion of Intelligent Video Surveillance Systems. In Chapter 4 we describe a generic framework for intelligent video analysis that can be used in a video surveillance context. We started to detail some stages of this generic framework in the next chapters: we discuss in Chapter 5 different implementations of adaptive background modeling algorithms for moving object detection. In Chapter 6 we describe machine learning for visual object detection with boosting. Object tracking is discussed in Chapter 7. In Chapter 8 we present stream programming model, and programmable graphics hardware. Streaming formulation and implementation for visual object detection is described in Chapter 9.

8

Part I: Introduction and state of the art

We describe the PUVAME2 application in Chapter 10. Finally, we suggest areas for future research, reiterate our contributions, and conclude this dissertation in Chapter 11. This thesis is a testament to the multi-faceted nature of computer science and the many and varied links within this relatively new discipline. I hope you will enjoy reading it as much as I enjoyed the research which it describes.

2

Protection des Vuln´erables par Alarmes ou Manoeuvres

Chapter 3 Intelligent video surveillance systems (IVSS) Contents 3.1

3.2

3.3

3.4

What is IVSS? . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.1.2

Historical background . . . . . . . . . . . . . . . . . . . . .

10

Applications of intelligent surveillance . . . . . . . . . . .

11

3.2.1

Real time alarms . . . . . . . . . . . . . . . . . . . . . . . .

11

3.2.2

User defined alarms . . . . . . . . . . . . . . . . . . . . . .

11

3.2.3

Automatic unusual activity alarms . . . . . . . . . . . . . .

12

3.2.4

Automatic Forensic Video Retrieval (AFVR) . . . . . . . .

12

3.2.5

Situation awareness . . . . . . . . . . . . . . . . . . . . . .

13

Scenarios and examples . . . . . . . . . . . . . . . . . . . .

13

3.3.1

Public and commercial security . . . . . . . . . . . . . . . .

13

3.3.2

Smart video data mining . . . . . . . . . . . . . . . . . . .

14

3.3.3

Law enforcement . . . . . . . . . . . . . . . . . . . . . . . .

14

3.3.4

Military security . . . . . . . . . . . . . . . . . . . . . . . .

14

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.4.1

Technical aspect . . . . . . . . . . . . . . . . . . . . . . . .

14

3.4.2

Performance evaluation . . . . . . . . . . . . . . . . . . . .

15

3.5

Choice of implementation methods . . . . . . . . . . . . .

15

3.6

Conclusion

16

. . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Part I: Introduction and state of the art

3.1 3.1.1

What is IVSS? Introduction

Intelligent video surveillance addresses the use of automatic video analysis technologies in video surveillance applications. This chapter attempts to answer a number of questions about intelligent surveillance: what are the applications of intelligent surveillance? what are the system architectures for intelligent surveillance? what are the key technologies? and what are the key technical challenges?

3.1.2

Historical background

It is because of the advance in computing power, availability of large-capacity storage devices and high-speed network infrastructure, that we find inexpensive multi sensor video surveillance systems. Traditionally, the video outputs are processed online by human operators and are usually saved to tapes for later use only after a forensic event. The increase in the number of cameras in ordinary surveillance systems overloaded both the number of operators and the storage devices and made it impossible to ensure proper monitoring of sensitive areas for long times. In order to filter out redundant information and increase the response time to forensic events, helping the human operators by detecting important events in video by the use of smart video surveillance systems has become an important requirement. In the following sections we group video surveillance systems into three generations as in [Ded04]. • (1GSS, 1960-1980) were based on analog sub systems for image acquisition, transmission and processing. They extended human eye in spatial sense by transmitting the outputs of several cameras monitoring a set of sites to the displays in a central control room. They had major drawbacks like requiring high bandwidth, difficult archiving and retrieval of events due to large number of video tape requirements and difficult online event detection which only depended on human operators with limited attention span. • (2GSS, 1980-2000) The next generation surveillance systems were hybrids in the sense that they used both analog and digital sub systems to resolve some drawbacks of their predecessors. They made use of the early advances in digital video processing methods that provide assistance to the human operators by filtering out spurious events. Most of the work during 2GSS is focused on real time event detection. • (3GSS, 2000- ) Third generation surveillance systems provide end-to-end digital systems. Image acquisition and processing at the sensor level, communication through mobile and fixed heterogeneous broadband networks and image storage at the central servers benefit from low cost digital infrastructure. Unlike previous generations, in 3GSS some part of the image processing is distributed

Part I: Introduction and state of the art

11

toward the sensor level by the use of intelligent cameras that are able to digitize and compress acquired analog image signals and perform image analysis algorithms. Third generation surveillance systems (3GSS) assist human operators with online alarm generation defined on complex events and with offline inspection as well. 3GSS also handle distributed storage and content-based retrieval of video data. It can be used both for providing the human operator with high level data to help him to make the decisions more accurately and in a shorter time and for offline indexing and searching stored video data effectively. The advances in the development of these algorithms would lead to breakthroughs in applications that use visual surveillance.

3.2

Applications of intelligent surveillance

In this section we describe a few applications of smart surveillance technology [HBC+ 03]. We group the applications into three broad categories: real time alarms, automatic forensic video retrieval, and situation awareness.

3.2.1

Real time alarms

We can define two types of alerts that can be generated by a smart surveillance system, user defined alarms and automatic unusual activity alarms.

3.2.2

User defined alarms

In this type the system detects a variety of user defined events that occur in the monitored space and notifies the user in real time. So it is up to the user to evaluate the situation and to take preventive actions. Some typical events are presented. Generic alarms These alerts depend on the movement properties of objects in the monitored space. Following are a few common examples: • Motion Detection: This alarms detects movement of any object within a specified zone. • Motion Characteristic Detection: These alarms detect a variety of motion properties of objects, including specific direction of object movement (entry through exit lane), object velocity bounds checking (object moving too fast). • Abandoned Object alarm: This detects objects which are abandoned, e.g., a piece of unattended baggage in an airport, or a car parked in a loading zone. • Object Removal: This detects movements of a user-specified object that is not expected to move, for example, a painting in a museum.

12

Part I: Introduction and state of the art

Class specific alarms These alarms use the type of object in addition to its movement properties. As examples: • Type Specific Movement Detection: Consider a camera that is monitoring runways at an airport. In such a scene, the system could provide an alert on the presence or movement of people on the tarmac but not those of aircrafts. • Statistics: Example applications include, alarms based on people counts (e.g., more than one person in security locker) or people densities (e.g., discotheque crowded beyond an acceptable level). Behavioral alarms when detecting adherence to, or deviation from, learnt models of motion patterns. Such models are based on training and analyzing movement patterns over extended periods of time. These alarms are used in specific applications and use a significant amount of context information, for example: • Detecting shopping groups at retail checkout counters, and alerting the store manager when the length of the queue at a counter exceeds a specified number. • Detecting suspicious behavior in parking lots, for example, a person stopping and trying to open multiple cars. High Value Video Capture This is an application which augments real time alarms by capturing selected clips of video based on pre-specified criteria. This becomes highly relevant in the context of smart camera networks, which use wireless communication.

3.2.3

Automatic unusual activity alarms

Unlike the user-defined alarms, here the system generates alerts when it detects activity that deviates from the norm. The smart surveillance system achieves this based on learning normal activity patterns. For example: when monitoring a street the system learns that vehicles move about on the road and people move about on the side walk. Based on this pattern the system will provide an alarm when a car drives on the sidewalk. Such unusual activity detection is the key to effective smart surveillance, as the user cannot manually specify all the events of interest.

3.2.4

Automatic Forensic Video Retrieval (AFVR)

The capability to support forensic video retrieval is based on the rich video index generated by automatic tracking technology. This is a critical value-add from using

Part I: Introduction and state of the art

13

smart surveillance technologies. Typically the index consists of such measurements as object shape, size and appearance information, temporal trajectories of objects over time, object type information, in some cases specific object identification information. In advanced systems, the index may contain object activity information. During the incident the investigative agencies had access to hundreds of hours of video surveillance footage drawn from a wide variety of surveillance cameras covering the areas in the vicinity of the various incidents. However, the task of manually sifting through hundreds of hours of video for investigative purposes is almost impossible. However if the collection of videos were indexed using visual analysis, it would enable the following ways of retrieving the video • Spatio-Temporal Video Retrieval: An example query in this class would be, Retrieve all clips of video where a blue car drove in front of the 7/11 Store on 23rd street between the 26th of July 2pm and 27th of July 9am at speeds > 25mph.

3.2.5

Situation awareness

To ensure a total security at a facility requires systems that continuously track the identity, location and activity of objects within the monitored space. Typically surveillance systems have focused on tracking location and activity, while biometrics systems have focused on identifying individuals. As smart surveillance technologies mature, it becomes possible to address all these three key challenges in a single unified framework giving rise to, joint location identity and activity awareness, which when combined with the application context becomes the basis for situation awareness.

3.3

Scenarios and examples

Below are some scenarios that smart surveillance systems and algorithms might handle:

3.3.1

Public and commercial security

• Monitoring of banks, department stores, airports, museums, stations, private properties and parking lots for crime prevention and detection • Patrolling of highways and railways for accident detection • Surveillance of properties and forests for fire detection • Observation of the activities of elderly and infirm people for early alarms and measuring effectiveness of medical treatments • Access control

14

Part I: Introduction and state of the art

3.3.2

Smart video data mining

• Measuring traffic flow, pedestrian congestion and athletic performance • Compiling consumer demographics in shopping centers and amusement parks • Extracting statistics from sport activities • Counting endangered species • Logging routine maintenance tasks at nuclear and industrial facilities • Artistic performance evaluation and self learning

3.3.3

Law enforcement

• Measuring speed of vehicles • Detecting red light crossings and unnecessary lane occupation

3.3.4

Military security

• Patrolling national borders • Measuring flow of refugees • Monitoring peace treaties • Providing secure regions around bases • Assisting battlefield command and control

3.4

Challenges

Probably the most important types of challenges in the future development of smart video surveillance systems are:

3.4.1

Technical aspect

A large number of technical challenges exists especially in visual analysis technologies. These include challenges in robust object detection, tracking objects in crowded environments, challenges in tracking articulated bodies for activity understanding, combining biometric technologies like face recognition with surveillance to generate alerts.

Part I: Introduction and state of the art

15

Figure 3.1: Block diagram of a basic surveillance system architecture

3.4.2

Performance evaluation

This is a very significant challenge in smart surveillance system. To evaluate the performance of video analysis systems a lot of annotated data are required. Plus it is a very expensive and tedious process. Also, there can be significant errors in annotation. Therefore it is obvious that these issues make performance evaluation a significant challenge

3.5

Choice of implementation methods

A surveillance system depends on the subsequent architecture: sensors, computing units and media of communication. In the near future, the basic surveillance system architecture presented in Figure 3.5 is likely to be the most ubiquitous in the near future. In this architecture, cameras are mounted in different parts in a facility, all of which are connected to a central control room. The actual revolution in the industry of cameras and the improvements in the media made a revolution in the architectures and open doors for new applications (see Figure 3.5).

16

Part I: Introduction and state of the art

Figure 3.2: Block diagram of an advanced surveillance system architecture

3.6

Conclusion

In this chapter we presented an overview of intelligent video surveillance systems. It consists in an architecture based on cameras and computing servers and visualization terminals. The surveillance application is built upon this architecture and is adapted to the monitored scene domain. In the next chapters we consider a basic intelligent surveillance architecture and the subsequent algorithms are easily transported to more advanced architectures as described in the next paragraph. Surveillance applications can be described formally regardless the distribution on the computing units (servers or embedded processors in the case of smart cameras). We can abstract this notion using a high level application prototyping system as RTMaps. An application consists in a diagram of components connected to each other. Cameras are represented by source components and servers as processing components, and all visualization and archiving tasks are represented as sink components. Once the application is prototyped it is easy to do the implementation on the real architecture and to distribute the processing modules to the different computing units.

Part II Algorithms

Chapter 4 Generic framework for intelligent visual surveillance Contents 4.1

Object detection . . . . . . . . . . . . . . . . . . . . . . . .

19

4.2

Object classification . . . . . . . . . . . . . . . . . . . . . .

19

4.3

Object tracking . . . . . . . . . . . . . . . . . . . . . . . . .

20

4.4

Action recognition . . . . . . . . . . . . . . . . . . . . . . .

20

4.5

Semantic description . . . . . . . . . . . . . . . . . . . . .

21

4.6

Personal identification

. . . . . . . . . . . . . . . . . . . .

21

4.7

Fusion of data from multiple cameras . . . . . . . . . . .

21

The primary objective of this chapter is to give a general overview of the overall process of an intelligent visual surveillance system. Fig. 4 shows the general framework of visual surveillance in dynamic scenes. The prerequisites for effective automatic surveillance using a single camera include the following stages: object detection, object classification, object tracking, object identification, action recognition and semantic description. Although, some stages require interchange of information with other stages, this generic framework provides a good structure for the discussion throughout this chapter. In order to extend the surveillance area and overcome occlusion, fusion of data from multiple cameras is needed. This fusion can involve all the above stages. This chapter is organized as follows: section 4.1 describes the object detection stage. Section 4.2 introduces the object classification stage. Object tracking is presented in section 4.3. Section 4.4 presents action recognition stage. Section 4.6 describes the personal identification stage. Section 4.7 describes the fusion of data from multiple cameras stage.

Part II: Algorithms

19

Figure 4.1: A generic framework for intelligent visual surveillance

4.1

Object detection

The first step in nearly every visual surveillance system is moving object detection. Object detection aims at segmenting regions corresponding to moving objects from the rest of the image. Subsequent processes such as classification and tracking are greatly dependent on it. The process of object detection usually involves background modeling and motion segmentation, which intersect each other during processing. Due to dynamic changes in natural scenes such as sudden illumination and weather changes, repetitive motions that cause clutter, object detection is a difficult problem to process reliably.

4.2

Object classification

In real-world scenes, different moving objects may correspond to different moving targets. For instance, the image sequences captured by surveillance cameras mounted in outdoor parking scenes probably include humans, vehicles and other moving objects such as flying birds and moving clouds, etc. It is very important to recognize the type of a detected object in order to track reliably and analyze its activities correctly. Object classification can be considered as a standard pattern recognition issue. At present, there are two major categories of approaches toward moving object classi-

20

Part II: Algorithms

fication which are shape-based and motion-based methods [WHT03]. Shape-based classification makes use of the objects’2D spatial information whereas motion-based classification uses temporally tracked features of objects for the classification solution.

4.3

Object tracking

After object detection, visual surveillance systems generally track moving objects from one frame to another in an image sequence. The objective of tracking is to establish correspondence of objects and object parts between consecutive frames of video stream. The tracking algorithms usually have considerable intersection with lower stages (object detection) and higher stages (action recognition) during processing: they provide cohesive temporal data about moving objects which are used both to enhance lower level processing such as object detection and to enable higher level data extraction such as action recognition. Tracking in video can be categorized according to the needs of the applications it is used in or according to the methods used for its solution. For instance, in the case of human tracking, whole body tracking is generally adequate for outdoor video surveillance whereas objects’ part tracking is necessary for some indoor surveillance and higher level action recognition applications. There are two common approaches in tracking objects as a whole [Ame03]: one is based on correspondence matching and another carries out explicit tracking by making use of position prediction or motion estimation. Useful mathematical tools for tracking include the Kalman filter, the Condensation algorithm, the dynamic Bayesian network, the Geodesic method, etc. Tracking methods are divided into four major categories: region-based tracking, active-contour-based tracking, feature-based tracking and model-based tracking. It should be pointed out that this classification is not absolute in that algorithms from different categories can be integrated together.

4.4

Action recognition

After successfully tracking the moving objects from one frame to another in an image sequence, the problem of understanding objects’ behaviors (action recognition) from image sequences follows naturally. Behavior understanding involves the analysis and recognition of motion patterns. Understanding of behaviors may simply be thought as the classification of time varying feature data, i.e., matching an unknown test sequence with a group of labeled reference sequences representing typical behavior understanding is to learn the reference behavior sequences from training samples, and to devise both training and matching methods for coping effectively with small variations of the feature data within each class of motion patterns. This stage is out of the focus of this thesis. Otherwise the boosting technique presented in Chapter 6 could be used to learn the typical behavior and to detect unusual behaviors. The boosting technique developed

Part II: Algorithms

21

within LibAdaBoost is made generic and could be extended to cover the data model of behaviors and trajectories and other types of data.

4.5

Semantic description

Semantic description aims at describing object behaviors in a natural language suitable for nonspecialist operator of visual surveillance. Generally there are two major categories of behavior description methods: statistical models and formalized reasoning. This stage is out of the focus of this thesis.

4.6

Personal identification

The problem of who is now entering the area under surveillance? is of increasing importance for visual surveillance. Such personal identification can be treated as a special behavior understanding problem. For instance, human face and gait are now regarded as the main biometric features that can be used for people identification in visual surveillance systems [SEN98]. In recent years, great progress in face recognition has been achieved. The main steps in the face recognition for visual surveillance are face detection, face tracking, face feature detection and face recognition. Usually, these steps are studied separately. Therefore, developing an integrated face recognition system involving all of the above steps is critical for visual surveillance. This stage is out of the focus of this thesis.

4.7

Fusion of data from multiple cameras

Motion detection, tracking, behavior understanding, and personal identification discussed above can be realized by single camera-based visual surveillance systems. Multiple camera-based visual surveillance systems can be extremely helpful because the surveillance area is expanded and multiple view information can overcome occlusion. Tracking with a single camera easily generates ambiguity due to occlusion or depth. This ambiguity may be eliminated from another view. However, visual surveillance using multi cameras also brings problems such as camera installation (how to cover the entire scene with the minimum number of cameras), camera calibration, object matching, automated camera switching and data fusion.

Chapter 5 Moving object detection Contents 5.1

Challenge of detection

. . . . . . . . . . . . . . . . . . . .

23

5.2

Object detection system diagram . . . . . . . . . . . . . .

25

5.3

5.4

5.2.1

Foreground detection . . . . . . . . . . . . . . . . . . . . . .

26

5.2.2

Pixel level post-processing (Noise removal) . . . . . . . . .

27

5.2.3

Detecting connected components . . . . . . . . . . . . . . .

28

5.2.4

Region level post-processing . . . . . . . . . . . . . . . . . .

28

5.2.5

Extracting object features . . . . . . . . . . . . . . . . . . .

28

Adaptive background differencing . . . . . . . . . . . . . 5.3.1

Basic Background Subtraction (BBS) . . . . . . . . . . . .

29

5.3.2

W4 method . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

5.3.3

Single Gaussian Model (SGM) . . . . . . . . . . . . . . . .

31

5.3.4

Mixture Gaussian Model (MGM) . . . . . . . . . . . . . . .

32

5.3.5

Lehigh Omni-directional Tracking System (LOTS): . . . . .

33

Shadow and light change detection . . . . . . . . . . . . . 5.4.1

5.5

Methods and implementation . . . . . . . . . . . . . . . . .

High level feedback to improve detection methods . . . 5.5.1

5.6

29

The modular approach . . . . . . . . . . . . . . . . . . . . .

Performance evaluation . . . . . . . . . . . . . . . . . . . .

35 36 46 47 48

5.6.1

Ground truth generation . . . . . . . . . . . . . . . . . . . .

48

5.6.2

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

5.6.3

Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . .

49

5.6.4

Experimental results . . . . . . . . . . . . . . . . . . . . . .

54

5.6.5

Comments on the results . . . . . . . . . . . . . . . . . . .

56

Part II: Algorithms

23

Detection of moving objects and performing segmentation between background and moving objects is an important step in visual surveillance. Errors made at this abstraction level are very difficult to correct at higher levels. For example, when an object is not detected at the lowest level, it cannot be tracked and classified at the higher levels. The more accurate the segmentation at the lowest level, the better the object shape is known and consequently, the easier the (shape based) object recognition tasks become. Frequently used techniques for moving object detection are background subtraction, statistical methods, temporal differencing and optical flow. Our system adopts a method based on combining different object detection algorithms of background modeling and subtraction, to achieve robust and accurate motion detection and to reduce false alarms. This approach is suitable for both color and infra-red sequences and based on the assumption of a stationary surveillance camera, which is the case in many surveillance applications. Figure 5.1 shows an illustration of intermediate results within a moving object detection stage. Each new frame (Figure 5.1a) is used to update existing background image (Figure 5.1b). A threshold is applied to difference image between the current frame and the background image, and a binary image is produced (Figure 5.1c) which indicates the areas of activities. Finally, segmentation methods, applied on the binary image, can extract the moving objects (Figure 5.1d). This chapter is organized as follows: section 5.1 provides a non exhaustive list of challenges that any object detection algorithm has to handle. Section 5.2 presents a generic template for an object detection system. In section 5.3 we describe our implementation of state of the art methods for adaptive background differencing. Shadows and light change detection elimination are described in section 5.4. The improvement of these methods using high level feedback is presented in section 5.5. To comparison of the different approaches on benchmarked datasets is presented in section 5.6.

5.1

Challenge of detection

There are several problems that a moving object detection algorithm must solve correctly. A non exhaustive list of problems would be the following [TKBM99]: • Moved objects: A background object can be moved. These objects should not be considered part of the foreground forever after. • Time of day: Gradual illumination changes alter the appearance of the background. • Light switch: Sudden changes in illumination and other scene parameters alter the appearance of the background.

24

Part II: Algorithms

Figure 5.1: Object detection example: (a) input frame, (b) background image model, (c) foreground pixel-image, (d) detected objects • Waving trees: Backgrounds can vacillate, requiring models which can represent disjoint sets of pixel values. • Camouflage: A foreground object’s pixel characteristics may be subsumed by the modeled background. • Bootstrapping: A training period absent of foreground objects is not available in some environments. • Foreground aperture: When a homogeneously colored object moves, change in the interior pixels cannot be detected. Thus, the entire object may not appear as foreground. • Sleeping person: A foreground object that becomes motionless cannot be distinguished from a background object that moves and then becomes motionless. • Waking person: When an object initially in the background moves, both it and the newly revealed parts of the background appear to change. • Shadows: Foreground objects often cast shadows which appear different from the modeled background.

Part II: Algorithms

5.2

25

Object detection system diagram

Figure 5.2: Object detection system diagram Figure 5.2 shows a generic system diagram for our object detection method. The system starts by background scene initialization step. Next step is detecting the foreground pixels by using the background model and current frame from the video stream. This pixel-level detection method is dependent on the background model in use and it is used to update the background model to adapt to dynamic scene changes. The foreground detection step is followed by a pixel-level post-processing stage to perform some noise removal filters. Once the result enhanced foreground

26

Part II: Algorithms

pixel-map is ready, in the next step, connected components analysis is used to the labeling of the connected regions. The labeled regions are then filtered to eliminate small regions caused by environmental noise in the region-level post-processing step. More advanced filters are considered for merging overlapping isolated regions and will be detailed later. In the final step of the detection process, a number of object features are extracted for subsequent use by high level stages.

5.2.1

Foreground detection

A common approach to detecting the foreground regions is adaptive background differencing, where each video frame is compared against a reference, often called the background image or background model. The major steps in an adaptive background differencing algorithm are background modeling and foreground detection. Background modeling uses the new frame to calculate and update a background model. This background model provides a pixel-wise statistical representation of the entire background scene. Foreground detection then identifies pixels in the video frame that cannot be adequately explained by the background model, and outputs them as a binary candidate foreground mask. Several methods for adaptive background differencing have been proposed in the recent literature. All of these methods try to effectively estimate the background model from the temporal sequence of the frames. However, there are a wide variety of techniques and both the expert and the newcomer to this area can be confused about the benefits and limitations of each method. In our video surveillance system we have selected five algorithms for modeling the background : 1. Basic Background Segmentation (BBS). 2. W4 method (W4). 3. Single Gaussian Model (SGM). 4. Mixture Gaussian Model (MGM). 5. Lehigh omni-directional Tracking System (LOTS). The first method is a basic background subtraction algorithm (BBS) [HA05]. This is the simplest algorithm and it provides a lower benchmark for the other algorithms which are more complex but based on the same principle. The second algorithm is denoted as W4 and operates on gray scale images. Three parameters are learned for each pixel to model the background: minimum intensity, maximum intensity and maximum absolute difference in consecutive frames. This algorithm incorporates the noise variations into the background model. The third method is used in Pfinder [WATA97] denoted here as SGM (Single Gaussian Model). This method assumes that each pixel is a realization of a random variable with a Gaussian distribution. The first and second order statistics of this distribution are independently estimated for each pixel.

Part II: Algorithms

27

The fourth method is an adaptive mixture of multiple Gaussian (MGM) as proposed by Stauffer and Grimson in [CG00]. Every pixel of the background is modeled using a mixture of Gaussian. The weights of the mixture and the parameters of the Gaussian are adapted with respect to the current frame. This method has the advantage that multi modal backgrounds (such as moving trees) can be modeled. Among the tested techniques, this is the one with the most complex background model. The fifth approach (LOTS) proposed by Boult in [BMGE01] is an efficient method designed for military applications that presumes a two background model. In addition, the approach uses high and low per-pixel thresholds. The method adapts the background by incorporating the current image with a small weight. At the end of each cycle, pixels are classified as false detection, missed detection and correct detection. The original point of this algorithm is that the per-pixel thresholds are updated as a function of the classification.

5.2.2

Pixel level post-processing (Noise removal)

The outputs of foreground region detection algorithms generally contain noise and therefore are not appropriate for further processing without special post-processing. There are various factors that cause the noise in foreground detection, such as: 1. Camera noise: This is the noise caused by the cameras image acquisition components. The intensity of a pixel that corresponds to an edge between two different colored objects in the scene may be set to one of the object color in one frame and to the other object color in the next frame. 2. Reflectance noise: When a source of light, for instance sun, moves it makes some parts in the background scene to reflect light. This phenomenon makes the foreground detection algorithms fail and detect reflectance as foreground regions. 3. Background colored object noise: Some parts of the objects may have the same color as the reference background behind them. This resemblance causes some of the algorithms to detect the corresponding pixels as non-foreground and objects to be segmented inaccurately. 4. Shadows and sudden illumination change: Shadows cast on objects are detected as foreground by most of the detection algorithms. Also, sudden illumination changes (e.g. turning on lights in a monitored room) makes the algorithms fail to detect actual foreground objects accurately. Morphological operations, erosion and dilation [Hei96], are applied to the foreground pixel map in order to remove noise that is caused by the first three of the items listed above. Our aim in applying these operations is removing noisy foreground pixels that do not correspond to actual foreground regions and to remove the noisy background pixels near and inside object regions that are actually foreground pixels. Erosion,

28

Part II: Algorithms

as its name implies, erodes one-unit thick boundary pixels of foreground regions. Dilation is the reverse of erosion and expands the foreground region boundaries with one-unit thick pixels. The subtle point in applying these morphological filters is deciding on the order and amount of these operations. The order of these operations affects the quality and the amount affects both the quality and the computational complexity of noise removal. Removal of shadow regions and detecting and adapting to sudden illumination changes require more advanced methods which are explained in section 5.4.

5.2.3

Detecting connected components

After detecting foreground regions and applying post-processing operations to remove noise and shadow regions, the filtered foreground pixels are grouped into connected regions (blobs) and labeled by using a two-level connected component labeling algorithm presented in [Hei96]. After finding individual blobs that correspond to objects, the bounding boxes of these regions are calculated.

5.2.4

Region level post-processing

Even after removing pixel-level noise, some artificial small regions remain due to inaccurate object segmentation. In order to eliminate this type of regions, the average region size (γ) in terms of pixels is calculated for each frame and regions that have smaller sizes than a fraction (α) of the average region size (Size(region) < α ∗ γ) are deleted from the foreground pixel map. Also, due to segmentation errors, some parts of the objects are found as disconnected from the main body. In order to correct this effect, the bounding boxes of regions that are close to each other are merged together and the region labels are adjusted.

5.2.5

Extracting object features

Once we have segmented regions we extract features of the corresponding objects from the current image. These features are size (S), center of mass (Cm ), color histogram (Hc ) and silhouette contour of the objects blob. Calculating the size of the object is trivial and we just count the number of foreground pixels that are contained in the bounding box of the object. In order to calculate the center of mass point, Cm = (xCm , yCm ), of an object O, we use the following equation 5.1: P P yi xi , yC m = (5.1) x Cm = n n where n is the number of pixels in O. The color histogram, Hc is calculated over monochrome intensity values of object pixels in current image. In order to reduce computational complexity of operations that use Hc , the color values are quantized. color values. Let N be the number of bins in the histogram, then every bin covers 255 N

Part II: Algorithms

29

The color histogram is calculated by iterating over pixels of O and incrementing the stored value of the corresponding color bin in the histogram, Hc . So for an object O the color histogram is updated as follows: Hc [

ci ci ] = Hc [ ] + 1, ∀ci ∈ O N N

(5.2)

where ci represents the color value of it h pixel. In the next step the color histogram is normalized to enable appropriate comparison with other histograms in later steps. ˆ c is calculated as follows: The normalized histogram H ˆ c [i] = PHc [i] H Hc [i]

5.3

(5.3)

Adaptive background differencing

This section describes object detection algorithms based on adaptive background differencing technique used in this work: BBS, W4, SGM, MGM and LOTS. The BBS, SGM, MGM algorithms use color images while W4 and LOTS use gray scale images.

5.3.1

Basic Background Subtraction (BBS)

There are different approaches to this basic scheme of background subtraction in terms of foreground region detection, background maintenance and post processing. This method detects targets by computing the difference between the current frame and a background image for each color channel RGB. The implementation can be summarized as follows: • Background initialization and threshold: motion detection starts by computing a pixel based absolute difference between each incoming frame It and an adaptive background frame Bt . The pixels are assumed to contain motion if the absolute difference exceeds a predefined threshold level τ . As a result, a binary image is formed if (5.4) is satisfied, where active pixels are labeled with 1 and non-active ones with 0. |It (φ) − Bt (φ)| > τ

(5.4)

where τ is a predefined threshold. The operation in (5.4) is performed for all image pixels φ. • Filtering: the resulting thresholded image usually contains a significant amount of noise. Then a morphological filtering step is applied with a 3x3 mask (dilation and erosion) that eliminates isolated pixels.

30

Part II: Algorithms • Background adaptation: to take into account slow illumination changes which is necessary to ensure longterm tracking, the background image is subsequently updated by: Bit+1 (φ) = αIit (φ) + (1 − α)Bit (φ)

(5.5)

where α is the learning rate. In our experiments we use α = 0.15 and the threshold τ = 0.2. These parameters stay constant during the experiment. The following figure shows the results

Figure 5.3: The BBS results (a) current frame (b)(c) Foreground detection

5.3.2

W4 method

This algorithm was proposed by Haritaoglu in [CG00]. W4 has been designed to work with only monochromatic stationary video sources, either visible or infrared. It is designed for outdoor detection tasks, and particularly for night-time or other low light level situations. • Background scene modeling: W4 obtains the background model even if there are moving foreground objects in the field of view, such as walking people moving cars, etc. The background scene is modeled by representing each pixel by three values; minimum intensity(m), maximum intensity (M), and the maximum intensity difference (D) between consecutive frames. In our implementation we selected an array V containing N consecutive images, V i (φ) is the intensity of a pixel location φ in the ith image of V. The initial background model for a pixel location [m(φ), M (φ), D(φ)], is obtained as follows:     m(φ) minV (φ)  M (φ)  =   maxV (φ) (5.6) i i−1 D(φ) max|V (φ) − V (φ)| • Classification: each pixel is classified as background or foreground according to (5.7). Giving the values of Min, Max and D, a pixel I(φ) is considered as foreground pixel if : |m(φ) − I(φ)| > D(φ)or|M (φ) − I(φ)| > D(φ)

(5.7)

Part II: Algorithms

31

• Filtering: after classification the resulting image usually contains a significant amount of noise. A region based noise cleaning algorithm is applied that is composed of an erosion operation followed by connected component analysis that allows to remove regions with less than 50 pixels. The morphological operations dilatation and erosion are now applied.

Figure 5.4: The w4 results (a) current frame (b)(c) Foreground detection

5.3.3

Single Gaussian Model (SGM)

The Single Gaussian Model algorithm is proposed by Wren in [WWT03]. In this section we describe the different steps to implement this method. • Modeling the background: we model the background as a texture surface; each point on the texture surface is associated with a mean color value and a distribution about that mean. The color distribution of each pixel is modeled with the Gaussian described by a full covariance matrix. The intensity and color of each pixel are represented by a vector [Y, U, V ]T . We define µ(φ) to be the mean [Y, U, V ] of a point on the texture surface, and U (φ) to be the covariance of that point’s distribution. • Update models: we update the mean and covariance of each pixel φ as follows: µ(φ) = (1 − α)µt−1 (φ) + αI t (φ), U t (φ) = (1 − α)U t−1 (φ) + αν(φ)ν(φ)T

(5.8) (5.9)

Where I t (φ) is the pixel of the current frame in YUV color space, α is the learning rate; in the experiments we have used α = 0.005 and ν(φ) = I t (φ) − µt (φ). • Detection: for each image pixel I t (φ) we must measure the likelihood that it is a member of each of the blob models and the scene model. We compute for each pixel at the position φ the log likelihood L(φ) of the difference ν(φ). This value gives rise to a classification of individual pixels as background or foreground. 1 1 3 L(φ) = − ν(φ)T (U t )−1 − ln |U t | − ln(2π). 2 2 2

(5.10)

32

Part II: Algorithms A pixel φ is classified as foreground if L(φ) > τ otherwise it is classified as background. Where τ is a threshold, in the experiment we consider τ = −300. We have obtained the following results (see on Figure 5.5)

Figure 5.5: The SGM results (a) current frame (b)(c) Foreground detection

5.3.4

Mixture Gaussian Model (MGM)

In this section we present the implementation of the original version of the adaptive Mixture of multiple Gaussians Model for background modeling. This model was proposed by Stauffer and Grimson in [HA05]. It can robustly deal with lighting changes, repetitive motions, clutter, introducing or removing objects from the scene and slowly moving objects. Due to its promising features, this model has been a popular choice for modeling complex and dynamic backgrounds. The steps of the algorithm can be summarised as: • Background modeling: in this model, the value of an individual pixel (e.g. scalars for gray values or vectors for color images) over time is considered as a pixel process and the recent history of each pixel, {X1, ..., Xt}, is modeled by a mixture of K Gaussian distributions. The probability of observing current pixel value then becomes: P (Xt ) =

k X

2 wi,t ∗ η(Xt , µi,t , σi,t )

(5.11)

i=1

where k is the number of distributions, wi,t is an estimate of the weight (what portion of the data is accounted for this Gaussian) of the ith Gaussian Gi,t in 2 the mixture at time t, µi,t is the mean value of Gi,t and σi,t is the covariance matrix of Gi,t and η is a Gaussian probability density function: η(Xt , µ, σ 2 ) =

1 (2π)

n 2

1

1

|σ 2 | 2

e− 2 (Xt −µt )

T (σ 2 )−1 (X −µ ) t t

.

(5.12)

Decision on K depends on the available memory and computational power. Also, the covariance matrix is assumed to be of the following form for computational efficiency 2 σk,t = αk2 I (5.13)

Part II: Algorithms

33

which assumes that red, green and blue color components are independent and have the same variance. • Update equations: the procedure for detecting foreground pixels is as follows: when the system starts, the K Gaussian distributions for a pixel are initialized with predefined mean, high variance and low prior weight. When a new pixel is observed in the image sequence, to determine its type, its RGB vector is checked against the K Gaussians, until a match is found. A match is defined if the pixel value is within γ = 2.5 standard deviations of a distribution. Next, the prior weights of the K distributions at time t, wk,t , are updated as follows: wk,t = (1 − α)wk,t−1 + α(Mk,t )

(5.14)

Where α is the learning rate and Mk,t is 1 for the model which matched the pixel and 0 for the remaining models. After this approximation the weights are renormalised. The changing rate in the model is defined by α1 . The parameters µ and σ 2 for unmatched distribution are updated as follows: µt = (1 − α)µt−1 + ρXt 2 σt2 = (1 − ρ)σt−1 + ρ(Xt − µ)T ρ = αν(Xt |µk , σk )

(5.15) (5.16) (5.17)

If no match is found for the new observed pixel, the Gaussian distribution with the least probability is replaced with a new distribution with the current pixel value as its mean value, an initially high variance and low prior weight. • Classification: in order to classify (foreground or background) a new pixel, the K Gaussian distributions are sorted by the value of w/σ. This ordered list of distributions reflects the most probable backgrounds from top to bottom since by 5.14, background pixel processes make the corresponding Gaussian distribution have larger prior weight and less variance. Then the first B distributions are defined as background. b X B = arg minb ( wk > T )

(5.18)

k=1

T is the minimum portion of the pixel data that should be accounted for by the background. If a small value is chosen for T, the background is generally unimodal. Figure 5.6 shows the results.

5.3.5

Lehigh Omni-directional Tracking System (LOTS):

The target detector proposed in [WATA97] operates on gray scale images. It uses two background images and two per-pixel thresholds. The steps of the algorithm can be summarized as:

34

Part II: Algorithms

Figure 5.6: The MGM results (a) Current frame (b) (c) Foreground detection • Background and threshold initialization: the first step of the algorithm, performed only once, assumes that during a period of time there are no targets in the image. In this ideal scenario, the two backgrounds are computed easily. The image sequences used in our experiment do not have target-free images, so B1 is initialized as the mean of 20 consecutive frames and B2 is given by : B2 = B1 + µ

(5.19)

Where µ is an additive Gaussian noise with η(µ = 10, σ = 20) The per-pixel threshold TL , is then initialized to the difference between the two backgrounds: TL (φ) = |B1 (φ) − B2 (φ)| + ν(φ)

(5.20)

Where ν represents noise with an uniform distribution in [1,10], and φ an image pixel. Then the higher threshold TH is computed by: TH (φ) = TL (φ) + V

(5.21)

Where V is the sensitivity of the algorithm, we have chosen V = 40 in our implementation. • Detection: we create D, which contains the minimum of the differences between the new image I, and the backgrounds B1 and B2 : D(φ) = minj |Ij (φ) − Bj (φ)|, j = 1, 2.

(5.22)

D is then compared with the threshold images. Two binary images DL and DH are created. The active pixels of DL and DH are the pixels of D that are higher than the thresholds TL and TH respectively. • Quasi connected components analysis (QCC): QCC computes for each thresholded image DL and DH images DsL and DsH with 16 times smaller resolution. Each element in these reduced images, DsL and DsH , has the number of active pixels in a 4x4 block of DL and DH respectively. Both images are then merged into a single image that labels pixels as detected, missed or insertions. This process labels every pixel, and also deals with targets that are not completely connected, considering them as only one region.

Part II: Algorithms

35

• Labeling and cleaning: a 4-neighbor connected components analyzer is then applied to this image (DsL and DsH ) and, the regions with less than Λ pixels are eliminated. The remaining regions are considered as detected. The detected pixels are the ones from DL that correspond to detected regions. The insertions are the active pixels in DL , but do not correspond to detected regions, and the missed pixels are the inactive pixels in DL . • Background and threshold adaptation: The backgrounds are updated as follows:  (1 − α2 )Bit (φ) + α2 I t (φ) φ ∈ T t t+1 Bi (φ) = (5.23) (1 − α1 )Bit (φ) + α1 I t (φ) φ ∈ N t Where φ ∈ T t represents a pixel that is labeled as detected and φ ∈ N t represents a missed pixel. Generally α2 is smaller than α1 , In our experiment we have chosen α1 = 0.000306 and α2 = α41 , and only the background corresponding to the smaller difference D is updated. Using the pixel labels, thresholds are updated as follows:  t  TL (φ) + 10 φ ∈ insertion T t (φ) − 1 φ ∈ missed (5.24) TLt+1 (φ) =  L t TL (φ) φ ∈ detected We have obtained the following results (see on Figure 5.7).

Figure 5.7: The LOTS results (a) Current frame (b)(c) Foreground detection

5.4

Shadow and light change detection

The algorithms described above for motion detection perform well on indoor and outdoor environments and have been used for real-time surveillance for years. However, most of these algorithms are susceptible to both local (e.g. shadows and highlights) and global illumination changes (e.g. sun being covered/uncovered by clouds). Shadows cause the motion detection methods fail in segmenting only the moving objects

36

Part II: Algorithms

and make the upper levels such as object classification to perform inaccurately. Many existing shadow detection algorithms mostly use either chromaticity or a stereo information to cope with shadows and sudden light changes. In our video surveillance system we have implemented two methods based on chromaticity: Horprasert [TDD99] and Chen [CL04]. Shadows are due to the occlusion of the light source by an object in the scene (Figure 5.8). The part of an object that is not illuminated is called self-shadow, while the area projected on the scene by the object is called cast shadow and is further classified into umbra and penumbra. The umbra corresponds to the area where the direct light is totally blocked by the object, whereas in the penumbra area it is partially blocked. The umbra is easier to be seen and detected, but is more likely to be misclassified as moving object. If the object is moving, the cast shadow is more properly called moving cast shadow [JRO99], otherwise is called still shadow. Human visual system seems to utilize the following observations for reliable shadow detection: 1. Shadow is dark but does not change significantly neither the color nor the texture of the background covered. 2. Shadow is always associated with the object that casts it and to the behavior of that object. 3. Shadow shape is the projection of the object shape on the background. For an extended light source,the projection is unlikely to be perspective. 4. Both the position and strength of the light source are known. 5. Shadow boundaries tend to change direction according to the geometry of the surfaces on which they are cast.

Figure 5.8: Illumination of an object surface

5.4.1

Methods and implementation

Shadow detection processes can be classified in two ways. One, they are either deterministic or statistical, and two, they are either parametric or non-parametric. This

Part II: Algorithms

37

particular approach for shadow detection is statistical and non-parametric which implies that it uses probabilistic functions to describe the class membership and parameter selection is not a critical issue. In the current work, to cope with shadows, we selected implemented and evaluated two methods: a statistical non-parametric approach introduced by Horprasert et al. in [TDD99] and a deterministic non-model based approach proposed by Chen et al. [CL04]. Our objective is to complete our video surveillance prototype with shadow removal cores with known performance and limitations. Statistical non-parametric approach A good understanding of this approach requires the introduction of some mathematical tools for image processing: • Computational color model: based on the fact that humans tend to assign a constant color to an object even under changing illumination over time and space (color constancy), the color model used here separates the brightness from the chromaticity component. Figure 5.9 illustrates the proposed color model in

Figure 5.9: Proposed color model in three dimensional RGB color space three-dimensional RGB color space. Consider a pixel, i, in the image let Ei = [ER (i), EG (i), EB (i)] represent the pixel’s expected RGB color in the background image. Next, let Ii = [IR (i), IG (i), IB (i)] denotes the pixel’s RGB color value in a current image that we want to subtract from the background. The line OEi is called expected chromaticity line whereas OIi is the observed color value. Basically, we want to measure the distortion of Ii from Ei . The proposed color model separates the brightness from the chromaticity component. We do this

38

Part II: Algorithms by decomposing the distortion measurement into two components, brightness distortion and chromaticity distortion, which will be described below. • Brightness distortion: the brightness distortion (α ) is a scalar value that brings the observed color close to the expected chromaticity line. It is obtained by minimizing (5.25): φ(αi ) = (Ii − αi Ei )2 (5.25) • Color distortion: color distortion is defined as the orthogonal distance between the observed color and the expected chromaticity line. The color distortion of a pixel i is given by (5.26). CDi = ||Ii − αi Ei ||

(5.26)

• Color characteristics: the CCD sensors linearly transform infinite-dimensional spectral color space to a three-dimensional RGB color space via red, green, and blue color filters. There are some characteristics of the output image, influenced by typical CCD cameras, that we should account for in designing the algorithm, as follows: – color variation: the RGB color value for a given pixel varies over a period of time due to camera noise and illumination fluctuation by light sources. – band unbalancing: cameras typically have different sensitivities to different colors. Thus, in order to make the balance weights on the three color bands (R,G,B), the pixel values need to be rescaled or normalized by weight values. Here, the pixel color is normalized by its standard deviation (si ) which is given by (5.27) si = [σR (i), σG (i), σB (i)]

(5.27)

where σR (i), σG (i), and σB (i) are the standard deviation of the it h pixel’s red, green, blue values computed over N frames of the background frames. – clipping: since the sensors have limited dynamic range of responsiveness, this restricts the varieties of color into a RGB color cube, which is formed by red, green, and blue primary colors as orthogonal axes. On 24-bit images, the gamut of color distribution resides within the cube range from [0, 0, 0] to [255, 255, 255]. Color outside the cube (negative or greater than 255 color) cannot be represented. As a result, the pixel value is clipped in order to lie entirely inside the cube. The algorithm is based on pixel modeling and background subtraction and its flowchart is shown in Figure 5.10. Typically, the algorithm consists of three basic steps: 1. Background modeling: the background is modeled statistically on a pixel by pixel basis. A pixel is modeled by a 4-tuple < Ei , si , ai , bi > where Ei is

Part II: Algorithms

39

Figure 5.10: Flowchart of the Statistical Non Parametric approach. The first N frames are used to compute, for each pixel, the means and the variances of each color channel, Ei and Si respectively (1). Then, the distortion of the brightness αi and the distortion of the chrominance CDi of the difference between expected color of a pixel and its value in the current image are computed and normalized (2,4). Finally, each pixel is classified into four classes C(i) (5) using a decision rule based on thresholds τCD , ταlo , τ α1 and τα2 automatically computed (3).

40

Part II: Algorithms the expected color value, si is the standard deviation of color value , ai is the variation of the brightness distortion, and bi is the variation of the chromaticity distortion of the ith pixel. The background image and some other associated parameters are calculated over a number of static background frames. The expected color value of pixel i is given by (5.28) Ei = [µR (i), µG (i), µB (i)]

(5.28)

where µR (i), µG (i), and µB (i) are the arithmetic means of the ith pixel’s red, green, blue values computed over N background frames. Since the color bands have to be balanced by rescaling color values by pixel variation factors, the formula for calculating brightness distortion and chromaticity distortion can be written as: IR (i)µR (i) G (i) B (i) + IGσ(i)µ + IBσ(i)µ 2 2 σR 2 (i) G (i) B (i) (5.29) αi = (i) 2 (i) 2 (i) 2 [ µσRR (i) ] + [ µσGG (i) ] + [ µσBB (i) ] s CDi =

(

IR (i) − αi µR (i) 2 IG (i) − αi µG (i) 2 IB (i) − αi µB (i) 2 ) +( ) +( ) σR (i) σG (i) σB (i) (5.30)

Since different pixels yield different distributions of brightness and chromaticity distortions, these variations are embedded in the background model as ai and bi respectively which are used as normalization factors. The variation in the brightness distortion of the ith pixel is given by ai , which is given by: s PN 2 i=0 (αi − 1) (5.31) ai = RM S(αi ) = N The variation in the chromaticity distortion of the ith pixel is given by bi which is defined as: s PN 2 i=0 (CDi ) bi = RM S(CDi ) = (5.32) N In order to use a single threshold for all pixels, we need to rescale i and CDi . Therefore, let αbi = and

αi − i − 1 ai

(5.33)

di = CDi CD (5.34) bi be normalized brightness distortion and normalized chromaticity distortion respectively.

Part II: Algorithms

41

Figure 5.11: Histogram of Normalized alpha

Figure 5.12: Histogram of Normalized CD

2. Threshold selection: a statistical learning procedure is used to automatically di determine the appropriate thresholds. The histograms of both αbi and the CD are built and the thresholds are computed by fixing a certain detection rate. τCD is a threshold value used to determine the similarity in chromaticity between the background image and the current observed image and τalpha1 and τalpha2 are selected threshold values used to determine the similarities in brightness. ταl o is described in the classification step. According to histograms of normalized α (Figure 5.11) and CD (Figure 5.12) the values selected for the various thresholds to be utilized for pixel classification are as follows (see on Table 5.1). Threshold Value τCD 100 ταlo -50 τα 1 15 τα 2 -15 Table 5.1: Shadow detection thresholds learning result 3. Pixel classification: chromaticity and brightness distortion components are calculated based on the difference between the background image and the current image. Suitable threshold values are applied to these components to determine the classification of each pixel i according to the following definitions: • Background (B): if it has both brightness and chromaticity similar to those of the same pixel in the background image. • Shaded background or Shadow (S): if it has similar chromaticity but lower brightness than those of the same pixel in the background image. This is based on the notion of the shadow as a semi-transparent region in the

42

Part II: Algorithms image, which retains a representation of the underlying surface pattern, texture or color value. • Highlighted background (H): if it has similar chromaticity but higher brightness than the background image. • Foreground (F): if the pixel has chromaticity different from the expected values in the background image. Based on these definitions, a pixel is classified into one of the four categories C(i) ∈ {F, B, S, H} by the following decision procedure.  di > τCD , else  F : CD   B : αbi < τα1 and αbi > τα2 , else (5.35) C(i) =  S : αbi < 0, else   H : otherwize A problem encountered here is that if a moving object in a current image contains very low RGB values, this dark pixel will always be misclassified as a shadow. This is because the color point of the dark pixel is close to the origin in RGB space and the fact that all chromaticity lines in RGB space meet at the origin, which causes the color point to be considered close or similar to any chromaticity line. To avoid this problem, we introduce a lower bound for the normalized brightness distortion (τ αlo). Then, the decision procedure becomes:  di > τCD orαbi < τ αlo , else  F : CD   B : αbi < τα1 and αbi > τα2 , else (5.36) C(i) =  S : αbi < 0, else   H : otherwize

Deterministic non-model based approach A shadow cast on a background does not change its hue significantly, but mostly lower the saturation of the points. We observe that the relationship between pixels when illuminated and the same pixels under shadows is roughly linear. Based on those observations, we exploit the brightness distortion and chrominance distortion of the difference between the expected color of a pixel and its value in the current image; otherwise the saturation component information and the ratios of brightness between the current image and the reference image is employed as well to determine the shadowed background pixels. This method is close to the method proposed by Chen et al [CL04] which presents a background model initiation and maintenance algorithm by exploiting HSV color information. This method differs from the previously described statistical non-parametric approach in Section 5.4.1 by the way it handles its background model. This approach does not provide a special procedure for updating the initial background model but

Part II: Algorithms

43

Figure 5.13: Result of Statistical non-parametric approach:(a) current frame, (b) without shadow detection result (c) shadows it uses instead, a similar procedure to create the background model periodically each N frames. In other words, it provides a newest background model periodically. Figure 5.14 shows the flowchart of this approach. Typically, the algorithm consists of three basic steps: 1. Background maintenance: the initial background model is obtained even if there are moving foreground objects in the field of view, such as walking people, moving cars, etc. In the algorithm, the frequency ratios of the intensity values for each pixel at the same position in the frames are calculated using N frames to distinguish moving pixels from stationary ones, and the intensity values for each pixel with the biggest ratios are incorporated to model the background scene. We extend this idea to the RGB color space. Each component of RGB is respectively treated with the same scheme. A RGB histogram is created to record the frequency of each component of the pixel. This process is illustrated in Figure 5.15. The RGB component value with the biggest frequency ratio estimated in the histogram is assigned as the corresponding component value of the pixel in the background model. In order to maintain the Background model we reinitialize

44

Part II: Algorithms

Figure 5.14: Flowchart of the Deterministic non-model based approach. Periodically, each N frames are used to compute, for each pixel, the most representative RGB pixel values with higher frequencies f R , f G and f B (5). For each frame (1), a conversion from RGB to HSV is applied (2), then each pixel is classified using two stages pixel process into three classes C(i) ∈ {F, B, S} (3) using a decision rule based on thresholds τH , τs , R, M , Mmin and M n(x, y) periodically computed (6).

Part II: Algorithms

45

Figure 5.15: RGB histogram of pixel (x, y) over the learning period for background initiation. it periodically each N frames (N = 100-200 frames) to adapt to the illumination changes and scene geometry changes in the background scene.

Figure 5.16: (A) frame 106, (B) frame 202, (C) estimated background

2. Pixel classification: pixel classification is a two-stage process. The first stage aims at segmenting the foreground pixels. The distortion of brightness for each pixel between the incoming frame and the reference image is computed. This

46

Part II: Algorithms stage is performed according   1 if F (x, y) =  0

to the following equation: |I v (x, y) − B v (x, y)| ≥ 1.1Mn (x, y) ∨|I v (x, y) − B v (x, y)| > Mmin otherwise

(5.37)

The second stage consists in the segmentation of the shadow pixels. This is done by introducing the hue and saturation information of the candidate pixels, as well as the division of brightness between the pixels and the same pixels in background image. Thus, the shadow points are classified by the following decision procedure:  1 if I v (x, y) − B v (x, y) < M     ∧L < |I v (x, y)/B v (x, y)| < R  ∧|I H (x, y) − B H (x, y)| > τH (5.38) S(x, y) =  s s  ∧I (x, y) − B (x, y) < τ  s   0 otherwise 3. Threshold selection: equation 5.37 introduces two thresholds: Mn (x, y) and Mmin . Mn (x, y) is the mean value of the distortion of brightness for the pixel at the position (x, y) over the last N frames. Mmin is a minimum mean value which is introduced as a noise threshold to prevent the mean value from decreasing below a minimum should the background measurements remain strictly constant over a long period of time. Mmin is set at 40. Equation 5.38 involves four thresholds: M , R, τH and τs . R is the ratio between pixels when illuminated and the same pixels. This ratio is linear according to [RE95]. We experimentally fixed the value of R to 2.2 which is coherent with the range that has been experimentally found by Chen et al. [CL04]. L and M have been empirically set respectively to 1.2 and 20. τH and τs are selected threshold values used to measure the similarities of the hue and saturation between the background image and the current observed image. τH is set at 30 and τs is set at -10.

5.5

High level feedback to improve detection methods

Background models are usually designed to be task independent, and this often means that they can use very little high-level information. Several approaches to incorporating information about foreground objects into background maintenance have been proposed. They may be broadly separated into two categories: probabilistic frameworks that jointly model scene and foreground object evolution, and systems consisting of separate modules for scene modeling and high level inference.

Part II: Algorithms

47

Figure 5.17: Result of deterministic non-model based approach:(a) current frame, (b) without shadow detection result (c) shadows

5.5.1

The modular approach

The usage of high level information in the background modeling framework is present at the model update level. Instead of updating the background model using the classification result, the background model is updated using the classification result postprocessed using the high level algorithms (tracking, shadow detection...). The advantage of the modular approach is that it conserves the known algorithms for adaptive background differencing. It only intercepts the model update step. It provides a corrected mask using the high level information. Thus, the usage of the previous algorithms is ease to take in hand. Otherwise, the usage of the high level information depends on the sensibility of the subsequent background model. In our work we use the tracking and shadow detection as high level information provider. We do not explain all the implementation in the present thesis (this part is well explained in the library developed in C++ for implementing the different algorithms), we concentrate on the comparison of several algorithms and the integration to the rest of the framework.

48

5.6

Part II: Algorithms

Performance evaluation

Performance evaluation is an important issue in the development of object detection algorithms. Quantitative measures are useful for optimizing, evaluating and comparing different algorithms. For object detection, we believe there is no single measure for quantifying the different aspects of performance. In this section, we present a set of performance measures for determining how well a detection algorithm output matches the ground truth.

5.6.1

Ground truth generation

A number of semi-automatic tools are currently available for generating ground truth from pre-recorded video. The Open development environment for evaluation of video systems (ODViS) [JWSX02] allows the user to generate ground truth, incorporate new tracking algorithm into the environment and define any error metric through a graphical user interface. The Video Performance Evaluation Resource (ViPER) [DM00] is directed more toward performance evaluation for video analysis systems. It also provides an interface to generate ground truth, metrics for evaluation and visualization of video analysis results. In our experiments the comparison is based on manually annotated ground truth. We used to draw a bounding box around each individual and also around each group of individuals. Individuals are labeled only once they start moving. Groups are defined as two or more individuals that interact. Groups cannot be determined only by analyzing the spatial relations between individuals which makes the detection and tracking very difficult for artificial systems. For this reason, we decide to restrict the evaluation only to individual bounding boxes. On the other hand we labeled the shape of the given individual. We define two labels: one label for the individual and another for the moving cast shadow. This two-level labeling (pixel based and object based) is useful to compare the object detection algorithms using adaptive background differencing.

5.6.2

Datasets

We evaluated the performance of the different algorithms on a collection of datasets for outdoor and indoor scenes. The outdoor scenes are mainly focused on people moving in parking. The acquisition took place in INRIA (Grenoble). Figure 5.18 shows an example of the parking scene with the appropriate visual labeling (we can distinguish the main object and the casted shadow). Figure 5.19 shows the well known example of indoor dataset: the intelligent room. the usage of this dataset was useful to compare our results with published results and to verify the correctness of other results as well.

Part II: Algorithms

49

Figure 5.18: Example frame of the parking ground truth

5.6.3

Figure 5.19: Example frame of intelligent room ground truth

Evaluation metrics

In order to evaluate the performance of object detection algorithms, we adopted some of the performance metrics proposed in [HA05] and [BER03]. We used also other metrics introduced in [VJJR02]. These metrics and their significance are described below. Let G(t) be the set of ground truth objects in a single frame t and let D(t) be the set of the output boxes produced by the algorithm. NG(t) and ND(t) are their respective values in frame t. 1. Precision(FAR) and Recall(TRDR): TP TP + FN TP F AR = TP + FP

T RDR =

(5.39) (5.40)

Where a True Positive (TP) is defined as a ground truth point that is located within the bounding box of an object detected by the algorithm. A False negative (FN) is a ground truth point that is not located within the bounding box of any object detected by the algorithm. A False positive (FP) is an object that is detected by the algorithm that has not a matching ground truth point. A correct match is registered, when the bounding boxes of the targets Aobs and Atruth overlap T . Aobs ∩ Atruth ≥T (5.41) Aobs ∪ Atruth

2. Area-Based Recall for Frame(ABRF): This metric measures how well the algorithm covers the pixel regions of the ground truth. Initially it is computed for each frame and it is the weighted average for the whole data set. Let UG(t) and UD(t) be the spatial union of the

50

Part II: Algorithms Method Room Parking Recall at T=50% LOTS 0.3800 0.7500 W4 0.3900 0.0789 MGM 0.3799 0.5132 SGM 0.0000 0.0000 BBS 0.0000 0.0000 BBS-R 0.0092 0.0078 Precision at T=50% LOTS 0.0800 0.0600 W4 0.0115 0.0001 MGM 0.0000 0.0088 SGM 0.0000 0.0000 BBS 0.0000 0.0000 BBS-R 0.0092 0.0022

Table 5.2: Comparison of recall and precision of the algorithms evaluated on the indoor and outdoor datasets boxes in G(t) and D(t) : NG(t)

UG(t) =

[

(t)

(5.42)

(t)

(5.43)

Gi

i=1 ND(t)

UD(t) =

[

Di

i=1

For a single frame t, we define Rec(t) as the ratio of the detected areas in the ground truth with the total ground truth area: ( undef ined if UG(t) = ∅ |UG(t) ∩UD(t) | Rec(t) = (5.44) otherwise |U (t) | G

OverallRec =

  

undef ined PNf

t=1 |UG(t) |×Rec(t) PNf t=1 |UG(t) |

if

PNf

t=1

|UG(t) | = 0

otherwise

(5.45)

where Nf is the number of frames in the ground-truth data set and the | operator denotes the number of pixels in the area. 3. Area-Based Precision for Frame(ABPF): This metric measures how well the algorithm minimized false alarms. Initially

Part II: Algorithms

51 Method LOTS W4 MGM SGM BBS BBS-R

Room 0.3731 0.8464 0.5639 0.6434 0.8353 0.7000

Parking 0.7374 0.6614 0.8357 0.5579 0.9651 0.5911

Table 5.3: ABRF metric evaluation of the different algorithms it is computed for each frame and it is the weighted average for the whole data set. Let UG(t) and UD(t) be the spatial union of the boxes in G(t) and D(t) : NG(t)

UG(t) =

[

(t)

(5.46)

(t)

(5.47)

Gi

i=1 ND(t)

UD(t) =

[

Di

i=1

For a single frame t, we define P rec(t) as the ratio of the detected areas in the ground truth with the total ground truth area: ( undef ined if UD(t) = ∅ |UG(t) ∩UD(t) | P rec(t) = (5.48) 1 − |U (t) | otherwise D

OverallPrec is the weighted average precision of all the frames.  PNf undef ined if  t=1 |UD(t) | = 0 PNf OverallP rec = |UG(t) |×Rec(t)  t=1 otherwise PNf t=1

(5.49)

|UD(t) |

where Nf is the number of frames in the ground-truth data set and the | operator denotes the number of pixels in the area. Method LOTS W4 MGM SGM BBS BBS-R

Room 0.1167 0.5991 0.5652 0.8801 0.9182 0.5595

Parking 0.3883 0.7741 0.6536 0.8601 0.9526 0.4595

Table 5.4: ABPF metric evaluation of the different algorithms

52

Part II: Algorithms 4. Average Fragmentation(AF): This metric is intended to penalize an algorithm for multiple output boxes covering a single ground-truth object. Multiple detections include overlapping (t) and non-overlapping boxes. For a ground-truth object Gi in frame t, the (t) fragmentation of the output boxes overlapping the object Gi is measured by:   undef ined if ND(t) ∩G(t) = 0 (t) i 1 (5.50) F rag(Gi ) = otherwise  1+log10 (N (t) ) D (t) ∩G i

where ND(t) ∩G(t) is the number of output boxes in D(t) that overlap with the i

(t)

ground truth object Gi . Overall fragmentation is defined as average fragmentation for all ground-truth objects in the entire data set. 5. Average Object Area Recall(AOAR): This metric is intended to measure the average area recall of all the groundtruth objects in the data set. The recall for an object is the proportion of its area that is covered by the algorithm’s output boxes. The objects are treated equally regardless of size. For a single frame t, we define Recall(t) as the average recall for all the objects in the ground truth G(t) : P Recall(t) =

(t)

(t)

∀Gi

ObjectRecall(Gi ) NG(t)

Where

(5.51)

(t)

ObjectRecall(Gi (t)) =

|Gi ∩ UD( t) |

PNf OverallRecall(Gi (t)) =

Method LOTS W4 MGM SGM BBS BBS-R

Room 0.1500 0.5555 0.2920 0.4595 0.7026 0.3130

t=1

(t)

|Gi | NG(t) × Recall(t) PNf t=1 |NG(t) |

Parking 0.7186 0.6677 0.8468 0.5418 0.9575 0.1496

Table 5.5: AOAR metric evaluation of the different algorithms

(5.52)

(5.53)

Part II: Algorithms

53

6. Average Detected Box Area Precision(ADBAP): This metric is a counterpart of the previous metric 5.51 where the output boxes are examined instead of the ground-truth objects. Precision is computed for each output box and averaged for the whole frame. The precision of a box is the proportion of its area that covers the ground truth objects. (t)

P

(t)

∀Di

P recision(t) =

BoxP recision(Di )

(5.54)

ND(t) (t)

(t) BoxP recision(Di )

PNf OverallP recision =

Method LOTS W4 MGM SGM BBS BBS-R

t=1

Room 0.0053 0.0598 0.0534 0.0133 0.0048 0.1975

=

|Di ∩ UG( t) |

(5.55)

(t)

|Di |

ND(t) × P recision(t) PNf t=1 |ND(t) |

(5.56)

Parking 0.1453 0.0417 0.0177 0.0192 0.0002 0.0192

Table 5.6: ADBAP metric evaluation of the different algorithms

7. Localized Object Count Recall(LOCR): In this metric, a ground-truth object is considered as detected if a minimum proportion of its area is covered by the output boxes. Recall is computed as the ratio of the number of detected objects with the total number of ground-truth objects. X

(t)

ObjDetect(Gi )

(5.57)

> OverlapM in (t) |Gi |  0 otherwise PNf LocObjRecall(t) OverallLocObjRecall = t=1 PNf t=1 NG(t)

(5.58)

LocObjRecall(t) =

(t) ∀Gi

(t) ObjDetect(Gi )

=

 

(t)

1

if

|Gi ∩UD( t) |

i

(5.59)

54

Part II: Algorithms Method LOTS W4 MGM SGM BBS BBS-R

Room 0.0056 0.0083 0.0083 0.0056 0.0083 0.0028

Parking 0.0190 0.0127 0.0253 0.0190 0.0253 0.0063

Table 5.7: LOCR metric evaluation of the different algorithms 8. Localized Output Box Count Precision(LOBCP): This is a counterpart of metric 5.56. It counts the number of output boxes that (t) significantly covered the ground truth. An output box Di significantly covers the ground truth if a minimum proportion of its area overlaps with UG(t) . X (t) LocBoxCount(t) = (5.60) BoxP rec(Di ) (t)

∀Di (t) BoxP rec(Di )

=

 

(t)

1

if

|Di ∩UG( t) | (t)

|Di |

> OverlapM in

 0 otherwise PNf LocBoxCount(t) OverallOutputBoxP rec = t=1 PNf t=1 ND(t)

(5.61)

(5.62)

i

Method LOTS W4 MGM SGM BBS BBS-R

Room 0.5919 0.0592 0.0541 0.0124 0.0043 0.1900

Parking 0.1729 0.0427 0.0206 0.0186 0.0001 0.0174

Table 5.8: LOCBP metric evaluation of the different algorithms

5.6.4

Experimental results

Figure 5.20(a) and figure 5.20(b) display respectively precision and recall of different methods evaluated on the experimental indoor dataset. Figure 5.21(a) and figure 5.21(b) display respectively precision and recall of different methods evaluated on the experimental outdoor dataset. Table 5.2 shows precision and recall with an overlap requirement of T=50% (see eq 5.41). According to those results we can introduce the following remarks: the simple method BBS gives the lower benchmark on precision, since it produces a very high

Part II: Algorithms

55

0.6

1 LOTS W4 MGM SGM BBS BBS-R

0.55 0.5

LOTS W4 MGM SGM BBS BBS-R

0.95 0.9 0.85 0.8

0.45

0.75

Recall = TP/(TP+FN)

Precision = TP/(TP+FP)

0.7 0.4 0.35 0.3 0.25 0.2

0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3

0.15

0.25 0.2

0.1

0.15 0.1

0.05

0.05 0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

Overlap

0.6

0.8

1

Overlap

(a) precision with respect to overlap

(b) recall with respect to overlap

Figure 5.20: Room sequence 0.25

1 LOTS W4 MGM SGM BBS BBS-R

0.2

LOTS W4 MGM SGM BBS BBS-R

0.9 0.8

Recall = TP/(TP+FN)

Precision = TP/(TP+FP)

0.7 0.15

0.1

0.6 0.5 0.4 0.3

0.05

0.2 0.1

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

Overlap

0.4

0.6

0.8

1

Overlap

(a) precision with respect to overlap

(b) recall with respect to overlap

Figure 5.21: Parking sequence number of false detections. The method MGM has the best recall for high overlap thresholds (T > 0.6) and for low overlap thresholds (T < 0.2) LOTS get the best one for indoor dataset, but for outdoor dataset (parking) LOTS has already the best recall. Which means that the size estimates are quite good. This is due to the connected component analysis of the methods. In the case of indoor scene, BBS-R performs better than BBS, and BBS-R has also higher precision for T < 0.4 but it still has lower recall. This method seems as well as BBS not appropriate for the task in the indoor scenes. On the other side it performs better in the case of outdoor scenes. The new method performs better than the original BBS algorithm. It outperforms SGM and W4, but it still not two performant as the complex algorithms MGM and LOTS. Thus, the new approach is modular, we can imagine to apply the same method to those complex algorithms. For some of the evaluated sequences there was the problem of sudden camera jitter, which can be caused by a flying object or strong wind. This sudden motion causes the motion detector to fail and identify the whole scene as a moving object, although the system readjusts itself after a while but this causes a large drop in the performance.

56

Part II: Algorithms

5.6.5

Comments on the results

The evaluation results shown above were affected by some very important factors: 1. For some of the evaluated sequences there was the problem of sudden camera jitter, which can be caused by a flying object or strong wind. This sudden motion cause the motion detector to fail and identify the whole scene as a moving object, although the system readjusts itself after a while but this cause a large drop in the performance. 2. The number of objects in the sequence alters the performance significantly where the optimum performance occurs in the case of a few independent objects in the scene. Increasing the number of objects increases the dynamic occlusion and loss of track.

Chapter 6 Machine learning for visual object-detection Contents 6.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . .

58

6.2

The theory of boosting . . . . . . . . . . . . . . . . . . . .

58

6.3

6.4

6.5

6.2.1

Conventions and definitions . . . . . . . . . . . . . . . . . .

58

6.2.2

Boosting algorithms . . . . . . . . . . . . . . . . . . . . . .

60

6.2.3

AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.2.4

Weak classifier . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.2.5

Weak learner . . . . . . . . . . . . . . . . . . . . . . . . . .

63

Visual domain . . . . . . . . . . . . . . . . . . . . . . . . .

66

6.3.1

Static detector . . . . . . . . . . . . . . . . . . . . . . . . .

67

6.3.2

Dynamic detector

. . . . . . . . . . . . . . . . . . . . . . .

68

6.3.3

Weak classifiers . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.3.4

Genetic weak learner interface

. . . . . . . . . . . . . . . .

75

6.3.5

Cascade of classifiers . . . . . . . . . . . . . . . . . . . . . .

76

6.3.6

Visual finder . . . . . . . . . . . . . . . . . . . . . . . . . .

77

LibAdaBoost: Library for Adaptive Boosting . . . . . .

80

6.4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

80

6.4.2

LibAdaBoost functional overview . . . . . . . . . . . . . . .

81

6.4.3

LibAdaBoost architectural overview . . . . . . . . . . . . .

85

6.4.4

LibAdaBoost content overview . . . . . . . . . . . . . . . .

86

6.4.5

Comparison to previous work . . . . . . . . . . . . . . . . .

87

Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1

Car detection . . . . . . . . . . . . . . . . . . . . . . . . . .

88 89

58

Part II: Algorithms

6.6

6.1

6.5.2

Face detection . . . . . . . . . . . . . . . . . . . . . . . . .

90

6.5.3

People detection . . . . . . . . . . . . . . . . . . . . . . . .

91

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . .

92

Introduction

In this chapter we present a machine learning framework for visual object-detection. We called this framework LibAdaBoost as an abreviation of Library for Adaptive Boosting. This framework is specialized in boosting. LibAdaBoost supports visual data types, and provides a set of tools which are ready for use by computer vision community. These tools are useful for the generation of visual detectors using state of the art boosting algorithms. This chapter is organized as follows: section 6.2 gives a general introduction for boosting. Section 6.3 describes the usage of boosting for visual object-detection. Section 6.4 introduces LibAdaBoost and its software architecture and the subsequent workflows for learning tasks, validation tasks and test tasks. Section 6.5 describes use cases for examples of visual object-detection with LibAdaBoost. Each use case is described by its learning part, its validation part and its test on benchmarked datasets. Section 6.6 concludes the chapter.

6.2

The theory of boosting

Boosting has been presented as a very effective technique to help learning algorithms. This method allows, as its name implies, to boost the hypotheses given by a weak learner. AdaBoost, one of the most interesting boosting algorithm, has first been proposed by Schapire and Freund [FS95a] in the 90’s. It has further grown in the works of Schapire and Singer [SS98] where it was generalized making the first version a specific one of the second. A good introduction to AdaBoost is also available in [FS99]. Boosting has been used first for visual object detection by Viola and Jones [VJ01b], for detecting faces since 2000. Other works for detection of pedestrians [VJS03], cars [KNAL05], license plates [ZJHW06], and others [ASG05]. This increasing usage of Boosting motivated us for developing a unified platform for different boosting algorithms, data types, and subsequent weak learners and classifiers.

6.2.1

Conventions and definitions

In this section we will introduce some definitions related to learning process. These definitions are useful for easily understanding section 6.2.2. Whether we are talking about machines or humans the learning process could be decomposed into two main parts as shown in Figure 6.1. First of all one has to be able to retain previously seen

Part II: Algorithms

59

Figure 6.1: Learning process: first step consists in retaining previously seen data so that one is able to identify it when it is seen again. The second step consists in applying the knowledge, gained from previous step, to new data, this step is called generalization. data so that it is able to identify it when it is seen again. The second step consists in applying the knowledge, gained from previous step, to new data. This fact is known as generalization and is the main goal of the learner. Thus we introduce the definition of the generalization error in Definition 1. Definition 1 (Generalization error). The generalization error is the error rate of an algorithm trying to apply its expertise to new samples. Formally we can define a learning algorithm as in Definition 2. Definition 2 (Learning algorithm). A learning algorithm is a procedure which can reduce the error on a set of training data [DHS00] as well as have the capacity of minimizing the generalization error. The training set is the input data for the learning algorithm. It is defined as in Definition 3. Definition 3 (Training set). A training set S is defined by (x1 , y1 ), ..., (xn , yn ) where each xi belongs to some domain or instance space X, and each label yi belongs to some label set Y . In this work we focus on learning using classifiers defined in Definition 4. An example of a classifier would be a voting system which, given an email, would be able to classify it as spam email or safe email. Definition 4 (Knowledge). We define the knowledge as a classifier C : X → Y which calculates, for a given sample x, the appropriate label y. In the following sections we will be using binary classifiers. A binary classifier is defined in Definition 5. Definition 5 (Binary Classifier). C is a binary classifier if Y = {−1, +1}.

60

6.2.2

Part II: Algorithms

Boosting algorithms

To understand boosting we will use the horse-racing gambler example which is taken from [FS99]. A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a ”rule of thumb” for that set of races (such as, ”Bet on the horse that has recently won the most races” or ”Bet on the horse with the most favored odds”). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert’s opinion on different collections of races, the gambler is able to extract many rules of thumb. In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler: first, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful? Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule? Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.

Figure 6.2: Boosting algorithm framework

Part II: Algorithms

61

Historical overview The following summary is based on [FS99]: originally, boosting started from a theory of machine learning called ”Probably Approximately Correct” (PAC) [Val84, KV94]. This model allows us to derive bounds on estimation error for a choice of model space and given training data. In [KV88, KV89] the question was asked if a ”weak” learning algorithm, which works just slightly better than random guessing (according to the PAC model) can be made stronger (or ”boosted”) to create a ”strong” learning algorithm. The first researcher to develop an algorithm and to prove its correctness is Schapire [Sch89]. Later, Freund [Fre90] developed a more efficient boosting algorithm that was optimal but had several practical problems. Some experiments with these algorithms were done by Drucker, Schapire and Simard [DSS93] on an OCR task.

6.2.3

AdaBoost

The AdaBoost algorithm was introduced in 1995 by Freund and Schapire [FS95a]. AdaBoost solved many of the practical problems that existed in the previous boosting algorithms [FS95a]. In Algorithm 1 we bring the original AdaBoost algorithm. The algorithm takes as input a training set {(x1 , y1 ), . . . (xm , ym )} where each xi belongs to some domain X and each label yi is in some label set Y . To our purposes we need only binary classification and thus can assume Y = {+1, −1}. Generalization to the multi-class case is out of the focus of this thesis. AdaBoost calls a given weak learning Algorithm 1 The Adaboost algorithm in its original form Require: a sequence of N labeled samples {(x1 , y1 ), . . . , (xm , ym )} where xi ∈ X and yi ∈ Y = {−1, +1} 1 1: Initialize the weight vector: D1 (i) = m 2: for t = 1 to T do 3: Train weak learner, providing it with the distribution Dt . 4: Get weak classifier (weak hypothesis) ht : X → {−1, +1} with error   t = P ri∼Dt [ht (xi ) 6= yi ]. t 5: Choose αt = 21 ln 1− t 6: Update:  −α e t if ht (xi ) = yi Dt+1 (i) = Dt (i) × eαt if ht (xi ) 6= yi 7: Normalize the weights so that P Dt+1 will be a distribution: N i=1 Dt+1 (i) = 1. 8: end for P Ensure: Strong classifier H(x) = sign( Tt=1 αt ht (x)) algorithm repeatedly in a series of ”cycles” t = 1 . . . T . The most important idea of the algorithm is that it holds at all times a distribution over the training set (i.e. a weight for each sample). We will denote the weight of this distribution on training sample (xi , yi ) on round t by Dt (i). At the beginning, all weights are set equally, but

62

Part II: Algorithms

on each round, the weights of incorrectly classified samples are increased so that the weak learner is forced to focus on the hard samples in the training set. The role of the weak learner is to find a weak rule (classifier) ht : X → Y appropriate for the distribution Dt . But what is ”appropriate”? The ”quality” of a weak classifier is measured by its error, according to the weights: t =

X

Dt (i)

(6.1)

i:ht (xi )6=yi

The error is measured, of course, with respect to the distribution Dt on which the weak learner was trained, so each time the weak learner finds a different weak classifier. In practice, the weak learner is an algorithm - doesn’t matter which algorithm - that is allowed to use the weights Dt on the training samples. Coming back to the horse-racing example given above, the instances xi correspond to descriptions of horse races (such as which horses are running, what are the odds, the track records of each horse, etc.) and the label yi gives a binary outcome (i.e., if your horse won or not) of each race. The weak classifiers are the rules of thumb provided by the expert gambler, where he chooses each time a different set of races, according to the distribution Dt . Algorithm 2 shows the AdaBoost algorithm running in the head of the gambler. Algorithm 2 The Adaboost algorithm running in the head of the gambler 1: If I look at the entire set of races I’ve seen in my life, I can tell that it is very important that the other horses (other than the one I gamble on) will not be too strong. This is an important rule. 2: I will now concentrate on the races where my horse was the strongest horse, but nevertheless it didn’t win. I recall that in some of these races, it was because there was a lot of wind, and even a strong horse might be sensitive to wind. This cannot be known in advance about a certain horse. 3: In another case the horse I chose was not the strongest, but even though, it won. This was because it was in the outer lane, and it is easier to run there. Ensure: Resulting strong classifier The other horses (other than the one I gamble on) should not be too strong. Importance of rule: 7 (on a scale of 1 to 10). These should not be wind in the day of the race. Importance: 3. The horse should be in the outer lane. Importance: 2. Note that the importances of the weak hypothesis are calculated according to their error in the execution of AdaBoost. This corresponds to the α of the rule in the algorithm (see Algorithm 1). Once the weak hypothesis ht has been received from the weak learner, AdaBoost chooses a parameter αt . Intuitively, αt measures the importance that is assigned to ht . Note that αt ≥ 0 if t < 12 (which we can assume without loss of generality), and that αt gets larger as t gets smaller. The distribution Dt is then updated using

Part II: Algorithms

63

the rule shown in Algorithm 1. This rule actually increases the weight of samples misclassified by ht , and decreases the weight of correctly classified samples. Thus, the weight tends to concentrate on ”hard” samples. The final hypothesis H is a weighted majority vote of the T weak hypotheses where αt is the weight assigned to ht . Variations on AdaBoost AdaBoost as it was showed in the previous section could only classify binary discrete problems. Various extensions have been created (for multi-class and multi-label) such as the ones described in [FS95b] known as AdaBoost.M1, AdaBoost.M2. Simpler forms of these can also be found in [SS98]. There have also been several very important additions to AdaBoost. one of the most interesting used in language recognition problems is prior knowledge [RSR+ 02]. It enables to add human knowledge to maximize the classification properties of the algorithm. It basically gives some knowledge to the learner before training. This is showed to be especially useful when the number of training samples is limited.

6.2.4

Weak classifier

A weak classifier depends on the Domain X. In this thesis we focus on the Visual Domain. Therefore, Definition 6 represents the description of the weak classifier in a generic boosting framework. In section 6.3.3 we present a set of weak classifiers for the visual domain. Definition 6 (Weak classifier). Let X as given Domain. Let Y as a set of labels. We consider a binary classification case :Y = {−1, +1}. A weak classifier h : X → Y is a classifier that is used as an hypothesis in the boosting framework.

6.2.5

Weak learner

The AdaBoost algorithm needs a weak classifier provider, this is called the weak learner. As described in Definition 8 the weak learner is a learning algorithm which will provide a ”good” feature at each boosting cycle, when the goodness is measured according to the current weights of the samples. Obviously, choosing the best weak classifier at each boosting cycle cannot be done by testing all the possibilities in the weak classifier’s space. Therefore, a partial search (selective search) strategy can be used to search a good weak classifier. This weak classifier is not always the best according to the whole space. Thus, partial strategies can be based on genetic-like approaches or other heuristics. Definition 7 (Weak Learner). A Weak Learner is a learning algorithm responsible of the generation of a weak classifier (weak hypothesis). A weak learner is independent of the Domain X. AdaBoost needs a weak learner with an error rate better than random.

64

Part II: Algorithms

In this section we will describe an improvement of a Genetic-Like algorithm that was first introduced by Abramson [Abr06]. In our work we implemented a generic implementation of the algorithm that is completely independent of the domain X. Genetic-like weak learner The genetic-like algorithm, given in Algorithm 3, maintains a set of weak classifiers which are initialized as random ones. At each step of the algorithm, a new ”generation” of weak classifiers is produced by applying a set of mutations on each of the weak classifiers. All the mutations are tested and the one with the lowest error might replace the ”parent” if it has a lower error. In addition, some random weak classifiers are added at each step. Algorithm 3 The Genetic-Like algorithm from [Abr06] Require: An error measurement Γ that matches an error to every weak classifier, a number G of generations to run, and a number N of ”survivors” at each generation. 1: Let Υ be a generator of random weak classifiers. Initialize the first generation’s weak classifiers vector c1 as: c1i ← Υ i = 1...N 2: for g = 1 to G do 3: Build the next generation vector cg+1 as: cg+1 ← cgi i = 1...N i 4: for m as Mutation in MutationSet do 5: if =m (cgi ) is valid and Γ(=m (cgi )) < Γ(cg+1 ) then i m g 6: cg+1 ← = (c ) i = 1 . . . N i i 7: end if 8: end for 9: Enlarge the vector C g+1 with N additional weak classifiers, chosen randomly: cg+1 ←Υ i = N + 1, . . . , 2N i 10: Sort the vector C g+1 such that g+1 g+1 Γ(cg+1 1 ) ≤ Γ(c2 ) ≤ · · · ≤ Γ(c2N ) 11: end for 12: return the weak classifier cG+1 . 1 The genetic algorithm continues to evolve the generations until there is no improvement during G consecutive generations. An improvement for that matter is of course a reduction in the error of the best feature in the generation. Definition 8 (Error measurment). Given a learning sample set Ω, and a classifier Ψ. An error measurement relative to Ω is a function ΓΩ that matches for Ψ a weighted error relative to the weights in Ω. In our work, we implemented an advanced instance of the previous weak learner (Algorithm 3). The new advanced genetic-like algorithm (AGA) is shown in Algo-

Part II: Algorithms

65

Algorithm 4 AGA: Advanced version of Genetic-like Algorithm Require: A learning sample set Ω and its associated error measurement ΓΩ (see Definition 8). 1: Let C as a vector of N classifiers. Ck is the k th element in C. 2: Let C g the status of C at the g th iteration. 3: Let Υ be a generator of random weak classifiers. 4: Ci1 ← Υ, i = 1 . . . N 5: IN oImprovement ← 0, g ← 1 6: Ψ ← C11 7: while true do 8: g ←g+1 9: Sort the vector C g such that ΓΩ (C1g ) ≤ ΓΩ (C2g ) ≤ · · · ≤ ΓΩ (CNg ) 10: if ΓΩ (Ψ) ≥ ΓΩ (C1g ) then 11: IN oImprovement ← 0 12: Ψ ← C1g 13: else 14: IN oImprovement ← IN oImprovement + 1 15: end if 16: if IN oImprovement = Nmaxgen then 17: return Ψ 18: end if 19: for i = 1 to Nbest do 20: if Γ(=m (Cig )) < Γ(Cig ) then 21: Cig ← =m (Cig ) 22: end if 23: end for 24: for i = Nbest + 1 to Nbest + Nlooser do 25: if Γ(Υ(Cig )) < Γ(Cig ) then 26: Cig ← Υ(Cig ) 27: end if 28: end for 29: for i = Nbest + Nlooser + 1 to N do 30: Cig ← Υ 31: end for 32: end while 33: return Ψ Ensure: the result weak classifier Ψ has the lowest weighted error on the learning sample set.

66

Part II: Algorithms

rithm 4. The main differences introduced with AGA (see on Algorithm 4) are as follows: • The size of the population is fixed to N. The population is partitioned into three classes: best to keep, looser to keep, and looser. The best to keep part is of size Nbest . It represents the classifiers with lowest error. These classifiers will be replaced by the result of the Mutation operation with several mutation types.(see Algorithm 4(line.19)) The looser to keep part is of size Nlooser . It represents the classifiers which will be replaced by randomized classifiers. A randomized classifier is a random classifier that is generated by simple modifications to an initial classifier. The result classifier is not completely random.(see Algorithm 4(line.24)) The looser part is of size N − (Nbest + Nlooser ). It represents the classifiers which will be rejected from the population and replaced by new random ones.(see Algorithm 4(line.29)) • The number of iterations is not fixed to a maximum number of generations G. Otherwise, we introduce the notion of maximum number of iterations without improvement Nmaxgen . This means that the loop stopped once there are no improvement of Ψ during Nmaxgen iterations of the weak learner. Parameter Description N Size of the Population Nbest Number of classifiers to Mutate Nlooser Number of classifiers to Randomize Nmaxgen Max number of iterations without improvement Table 6.1: AGA: parameters description The AGA algorithm has a set of parameters which are summarized in Table 6.1. We have elaborated an experimental study to find a set of best parameters for this algorithm. Our experimental tests show how the AGA algorithm performs better than the previous GA algorithm.

6.3

Visual domain

Visual domain consists in working with images. The objective is to find visual objects in a given image. A visual object can be a face, a car, a pedestrian or whatever we want. Figure 6.3 shows some examples of visual objects: a visual object has a width and a height, and is characterized by a knowledge generated using a learning procedure. In this context we consider a sliding window technique for finding visual objects in input images. Sliding window techniques follow a top-down approach. These

Part II: Algorithms

67

(a)

(b)

(c)

Figure 6.3: Visual object-detection examples techniques use a detection window of a fixed size and place this detection window over the input image. Next the algorithm determines whether the content of the image inside the window represents an object of interest or not. When it finishes processing the window content, the window slides to another location on the input image and the algorithm again tries to determine whether the content of the detection window represents an object of interest or not. This procedure continues until all image locations are checked. To detect an object of interest at different scales the image is usually scaled down a number of times to form a so-called image pyramid and the detection procedure is repeated. In this section we will discuss the different elements of a visual detector: section 6.3.1 and section 6.3.2 describe static and dynamic detectors. Section 6.3.3 presents the different visual weak classifiers, from the state of the art as well as from our work. The learning procedure with these weak classifiers is presented in section 6.3.4. Section 6.3.6 describes the test procedure so called finder.

6.3.1

Static detector

A static detector uses spacial information only, and doesn’t take into account the temporal information. Thus, the visual object is represented only by an image with dimensions Height × W idth. This image sample is called static image sample and is formally presented in Definition 9. Definition 9 (Static image sample). A static Image Sample I s is a visual sample which represents an image of dimensions Height × W idth. The static image sample is used as input for static detectors. The static detector can be applied to single images, as well as on video streams. It handles the video stream, each image independently of the others. The information on the motion is not used, otherwise it becomes a dynamic detector presented in the next section.

68

6.3.2

Part II: Algorithms

Dynamic detector

In order to use the motion information in a video stream, we introduce the notion of dynamic detector. The dynamic detector can be applied on video streams only. The detector uses the current image I and a buffer of p next images. Thus, the image sample contains image information from multiple images sources as described in Definition 10. Definition 10 (Dynamic image sample). A dynamic Image Sample I d is a visual d d with similar dimensions . . . It+p sample which represents a collection of images Itd , It+1 Height × W idth . These images correspond to the same window in a video stream. These images distributed on the temporal space contain information on the motion of the visual object. This information will be integrated to the knowledge on the visual object. The dynamic image sample is used as input for dynamic detectors.

6.3.3

Weak classifiers

A visual weak classifier ( visual feature ) has a two stage architecture (see Figure 6.4): the first stage does the preprocessing on the input image sample (static or dynamic). The preprocessing result is then used as an input to the classification rule. The classification rule is a simple computation function which decides whether the input corresponds to the object of interest or not. In the following sections we describe

Figure 6.4: Visual weak classifier ( Visual Feature ) pattern the most common visual features. We map these features to the architecture pattern described in Figure 6.4. This projection is useful for subsequent complexity analysis and generic implementation as described in section 6.4. Viola and Jones Rectangular Features In [VJ01b] Paul Viola and Michael Jones learned a visual detector for face detection using AdaBoost. The learned detector is based on rectangular features (visual weak classifiers) which span a rectangular area of pixels within the detection window, and can be computed rapidly when using a new image representation called the integral image. This section will discuss the rectangular features used by Viola and Jones as shown in Figure 6.5. First we will discuss the classification rules. Then we will

Part II: Algorithms

69

Figure 6.5: Viola and jones rectangular features describe the preprocessing stage which introduces the integral image. Next, how these features can be computed using the integral image is discussed.

(a) FvEdge

(b) FhEdge

(c) FvLine

(d) FhLine (e) FDiagonal

Figure 6.6: The five classification rules used by Viola and Jones placed in the detection window.

1. Rectangular features: Figure 6.12 shows the five rectangular features used by Viola and Jones placed in the detection window. To compute a feature, the sum of the pixels in the dark area is subtracted from the sum of the pixels in the light area. Notice that we could also subtract the light area from the dark area, the only difference is the sign of the result. The vertical and horizontal tworectangle features FvEdge and FhEdge are shown in Figure 6.6(a) and Figure 6.6(b) respectively, and tend to focus on edges. The vertical and horizontal threerectangle filters FvLine and FhLine are shown in Figure 6.6(c) and Figure 6.6(d) respectively, and tend to focus on lines. The four rectangle feature FDiagonal in Figure 6.6(e) tends to focus on diagonal lines. 2. Integral image:Viola and Jones propose a special image representation called the integral image to compute the rectangular filters very rapidly. The integral image is in fact equivalent to the Summed Area Table (SAT) that is used as a texture mapping technique, first presented by Crow in [Cro84]. Viola and Jones

70

Part II: Algorithms

Figure 6.7: Integral image renamed the SAT to integral image to distinguish between the purpose of use: texture mapping versus image analysis. The integral image at location (x, y) is defined as the sum of all pixels above and to the left of (x, y) (see Figure 6.7): ii(x, y) =

y x X X

(I(j, k))

j=0 k=0

where ii(x, y) is the integral image value at (x, y), and I(x, y) is the original image value. 3. Feature computation: When using integral image representation previously described, any pixel sum of a rectangular region (see Figure 6.8) can be computed with only four lookups, two subtractions and one addition, as shown in the following equation: X = (ii(x + w, y + h) + ii(x, y)) − (ii(x + w, y) + ii(x, y + h)) We note that this sum is done in constant time regardless of the size of the rectangle region. The five features as proposed by Viola and Jones consist of

(a)

(b)

Figure 6.8: Calculation of the pixel sum in a rectangular region using integral image two or more rectangular regions which need to be added together or subtracted

Part II: Algorithms

71

from each other. Thus, the computation of a feature is reduced to a finite number of pixel sums of rectangular regions. Which can be done easily and in a constant time regardless the size and position of the features. 4. Image normalization: Viola and Jones normalized the images to unit variance during training to minimize the effect of different lighting conditions. To normalize the image during detection they post multiply the filter results by the standard deviation σ of the image in the detector window, rather than operating directly on the pixel values. Viola and Jones compute the standard deviation of the pixel values in the detection window by using the following equation: r 1 X 2 x σ = µ2 − N where µ is the mean, x is the pixel value and N is the total number of pixels inside the detection window. The mean can be computed by using the integral P 2 image. To calculate x Viola and Jones use a squared integral image, which is an integral image only with squared image values. Computing σ only requires eight lookups and a few instructions. Control-points features

Figure 6.9: Control-points features Figure 6.9 shows visual features proposed by Abramson et al [ASG05]. These features work on gray images as in [VJ01b]. Given an image of width W and height H (or a sub-window of a larger image, having these dimensions), we define a control point to be an image location in the form hi, ji, where 0 ≤ i < H and 0 ≤ j < W . Given an image location z, we denote by val(z) the pixel value in that location. Control-points feature consists of two sets of control-points, x1 . . . xn and y1 . . . ym , where n, m ≤ K. The choice of the upper bound K influences the performance of the

72

Part II: Algorithms

system in a way which cand be experimented using LibAdaBoost, and in our system we chose K = 6. Each feature either works on the original W × H image, on a halfresolution 12 W × 12 H image, or on a quarter resolution 14 W × 14 H image. These two additional scales have to be prepared in advance by downscaling the original image. To classify a given image, a feature examines the pixel values in the control-points x1 . . . xn and y1 . . . ym in the relevant image (original, half or quarter). The feature answers ”yes” if and only if for every control-point x ∈ x1 . . . xn and every controlpoint y ∈ y1 . . . ym , we have val(x) > val(y). Some examples are given in Figure 6.10. The control-points feature is testing the relations between pixels. It seeks for a predefined set of pixels in the image in a certain order. If such is not found, it labels negatively. Because it is order-based and not intensity-based, it does not care about what is the difference between two pixels’ values; all it cares about is the sign of that difference. It is therefore insensitive to luminance normalization. In fact, the feature is insensitive to any order-preserving change of the image histogram, including histogram equalization, which is used by Papageorgiou et al [OPS+ 97]. The

(a)

(b)

(c)

Figure 6.10: Control-Points features 3-resolution method helps to make the feature less sensitive to noise; a single controlpoint in the lower resolutions can be viewed as a region and thus detects large image components, thus offering an implicit smoothing of the input data. One can immediately notice that for computing the result of an individual feature, an efficient implementation has not always to check all the m + n control points. The calculation can be stopped once the condition of the feature is broken, and this usually happens much before all the control-points are checked. Viola and Jones motion-based features In [VJS03] Viola et al introduced the first dynamic visual feature which mixes patterns of motion and appearance for pedestrian detection. This section will explain these features which are considered as a natural generalization of the rectangular visual features presented above. Figure 6.11 shows a system block representation of the motion-based features. These features operate on a pair of images [It , It+1 ]. Motion is extracted from pairs of images using the following filters: 1. ∆ = abs(It − It+1 )

Part II: Algorithms

73

Figure 6.11: Viola and Jones motion-based rectangular features 2. U = abs(It − It+1 ↑) 3. L = abs(It − It+1 ←) 4. R = abs(It − It+1 →) 5. D = abs(It − It+1 ↓)

(a)

(b)

(c)

(d)

(e)

Figure 6.12: Viola and Jones motion-based rectangular features where It and It+1 are images in time, and {↑, ←, →, ↓} are image shift operators. One type of filter compares sums of absolute differences between ∆ and one of {U, L, R, D} fi = ri (∆) − ri (S)

74

Part II: Algorithms

where S is one of {U, L, R, D} and ri () is a single box rectangular sum within the detection window. These filters extract information related to the likelihood that a particular region is moving in a given direction. The second type of filter compares sums within the same motion image: fj = Φj (S) where Φ is one of the rectangle filters shown in Figure 6.12. These features measure something closer to motion shear. Finally, a third type of filter measures the magnitude of motion in one of the motion images: fk = rk (S) where S is one of {U, L, R, D} and rk is a single box rectangular sum within the detection window. We also use appearance filters which are simply rectangle filters that operate on the first input image, It: fm = Φ(It ) The motion filters as well as appearance filters can be evaluated rapidly using the integral image previously described in Section 6.3.3. A feature, F , is simply a thresholded filter that outputs one of two votes.  α if fi (It , ∆, U, L, R, D) > ti (6.2) Fi (It , It+1 ) = β otherwise where ti ∈ < is a feature threshold and fi is one of the motion or appearance filters defined above. The real-valued α and β are computed during AdaBoost learning (as is the filter, filter threshold θ and classifier threshold). In order to support detection at multiple scales, the image shift operators {↑, ← , →, ↓} must be defined with respect to the detection scale. Extended rectangular features In this section we focus on the question whether more complex visual features based on rectangular features would increase the performance of the classifier. Therefore, we propose two categories of features based on the same pattern as rectangular features. This means that these features use the integral image representation as well as the image normalization using the integral square image representation (see Figure 6.13). The visual filters introduced by these features are as follows: • Box-based rectangular features: These features are based on a basic box which encapsulates another box. The internal box is located in the middle as in Fbox or at the corner as in {FU L , FDL , FDR , FU R }.(see Figure 6.14) • Grid-based rectangular features: These features are based on the previous rectangular features FvLine , FhLine , and FDiagonal . We introduced lower granularity to these features by decomposing each rectangular region to several smaller regions.(see Figure 6.15)

Part II: Algorithms

75

Figure 6.13: Extended rectangular features

(a) FU L

(b) FDL

(c) FDR

(d) FU R

(e) Fbox

Figure 6.14: Classification rules used within extended rectangular features

6.3.4

Genetic weak learner interface

The genetic like weak learner described in section 6.2.5 operates on weak classifiers through a genetic interface. The genetic interface provides three operations: randomize, mutate and crossover. In this work we only handled the two first operations randomize and mutate. In this section we will describe the genetic interface for each of the previous visual features. The visual features are decomposed into two categories: the rectangular features (Viola and Jones rectangular features, motion-based rectangular features and the extended rectangular features) and the control-points features. The first category consists in a set of filters with a generic pattern defined by a rectangular region at a given position in the detection window. This rectangular region is decomposed into multiple rectangular regions. This will generate the different filters. So the genetic interface is based on this rectangular region, so it will be used by all these features in the category. The second category is the control-points features, and it will be described separately. Rectangular filters Given an image sample I of width W and height H. Given a rectangular filter Fct of category c and type t. where c and t are given in Table 6.2. The randomize operation

76

Part II: Algorithms

(a) FV Grid

(b) FHGrid

(c) FGrid

Figure 6.15: Extended rectangular features

Figure 6.16: Attentional cascade of classifiers consists in moving the rectangular region randomly within a limited area around the current position. The randomize operation conserves the type and the category of the filter. The mutate operation consists in modifying the dimensions of the rectangular region (scale operation) and to modify the type of the filter without changing the category of the filter. Control-points filters Given an image sample I of width W and height H. Given a control-points feature Fr , r ∈ {R0 , R1 , R2 }. The randomize operation consists in modifying the position of a red point or a blue point within a limited local area. The randomize conserves the resolution r. The mutation consists in add, remove, move a control-point (red or blue).

6.3.5

Cascade of classifiers

The attentional cascade of classifiers was first introduced by Viola and Jones [VJ01b] as an optimization issue. In a single image, there are very large amount of sliding windows of which the vast majority does not correspond to the object of interest. These sliding windows have to be rejected as soon as possible. The cascade, as shown in Figure 6.16, is composed of a sequence of strong classifiers so called stages (or layers). The first layers contain few weak classifiers and are very fast. The last layers are more complex and contain more weak classifiers. The cascade is learned in such a way: the first layer rejects a large amount of input sliding windows and accepts all positive sliding windows. The next layers are tuned to reduce the amount of false positive rates while still maintaining a high detection rate.

Part II: Algorithms

77

The cascade of classifiers accelerates the detection process. Therefore the single layer classifiers maintain a better accuracy and false detection rate. In this thesis we focus on single layer classifiers and the acceleration of these classifiers using specialized hardware.

6.3.6

Visual finder

Definition 11 (Visual data). Visual data can be seen as a collection of samples at the abstract level. In the visual domain, the Visual data encapsulates It , [It+1 , . . . It+p ] where p = 0 in the case of static detector and p > 0 in the case of dynamic detector. The Visual data encapsulates also all the subsequent visual data obtained from these images. To obtain the samples (static or dynamic image samples) we use an iterator called search strategy. Given a knowledge Cζ relative to a visual object ζ. Given a visual data (see Definition 11) source π. A visual finder is a component that first builds the visual pyramid for the visual data π(t) and does some preprocessing on π(t). This preprocessing is required by the classification step. Next it iterates over the sliding windows available through an iterator named search strategy. Each sliding window is classified using Cζ . Visual finder ends by a postprocessing step which consists in some spatial and temporal filters on the positively classified sliding windows. In the following sections we will detail each of these functional blocks. We start by the generic patterns that could be used to group all these blocks together. Visual finder patterns In this section we focus on the question where to put the preprocessing block relative to the search strategy? because it influences the performance of the total process in term of memory space and computation time. Figure 6.17(a) shows the first configuration called Global Preprocessing. Figure 6.17(b) shows the second configuration called Local Preprocessing. Local preprocessing consists in iterating over the input visual data first. For each sliding window it builds the required preprocessing defined by Cζ . The sliding window is responsible of the storage of the preprocessing data. This pattern requires less memory size with the cost of more computation time. Global preprocessing consists in building a global preprocessing data which is shared between all the sliding windows. Each sliding window maintains a reference to the shared preprocessing interface, and a position (sx , sy ). This pattern requires larger memory space and lower computation time. These two patterns are constrained by the subsequent architecture. From the functional point of view these two patterns are similar. Thus, in next sections we refer to the visual finder regardless its pattern.

78

Part II: Algorithms

(a) Global Preprocessing

(b) Local Preprocessing

Figure 6.17: Generic Finder Patterns Visual pyramid In the generic design of the visual detector, we focus on the question how to find objects at different scales in the same image? First let us remember that the knowledge Cζ is specialized on an object ζ with dimensions h × w. The dimensions h × w represent the dimensions of the smallest object ζ that can be recognized by Cζ . To detect objects of same category ζ with higher dimensions h0 × w0 constrained by h = wh00 , the classifier Cζ is applied to the scaled image with factor α = hh0 . w

(a) Object to scale association

(b) Face example

Figure 6.18: Visual Pyramid: ζ corresponds to ’A’, and the largest object found in the image is ’E’. Given a visual data π(t) with embedded image It (in case of dynamic detector then we consider same pyramid for all It+p , p = 0 . . . n). Let H × W the dimensions

Part II: Algorithms

79

of It . The largest object ζ that can be recognized in It has as dimensions H × W . Therefore, the objects ζ that could be found in this image have dimensions hk × wk constrained by: hk = H . . . h, wk = W . . . w and whkk = wh . We build a set of N images obtained from It by a scaling factor α, where αN = w ). This set of images is called visual pyramid (see Figure 6.18(a)). The size min( Hh , W of the visual pyramid is N . N is calculated by:  j  w k h N = min( logα ( ) , logα ( ) ) H W Table 6.3 represents the size of the pyramid for a quarter-pal image with several values of α. The choise of α depends on the requirements of the application. Visual finder preprocessing Given a knowledge Cζ . Let Ψ be a category of visual features. Let fΨ denote a visual feature based on Ψ. For each Ψ, if Cζ contains at least a visual feature fΨ , then the preprocessing block must prepare the required data, as shown in Table 6.4. Visual finder search strategy Definition 12 (Visual search strategy). Given a visual data π(t). A search strategy is an iterator on the sliding windows present in π(t). The search may cover all the available sliding windows and then we speak in term of full-search, and otherwise it is called partial-search. We have introduced the notion of search strategy as an abstraction of the scan method presented by Viola and Jones [VJ01b]. We call the previous method the generic visual search. It is fully customizable: this will help us to compare the performance of different visual features without doubts. We can configure the horizontal and vertical steps to zero and then it is similar to full-search strategy as shown in Figure 6.19. Otherwise, Figure 6.19 shows that these parameters decrease dramatically the number of sliding windows in the image. The accuracy of the detection decreases as well. The ideal case is to cover all the sliding windows but this is a heavy task for the traditional processors. The notion of search strategy is useful, since it can be extended to untraditional search methods. For example in some applications, the objects of interest are located in some regions at different layers in the pyramid. These informations could be measured using statistics on some detection results applied to recorded sequences. Visual finder postprocessing Given a knowledge Cζ and a visual data π(t). The detector based on Cζ is insensitive to small variations in scale and position, usually a large number of detections occur around an object of interest ζ (see Figure 6.20(a)). To incorporate these multiple detections into a single detection, which is more practical, we apply a simple

80

Part II: Algorithms

Figure 6.19: Generic Visual Search Strategy grouping algorithm as in [VJ01b]. This algorithm combines overlapping detection rectangles into a single detection rectangle. Two detections are placed in the same set if their bounding rectangles overlap. For each set the average size and position is determined, resulting in a single detection rectangle per set of overlapping detections (see Figure 6.20(b)).

(a) Before

(b) After

Figure 6.20: Visual postprocessing: grouping

6.4 6.4.1

LibAdaBoost: Library for Adaptive Boosting Introduction

We started developing LibAdaBoost as a unified framework for boosting. This framework implements an extension to the computer vision domain. We have designed the boosting framework using a basic abstraction called ”open machine learning framework”. Thus, we consider LibAdaBoost as an open machine learning framework which

Part II: Algorithms

81

provides a boosting framework as well as an extension of this framework for the vision domain. Boosting is for instance used for signal processing (analysis and prediction), image and video processing (face, people, cars ..) or speech processing (speech recognition ...). Every year, many new algorithms (boosting algorithms, weak learners and weak classifiers) are proposed in various international conferences and journals. It is often difficult for scientists interested in solving a particular task (say visual object detection) to implement them and compare them to their usual tools. The aim of this section is to present LibAdaBoost, a new machine learning software library available to the scientific community under the Less GPL license, and which implements a boosting framework and a specialized extension for computer vision. The objective is to ease the comparison between boosting algorithms and simplify the process of extending them through extending weak classifiers, weak learners and learners or even adding new ones. We implemented LibAdaBoost in C++. We designed LibAdaBoost using wellknown patterns. This makes LibAdaBoost easy to take in hand by the majority of the researchers. LibAdaBoost is too large, it consists in more than twenty packages with more than two hundred classes. In this section we focus on the functional description of the different parts of LibAdaBoost. And we give also a sufficient presentation of the global architecture. More advanced technical details are available in [Gho06]. This section is organized as follows: Section 6.4.2 presents the machine learning framework abstraction as exposed in LibAdaBoost. Section 6.4.2 presents the boosting framework built upon the machine learning framework. Section 6.4.2 describes the computer vision extension of the boosting framework. An architecture overview of LibAdaBoost is given in Section 6.4.3, in this section we present both the plugin-based and service oriented approaches. Section 6.4.4 covers the different plugins within LibAdaBoost, and regroups them into four levels: architecture, machine learning framework, boosting framework, and computer vision extension. We compare LibAdaBoost with other available tools in Section 6.4.5; this is followed by a quick conclusion.

6.4.2

LibAdaBoost functional overview

Figure 6.21 shows a high level functional view of LibAdaBoost. It consists in three conceptual levels: machine learning framework, boosting framework and computer vision extension of this boosting framework. The potential users of LibAdaBoost come from two large communities: machine learning community and computer vision community. The machine learning community can improve the first two layers and the computer vision users are more specialized in deploying these algorithms in computer vision domain. In the following we will describe the main concepts that we introduced in each of these frameworks.

82

Part II: Algorithms

Figure 6.21: LibAdaBoost high level functional diagram Machine learning framework

Figure 6.22: Machine learning framework principal concepts A machine learning framework consists in providing concepts for describing learning algorithms. We focus on statistical machine learning algorithms. These algorithms can be used to build systems able to learn to solve tasks given both a set of samples of that task which were drawn from an unknown probability distribution and with some a priori knowledge of the task. In addition, it must provide means for measuring the expected performance of generalization capabilities. Thus, in this machine learning framework (see Figure 6.22), we provide the following main concepts: • DataSet, Sample and Pre-Interface: these concepts cover data representation. A Sample represents an elementary data input and a DataSet represents a set of Samples which is used as input for the learning system. The preprocessing

Part II: Algorithms

83

interface represents a way for extending the capabilities of a Sample to be able to communicate with high level concepts, such that the learning system. • Classifier: This concept represents the knowledge that is used to classify a given Sample. A classifier can be an instance of neural network, a support vector machine, a Hidden Markov Model or a boosted strong classifier. • Learner and Learner Interface: A learner is used to generate a Classifier according to a given criterion and a given DataSet. It will test its generalization using another DataSet (or the same DataSet ). • Metric and Measure Operator: a metric represents a measure of interest such as classification error, or more advanced as AUC of a ROC curve or others. a Measure operator is the functional block that is responsible of the generation of a given metric. Boosting framework

Figure 6.23: Boosting framework principal concepts A boosting framework as shown in Figure 6.23 consists in few collaborating concepts: the ”Boosting algorithm”, as AdaBoost [Ref to Shapire and Freund] train several models from a given ”Weak Learner” using variations of the original DataSet and then combine the obtained models by a ”weighted sum of classifiers”. This first procedure is called training procedure. The second procedure in the boosting framework is the test procedure which consists in applying the classifier on sets of samples as input. These sets have specific organization of the samples, thus we use a kind of Finder which once given a classifier and a source of generic data, consists in applying the classifier to each sample present in the generic data. This sample is given by a search strategy implemented in the Finder. the finder has three stages preprocessing, classification and post processing. A non exhaustive list of new introduced concepts is as follows: • Boosting algorithm ( AdaBoost)

84

Part II: Algorithms • Weak Learner , Genetic Weak Learner and Genetic Interface • Weak classifier and Weighted sum classifier • Generic Data • Finder, Finder preprocessor and Finder post processor

Boosting vision extension

Figure 6.24: Boosting vision extension framework principal concepts An extension of the boosting framework for the visual domain (see Figure 6.24 ) consists in extending the notion of sample to support the visual object representation ImageSample used. An image sample can be static (see Definition 9) or dynamic (see Definition 10). The generic Data is extended to Visual Data which represents an image with the pyramid and the set of preprocessing implemented by the different images at the different scales. The weak classifier is extended to visual weak classifier called Feature. In LibAdaBoost we implemented the rectangular features, the controlpoints features and the VJ motion-based features. The Visual Finder is responsible for finding patterns corresponding to a given object of interest in the input image or video stream. A non exhaustive list of new introduced concepts is as follows: • Image sample • Visual Data and Visual Pyramid • Feature • Visual Finder, Visual post and pre processors • VJ, CP and VJM

Part II: Algorithms

6.4.3

85

LibAdaBoost architectural overview

Plugin-based architecture A design key in LibAdaBoost is the extensibility. We have adopted an architecture which favorates the extensibility and eases the adding or extending algorithms. The architecture has three granularity levels: modules, components and environments. The lowest granularity is the module and the highest granularity is the environment. A module can be accessed in LibAdaBoost if it is registered to a given module manager (similar pattern for components and environments).

Figure 6.25: Plugin-based architecture objects instantiation Figure 6.25 shows this plugin-based architecture. We can figure out the following blocks:

86

Part II: Algorithms • Algorithm module manager, which handles a set of modules implementing the same given interface (example: a learner). • The kernel is responsible of handling the set of algorithm module managers. It defines the different families of modules available in the system. The component manager handles the set of available components. • The options manager is used to handle the options required to pass the parameters to these different modules. The environment manager handles the environments objects. • The console is responsible of the execution of a command handled by the command factory within a given environment. • The loader is the part responsible of the initialization of the system, it is responsible of loading and doing registration of the different available plugins (modules, components, and environments).

Service oriented tools LibAdaBoost presents two levels: SDK and toolkit. The toolkit is based on a set of services. These services provide high level functionalities called tasks. We implemented three major services: learning service, validation service and test service. The service oriented architecture is characterized by the following properties: • The service is automatically updated to support new functionalities introduced by plugins. • The service is an extensible part of LibAdaBoost. The developer can easily add more services. • The different services encode results in a common format. • A service can be invoked from both: command line and graphical user interface. Figure 6.26 shows more functional blocks as Launcher and Modeler. Both tools consist in more advanced environments for launching learning tasks, validation tasks and test tasks. We have partially implemented these tools as a web application.

6.4.4

LibAdaBoost content overview

Figure 6.27 presents the fusion of both : the functional analysis and the architectural analysis. We identify four categories of building blocks: the blocks related to the plugin-based architecture, which serve to handle the different modules, and the communication between the different modules, as well within the framework. The blocks related to machine learning framework: these blocks are mostly module managers or component managers and environment managers. The boosting framework consists

Part II: Algorithms

87

Figure 6.26: LibAdaBoost Service oriented environment

in a set of module managers and modules as plugins. The boosting for vision is a set of plugins implemented upon the boosting framework.

6.4.5

Comparison to previous work

LibAdaBoost is comparable to Torch and Torch Vision [Put reference of Torch]. Thus we provide more advanced abstraction of the machine learning framework, and the fact that LibAdaBoost is specialized in boosting make it in advanced step compared to torch but in the same time we don’t implement the other learning algorithms such as SVM, ... LibAdaBoost is rich in boosting oriented algorithms. Another advantage of LibAdaBoost is the service oriented approach, thus we provide ready to use tools to benefit from the existing algorithms. The plugin based approach make it easy to share and distribute modules.

88

Part II: Algorithms

Figure 6.27: LibAdaBoost content overview

6.5

Use cases

In this section we illustrate three use cases for visual object detection: face, car and people. We used LibAdaBoost to solve the task of learning a visual detector for each pattern. We compared several features for this task. We introduced some optimizations to the visual finder in some cases (car detection). We tried to present in each use case a special functionality in LibAdaBoost. These use cases are part of larger projects in the domain of video surveillance and intelligent transportation systems.

Part II: Algorithms

6.5.1

89

Car detection

In this section we illustrate how to use LibAdaBoost for learning a visual detector specialized in cars. The car detection is an important task within a wide application area related to intelligent transportation systems. Vision systems based on frontal cameras, embedded on a vehicle, provide realtime video streams. This video stream is analyzed with special algorithms in the aim to detect cars, people... We use boosting for learning a special detector specialized in cars. The learning task starts by collecting a set of car samples (1224 samples) and non car samples (2066 samples). These samples are labeled positive and negative respectively. Some examples of these samples are shown in Figure 6.28. A car sample consists in a rear view of a car. The shape of the car is labeled using a bounding box manually selected from the training dataset. The basic size of a car is 48x48 pixels in RGB space.

Figure 6.28: Car examples from the learning set: upper line for the positive samples and lower line for the negative examples

(a) Rectangular Features

(b) Control Points

Figure 6.29: Car detection: training error and generalization error Figure 6.29 shows the evolution of both training and generalization errors for two learning tasks: (a) shows the evolution of the training using rectangular features and

90

Part II: Algorithms

(b) shows the evolution of the training using control points.

(a) Step 15

(b) Step 160

Figure 6.30: Car detection: Voting histogram for training and validation datasets LibAdaBoost offers a log utility which permits to spy the evolution of the voting values distribution using histograms. Figure 6.30 shows the status of the voting histogram at two moments: the first view on step 15 where we find that the positive and negative histograms are mixed. The second view takes place on step 160. It shows that the histograms are separated for the training dataset and still have a small mixed part for the generalization dataset. Finally, Figure 6.31 shows the validation step of the result knowledge: the ROC curve is plotted as well as the precision/recall curve. The accuracy variation according to the voting threshold shows that the standard 0.5 value is the best for this detector. The validation result of the detector on the validation dataset is shown as a histogram distribution of the votes which shows a relatively small error zone.

6.5.2

Face detection

In this section we illustrate how to use LibAdaBoost for learning a visual detector specialized in faces. The face detection is an important task within a wide application area related to mobile and video surveillance domains. We use boosting for learning a special detector specialized in faces. The learning task starts by collecting a set of face samples (4870) and non face samples (14860). These samples are labeled positive and negative respectively. Some examples of these samples are shown in Figure 6.32. A face sample consists in a frontal view of a face. The shape of the face is labeled using a bounding box manually selected from the training dataset. The basic size of a face is 24x24 pixels in RGB space. Figure 6.33 shows the evolution of both training and generalization errors for two learning tasks: (a) shows the evolution of the training using rectangular features and (b) shows the evolution of the training using control points. LibAdaBoost offers a log utility which permits to spy the evolution of the voting values distribution using histograms. Figure 6.34 shows the status of the voting histogram at two moments: the first view on step 15 where we find that the positive and negative histograms are mixed. The second view takes place on step 160. It shows that the histograms are separated for the training dataset and still have a small mixed part for the generalization dataset.

Part II: Algorithms

91

Figure 6.31: Car detection: LibAdaBoost validation result Finally, Figure 6.35 shows the validation step of the result knowledge: the ROC curve is plotted as well as the precision/recall curve. The accuracy variation according to the voting threshold shows that the standard 0.5 value is the best for this detector. The validation result of the detector on the validation dataset is shown as a histogram distribution of the votes which shows a relatively small error zone.

6.5.3

People detection

In this section we illustrate how to use LibAdaBoost for learning a visual detector specialized in people. The people detection is an important task within a wide application area related to intelligent video surveillance systems. Real time video streams are analyzed with special algorithms in the aim to detect people and then track them and do further post processing and high level analysis. We use boosting for learning a special detector specialized in people. The learning task starts by collecting a set of people samples and non people samples. These samples are labeled positive and negative respectively. Some examples of these samples are shown in Figure 6.36. A people sample consists in a complete view of a person. The shape of the person is labeled using a bounding box manually selected from the

92

Part II: Algorithms

Figure 6.32: Face examples from the learning set: upper line for the positive samples and lower line for the negative examples

(a) Rectangular Features

(b) Control Points

Figure 6.33: Face detection: earning and generalization error training dataset. The basic size of a person is 36x48 pixels in RGB space. People knowledge depends on the camera installation, the view angle is closely correlated to the knowledge. Thus for each camera we dispose of a special learning process. This is a disadvantage of the current features (Control Points and rectangular features). This can be resolved by motion based features which is not used in this present use case since some technical bugs still have to be fixed. We present as in the previous sections the evolution of the error for training as well as for the generalization step. This is shown in Figure 6.37.

6.6

Conclusion

In this chapter we presented a new framework for machine learning. We built a boosting framework based on this framework. We exposed the boosting framework to the computer vision domain. We made this framework public under the LGPL license. To the best of our knowl-

Part II: Algorithms

(a) Step 15

93

(b) Step 160

Figure 6.34: Face detection: Voting histogram for learning and validation datasets edge this is a new framework which will serve as a reference for the implementation and comparison of various weak classifiers and learning algorithms. We showed some examples of how using LibAdaBoost for the generation of knowledge. We did not go deeply inside the technical details. These details are explained in tutorials and programming documentation. The framework is operational and is ready to be integrated in larger projects as a tool for boosting based detectors.

94

Part II: Algorithms

Figure 6.35: Face detection: LibAdaBoost validation result

Category Type Viola and Jones Rectangular Feature FvEdge , FhEdge , FvLine , FhLine , FDiagonal Viola and Jones Motion-based features motion: fj , fk appearance: fm mixture: fi Extended Rectangular features FU L , FDL , FDR , FU R , Fbox FV Grid , FHGrid , FGrid Table 6.2: Category and types association

Part II: Algorithms

95 α 0.95 0.90 0.85 0.8 0.75 0.70 0.50 0.25

N 48 23 15 11 8 6 3 1

Table 6.3: Pyramid size for different values of α, image dimensions 384 × 288, ζ size 24 × 24

Visual feature Viola and Jones rectangular features Control-points Motion-based rectangular features

Required preprocessing IIIt , IIIt2 R0 , R1 , R2 IIIt , IIIt2 ∆, II∆ , II∆2 U, IIU , IIU 2 D, IID , IID2 L, IIL , IIL2 R, IIR , IIR2

Table 6.4: Required preprocessing for each visual feature

Figure 6.36: People examples from the learning set: upper line for the positive samples and lower line for the negative examples

96

Part II: Algorithms

(a) Rectangular Features

(b) Control Points

Figure 6.37: People detection: learning and generalization error

(a) Step 15

(b) Step 160

Figure 6.38: People detection: Voting histogram for learning and validation datasets

Part II: Algorithms

Figure 6.39: People detection: LibAdaBoost validation result

97

Chapter 7 Object tracking Contents 7.1

Initialization . . . . . . . . . . . . . . . . . . . . . . . . . .

98

7.2

Sequential Monte Carlo tracking . . . . . . . . . . . . . .

99

7.3

State dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.4

Color distribution Model . . . . . . . . . . . . . . . . . . . 100

7.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.6

Incorporating Adaboost in the Tracker . . . . . . . . . . 102 7.6.1

Experimental results . . . . . . . . . . . . . . . . . . . . . . 102

The aim of object tracking is to establish a correspondence between objects or object parts in consecutive frames and to extract temporal information about objects such as trajectory, posture, speed and direction. Tracking detected objects frame by frame in video is a significant and difficult task. It is a crucial part of our video surveillance systems since without object tracking, the system could not extract cohesive temporal information about objects and higher level behavior analysis steps would not be possible. On the other hand, inaccurate foreground object segmentation due to shadows, reflectance and occlusions makes tracking a difficult research problem. Our tracking method is an improved and extended version of the state of art color-based sequential Monte Carlo tracking method proposed in [JM02].

7.1

Initialization

In order to perform the tracking of objects, it is necessary to know where they are initially. We have three possibilities to consider: either manual initialization, semiautomatic initialization, or automatic initialization. For complicated models or objects with a constantly moving background, manual initialization is often preferred. However, even with a moving background, if the background is a uniformly colored region, then semi-automatic initialization can be implemented. Automatic initialization is

Part II: Algorithms

99

possible in many circumstances, such as smart environments, surveillance, sports tracking with a fixed background or other situations with a fixed camera and a fixed background by using the background model and detect targets based on a large intensity change in the scene and the distribution of color on the appearance of an object.

7.2

Sequential Monte Carlo tracking

Sequential Monte Carlo methods have become popular and already been applied to numerous problems in time series analysis, econometrics and object tracking.In non-Gaussian, hidden Markov (or state-space) models (HMM), the state sequence {xt ; t∈ N },xt ∈ Rnx , is assumed to be an unobserved (hidden) Markov process with initial distribution p(x0 ) and transition distribution p(xt /p( xt−1 ), where nx is the dimension of the state vector. The observations {yt ; t∈ N ∗ },yt ∈ Rny are conditionally independent given the process {xt ; t∈ N } with marginal distribution p(yt /xt ), where ny is the dimension of the observation vector. We denote the state vectors and observation vectors up to time t by x0:t ≡ {x0 ...xt } and similarly for y0:t . Given the model, we can obtain the sequence of filtering density to be tracked by the recursive Bayesian estimation: p(xt /y0:t ) =

p(yt /xt )p(xt /y0:t−1 ) p(yt /y0:t−1 )

(7.1)

where p(yt /xt ) is likelihood in terms of the observation model, p(xt /y0:t−1 ) is a prior the propagates part state to future and p(yt /y0:t−1 ) is evidence. The Kalman filter can handle Eq (7.1) Analytically if the model is based on the linear Gaussian state space. However, in the case of visual tracking, the likelihood is non-linear and often multi-modal with respect to the hidden states. As a result, the Kalman filter and its approximation work poorly for our case. With sequential Monte Carlo techniques, we can approximate the posterior p(xt /y0:t−1 ) by a finite set of M particles (or samples), {xit }i=1...M . In order to generate samples from p(xt /y0:t−1 ), we propagate samples based on an appropriate proposal transition function f (xt /xt−1 , yt ). We set f (xt /xt−1 , yt ) = p(xt /xt−1 ), which is the bootstrap filter. We denote {xit }i=1...M as fair samples from the filtering distribution at time t, then the new particles, denoted by x˜it+1 , have the following association with the importance weights: i πt+1

p(t+1 /˜ xit+1 p(˜ xit+1 /xit ∝ f (˜ xt + 1i /xit , yt+1 )

(7.2)

PM i where i=1 πt+1 = 1. We resample these particles with their corresponding importance weights to generate a set of fair samples {xit+1 }i=1...M from the posterior p(xt /y0:t ). With the discrete approximation of p(xt /y0:t ), the tracker output is obtained by the Monte Carlo approximation of the expectation:

100

Part II: Algorithms

xˆt ≡ E(xt /yt ) where E(xt /yt ) =

7.3

1 M

PM

i=1

(7.3)

xit .

State dynamics

We formulate the state to describe a region of interest to be tracked. We assume that the shape, size, and position of the region are known a priori and define a rectangular window W. The shape of the region could also be an ellipse or any other appropriate shapes to be described, which depends mostly on what kind of object to track. In our case,we use a rectangle . The state consists of the location and the scale of the window W. The state dynamics varies and depends on the type of motion to deal with. Due to the constant, yet often random, nature of objects motion, we choose the second-order auto-regressive dynamics to be the best approximating their motion as in [JM02]. T , st , st−1 ), where T denotes If we define the state at time t as a vector xt = (ltT , lt−1 the transpose, lt = (x, y)T is the position of the window W at time t in the image coordinate, and st is the scale of W at time t. We apply the following state dynamics: xt+1 = Axt + Bxt−1 + Cvt , vt ≡ N (0, Σ).

(7.4)

Matrices A,B,C and Σ control the effect of the dynamics. In our experiments, we define those matrices in ad-hoc way by assuming the constant velocity on object’s motion.

7.4

Color distribution Model

This section explains how we incorporate the global nature of color in visual perception into our sequential Monte Carlo framework. We follow the implementation of HSV color histograms used in [JM02], and extend it to our adaptive color model. We use histograming techniques in the Hue-Saturation-Value (HSV) color space for our color model. Since HSV decouples the intensity (i.e., Value) from color (i.e., Hue and Saturation), it becomes reasonably insensitive to illumination effects. An HSV histogram is composed of N = Nh Ns + Nv bins and we denote bt (k) ∈ 1, ...N as the bin index associated with the color vector yt (k) at a pixel location k at time t. As it is pointed out in [JM02] that the pixels with low saturation and value are not useful to be included in HSV histogram, we populate the HSV histogram without those pixels with low saturation and value. If we define the candidate region in which we formulate the HSV histogram as R(xt ) ≡ lt + st W , then a kernel density estimate Q(xt ) ≡ q(n; xt )n=1,...,N of the color

Part II: Algorithms

101

distribution at time t is given by : q(n; xt ) = η

X

δ[bt (k) − n]

(7.5)

k∈R(xt )

PNwhere δ is the Kronecker delta function, η is a normalizing constant which ensures n=1 q(n; xt ) = 1, and a location k could be any pixel location within R(xt). Eq(7.5) defines q(n; xt ) as a probability of a color bin n at time t. If we denote Q∗ = q∗ (n; x0 )n=1,...,N as the reference color model and Q(xt ) as a candidate color model, then we need to measure the data likelihood (i.e., similarity) between Q∗ and Q(xt ). As in [JM02], we apply the Bhattacharyya similarity coefficient to define a distance ζ on HSV histograms and its formulation given by: # 21 " N X p q ∗ (n; x0 )q(n; xt ) (7.6) ζ[Q∗ , Q(xt )] = 1 − n=1

Once we obtain a distance ζ on the HSV color histograms, we use the following likelihood distribution given by: p(yt /xt ) ∝ exp−λζ

2 [Q∗ ,Q(x

t )]

(7.7)

where λ = 20. λ is determined based on our experiments. Also, we set the size of bins Nh , Ns , and Nv as 10.

7.5

Results

This section presents the tracking results by our tracker for a single object. Our tracker tracks a target for many frames if a good initialization were made. The following figures show the result. These figures show the robustness of our tracker under a severe illumination change with a similar object closely, and the robustness in a cluttered scenes and partial occlusion, the tracker never loses the target. Figure 7.1 shows illumination insensitive performance of our tracker which does not lose the target even if there is a drastic illumination change due to camera flashes. The same figure shows that the second-order AR dynamics tends to fail to approximate the target’s velocity and our tracker may not locate the object exactly, when the object moves by a small amount. Figure 7.1 and Figure 7.2 show the successful performance of our tracker in a cluttered scene with similar objects nearby (frame 390). These figures also show that the second order AR dynamics makes our tracker robust in a scene when targets make a sudden direction change. Our Tracker fails when there are identical objects such as players in the same team nearby the target, for the tracker does not know the association between two identical objects without any prior information. This is one open problem that we need to deal with in the future.

102

7.6

Part II: Algorithms

Incorporating Adaboost in the Tracker

This method is introduced in [Ka04]. They adopt the cascaded Adaboost algorithm of Viola and Jones [VJ01a], originally developed for detecting faces. The Adaboost results could be improved if we considered the motion models of objects. In particular, by considering plausible motions, the number of false positives could be reduced. For this reason, we incorporated Adaboost in the proposal mechanism of our tracker. For this the expression for the proposal distribution will be given by the following mixture. Q∗b (xt /x0:t−1 , yt ) = αQada (xt /xt−1 , yt ) + (1 − α)p(xt /xt−1 )

(7.8)

where Qada is a Gaussian distribution that we discuss in the subsequent paragraph. The parameter α can be set dynamically without affecting the convergence of the particle filter (it is only a parameter of the proposal distribution and therefore its in hence is corrected in the calculation of the importance weights). Note that the Adaboost proposal mechanism depends on the current observation yt . It is, therefore, robust to peaked likelihoods. Still, there is another critical issue to be discussed: determining how close two different proposal distributions need to be for creating their mixture proposal. We can always apply the mixture proposal when Qada is overlaid on a transition distribution modeled by autoregressive state dynamics. However, if these two different distributions are not overlapped, there is a distance between the mean of these distributions. If a Monte Carlo estimation of a mixture component by a mixture particle filter overlaps with the nearest cluster given by the Adaboost detection algorithm, we sample from the mixture proposal distribution. If there is no overlap between the Monte Carlo estimation of a mixture particle filter for each mixture component and cluster given by the Adaboost detection, then we set α = 0, so that our proposal distribution takes only a transition distribution of a mixture particle filter.

7.6.1

Experimental results

We have used for our experiments an implementation in Matlab and initialized our tracker to track three targets. Figure 7.3 shows the results. In this figure, it is important to note that no matter how many objects are in the scene, the mixture representation of the tracker is not affected and successfully adapts to the change. For objects coming into the scene, Adaboost quickly detects a new object in the scene within a short time sequence of only two frames. Then the tracker immediately assigns particles to an object and starts tracking it.

Part II: Algorithms

Figure 7.1: Tracking results with displaying all particles

103

104

Part II: Algorithms

Figure 7.2: Tracking results with displaying the most likely particles

Part II: Algorithms

105

Figure 7.3: Tracker + Adaboost results

Part III Architecture

Chapter 8 General purpose computation on the GPU Contents 8.1

Introduction

8.2

Why GPGPU? . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.3

8.4

8.5

8.1

. . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.2.1

Computational power . . . . . . . . . . . . . . . . . . . . . 109

8.2.2

Data bandwidth . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2.3

Cost/Performance ratio . . . . . . . . . . . . . . . . . . . . 112

GPGPU’s first generation . . . . . . . . . . . . . . . . . . 112 8.3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.3.2

Graphics pipeline . . . . . . . . . . . . . . . . . . . . . . . . 114

8.3.3

Programming language

8.3.4

Streaming model of computation . . . . . . . . . . . . . . . 120

8.3.5

Programmable graphics processor abstractions . . . . . . . 121

. . . . . . . . . . . . . . . . . . . . 119

GPGPU’s second generation . . . . . . . . . . . . . . . . . 123 8.4.1

Programming model . . . . . . . . . . . . . . . . . . . . . . 124

8.4.2

Application programming interface (API) . . . . . . . . . . 124

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Introduction

In this part, we focus on hardware acceleration of computer vision algorithms. As a choice of architecture we selected a low cost solution generally available on all computers. We selected the graphics card. This approach became an attractive solution for computer vision community. To the best of our knowledge, we implemented the first

Part III: Architecture

109

visual object detection algorithm on the graphics card. The implemented algorithm performs better than the traditional implementation on general purpose processor. It is possible now to implement large detectors and to maintain a real time run. This part is organized as follow: Chapter 8 presents the graphics hardware as well as the abstraction of the graphics hardware as stream co-processor. Chapter 9 presents the mapping of the visual object detection on the graphics hardware as in [HBC06]. The remainder of this chapter is organized as follows: in Section 8.2 we give an answer to the question why GPGPU?. Section 8.3 presents the GPGPU’s first generation. The GPGPU’s second generation is presented in Section 8.4. Section 8.5 concludes the chapter.

8.2

Why GPGPU?

Commodity computer graphics chips, known generically as Graphics Processing Units or GPUs, are probably todays most powerful computational hardware for the dollar. Researchers and developers have become interested in harnessing this power for general-purpose computing, an effort known collectively as GPGPU (for GeneralPurpose computing on the GPU). In this chapter we summarize the principal developments to date in the hardware and software behind GPGPU, give an overview of the techniques and computational building blocks used to map general-purpose computation to graphics hardware, and survey the various general-purpose computing tasks to which GPUs have been applied. We begin by reviewing the motivation for and challenges of general-purpose GPU computing. Why GPGPU?

8.2.1

Computational power

Recent graphics architectures provide tremendous memory bandwidth and computational horsepower. For example, the flagship NVIDIA GeForce 7900 GTX ($378 on October 2006) boasts 51.2 GB/sec memory bandwidth; the similarly priced ATI Radeon X1900 XTX can sustain a measured 240 GFLOPS, both measured with GPUBench [BFH04a]. Compare to 8.5 GB/sec and 25.6 GFLOPS theoretical peak for the SSE units of a dual-core 3.7 GHz Intel Pentium Extreme Edition 965 [Int06]. GPUs also use advanced processor technology; for example, the ATI X1900 contains 384 million transistors and is built on a 90-nanometer fabrication process. Graphics hardware is fast and getting faster quickly. For example, the arithmetic throughput (again measured by GPUBench) of NVIDIAs current-generation launch product, the GeForce 7800 GTX (165 GFLOPS), more than triples that of its predecessor, the GeForce 6800 Ultra (53 GFLOPS). In general, the computational capabilities of GPUs, measured by the traditional metrics of graphics performance, have compounded at an average yearly rate of 1.7 (pixels/second) to 2.3 (vertices/second). This rate of growth significantly outpaces the often-quoted Moore’s Law as applied

110

Part III: Architecture

Figure 8.1: CPU to GPU GFLOPS comparison to traditional microprocessors; compare to a yearly rate of roughly 1.4 for CPU performance [EWN04](Figure 8.1). Why is graphics hardware performance increasing more rapidly than that of CPUs? Semiconductor capability, driven by advances in fabrication technology, increases at the same rate for both platforms. The disparity can be attributed to fundamental architectural differences (Figure 8.2): CPUs are optimized for high performance on sequential code, with many transistors dedicated to extracting instruction-level parallelism with techniques such as branch prediction and out-of-order execution. On the other hand, the highly data-parallel nature of graphics computations enables GPUs to use additional transistors more directly for computation, achieving higher arithmetic intensity with the same transistor count. We discuss the architectural issues of GPU design further in Section 8.3.2.

8.2.2

Data bandwidth Component Bandwidth GPU Memory Interface 35 GB/sec PCI Express Bus (16) 8 GB/sec CPU Memory Interface (800 MHz Front-Side Bus) 6.4 GB/sec

Table 8.1: Available Memory Bandwidth in Different Parts of the Computer System The CPU in a modern computer system (Figure 8.3) communicates with the GPU through a graphics connector such as a PCI Express or AGP slot on the motherboard. Because the graphics connector is responsible for transferring all command, texture, and vertex data from the CPU to the GPU, the bus technology has evolved alongside GPUs over the past few years. The original AGP slot ran at 66 MHz and was 32 bits wide, giving a transfer rate of 264 MB/sec. AGP 2, 4, and 8 followed, each doubling

Part III: Architecture

111

Figure 8.2: CPU to GPU Transistors

Figure 8.3: The overall system architecture of a PC

112

Part III: Architecture

the available bandwidth, until finally the PCI Express standard was introduced in 2004, with a maximum theoretical bandwidth of 4 GB/sec simultaneously available to and from the GPU. (Your mileage may vary; currently available motherboard chipsets fall somewhat below this limitaround 3.2 GB/sec or less.) It is important to note the vast differences between the GPUs memory interface bandwidth and bandwidth in other parts of the system, as shown in Table 8.1. Table 8.1 shows that there is a vast amount of bandwidth available internally on the GPU. Algorithms that run on the GPU can therefore take advantage of this bandwidth to achieve dramatic performance improvements.

8.2.3

Cost/Performance ratio

Graphics hardware industry is driven by the game market. The price of the top graphics card is about 600$. The motherboard required for the setup of this graphics card is general public and has a low price. Otherwise, the required motherboard for the installation of multiple processors is too high. The graphics card is considered as a free component. This is due to the fact that it is a basic component in a standard PC configuration.

8.3 8.3.1

GPGPU’s first generation Overview

Flexible and precise Modern graphics architectures have become flexible as well as powerful. Early GPUs were fixed-function pipelines whose output was limited to 8-bit-per-channel color values, whereas modern GPUs now include fully programmable processing units that support vectorized floating-point operations on values stored at full IEEE single precision (but note that the arithmetic operations themselves are not yet perfectly IEEEcompliant). High level languages have emerged to support the new programmability of the vertex and pixel pipelines [BFH.04b,MGAK03,MDP.04]. Additional levels of programmability are emerging with every major generation of GPU (roughly every 18 months). For example, current generation GPUs introduced vertex texture access, full branching support in the vertex pipeline, and limited branching capability in the fragment pipeline. The next generation will expand on these changes and add geometry shaders, or programmable primitive assembly, bringing flexibility to an entirely new stage in the pipeline [Bly06]. The raw speed, increasing precision, and rapidly expanding programmability of GPUs make them an attractive platform for general purpose computation.

Part III: Architecture

Figure 8.4: First Generation GPGPU framework

113

114

Part III: Architecture

Limitations and difficulties Power results from a highly specialized architecture, evolved and tuned over years to extract maximum performance on the highly parallel tasks of traditional computer graphics. The increasing flexibility of GPUs, coupled with some ingenious uses of that flexibility by GPGPU developers, has enabled many applications outside the original narrow tasks for which GPUs were originally designed, but many applications still exist for which GPUs are not (and likely never will be) well suited.Word processing, for example, is a classic example of a pointer chasing application, dominated by memory communication and difficult to parallelize. Todays GPUs also lack some fundamental computing constructs, such as efficient scatter memory operations (i.e., indexed-write array operations) and integer data operands. The lack of integers and associated operations such as bit-shifts and bitwise logical operations (AND, OR, XOR, NOT) makes GPUs ill-suited for many computationally intense tasks such as cryptography (though upcoming Direct3D 10-class hardware will add integer support and more generalized instructions [Bly06]). Finally, while the recent increase in precision to 32-bit floating point has enabled a host of GPGPU applications, 64-bit double precision arithmetic remains a promise on the horizon. The lack of double precision prevents GPUs from being applicable to many very large-scale computational science problems. Furthermore, graphics hardware remains difficult to apply to non-graphics tasks. The GPU uses an unusual programming model (Section 2.3), so effective GPGPU programming is not simply a matter of learning a new language. Instead, the computation must be recast into graphics terms by a programmer familiar with the design, limitations, and evolution of the underlying hardware. Today, harnessing the power of a GPU for scientific or general-purpose computation often requires a concerted effort by experts in both computer graphics and in the particular computational domain. But despite the programming challenges, the potential benefits leap forward in computing capability, and a growth curve much faster than traditional CPUs are too large to ignore. The potential of GPGPU A vibrant community of developers has emerged around GPGPU (http://GPGPU.org/), and much promising early work has appeared in the literature.We survey GPGPU applications, which range from numeric computing operations, to non-traditional computer graphics processes, to physical simulations and game physics, to data mining. We cover these and more applications in Section 5.

8.3.2

Graphics pipeline

The graphics pipeline is a conceptual architecture, showing the basic data flow from a high level programming point of view. It may or may not coincide with the real hardware design, though the two are usually quite similar. The diagram of a typical graphics pipeline is shown in Figure 8.5. The pipeline is composed of multiple functional stages. Again, each functional stage is conceptual, and may map to one or

Part III: Architecture

115

multiple hardware pipeline stages. The arrows indicate the directions of major data flow. We now describe each functional stage in more detail. Application The application usually resides on a CPU rather than GPU. The application handles high level stuff, such as artificial intelligence, physics, animation, numerical computation, and user interaction. The application performs necessary computations for all these activities, and sends necessary command plus data to GPU for rendering. Host and command The host is the gate keeper for a GPU. Its main functionality is to receive commands from the outside world, and translate them into internal commands for the rest of the pipeline. The host also deals with error condition (e.g. a new glBegin is issued without first issuing glEnd) and state management (including context switch). These are all very important functionalities, but most programmers probably do not worry about these (unless some performance issues pop up). Geometry The main purpose of the geometry stage is to transform and light vertices. It also performs clipping, culling, viewport transformation, and primitive assembly. We now describe these functionalities in detail. The programmable Vertex Processor The vertex processor has two major responsibilities: transformation and lighting. In transformation, the position of a vertex is transformed from the object or world coordinate system into the eye (i.e. camera) coordinate system. Transformation can be concisely expressed as follows: (xo , yo , zo , wo ) = (xi , yi , zi , wi )M T Where (xi , yi , zi , wi ) is the input coordinate, and (xo , yo , zo , wo ) is the transformed coordinate. M is the 4 × 4 transformation matrix. Note that both the input and output coordinates have 4 components. The first three are the familiar x, y, z cartesian coordinates. The fourth component, w, is the homogeneous component, and this 4component coordinate is termed homogeneous coordinate. Homogeneous coordinates are invented mainly for notational convenience. Without them, the equation above will be more complicated. For ordinary vertices, the w component is simply 1. Let we give you some more concrete statements of how all these means. Assuming the input vertex has a world coordinate (xi , yi , zi ). Our goal is to compute its location in the camera coordinate system, with the camera/eye center located at (xe , ye , ze ) in the world space. After some mathematical derivations which are best shown on

116

Part III: Architecture

Figure 8.5: Graphics pipeline

Part III: Architecture

117

a white board, you can see that the 4 × 4 matrix M above is composed of several components: the upper-left 3 × 3 portion the rotation sub-matrix, the upper right 3 × 1 portion the translation vector, and the bottom 1 × 4 vector the projection part. In lighting, the color of the vertex is computed from the position and intensity of the light sources, the eye position, the vertex normal, and the vertex color. The computation also depends on the shading model used. In the simple Lambertian model, the lighting can be computed as follows: (n.l)c Where n is the vertex normal, l is the position of the light source relative to the vertex, and c is the vertex color. In old generation graphics machines, these transformation and lighting computations are performed in fixed-function hardware. However, since NV20, the vertex engine has become programmable, allowing you to customize the transformation and lighting computation in assembly code [Lindholm et al. 2001]. In fact, the instruction set does not even dictate what kind of semantic operation needs to be done, so you can actually perform arbitrary computation in a vertex engine. This allows us to utilize the vertex engine to perform non-traditional operations, such as solving numerical equations. Later, we will describe the programming model and applications in more detail. Primitive assembly Here, vertices are assembled back into triangles (plus lines or points), in preparation for further operation. A triangle cannot be processed until all the vertices have been transformed as lit. As a result, the coherence of the vertex stream has great impact on the efficiency of the geometry stage. Usually, the geometry stage has a cache for a few recently computed vertices, so if the vertices are coming down in a coherent manner, the efficient will be better. A common way to improve vertex coherency is via triangle stripes. Clipping and Culling After transformation, vertices outside the viewing frustum will be clipped away. In addition, triangles with wrong orientation (e.g. back face) will be culled. Viewport Transformation The transformed vertices have floating point eye space coordinates in the range [−1, 1]. We need to transform this range into window coordinates for rasterization. In viewport stage, the eye space coordinates are scaled and offseted into [0, height − 1] × [0, width − 1].

118

Part III: Architecture

Rasterization The primary function of the rasterization stage is to convert a triangle (or line or point) into a set of covered screen pixels. The rasterization operation can be divided into two major stages. First, it determines which pixels are part of the triangle. Second, rasterization interpolates the vertex attributes, such as color, normal, and texture coordinates, into the covered pixels. Fragment Before we introduce the fragment stage, let us first define what fragment is. Basically, a fragment corresponds to a single pixel and includes color, depth, and sometimes texture coordinate values. For an input fragment, all these values are interpolated from vertex attributes in the rasterization stage, as described earlier. The role of the fragment stage is to process each input fragment so that a new color or depth value is computed. In old generation graphics machines, the fragment stage has fixed function and performed mainly texture mapping. Between NV20 and NV30, the fragment stage has become reconfigurable via register combiners. But this is all old stuff and I dont think you need to worry about it. Since NV30, the fragment stage has become fully programmable just like the vertex processor. In fact, the instruction set of vertex and fragment programs are very similar. The major exception is that the vertex processor has more branching capability while the fragment processor has more texturing capability, but this distinction might only be transitory. Given their programmability and computation power, fragment processors have been both embraced by the gaming community for advanced rendering effects, as well scientific computing community for general purpose computation such as numerical simulation [Harris 2005]. Later, we will describe the programming model and applications in more detail. Raster Operation ROP (Raster Operation) is the unit that writes fragments into the frame-buffer. The main functionality of ROP is to efficiently write batches of fragments into the framebuffer via compression. It also performs alpha, depth, and stencil tests to determine if the fragments should be written or discarded. ROP deals with several buffers residing in the frame-buffer, including color, depth, and stencil buffers. Frame Buffer The frame-buffer is the main storage for the graphics pipeline, in addition to a few scattered caches and FIFOs throughout the pipe. The frame-buffer stores vertex arrays, textures, and color, depth, stencil buffers. The content of the color buffer is fed to display for viewing. Frame-buffer has several major characteristics: size, width, latency, and clock speed. The size determines how much and how big textures and buffers you can store on chip, the width determines how much data maximum you

Part III: Architecture

119

can transfer in one clock cycle, the latency determines how long you have to wait for a data to come back, and the clock speed determines how fast you can access the memory. The product of width and clock speed is often termed bandwidth, which is one of the most commonly referred jargon in comparing DRAMs. However, for graphics applications, latency is probably at least as important as bandwidth, since it dedicates the length and size of all internal FIFOs and caches for hiding latency. (Without latency hiding, we would see bubbles in the graphics pipeline, wasting performance.) Although frame-buffer appears to be mundane, it is crucial in determining the performance of a graphics chip. Various tricks have been invented to hide frame-buffer latency. If frame-buffers had zero latency, the graphics architecture would be much simpler; we can have a much smaller shader register file and we don’t even need a texture cache anymore.

8.3.3

Programming language

Standard 3D programming interfaces The graphics hardware is accessible using a standard 3D application programming interface (API). Two APIs are actually available. The first API is DirectX from Microsoft. The second API is OpenGL from Silicon Graphics Incorporated. Both APIs provide an interface to handle GPU. Each API has a different phylosophy. DirectX depends directly on Microsoft’s technology. OpenGL is modular and extensible. Each 3D hardware provider can extend OpenGL to support the new functionalities provided by the new GPU. OpenGL is standardized by OpenGL Architectural Review Board (ARB) which is composed of the main graphics constructors. OpenGL based solutions have a major advantage relative to DirectX based solutions - these solutions are cross platform. This advantage motivated our choice of OpenGL for 3D programming. We were interested in general programming on the GPU. We started our development on Linux platform, and we moved backward to Windows recently. For more details on these APIs, you can find advanced materials in the literature covering OpenGL and DirectX. High level shading languages A shading language is a domain-specific programming language for specifying shading computations in graphics. In a shading language, a program is specified for computing the color of each pixel as a function of light direction, surface position, orientation, and other parameters made available by rendering system. Typically shading languages are focused on color computation, but some shading systems also support limited modeling capabilities, such as support for displacement mapping. Shading languages developped specifically for programmable GPUs include the OpenGL Shading Language (GLSL), the Stanford Real-Time Shading Language (RTSL), Microsoft’s High-level Shading Language (HLSL), and NVIDIA’s Cg. Of these lan-

120

Part III: Architecture

guages, HLSL is DirectX-specific, GLSL is OpenGL-specific, Cg is multiplatform and API neutral, but developed by NVIDIA (and not well supported by ATI), and RTSL is no longer under active development.

8.3.4

Streaming model of computation

Streaming architecture Stream architectures are a topic of great interest in computer architecture. For example, the Imagine stream processor demonstrated the effectiveness of streaming for a wide range of media applications, including graphics and imaging. The Stream/KernelC programming environment provides an abstraction which allows programmers to map applications to the Imagine processor. Although, these architectures are specific and cope well with computer vision algorithms previously presented in Part II. These architectures will not be our principal interest. Otherwise, we mention these architectures to give a total sight of the problem. Since, as we described our motivation in Part I, we are interested particularly in architectures with similar functionalities which are available for general public graphics hardware. Stream programming model The stream programming model exposes the locality and concurrency in media processing applications. In this model, applications are expressed as a sequence of computation kernels that operate on streams of data. A kernel is a small program that is repeated for each successive element in its input streams to produce output streams that are fed to subsequent kernels. Each data stream is a variable length collection of records, where each record is a logical grouping of media data. For example, a record could represent a triangle vertex in a polygon rendering application or a pixel in an image processing application. A data stream would then be a sequence of hundreds of these vertices or pixels. In the stream programming model, locality and concurrency are exposed both within kernel and between kernels. Definition 13 (Streams). A stream is a collection of data which can be operated on in parallel. A stream is made up of elements. Access to stream elements is restricted to kernels (Definition 14) and the streamRead and streamWrite operators, that transfer data between memory and streams. Definition 14 (Kernels). A kernel is a special function which operates on streams. Calling a kernel on a stream performs an implicit loop over the elements of the stream, invoking the body of the kernel for each element. Definition 15 (Reductions). A reduction is a special kernel which accepts a single input stream. It provides a data-parallel method for calculating either a smaller stream of the same type, or a single element value.

Part III: Architecture

8.3.5

121

Programmable graphics processor abstractions

From abstract point of view, the GPU is a streaming processor (Figure 8.6), particularly suitable for the fast processing of large arrays [SDK05]. Thus, graphics processors have been used by researchers to enhance the performance of specific, non-graphic applications and simulations. Researchers started to use the graphics hardware for general purpose computation using traditional graphics programming languages: a graphics Application Programming Interface (API), and a graphics language for the kernels well known as Shaders. In Appendix A we describe a hello world example of using the GPU for general purpose computing, using graphics APIs and Shading language. The graphics API was previously described in Section 8.3.3. It understands function calls used to configure the graphics hardware. Implementations of these APIs are available for a variety of languages, e.g. C, C++, Java, Python. These APIs target always graphics programming and thus one must encapsulate the native API in a library to make the code looks modular. As described in Section 8.3.3, shading languages can be used with DirectX (HLSL) or OpenGL (GLSL), or both of them (Cg). The programming model of a Shader is described in Section 8.3.5.

Figure 8.6: The graphics hardware abstraction as a streaming coprocessor

It is not always easy to use graphics APIs and Shading languages for non graphics community. That is why several projects try to further abstract the graphics hardware as a slave coprocessor, and to develop a suitable model of computation. Ian Buck et al. [BFH+ 04b] describe a stream programming language called Brook for GPU. Brook extends C++ to provide simple data-parallel constructs to allow using the GPU as a streaming coprocessor. It consists in two components: a kernel compiler brcc, which compiles kernel functions into legal Cg code, and a runtime system built on top of OpenGL (and DirectX) which implements the Brook API. In Appendix B we provide a hello world example of Brook for image processing similar to Appendix A.

122

Part III: Architecture

The programming model of fragment processor The execution environment of a fragment (or vertex) processor is illustrated in Figure 8.7. For every vertex or fragment to be processed, the shader program receives from the previous stage the graphics primitives in the read-only input registers. The shader is then executed and the result of rendering is written on the output registers. During execution, the shader can read a number of constant values set by the host processor, read from texture memory (latest GPUs started to add the support for vertex processors to access texture memory), and read and write a number of temporary registers.

Figure 8.7: Programming model for current programmable graphics hardware. A shader program operates on a single input element (vertex or fragment) stored in the input registers and writes the execution result into the output registers.

Implementation constraints This section discusses basic properties of GPUs which are decisive for the design of efficient algorithms on this architecture. Table 8.2 presents a summary of these properties. Note that these properties depend on the graphics hardware and are changing each 6 months. In Table 8.2 we consider the nVIDIA G7 family and the first Geforce FX (5200) [nVI07a]. Property CPU GFX5200 GF7800GTX Input Streams native 1D native 1D, 2D, 3D native 1D, 2D, 3D Output Streams native 1D native 2D native 2D Gathers arbitrary arbitrary arbitrary Scatters arbitrary global global, emulated Dynamic Branching yes no yes Maximum Kernel Length unlimited 1024 unlimited Table 8.2: Graphics hardware constraints In the new GPUs generation these properties completely changed. This global change is made by the fact that the GPU emerged recently to a general multiprocessor

Part III: Architecture

123

on chip architecture. This will be highlighted in the next section.

8.4

GPGPU’s second generation

Figure 8.8 shows a new GPGPU diagram template. This new diagram is characterized by a novel approach: the GPGPU application bypasses the graphics API (OpenGL, DirectX). This novel functionality is provided by CUDA. CUDA is the new product released by nVIDIA. CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.

Figure 8.8: Second Generation GPGPU framework The CUDA software stack is illustrated in Figure 8.8. It consists in several layers: a hardware driver, an application programing interface (API) and its runtile, and two high-level mathematical libraries of common usage, CUFFT and CUBLAS. The CUDA API is an extension of the ANSI C programming language for a minimum learning curve. The operating system’s multitasking mechanism is responsible for managing the access to the GPU by several CUDA and graphics applications running concurrently.

124

Part III: Architecture

CUDA provides general DRAM memory addressing and supports an arbitrary scatter operation. The CUDA software will have a great impact on the GPGPU research. Researchers can now concentrate on writing applications on a multiprocessor using standard C programming language. They don’t have to try to abstract the graphics hardware through graphics APIs and shading languages. Researchers can now put their effort on mapping their algorithms on the data-parallel architecture using a data-parallel model of computation. This justifies the decomposition of the GPGPU into two generations. In the remainder sections we will give you an overview of the CUDA software based on CUDA documentation [nVI07b]. The reader is invited to read the documentation for more details on the architecture. CUDA was released recently, and we did not use it in our work. We mention it in this thesis with the aim to show the importance as well as the evolution of the GPGPU approach. GPGPU is now supported by the graphics hardware constructor nVIDIA itself.

8.4.1

Programming model

CUDA models the GPU as a computing device capable of executing a very high number of threads in parallel. The GPU operates as a coprocessor to the host CPU. Both, the host and the device maintain their own DRAM, referred to as host memory and device memory respectively. DMA engines are used to accelerate the memory copy transactions between the two DRAMs. The device is implemented as a set of multiprocessors on chip. Each multiprocessor consists in several SIMD processors with on-ship shared memory. At any given clock cycle each processor executes the same instruction, but operates on different data. The adequate programming model for this architecture is as follows: the thread is the elementary unit in the execution model. Threads are grouped into batch of threads called blocks. Blocks are grouped together and form what is called grids. Thus, each thread is identified by its grid id, its block id and its id within the block. Only threads in the same block can safely communicate. The maximum number of threads in a block, and the maximum number of blocks in a grid, as well as the number of grids in the device are parameters of the implementations. These parameters are varying from a model to another.

8.4.2

Application programming interface (API)

The goal of the CUDA programming interface is to provide for C programmers the possibility to easily write a program for execution on the device. The implementation of algorithms on the device with CUDA is out of the focus of the present work, otherwise, we provide in Appendix C a hello word example of a CUDA program for image processing similar to the previous hello worlds for Brook and OpenGL/Cg. For more information on the application programming interface, the reader is invited to refer to CUDA programming guide [nVI07b] (Chapter 4).

Part III: Architecture

8.5

125

Conclusion

In this chapter we introduced the architecture of the graphics hardware. This architecture is in a migration phase: from an architecture completely based on the graphics pipeline toward a generic architecture based on a programmable multiprocessor on chip. This migration is motivated by the recent usage of the graphics hardware for general purpose computing. In this thesis, we experimented with the first generation of the programmable graphics hardware (GeforceFX5200, GeforceFX6600GT and GeforceFX7800GTX). Thus we had to work with graphics APIs (OpenGL) and shaders (Cg) to implement programs for general purpose computing. We introduced research activities launched for the best abstraction of the graphics hardware as a streaming coprocessor (Brook).

Chapter 9 Mapping algorithms to GPU Contents 9.1

Introduction

9.2

Mapping visual object detection to GPU . . . . . . . . . 128

9.3

Hardware constraints . . . . . . . . . . . . . . . . . . . . . 130

9.4

Code generator . . . . . . . . . . . . . . . . . . . . . . . . . 131

9.5

Performance analysis . . . . . . . . . . . . . . . . . . . . . 133

9.6

9.1

. . . . . . . . . . . . . . . . . . . . . . . . . . 126

9.5.1

Cascade Stages Face Detector (CSFD) . . . . . . . . . . . . 133

9.5.2

Single Layer Face Detector (SLFD) . . . . . . . . . . . . . . 134

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Introduction

In this chapter we describe our experience on writing computer vision applications (see Part II) for execution on GPU. We started the GPGPU activity since 2004 from scratch. What do we mean by scratch, that we were newbies for GPU programming without any previous experience with shading languages and a little experience with OpenGL. We were motivated by the performance of these architectures and with possibility to implement parallel algorithms on a commodity PC. Our first source of inspiration was OpenVidia [FM05] where Fung and Mann presented an innovative platform which uses multiple graphics cards to accelerate computer vision algorithms. The platform was targeted to a Linux platform only. We took this platform in hands and we succeeded to make it operational after several weeks of configuration on a Fedora 4 box, with GeforceFX5200 on AGP bus and three GeforceFX5200 on PCI bus. We used OpenGL and Cg during this part of work. This first part was for us a training period for graphics APIs as well as for Cg Shading language. The graphics cards market was and still is too dynamic, and we found that new graphics cards (Geforce5900, Geforce6800...) performs better than

Part III: Architecture

127

the whole OpenVIDIA platform. And no more graphics cards provided for the PCI bus. Thus we left back the OpenVIDIA platform and we focused on single graphics card architecture. Once we became familiar with graphics languages, we started to search for more abstraction of the GPUs. Our objective was to find an abstraction of the GPU with the aim to implement portable applications for several graphics cards models. And to design applications for a stream model of computation. This will be useful for the future implementation of these applications for execution on stream processors. We found several approaches (see Section 8.3.5) and we were motivated for Brook [BFH+ 04b]. Brook initially developed for stream processors and adapted for GPUs by Buck et al. was a good choice because it copes with our objectives: Brook applications are cross platform, and could be compiled to run on several architectures (nv30, nv35, nv40). Brook applications are stream based and give a stream coprocessor abstraction of the GPU. We implemented image processing filters available in Camellia image processing library [BS03]. These filters perform better on GPU than the CPU. The main handicap was the communication time on the AGP bus. For this first part we had not a special complete application to accelerate. The idea was to accelerate the Camellia library on the GPU. And thus any application based on Camellia could benefit from the GPU acceleration. We accelerated mathematical morphology operators, linear filters, edge detection filters and color transformation filters. This work is comparable to another approach elaborated by Farrugia and Horain [FH06]. In their work Farrugia and Horain tried to develop a framework for the execution of OpenCV filters on the GPU using OpenGL and GLSL shading language. In their release, they provided the acceleration of the previous described filters, and they developed a framework comparable to OpenVIDIA [FM05]. This approach requires graphics programming knowledge to add a new filter. It is too dependent on the subsequent architecture, and can’t resolve automatically hardware constraints: register number, shader length, output registers... Thus we find that the usage of Brook to accelerate a stack of algorithmic filters could be a more durable solution and could be adapted to new technologies bringed up by the new GPUs. After image processing filters we focused in more complex algorithms which can’t perform in real time on standard PC. We were particularly interested in visual object detection with sliding window technique. This algorithm requires some optimizations to perform in real time, and these optimizations decrease the final accuracy as described in Abramson’s thesis [Abr06]. We developed a framework for mapping the visual detectors on the GPU. To the best of our knowledge we were the first to implement a sliding window detector for visual objects on the GPU [HBC06]. The first framework was based on OpenGL/Cg and targeted to nv30 based GPU (Geforce5200). This framework performs poorly on the Geforce5200, but it performs in real time on newer nv40 based GPUs as Geforce6600. We also developed an implementation of the framework using Brook. The Brook framework has similar performance to the previous one.

128

Part III: Architecture

In this chapter we focus on mapping visual object detectors on the GPU. The remainder of this chapter is organized as follows: Section 9.2 presents the design of the visual detector as a stream application. Section 9.3 presents the hardware constraints and their impact on the stream application design. Section 9.4 presents the code generator framework. A performance analysis task is described in Section 9.5. Section 9.6 concludes the chapter.

9.2

Mapping visual object detection to GPU

In this section we focus on the design of the algorithm using a stream model of computation previously described in Section 8.3.4. Some parts of the application are not suitable for a streaming model of computation. Thus the first question while writing the application is: which computation to move to the GPU?. To answer to this question we need to study the code of the application searching for possible functions that could be accelerated by the GPU and more generally the computation part which is rich in data parallelism. This is a pure algorithmic activity which consists in modeling the application using streams and kernels. We adopted a hierarchical approach. This approach consists in identifying the initial streams, and the main functional parts as blocks with inputs and outputs. These inputs and outputs represent data streams. We consider the visual detector as described in Section 6.3.6. We consider the global processing pattern (see Figure 6.17(a)). For this first implementation we consider the Control-Points features as visual weak classifiers. We made this choice because we used this configuration for a large set of problems, and it was under our focus. Therefore, the major part of the subsequent analysis can be used for other type of features, and only the preprocessing part will change. While reviewing the main parts of the detector based on the Control-Points features (Figure 9.1) we distinguish the following operations: Internal Resolution Tree The Control-Points Features are applied to the three resolutions of the input frame. This operation is undertaken for each input frame. Preprocessing The preprocessing step consists in preparing the three resolutions of the input image as described in Section 6.3.3. Binary Classifier The binary classifier is a pixel wise operation. It consists in a sliding window operation, at each pixel of the input frame, we test the possibility to have an object at this position defined by its upper left corner and object dimensions. Detection Grouping Due to multiple detections for a given object at a given position, the gather operation is applied to group multiple detections corresponding to an unique object.

Part III: Architecture

129

(a) All CPU

(b) GPU light

(c) CPU/GPU compromise

(d) GPU heavy

Figure 9.1: Application codesign, split between CPU and GPU The adequation between the CPU/GPU architecture and the visual detector algorithm can be represented by the four logical partitions as follows: All CPU (see Figure 9.1(a)) All the functional blocks are executed on the CPU. This is the case of the golden reference of the detector. Light GPU (see Figure 9.1(b)) It consists in executing the classification operation on the GPU. The preprocessing and the post processing are done on the CPU. The hierarchical tree is constructed on the host processor. This is the minimal configuration for the GPU. CPU/GPU Compromise (see Figure 9.1(c)) If the preprocessing copes well with the data parallel architecture then the preprocessing is mapped to the GPU and then we can obtain a compromised functional decomposition between CPU and GPU. Heavy GPU (see Figure 9.1(d)) The graphics hardware provides functionalities for building multiscale image representations, so only the post processing operation is executed on the CPU. The internal resolution tree could be accelerated using a traditional API from OpenGL for rendering a texture to a given Quad of Size smaller than the dimensions of the given texture. This can be done using specific GL Options. The gathering operation is a global operation and not adapted to data parallelism. It will be implemented on the CPU side.

130

Part III: Architecture

Figure 9.2: Graphics Pipeline used as Streaming Coprocessor. Three services are available: StreamRead, StreamWrite and RunKernel We adopted the light GPU (see Figure 9.1(b)) partition. The main part moved to the GPU is the binary classifier which consists in applying the same computation to each pixel in the input frame. We present in Algorithm 5 the implementation of the cascade as a streaming application. The input streams are R0 ,R1 ,R2 and V . The different layers are called successively and the result of each layer is transmitted to the next layer using the V stream. This part of the implementation is running on the CPU (Figure 9.2). In Algorithm 6 we present the pseudo-code of the implementation of the layer into several kernels (shaders) running on the GPU (Figure 9.2). Algorithm 5 Cascade of classifiers Require: Intensity Image R0 Ensure: a voting matrix 1: 2: 3: 4: 5: 6: 7:

Build R1 and R2 Initialize V from Mask StreamRead R0 , R1 , R2 and V for all i such that 0 ≤ i ≤ 11 do V ⇐ RunLayer i, R0 , R1 , R2 , V end for StreamWrite V

9.3

Hardware constraints

The Graphics Hardware has many constraints on the shaders. On the graphics cards implementing NV30 shaders, shader size is limited to 1024 instructions and the con-

Part III: Architecture

131

Algorithm 6 Layer: Weighted Sum Classifier Require: Iterator, R0 , R1 , R2 and V Ensure: a voting stream. 1: if V [Iterator] = true then 2: S⇐0 3: for each feature Fj in the layer do 4: A ⇐ RunFeature j, R0 , R1 , R2 , V 5: S ⇐ S + A∗ WeightOf(Fj ) 6: end for 7: if S ≤Threshold then 8: return false 9: else 10: return true 11: end if 12: else 13: return false 14: end if

stant registers are limited up to 16. On more advanced graphics cards implementing NV40, shader size is unlimited, but the constant registers are limited to 32. These constraints have an influence on the design of the application in terms of streaming application. Thus, to implement a cascade of binary classifiers, we were obliged to decompose the cascade into several kernels, each kernel corresponds to a layer. To achieve an homogenous decomposition of the application, we decomposed each layer into several kernels, called scans, and the whole cascade is the equivalent to a list of successive scans as shown in Figure 9.3. As shown in Figure 9.3, the data transmission between layers is done using the V Stream, and the intra-layer data transmission is done using the A stream to accumulate the intermediate features calculations.

9.4

Code generator

Our code generator is implemented in the Perl script language to generate BrookGPU and Cg/openGL programs according to input Adaboost learning knowledge. Brook Implementation We decided to generate the code of the application in BrookGPU language, mainly for two reasons. First, the generated code should be portable to various architectures, even future architectures that are not yet defined. Generating high level programs will allow fundamental changes in hardware and graphics API as long as the compiler and runtime for high level language compilers keep up with those changes. Second, the transparency provided by Brook, which makes it easier to write a streaming

132

Part III: Architecture

Figure 9.3: Multiscans technique. We consider the CSFD detector. Layer L0 is composed of two features, it is implemented using only one scan S0 . Layer L9 is composed of 58 features, it is implemented using two scans S10 and S11 . S10 produces an intermediate voting result, and S11 produces the vote of L9 application. The programmer can concentrate on the application design in terms of kernels and streams without any focus on the Graphics related languages and libraries ( Cg, openGL). In Figure 9.4 we show the final system developed with BrookGPU. The Adaboost knowledge description serves as an input to the Code Generator which generates the Brook kernels as well as the main C++ code for streams initialization and read/run/write calls handling. The generated code is a ”.br”, this file is compiled by brcc to generate the C++ code as well as the assembly language targeted to the GPU. This C++ file is then compiled and linked to the other libraries to build the executable. The main disadvantage of the brook solution is that we are obliged to enter the compile/link chain each time we modify the Adaboost knowledge input. Cg/OpenGL Implementation In order to test the performance of the BrookGPU runtime, as well as to benefit from additional features provided by the OpenGL library, we decided to implement a hand written version of the application using Cg to code the kernels as pixel shaders, and we used OpenGL as an API to communicate with the graphics card. The advantage of this solution is that we are not obliged to enter the compile/link chain once we change the input Adaboost knowledge. In fact, the Cg code generator generates the Cg code used to represent the visual detector, and this code is loaded dynamically using the Cg library. The Cg library enables loading Cg code on runtime from files

Part III: Architecture

133

Figure 9.4: Final System, Brook Runtime stored on disk. The abstraction of the graphics hardware provided by Brook, is not present, so we have to manage our data using textures and our kernel using pixel shaders manually.

9.5

Performance analysis

We tested our GPU implementation on a nVIDIA Geforce 6600GT on PCI-Express. The host is an Athlon 64 3500+, 2.21 Ghz with 512M DDR.

9.5.1

Cascade Stages Face Detector (CSFD)

As shown in Figure 9.6, the CPU version of the CSFD spends most of its time on the first 6 layers, which is related to the fact that the number of windows to test at these layers is still considerable. In the last 6 layers, even if the layers are too complex, the number of windows to classify is not great. Conversely, the GPU implementation spends most of its time on the last layers, and less time on the first 6 layers. This is related to the fact that, in the first layers, the number of features is not too big, so each layer requires a single pass on the GPU, which is too fast. But, the last layers are too complex, and require up to 30 scans per layer (layer 11), so the GPU spends more time classifying the frame.

134

Part III: Architecture

Figure 9.5: Final System, Cg/OpenGL Runtime

In Figure 9.7(a), we present a speedup for each layer of the cascade. We deduce that for the first 6 layers, the GPU is running faster than on CPU; this is due to the high data parallelism support on the GPU. The last 6 layers are running faster on the CPU, due to the small number of windows to classify and the high number of features within the layers. In Figure 9.7(b) we present the profiling of the mixed solution: the first 6 layers are running on the GPU and the next 6 layers are running on the CPU. Using this decomposition, we reach a real time classification with 15 fps for 415x255 frames.

9.5.2

Single Layer Face Detector (SLFD)

As shown in Table 9.1, on the x86 processor, the SLFD with 1600 features requires 18.8s to classify 92k windows in a 415x255 frame. The CSFD with 12 layer, needs only 175ms (Table 9.1). The implementation of the SLFD on the GPU requires 2s to classify 100k windows in a 415x255 frame. Which means that the GPU produces a speedup of 9.4 compared to the pure CPU implementation, but it is still not running in real time.

Part III: Architecture

135

Figure 9.6: Processing Time Profiling

(a)

(b)

Figure 9.7: (a) Cascaded Face Detection, Speedup plot for GPU to CPU implementation. We show that the GPU performs better than CPU for the first five layers; (b) Processing Time Profiling for the heterogeneous solution

9.6

Conclusion

In this chapter we described the mapping of the visual object detector to the graphics hardware. We designed the algorithm using the streaming model of computation. We identified the streams and the kernels. Then, we explained how we generated the code for the kernels and how we handled the storage of the images in streams. We faced some technical challenges. We explained these challenges, specially with the limited graphics hardware (GeforceFX5200): memory size and shader length. We accomplished a successful mapping of the algorithm. We considered a complex knowledge result issued from LibAdaBoost learning task. This knowledge is for face detection based on the control point features. The methodology that we have used in the mapping is generic enough to support other kind of detectors with other type of features. In the present chapter, we demonstrated that, for a given visual detector, we can

136

Part III: Architecture

CPU GPU

Time(ms) 18805.04 2030.12

ClassificationRate(WPS) 4902.73 53122.06

Table 9.1: Single Layer FD with 1600 features, the GPU implementation is 10 times faster than the CPU implementation associate several runtimes: the CPU runtime and the GPU runtime. The performance tests showed that the GPU architecture performs better than the traditional CPU for the visual detector. We can conclude that the GPU architecture with a streaming model of computation modeling is a promising solution for accelerating computer vision algorithms.

Part IV Application

Chapter 10 Application: PUVAME Contents 10.1 Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . 138

10.2 PUVAME overview . . . . . . . . . . . . . . . . . . . . . . 139 10.3 Accident analysis and scenarios . . . . . . . . . . . . . . . 140 10.4 ParkNav platform . . . . . . . . . . . . . . . . . . . . . . . 141 10.4.1 The ParkView platform . . . . . . . . . . . . . . . . . . . . 142 10.4.2 The CyCab vehicule . . . . . . . . . . . . . . . . . . . . . . 144 10.5 Architecture of the system . . . . . . . . . . . . . . . . . . 144 10.5.1 Interpretation of sensor data relative to the intersection . . 145 10.5.2 Interpretation of sensor data relative to the vehicule . . . . 149 10.5.3 Collision Risk Estimation . . . . . . . . . . . . . . . . . . . 149 10.5.4 Warning interface

. . . . . . . . . . . . . . . . . . . . . . . 150

10.6 Experimental results . . . . . . . . . . . . . . . . . . . . . 151

10.1

Introduction

In this part we focus on applications. An application is the adequation of a given algorithmic pipeline and a given architecture. The general framework for intelligent video analysis described in Chapter 4 will be parametrized to cope with the requirements. Some of the optimizations presented in Chapter 8 could occur if the architecture is composed of programmable graphics cards. In this chapter we present an application for intelligent video analysis which consists in people detection in urban area. It consists in the analysis of video streams from fixed cameras. This analysis aims at detecting people in the video and communicating this information to a global fusion system, which performs tracking and collision estimation.

Part IV: Applications

139

This application is an integrated part of the French PREDIT1 project PUVAME2 . A non exhaustive list of our contributions to this project is as follows: • We contributed to the different technical meetings around requirements definition. • We contributed to the design of the application using RTMaps3 as a development and deployment environment. • We provided the people detection module based on adaptive background subtraction and machine learning for people detection. • We created datasets for synchronized multiple cameras acquisition. The remainder of this chapter is organized as follows: in next section, we give a brief overview of the PUVAME project. In Section 10.3 we detail the accident analysis done by Connex-Eurolum in 2003 and also we describe the chosen use cases. Section 10.4 presents the experimental platform used to evaluate the solutions we propose. Section 10.5 details the architecture of the system. Experimental results are reported in section 10.6. We give some conclusions and perspectives in section 3.6.

10.2

PUVAME overview

In France, about 33% of roads victims are Vulnerable Road Users (VRU). In its 3rd framework, the French PREDIT includes VRU Safety. The PUVAME project was created to generate solutions to avoid collisions between VRU and Bus in urban traffic. This objective will be achieved by: • Improvement of driver’s perception capabilities close to his vehicle; This objective will be achieved using a combination of offboard cameras, observing intersections or bus stops, to detect and track VRU present at intersection or bus stop, as well as onboard sensors for localisation of the bus; • Detection and assessment of dangerous situations, analyzing position of the vehicule and of the VRU and estimating their future trajectories; • Triggering alarms and related processes inside the vehicule; • Integration on experimental vehicules. Furthermore, there is a communication between the infrastructure and the bus to exchange information about the position of the bus and position of VRU present in the environment. These informations are used to compute a risk of collisions between 1

National Research and Innovation Program for Ground Transport Systems Protection des Vuln´erables par Alarmes ou Manoeuvres 3 www.intempora.com 2

140

Part IV: Applications

the bus and VRU. In case of an important risk, a warning strategy is defined to prevent the bus driver and the VRU. The project started on October 2003 and was achieved on April 2006. The partners are: INRIA Responsible of the following tasks: project management, providing the hardware platform, tracking using grid occupation server, system integration. EMP-CAOR Responsible of people detection, system architecture under RTMaps and system integration. Connex-Eurolum Responsible of Accident analysis and system validation. Robosoft Responsible of vehicle equipment. ProBayes Provided and supported ProBayes software. Intempora Provided and supported RTMaps. INRETS LESCOT Responsible of the design of the warning interface for the bus.

10.3

Accident analysis and scenarios

In the scope of the project, we’ve analysed accidents occured in 2003 between vulnerables (pedestrians, cyclists) and buses in a French town. Three kinds of accidents arrised from this study. In 25.6% of the accidents, the vulnerable was struck while the bus was leaving or approaching a bus stop. In 38.5% of the accidents, the pedestrian was attempting to cross at an intersection when he was struck by a bus turning left or right. Finally, 33.3% of the accidents occured when the pedestrian lost balance on the sidewalk when he was running for the bus, or was struck by a lateral part of a bus or one of its rear view mirrors. It was also noticed that in all these cases, most of the impacts occurred on the right side or the front of the bus. In the aim of reducing these kinds of accidents, we proposed 3 scenarios to reproduce the most frequent accidents’ situations and find ways to overcome the lack of security. The first scenario aims at reproducing vulnerables struck while the bus arrives or leaves its bus stop (see Figure 10.1(a)). The second scenario aims at reproducing vulnerables struck at an intersection (see Figure 10.1(b)). The third scenario aims at reproducing vulnerables struck by the lateral part of the bus, surprised by the sweeping zone. In this thesis, we proposed to focus on the first two scenarios when a pedestrian is struck by the right part of the bus. In these 2 cases, it has been decided to use fixed cameras placed at the bus stop or at the road junction in order to detect potentially dangerous behaviors. As most of the right part of the bus is unseen by the driver, it is very important to give him information about the fact that a vulnerable is placed in this blind spot. The cameras will cover the entire blind zone.

Part IV: Applications

141

(a) Bus Station

(b) Cross Intersection

Figure 10.1: (a) The bus arrives or leaves its bus stop. The vulnerables situated in the blind spot near the bus are in danger because they are not seen by the driver and have a good probability to enter in collision with the bus; (b) The pedestrian crosses at an intersection when the bus turns right. The vulnerable who starts crossing the road may be unseen by the driver. Information given by cameras will be analyzed and merged with information about the position of the bus. A collision risk estimation will be done and an interface will alert the driver about the danger. A more detailed description of the process will be done in section 10.5. Next section presents the experimental site set up at INRIA Rhˆone-Alpes, where these two scenarios are under test.

10.4

ParkNav platform

WiFi router

Detector1

Left0

Chan4

Chan1

Right0

Right1

Left1

The experimental setup used to evaluate the PUVAME system is composed of 2 distinctive parts: the ParkView platform used to simulate an intersection or a bus stop and the Cycab vehicule used to simulate a bus.

Dectector2

MapServer

ParkView network P4 2.4Ghz 1Go RAM GNU/Linux Fedora Core 2

Xeon 3Ghz 2Go RAM GNU/Linux Fedora FC2

P4 1.5Ghz 1Go RAM GNU/Linux Fedora FC2

Figure 10.2: The ParkView platform hardware

142

10.4.1

Part IV: Applications

The ParkView platform

The ParkView platform is composed of a set of six off-board analog cameras, installed in a car-park setup such as their field-of-view partially overlap (see figure 10.3), and three Linux(tm) workstations in charge of the data processing, connected by a standard Local Area Network (figure 10.2).

(a)

(c) “left0”

(b)

(d) “right0”

Figure 10.3: (a) Location of the cameras on the parking; (b) Field-of-view of the cameras projected on the ground; (c) and (d) View from the two cameras used

The workstations are running a specifically developed client-server software composed of three main parts, called the map server, the map clients and the connectors (figure 10.4). The map server processes all the incoming observations, provided by the different clients, in order to maintain a global high-level representation of the environment; this is where the data fusion occurs. A single instance of the server is running. The map clients connect to the server and provide the users with a graphical representation of the environment (figure 10.5); they can also process this data further and perform application-dependent tasks. For example, in a driving assistance application, the vehicle on-board computer will be running such a client specialized in estimating the collision risk. The connectors are feeded with the raw sensor-data, perform the pre-processing, and send the resulting observations to the map server. Each of the computer connected with one or several sensors is running such a connector. For the application described here, all data preprocessing basically consist in detecting pedestrians. Therefore, the video stream of each camera is processed independently by a dedicated detector. The role of the detectors is to convert each incoming video frame to a set of bounding rectangles, one by target detected in the image plane (figure 10.14). The

Part IV: Applications

143

Pedestrian Detector

data translation formatage  donnees

Tracker.

Transformation : Data preprocessing: plan image, plan carte   pour : ­ transform to map ­ centre coordinate system ­ variance ­ vitesses ­ distortion correction prendre en compte : ­ ... ­ distorsion ­ erreur conjuguees

­ Observations  scheduling 

Fusion, association

­ Buffering for later processing

Offboard detector module  pour :

Map server  :  pour

 po

ur :

 Collision risk

 Graphical 

 estimation

visualisation

Figure 10.4: The ParkView platform software organization

Figure 10.5: A graphical map client showing the CyCab and a pedestrian on the parking

set of rectangles detected at a given time constitutes the detector observation, and is sent to the map server. Since the fusion system operates in a fixed coordinate system, distinct from each of the camera’s local systems, a coordinate transformation must be performed. For this purpose, each of the cameras has been calibrated beforehand. The result of this calibration consists in a set of parameters: • The intrinsic parameters contain the information about the camera optics and CCD sensor: the focal length and focal axis, the distorsion parameters, • The extrinsic parameters consist of the homography matrix: this is the 3x3 homogeneous matrix which transforms the coordinates of an image point to the ground coordinate system. In such a multi-sensor system, special care must be taken of proper timestamping and synchronization of the observations. This is especially true in a networked environment, where the standard TCP/IP protocol would incur its own latencies. The ParkView platform achieves the desired effect by using a specialized transfer

144

Part IV: Applications

protocol, building on the low-latency properties of UDP while guaranteeing in-order, synchronised delivery of the sensor observations to the server.

10.4.2

The CyCab vehicule

Figure 10.6: The CyCab vehicule

The CyCab (figure 10.6) has been designed to transport up to two persons in downtown areas, pedestrian malls, large industrial or amusement parks and airports, at a maximum of 30km/h speed. It has a length of 1.9 meter, a width of 1.2 meter and weights about 300 kg. It is equipped with 4 steer and drive wheels powered by four 1 kW electric motors. To control the cycab, we can manually drive it with a joystick or fully-automatic operate it. It is connected to the ParkView platform by a wireless connection: we can send it motors commands and collect odometry and sensors data: the Cycab is considered as a client of the ParkNav platform. It perceives the environment with a sick laser used to detect and avoid the obstacles.

10.5

Architecture of the system

In this section, we detail the PUVAME software architecture (figure 10.7) we choose for the intersection and bus stop. This architecture is composed of 4 main parts: 1. First of all, the images of the different offboard camera are used to estimate the position and speed of each pedestrian present in the crossroadmap; 2. The odometry and the Global Positionning System are used to determine the position and speed of the vehicule in the crossroad map; 3. Third, the position and speed of the different objects present in the intersection are used to estimate a risk of collision between the bus and each pedestrian; 4. Finally, the level of risk and the direction of the risk are sent to the Human Machine Interface (HMI) inside the bus to warn the bus driver. In the next subsections, we detail these 4 modules.

Part IV: Applications

145 GPS Interpretation of sensor data relative to the vehicule Odometry

Position & speed of the vehicule in the crossroad map Risk level

Collision risk estimation

HMI

Offboard camera Interpretation of sensor data

Position & speed of each pedestrian in the crossroad map

relative to the intersection Offboard camera

Figure 10.7: System diagram

10.5.1

Interpretation of sensor data relative to the intersection

Our objective is to have a robust perception using multi-sensor approaches to track the different pedestrians present at the intersection. The whole architecture is depicted in figure 10.8. As mentioned in section 10.4, the video camera feed is processed independently by a dedicated detector. The role of the detectors is to convert each incoming video frame to a set of bounding rectangles, one by target detected in the image plane. The set of rectangles detected at a given time constitutes the detector observations. Then, the information about the position of each VRU given by each offboard camera are merged, using an occupancy grid approach, in order to compute a better estimation of the position of each VRU. We use the output of this stage of fusion process in order to extract new observations on the VRU currently present in the environment. We use data association algorithms to update the different VRU position with extracted observations. Finally, we update the list of VRU present in the environment. The different parts of the process are detailed in the following paragraphs. Pedestrian detector To detect VRUs present at the intersection, a pedestrian detector subsystem is used. The detector is composed of three components: the first component consists in a foreground segmentation based on Multiple Gaussian Model as described in Section 5.3.4. The second component is a sliding window binary classifier for pedestrians using AdaBoost-based learning methods previously described in Chapter 6. The third component is a tracking algorithm using image based criteria of similarity. Occupancy grid The construction of the occupancy grid as a result of the fusion of the detector observations given by different cameras is detailed in [YARL06]. In this paragraph, we only give an overview of the construction of this occupancy grid. Occupancy grid is a generic framework for multi-sensor fusion and modelling of the environment. It has been introduced by Elfes and Moravec [Elf89] at the end

146

Part IV: Applications

Figure 10.8: Architecture of the pedestrians tracker of the 1980s. An occupancy grid is a stochastic tesselated representation of spatial information that maintains probabilistic estimates of the occupancy state of each cell in a lattice. The main advantage of this approach is the ability to integrate several sensors in the same framework taking the inherent uncertainty of each sensor reading into account, in opposite to the Geometric Paradigm whose method is to categorize the world features into a set of geometric primitives [CSW03]. The alternative that OGs offer is a regular sampling of the space occupancy, that is a very generic system of space representation when no knowledge about the shapes of the environment is available. On the contrary of a feature based environment model, the only requirement for an OG building is a bayesian sensor model for each cell of the grid and each sensor. This sensor model is the description of the probabilistic relation that links sensor measurement to space state, that OG necessitates to make the sensor integration. In [YARL06], we propose there two different sensor models that are suitable for different purposes, but which underline the genericity of the occupancy grid approach. The problem is that motion detectors give information in the image space and that

Part IV: Applications

147

we search to have knowledge in the ground plan. We solve this problem projecting the bounding box in the ground plan using some hypothesis: in the first model, we mainly suppose that the ground is a plan, all the VRU stand on the ground and the complete VRU is visible for the camera. The second model is more general as we consider that a VRU could be partially hidden but has a maximum height of 3 meters.

(a)

(b)

(e)

(c)

(d)

(f)

Figure 10.9: (a) An image of a moving object acquired by one of the offboard video cameras and the associated bounding box found by the detector. (b) The occulted zone as the intersection of the viewing cone associated with the bounding box and the ground plan. (c) The associated ground image produce by the system. (d) Ground image after gaussian convolution with a support size of 7 pixels. (e) Probability of the ground image pixel value, knowing that the pixel corresponds to an empty cell: P (Z|emp) for each cell. (f) Probability of the ground image pixel value, knowing that the pixel corresponds to an occupied cell: P (Z|occ) for each cell.

In both of the models we first search to segment the ground plan in three types of region: occupied, occulted and free zones using the bounding boxes informations. Then we introduce an uncertainty management, using a gaussian convolution, to deal with the position errors in the detector. Finally, we convert this information into probability distributions. Figure 10.9 illustrates the whole construction of the first model and figure 10.10 shows the first phase for the second model. Figure 10.11 shows experiments (ie, the resulting occupancy grid) with the first model. The first model is precise, but only when its hypothesis holds. In such cases this model will be the most suitable for position estimation. With the second model, the position uncertainty allows to surround the real position of the detected object, such that with other viewing points or other sensors, like laser range-finders or radar it is possible to obtain a good hull of the ground object occupation. Thanks to the uncertainty, this last model will never give wrong information about the emptiness of an area, which is a guarantee for safety applications.

148

Part IV: Applications

(a)

(b)

(c)

(d)

Figure 10.10: (a) Moving object whom the contact points with the ground are occulted. (b) The intersection of the viewing cone associated with the bounding box and the ground plan which is far from the position of the object. (c) Projection of the entire view cone on the ground in yellow. (d) Projection of the part of view cone that fits the object height hypothesis (in green).

Figure 10.11: The resulting probability that the cells are occupied after the inference process.

Object extraction Based on the occupancy grid, we extract the objects in the grid: we extract moving obstacles by first identifying the moving area. We proceed by differencing two consecutive occupancy grids in time, as in [BA04].

Prediction To track VRUs, a Kalman filter [Kal60] is used for each VRU present in the environment. The speed needed for the prediction phase of each Kalman filter is computed making a difference between the actual and the previous position.

Part IV: Applications

149

Track to track association To reestimate the position of each VRU using a Kalman filter, we first need to associate the observations of VRUs extracted from the occupancy grid to the predicted position. As there could be at most one observation associated to a given predicted position, we use a gating procedure which is enough for correct assignments. The association is also useful to manage the list of VRUs present in the environment, as described in the next paragraph. Track management Each VRU is tagged with a specific ID and its position in the environment. At the beginning of the process, the list of VRUs present in the environment is empty. The result of the association phase is used to update this list. Several cases could appear: 1. An observation is associated to a VRU: the reestimated position of this VRU is computed with a Kalman filter the predicted position and this observation; 2. A VRU has no observation associated to itself: the reestimated position of this VRU is equal to the predicted position; 3. An observation is not associated to any VRU: a new temporary VRU ID is created and its position is initialized as the value of the observation. To avoid to create VRU corresponding to false alarms, the temporary VRU is only confirmed (ie, becomes a definitive VRU) if it is seen during 3 consecutive instants. As we are using offboard cameras observing always the same environment, to delete a VRU of the list, 2 conditions are needed: it has to be unseen (ie, no observation has been associated to himself) since at least the last 5 instants and its position has to be outside of the intersection.

10.5.2

Interpretation of sensor data relative to the vehicule

The goal of this module [KDdlF+ 04] is to compute a precise position of the bus at the intersection. This position is computed using an extended Kalman filter with odometry data for predicting the position of the vehicule and a GPS to estimate its position. The localisation is based on a precise datation of the data to minimize the effect of latency.

10.5.3

Collision Risk Estimation

The risk of collision is computed with the Closest Point of Approach and the Time to Closest Point of Approach method. The Closest Point of Approach is the point where two moving objects are at a minimum distance. To compute this distance, we suppose that these two moving objects are moving at a constant speed. So at each time, we are able to compute the distance between these two objects. To know when

150

Part IV: Applications

Figure 10.12: interface (left) and different warning functions

Figure 10.13: interface (left) and different warning functions this distance is minimal, we search the instant where the derivative of the square of this distance is null. This instant is named the Time to Closest Point of Approach and the respective distance is named the Closest Point of Approach. In our application, at each time, we compute the Time to Closest Point of Approach and the respective distance (ie, Closest Point of Approach) between each VRU and the bus. If this distance is below a given threshold in less than 5 seconds, we compute the orientation of the VRU relative to the bus and generate an alarm for the HMI.

10.5.4

Warning interface

The interface (figure 10.12) is made of two parts. One is the bus outline (center of the interface), and the other is the possible position of vulnerables (the 6 circles). These circles can be fullfilled by warning pictograms. These pictograms show the place where a vulnerable could be struck by the bus : in front, middle, back, left or right side of the bus. The representation needs to be very simple in order to be

Part IV: Applications

151

Figure 10.14: (left) two pedestrians are crossing the road. They are detected by our system. The Cycab is arriving turning right. (center) The system estimates that there is a high probability of collision between the pedestrian starting crossing the road. (right) It alerts the driver with a signal on the interface understood immediately by the driver. Moreover, two speakers are added in order to warn rapidly the driver about the danger (figure 10.13).

10.6

Experimental results

We reproduced the scenario shown (see Figure 10.1(b)) in our experimental site. The bus is a Cycab, two pedestrians crossing the road are detected by our system. One is in danger because it has a high probability to be struck by the Cycab. The driver is alerted by the interface that a pedestrian is potentially in danger (see Figure 10.14).

Part V Conclusion and future work

Chapter 11 Conclusion Contents 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 11.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

11.1

Overview

In the present thesis we have presented a generic framework for intelligent video analysis. We have focused on the development of several stages within this framework: adaptive background subtraction and visual object detection. We have developed basic implementations for the rest of the stages in this framework (tracking module). We have accelerated some algorithms using a low cost solution. We have prototyped an application for urban zone safety. This application is built on the top of the given medium level algorithms and the low cost architecture. This work is composed of four main parts: the development of medium level algorithms specialized in background subtraction and object detection, the usage of the graphics hardware as a target for some algorithms, the design of the algorithms for the execution on the graphics hardware and finally the integration of the different algorithms within an application. We have worked with state of the art algorithms for background subtraction and visual object detection. We have implemented several approaches for background subtraction. We have compared the performance of these methods and we have enhanced their accuracy using high level feedback. We have developed LibAdaBoost, a generic software library for machine learning. We have extended this library to implement a boosting framework. We have adopted a plugin oriented architecture for this library. We have extended this framework for the visual domain and implemented the state of the art features as well as we have developed new features. With LibAdaBoost it has been easy to work with the different actors of the boosting framework. Thus, we have easily enhanced the genetic

Part V: Conclusion and future work

155

algorithm (previously developed by Y. Abramson) as a weak learner. LibAdaBoost makes it possible to objectively compare the different features (performance and accuracy). For the best of our knowledge, LibAdaBoost is the first machine learning library specialized in boosting for the visual domain. The low cost solution is based on using the graphics hardware for computer vision purpose. This has been a good choice according to the performance profile of the application. The graphics hardware has a data-parallelism oriented architecture. This architecture is interfaced through a streaming model of computation. We have introduced a streaming design of the visual object detection algorithm. This design was used to map the algorithm on the graphics hardware. For the best of our knowledge, we were the first to implement a visual object detector on the graphics hardware. Finally, we integrated those algorithms within a large application in an industrial context. The PUVAME project was a good opportunity to evaluate those algorithms and to experiment with the integration issues. The results still are at a proof of concept stage. We aimed at validating the system on some elementary scenarios where we have done some simplifications. We have got promising results, accuracy speaking as well as performance speaking.

11.2

Future work

Beside the various contributions in the present thesis, the complete framework for intelligent video analysis still is non perfect. Many improvements could be introduced at several levels. In this section we will present a non exhaustive list of improvements as future work issues: • The generic framework for intelligent video analysis as presented in Chapter 4 still is uncomplete. We have explored in our research some stages of this framework and not all the stages. The exploration of the other stages (object tracking, action recognition, semantic description, personal identification and fusion of multiple cameras) makes the application range wider. Thus, we can consider more advanced applications based on fusion of multiple sensors as well as a recognition system for controlling high security areas. • The adaptive background subtraction algorithms presented in Chapter 5 are not perfect. They have the advantage of being generic. This means that they have not special preprocessing for a given installation. Otherwise, those algorithms require fusion with other algorithms (visual object detection or motion detection) to perform a good accuracy. Recently, new approaches for background subtraction are proposed. The new approaches are based on machine learning techniques. In their work, Parag et al. [PEM06] used boosting for selecting the features to represent a given scene background at the pixel level. We propose to explore this possibility using the boosting framework LibAdaBoost (cf. Chapter 6).

156

Part V: Conclusion and future work

• LibAdaBoost is a complete machine learning framework. LibAdaBoost provides a boosting framework. This boosting framework is extended for the visual object detection domain. It could be extended for other domains (as described above, for pixel classification domain). It has two major points: extensibility and robustness. This library could be used as a reference for any boosting related algorithms. • In our research, we explored the GPGPU for accelerating computer vision algorithms. We reached promising performance results for the visual object detection. The new graphics hardware generation starting by the G8 series (NVIDIA) offers a complete solution for general purpose computing. NVIDIA proposes CUDA as software solution for GPGPU. We propose to continue the development on the CUDA framework to accelerate the different algorithms introduced in this thesis. • The PUVAME project is a promising project for improving pedestrian safety in urban zones. The pedestrian detection and tracking system was validated on special scenarios, and still has lackings to be fully operational in real condition scenes. Thus, more advanced research must continue on the object detection part, as well as on the object tracking part. • Video surveillance applications for large scale installations are migrating to third generation. The usage of smart cameras as distributed agents imposes new architecture and new problematics related to the cooperation and the computing distribution as well as the fusion of information from multiple cameras. The implementation of computer vision algorithms on these cameras is a special task, the onboard embedded processors have generally special architectures. The mapping of the algorithms to smart cameras could be guided by a codesign activity which aims to develop new cores for accelerating computer vision tasks using special hardware. We believe that second generation video surveillance systems still have a long life, and we still have to develop systems upon second generation architectures. Thus, the methodology used within this thesis still has to be optimized and has a promising market.

Chapter 12 French conclusion Dans le pr´esent manuscrit, on a pr´esent´e une chaˆıne de traitement g´en´erique pour l’analyse vid´eo intelligente. On a explor´e plusieurs impl´ementations relatives aux diff´erents ´etages dans cette chaˆıne: mod`eles adaptatifs du fond d’une sc`ene dynamique `a cam`era fixe ainsi que la d´etection visuelle d’objets par apprentissage automatique en utilisant le boosting. On a pu d´evelopper des impl´ementations de base pour le reste des ´etapes de la chaˆıne de traitement. On a acc´el´er´e une partie de ces algorithmes en utilisant des architectures ˆa faible coˆ ut bas´ees sur les cartes graphiques programmables. On a prototyp´e une application pour am´eliorer la s´ecurit´e des pi´etons dans les zones urbaines pr`es des arrˆets du bus. Ce travail est compos´e de quatres parties principales: le d´eveloppement d’algorithmes sp´ecialis´es dans la segmentation du fond et la d´etection visuelle d’objets, l’utilisation des cartes graphiques pour l’acc´el´eration de certains algorithmes de vision, la mod´elisation des algorithmes pour l’ex´ecustion sur des processeurs graphiques comme cible et finalement, l’int´egration des diff`erents algorithmes dans le cadre d’une application. On a travaill´e sur des algorithmes relevant de l’´etat de l’art sur la s´egmentation du fond et sur la d´etection visuelle d’objets. On a impl´ement´e plusieurs m`ethodes pour la segmentation du fond. Ces m`ethodes sont bas´ees sur le mod`ele adaptatif. On a ´etudi´e la performance de chacune de ces m`ethodes et on a essay´e d’am´eliorer la robustesse en utilisant le retour d’informations de haut niveau. On a d´evelopp´e LibAdaBoost, une librairie pour l’apprentissage automatique impl´ementant la technique du boosting en utilisant un mod`ele g´en´erique des donn´ees. Cette librairie est ´etendue pour suporter le domaine visuel. On a impl´ement´e les classifieurs faibles publi´es dans l’´etat de l’art ainsi que nos propres classifieurs faibles. LibAdaBoost a facilit´e l’utilisation de la technique du boosting pour le domaine visuel. On a pu am´eliorer facilement l’algorithme g´en´etique (d´evelopp´e par Y. Abramson). LibAdaBoost fournitdes fonctionnalit´es qui permettent une comparaison objective des diff`erents classifierus faibles ainsi que les algorithmes d’apprentissage et tout autre degr`es de libert´e. A nos connaissances LibAdaBoost est la premi´ere librarie d’apprentissage automatique qui fournit une impl´ementation du boosting qui est robuste, extensible et qui supporte un mod`ele de donn´ees g´en´erique. La solution `a faible coˆ ut est bas´ee sur les processeurs graphiques qui seront ex-

158

Part V: Conclusion and future work

ploit´es pour l’ex´ecution des algorithmes de vision. Il s’est av`er´e que cette solution est une solution adapt´ee `a ce type d’algorithmes. Le processeur graphique est bas´e sur une architecture qui favorise le parall´elisme de donn´ees. Les applications cibl´es sur ce processeur sont mod´elis´es en utilisant un mod`ele de calcul orient´e flux de donn´ees. Ce mod`ele de calcul est bas´e sur deux ´elements structureaux qui sont les flux de donn´ees streams et les noyaux de calcul kernels. A nos connaissances, nous sommes les premiers qui ont impl´ement´e le premier detecteur visuel d’objets bas´e sur le boosting sur un processeur graphique. Finalement, on a int´egr´e ces algorithmes dans une large application dans un contexte industriel. Le projet PUVAME ´etait une bonne oportunit´e pour ´evaluer ces algorithmes. On a pu aussi exp´eriment´e la phase d’int´egration avec les autres sous syst`emes d´evelopp´es par les autres partenaires. Les r´esultats restent au niveau de l’´etude de faisabilit´e. On a fait recours ´a une validation ponctuelle des algorithmes sur des sc´enarios simplifi´es. On a obtenu des r´esultats encourageant, au niveau de la robustesse et des performances. Pourtant les diff`erentes contributions `a travers ce travail, la totalit´e de la chaine de traitement reste inparfaite. Plusieurs am´eliorations pourront avoir lieu. Une liste non exhaustive de potentielle am´eliorations pourra ˆetre la suivante: • La chaˆıne de traitement pour l’analyse vid´eo intelligente pr´esent´ee dans Chapitre 4 est incompl‘´ete. On a explor´e quelques ´etapes de cette chaˆıne, et il nous reste encors plusieurs comme le suivi d’objets, la reconnaisance des objets, la description s´emantique du contenu, l’analyse des actions et la collaboration entre plusieurs cam´eras sous forme de fusion de donn´ees. Cependant, cette exploration des autres ´etapes permettra d’´elargir le spectre des applications qui pourraient ˆetre utilis´es dans le controle d’acc´es dans des zones `a fortes contraintes de s´ecurit´e. • Les algorithmes de segmentation du fond adaptatifs pr´esent´es dans Chapitre 5 ne sont pas parfaits. Ils ont l’avantage d’ˆetre g´en´eriques. Cela veut dire que ces algorithmes ne demandent pas un param´etrage pr´ealable en fonction de la sc´ene surveill´ee, mais ils ont besoins d’ˆetre fusionn´es avec d’autres algorithmes pour arriver `a une meilleure performance. Une nouvelle approche bas´ee sur l’apprentissage automatique en utilisant le boosting au niveau du pixel est utilis´ee par Parag et al. [PEM06]. On propose d’esplorer cette solution. Ceci est facilit´e par notre librairie de boosting LibAdaBoost (cf. Chapitre 6). • LibAdaBoost est une librairie compl`ete d´edi´ee pour l’apprentissage automatique. Cette librairie impl´emente la technique de boosting d’une mani´ere g´en´erique. Cette technique est ´etendu au domaine visuel. Elle pourra ˆetre ´etendue pour d’autres types de donn´ees (comme d´ecrit ci dessus au niveau des pixels pour la segmentation du fond). Cette librairie a deux points avantageux: l’extensibilit´e et la robustesse. Cette librairie servira comme environnement de r´ef´erence pour la recherche sur la technique du boosting. • On a explor´e l’utilisation des processeurs graphiques pour l’acc´el´eration des

Part V: Conclusion and future work

159

algorithmes de vision qui repr`esentent un parall´elisme de donn´ees fonctionnel. On a obtenu des r´esultats meilleurs que le processeur traditionnel. Les nouvelles cartes graphiques chez NVIDIA de la s´erie G8 sont programmables moyennant une solution logiciele r´evolutionnaire qui s’appelle CUDA. Cette solution logicielle permettant la programmation des processeurs graphiques avec le langage C ANSI sans passer par les couches de rendu 3D comme OpenGL et DirectX coupl´es aux langages de shader comme Cg. On propose de continuer le d´eveloppement sur les cartes graphique sur les processeurs G8 avec la solution CUDA. • Le projet PUVAME est prometeur pour l’am´elioration de la s´ecurit´e des pi´etons dans les zones urbaines. Malgr`es la validation sur des sc´enarios ´el´ementaires, les algorithmes de d´etection et de suivi devront ˆetre am´elior´es pour obtenir un syst`eme operationnel dans les conditions r´eelles. • Les syst`emes de vid´eo surveillance de grande envergure sont en migration vers la troisi´eme g´en´eration. L’utilisation des cam´era intelligentes comme des agents distribu´es impose une nouvelle m´ethode de pens´ee et drel`eves de nouvelle probl´ematique au niveau de la coop´eration entre ces agents et de la distribution du calcul entre eux. L’´ecriture des algorithmes de vision pour ces cam´eras comme cibles est tout un art. Les processeurs embarqu´es sur ces cam´eras ont des architectures sp´eciales. Et pour le moment il n’y a pas des normes qui unifient ces aspects architecturaux. On pense que malgr`es cette migration, la deuxi´eme g´en´eration des syst`emes de vid´eo surveillance a un temps de vie assez long qui justifie la poursuite des d´evelopements d´edi´es pour ces syst`emes. La m´ethodologie utilis´ee dans le cadre de cette th´ese devra ˆetre am´elior´ee. Auccune crainte budg´etaire, le march´e est prometteur.

Part VI Appendices

Appendix A Hello World GPGPU Listing A.1: Hello world GPGPU OpenGL/GLSL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

//−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // www.GPGPU. o r g // Sample Code //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Copyright ( c ) 2004 Mark J . H a r r i s and GPGPU. o r g // Copyright ( c ) 2004 3 Dlabs I n c . Ltd . //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // This s o f t w a r e i s p r o v i d e d ’ as−i s ’ , w i t h o u t any e x p r e s s o r i m p l i e d // warranty . In no e v e n t w i l l t h e a u t h o r s be h e l d l i a b l e f o r any // damages a r i s i n g from t h e u s e o f t h i s s o f t w a r e . // // P e r m i s s i o n i s g r a n t e d t o anyone t o u s e t h i s s o f t w a r e f o r any // purpose , i n c l u d i n g commercial a p p l i c a t i o n s , and t o a l t e r i t and // r e d i s t r i b u t e i t f r e e l y , s u b j e c t t o t h e f o l l o w i n g r e s t r i c t i o n s : // // 1 . The o r i g i n o f t h i s s o f t w a r e must not be m i s r e p r e s e n t e d ; you // must not c l a i m t h a t you wrote t h e o r i g i n a l s o f t w a r e . I f you u s e // t h i s s o f t w a r e i n a product , an acknowledgment i n t h e p r o d u c t // documentation would be a p p r e c i a t e d but i s not r e q u i r e d . // // 2 . A l t e r e d s o u r c e v e r s i o n s must be p l a i n l y marked a s such , and // must not be m i s r e p r e s e n t e d a s b e i n g t h e o r i g i n a l s o f t w a r e . // // 3 . This n o t i c e may not be removed o r a l t e r e d from any s o u r c e // distribution . // //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Author : Mark H a r r i s ( harrism@gpgpu . o r g ) − o r i g i n a l helloGPGPU // Author : Mike Weiblen ( mike . w e i b l e n @ 3 d l a b s . com ) − GLSL v e r s i o n //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // GPGPU Lesson 0 : ”helloGPGPU GLSL” ( a GLSL v e r s i o n o f ”helloGPGPU ” ) //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // // GPGPU CONCEPTS I n t r o d u c e d : //

Part VI: Appendix 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84

// 1 . ) Texture = Array // 2 . ) Fragment Program = Computational K e r n e l . // 3 . ) One−to−one P i x e l t o Texel Mapping : // a ) Data−Dimensioned Viewport , and // b) Orthographic P r o j e c t i o n . // 4 . ) Viewport−S i z e d Quad = Data Stream G e ne ra t o r . // 5 . ) Copy To Texture = f e e d b a c k . // // For d e t a i l s o f each o f t h e s e c o n c e p t s , s e e t h e e x p l a n a t i o n s i n t h e // i n l i n e ”GPGPU CONCEPT” comments i n t h e code below . // // APPLICATION Demonstrated : A s i m p l e post−p r o c e s s edge d e t e c t i o n f i l t e r . // //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Notes r e g a r d i n g t h i s ”helloGPGPU GLSL” s o u r c e code : // // This example was d e r i v e d from t h e o r i g i n a l ”helloGPGPU . cpp ” v1 . 0 . 1 // by Mark J . H a r r i s . I t d e m o n s t r a t e s t h e same s i m p l e post−p r o c e s s edge // d e t e c t i o n f i l t e r , but i n s t e a d implemented u s i n g t h e OpenGL Shading Language // ( a l s o known a s ”GLSL” ) , an e x t e n s i o n o f t h e OpenGL v1 . 5 s p e c i f i c a t i o n . // Because t h e GLSL c o m p i l e r i s an i n t e g r a l p a r t o f a vendor ’ s OpenGL d r i v e r , // no a d d i t i o n a l GLSL development t o o l s a r e r e q u i r e d . // This example was d e v e l o p e d / t e s t e d on 3 Dlabs Wildcat Realizm . // // I i n t e n t i o n a l l y minimized c h a n g e s t o t h e s t r u c t u r e o f t h e o r i g i n a l code // t o s u p p o r t a s i d e −by−s i d e comparison o f t h e i m p l e m e n t a t i o n s . // // Thanks t o Mark f o r making t h e o r i g i n a l helloGPGPU example a v a i l a b l e ! // // −− Mike Weiblen , May 2004 // // // [MJH : ] // This example has a l s o been t e s t e d on NVIDIA GeForce FX and GeForce 6800 GPUs . // // Note t h a t t h e example r e q u i r e s glew . h and g l e w 3 2 s . l i b , a v a i l a b l e a t // h t t p : / / glew . s o u r c e f o r g e . n e t . // // Thanks t o Mike f o r m o d i f y i n g t h e example . I have a c t u a l l y changed my // o r i g i n a l code t o match ; f o r example t h e i n l i n e Cg code i n s t e a d o f an // e x t e r n a l f i l e . // // −− Mark H a r r i s , June 2004 // // R e f e r e n c e s : // h t t p : / / gpgpu . s o u r c e f o r g e . n e t / // h t t p : / / glew . s o u r c e f o r g e . n e t / // h t t p : / /www. x m i s s i o n . com/˜ n a t e / g l u t . html //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

85 86 87

163

#i n c l u d e #i n c l u d e

164 88 89 90 91

Part VI: Appendix

#i n c l u d e < s t d l i b . h> #d e f i n e GLEW STATIC 1 #i n c l u d e #i n c l u d e

92 93 94 95

// f o r w a r d d e c l a r a t i o n s c l a s s HelloGPGPU ; v o i d r e s h a p e ( i n t w, i n t h ) ;

96 97 98

// g l o b a l s HelloGPGPU

∗ g pHello ;

99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

// This s h a d e r p e r f o r m s a 9−tap L a p l a c i a n edge d e t e c t i o n f i l t e r . // ( c o n v e r t e d from t h e s e p a r a t e ” e d g e s . cg ” f i l e t o embedded GLSL s t r i n g ) s t a t i c const char ∗ edgeFragSource = { ” uniform sampler2D t e x U n i t ; ” ” v o i d main ( v o i d ) ” ”{” ” const f l o a t o f f s e t = 1.0 / 512.0;” ” vec2 texCoord = gl TexCoord [ 0 ] . xy ; ” ” vec4 c = texture2D ( texUnit , texCoord ) ; ” ” vec4 b l = texture2D ( texUnit , texCoord + vec2 (− o f f s e t , − o f f s e t ) ) ; ” ” vec4 l = texture2D ( texUnit , texCoord + vec2 (− o f f s e t , 0.0));” ” vec4 t l = texture2D ( texUnit , texCoord + vec2 (− o f f s e t , offset ));” ” vec4 t = texture2D ( texUnit , texCoord + vec2 ( 0.0 , offset ));” ” vec4 ur = texture2D ( texUnit , texCoord + vec2 ( o f f s e t , offset ));” ” vec4 r = texture2D ( texUnit , texCoord + vec2 ( o f f s e t , 0.0));” ” vec4 br = texture2D ( texUnit , texCoord + vec2 ( o f f s e t , offset ));” ” vec4 b = texture2D ( texUnit , texCoord + vec2 ( 0.0 , −o f f s e t ) ) ; ” ” g l F r a g C o l o r = 8 . 0 ∗ ( c + −0.125 ∗ ( b l + l + t l + t + ur + r + br + b ) ) ; ” ”}” };

121 122 123 124 125 126 127 128 129 130 131 132 133

// This c l a s s e n c a p s u l a t e s a l l o f t h e GPGPU f u n c t i o n a l i t y o f t h e example . c l a s s HelloGPGPU { p u b l i c : // methods HelloGPGPU ( i n t w, i n t h ) : rAngle (0) , iWidth (w) , iHeight (h) { // C r e a t e a s i m p l e 2D t e x t u r e . This example d o e s not u s e // r e n d e r t o t e x t u r e −− i t j u s t c o p i e s from t h e f r a m e b u f f e r t o t h e // t e x t u r e .

134 135 136 137 138 139

// GPGPU CONCEPT 1 : Texture = Array . // T e x t u r e s a r e t h e GPGPU e q u i v a l e n t o f a r r a y s i n s t a n d a r d // computation . Here we a l l o c a t e a t e x t u r e l a r g e enough t o f i t our // data ( which i s a r b i t r a r y i n t h i s example ) . glGenTextures ( 1 , & i T e x t u r e ) ;

Part VI: Appendix

165

g l B i n d T e x t u r e (GL TEXTURE 2D, i T e x t u r e ) ; g l T e x P a r a m e t e r i (GL TEXTURE 2D, GL TEXTURE MIN FILTER, GL NEAREST ) ; g l T e x P a r a m e t e r i (GL TEXTURE 2D, GL TEXTURE MAG FILTER, GL NEAREST ) ; g l T e x P a r a m e t e r i (GL TEXTURE 2D, GL TEXTURE WRAP S, GL CLAMP ) ; g l T e x P a r a m e t e r i (GL TEXTURE 2D, GL TEXTURE WRAP T, GL CLAMP ) ; glTexImage2D (GL TEXTURE 2D, 0 , GL RGBA8, iWidth , i H e i g h t , 0 , GL RGB, GL FLOAT, 0 ) ;

140 141 142 143 144 145 146 147

// // // // //

148 149 150 151 152

GPGPU CONCEPT 2 : Fragment Program = Computational K e r n e l . A fragment program can be thought o f a s a s m a l l c o m p u t a t i o n a l k e r n e l t h a t i s a p p l i e d i n p a r a l l e l t o many f r a g m e n t s s i m u l t a n e o u s l y . Here we l o a d a k e r n e l t h a t p e r f o r m s an edge d e t e c t i o n f i l t e r on an image .

153

programObject = glCreateProgramObjectARB ( ) ;

154 155

// C r e a t e t h e edge d e t e c t i o n fragment program f r a g m e n t S h a d e r = glCreateShaderObjectARB (GL FRAGMENT SHADER ARB ) ; glShaderSourceARB ( f r a g m e n t S h a d e r , 1 , &edgeFragSource , NULL ) ; glCompileShaderARB ( f r a g m e n t S h a d e r ) ; glAttachObjectARB ( programObject , f r a g m e n t S h a d e r ) ;

156 157 158 159 160 161

// Link t h e s h a d e r i n t o a c o m p l e t e GLSL program . glLinkProgramARB ( programObject ) ; GLint p r o g L i n k S u c c e s s ; glGetObjectParameterivARB ( programObject , GL OBJECT LINK STATUS ARB , &p r o g L i n k S u c c e s s ) ; i f ( ! progLinkSuccess ) { f p r i n t f ( s t d e r r , ” F i l t e r s h a d e r c o u l d not be l i n k e d \n ” ) ; exit (1); }

162 163 164 165 166 167 168 169 170 171 172

// Get l o c a t i o n o f t h e s a m p l e r uniform t e x U n i t = glGetUniformLocationARB ( programObject , ” t e x U n i t ” ) ;

173 174 175

}

176

˜HelloGPGPU ( )

177 178 179

{ }

180 181 182 183 184 185 186 187 188 189

// This method upda t e s t h e t e x t u r e by r e n d e r i n g t h e geometry ( a t e a p o t // and 3 r o t a t i n g t o r i ) and c o p y i n g t h e image t o a t e x t u r e . // I t then r e n d e r s a s e c o n d p a s s u s i n g t h e t e x t u r e a s i n p u t t o an edge // d e t e c t i o n f i l t e r . I t c o p i e s the r e s u l t s of the f i l t e r to the te xture . // The t e x t u r e i s used i n HelloGPGPU : : d i s p l a y ( ) f o r d i s p l a y i n g t h e // r e s u l t s . v o i d update ( ) { r A n g l e += 0 . 5 f ;

190 191

// s t o r e t h e window v i e w p o r t d i m e n s i o n s s o we can r e s e t them ,

166 192 193 194

Part VI: Appendix // and s e t t h e v i e w p o r t t o t h e d i m e n s i o n s o f our t e x t u r e i n t vp [ 4 ] ; g l G e t I n t e g e r v (GL VIEWPORT, vp ) ;

195 196 197 198 199 200 201 202 203

// GPGPU CONCEPT 3a : One−to−one P i x e l t o Texel Mapping : A Data− // Dimensioned Viewport . // We need a one−to−one mapping o f p i x e l s t o t e x e l s i n o r d e r t o // e n s u r e e v e r y e l e m e n t o f our t e x t u r e i s p r o c e s s e d . By s e t t i n g our // v i e w p o r t t o t h e d i m e n s i o n s o f our d e s t i n a t i o n t e x t u r e and drawing // a s c r e e n −s i z e d quad ( s e e below ) , we e n s u r e t h a t e v e r y p i x e l o f our // t e x e l i s g e n e r a t e d and p r o c e s s e d i n t h e fragment program . g l V i e w p o r t ( 0 , 0 , iWidth , i H e i g h t ) ;

204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223

// Render a t e a p o t and 3 t o r i g l C l e a r (GL COLOR BUFFER BIT ) ; glMatrixMode (GL MODELVIEW) ; glPushMatrix ( ) ; g l R o t a t e f (− r A n g l e , 0 , 1 , 0 . 2 5 ) ; glutSolidTeapot ( 0 . 5 ) ; glPopMatrix ( ) ; glPushMatrix ( ) ; g l R o t a t e f ( 2 . 1 ∗ rAngle , 1 , 0 . 5 , 0 ) ; glutSolidTorus ( 0 . 0 5 , 0 . 9 , 64 , 6 4 ) ; glPopMatrix ( ) ; glPushMatrix ( ) ; g l R o t a t e f ( −1.5 ∗ r A n g l e , 0 , 1 , 0 . 5 ) ; glutSolidTorus ( 0 . 0 5 , 0 . 9 , 64 , 6 4 ) ; glPopMatrix ( ) ; glPushMatrix ( ) ; g l R o t a t e f ( 1 . 7 8 ∗ rAngle , 0 . 5 , 0 , 1 ) ; glutSolidTorus ( 0 . 0 5 , 0 . 9 , 64 , 6 4 ) ; glPopMatrix ( ) ;

224 225 226 227

// copy t h e r e s u l t s t o t h e t e x t u r e g l B i n d T e x t u r e (GL TEXTURE 2D, i T e x t u r e ) ; glCopyTexSubImage2D (GL TEXTURE 2D, 0 , 0 , 0 , 0 , 0 ,

iWidth ,

iHeight );

228 229 230 231 232

// run t h e edge d e t e c t i o n f i l t e r o v e r t h e geometry t e x t u r e // A c t i v a t e t h e edge d e t e c t i o n f i l t e r program glUseProgramObjectARB ( programObject ) ;

233 234 235

// i d e n t i f y t h e bound t e x t u r e u n i t a s i n p u t t o t h e f i l t e r glUniform1iARB ( t e x U n i t , 0 ) ;

236 237 238 239 240 241 242 243

// // // // // // //

GPGPU CONCEPT 4 : Viewport−S i z e d Quad = Data Stream G e ne ra t o r . In o r d e r t o e x e c u t e fragment programs , we need t o g e n e r a t e p i x e l s . Drawing a quad t h e s i z e o f our v i e w p o r t ( s e e above ) g e n e r a t e s a fragment f o r e v e r y p i x e l o f our d e s t i n a t i o n t e x t u r e . Each fragment i s p r o c e s s e d i d e n t i c a l l y by t h e fragment program . N o t i c e t h a t i n t h e r e s h a p e ( ) f u n c t i o n , below , we have s e t t h e frustum t o o r t h o g r a p h i c , and t h e frustum d i m e n s i o n s t o [ − 1 , 1 ] . Thus , our

Part VI: Appendix

167

// viewport −s i z e d quad v e r t i c e s a r e a t [ −1 , −1] , [ 1 , − 1 ] , [ 1 , 1 ] , and // [ − 1 , 1 ] : t h e c o r n e r s o f t h e v i e w p o r t . g l B e g i n (GL QUADS ) ; { glTexCoord2f ( 0 , 0 ) ; g l V e r t e x 3 f ( −1 , −1, −0.5 f ) ; glTexCoord2f ( 1 , 0 ) ; g l V e r t e x 3 f ( 1 , −1, −0.5 f ) ; glTexCoord2f ( 1 , 1 ) ; g l V e r t e x 3 f ( 1 , 1 , −0.5 f ) ; glTexCoord2f ( 0 , 1 ) ; g l V e r t e x 3 f ( −1 , 1 , −0.5 f ) ; } glEnd ( ) ;

244 245 246 247 248 249 250 251 252 253 254

// d i s a b l e t h e f i l t e r glUseProgramObjectARB ( 0 ) ;

255 256 257

// // // // // // //

258 259 260 261 262 263 264

GPGPU CONCEPT 5 : Copy To Texture (CTT) = Feedback . We have j u s t i n v o k e d our computation ( edge d e t e c t i o n ) by a p p l y i n g a fragment program t o a vie wport −s i z e d quad . The r e s u l t s a r e now i n t h e frame b u f f e r . To s t o r e them , we copy t h e data from t h e frame b u f f e r t o a t e x t u r e . This can then be f e d back a s i n p u t f o r d i s p l a y ( i n t h i s c a s e ) o r more computation ( s e e more advanced s a m pl e s . )

265

// update t h e t e x t u r e again , t h i s time with t h e f i l t e r e d s c e n e g l B i n d T e x t u r e (GL TEXTURE 2D, i T e x t u r e ) ; glCopyTexSubImage2D (GL TEXTURE 2D, 0 , 0 , 0 , 0 , 0 , iWidth , i H e i g h t ) ;

266 267 268 269

// r e s t o r e t h e s t o r e d v i e w p o r t d i m e n s i o n s g l V i e w p o r t ( vp [ 0 ] , vp [ 1 ] , vp [ 2 ] , vp [ 3 ] ) ;

270 271 272

}

273 274 275 276 277 278

void display ( ) { // Bind t h e f i l t e r e d t e x t u r e g l B i n d T e x t u r e (GL TEXTURE 2D, g l E n a b l e (GL TEXTURE 2D ) ;

iTexture ) ;

279

// r e n d e r a f u l l −s c r e e n quad t e x t u r e d with t h e r e s u l t s o f our // computation . Note t h a t t h i s i s not p a r t o f t h e computation : t h i s // i s o n l y t h e v i s u a l i z a t i o n o f t h e r e s u l t s . g l B e g i n (GL QUADS ) ; { glTexCoord2f ( 0 , 0 ) ; g l V e r t e x 3 f ( −1 , −1, −0.5 f ) ; glTexCoord2f ( 1 , 0 ) ; g l V e r t e x 3 f ( 1 , −1, −0.5 f ) ; glTexCoord2f ( 1 , 1 ) ; g l V e r t e x 3 f ( 1 , 1 , −0.5 f ) ; glTexCoord2f ( 0 , 1 ) ; g l V e r t e x 3 f ( −1 , 1 , −0.5 f ) ; } glEnd ( ) ;

280 281 282 283 284 285 286 287 288 289 290 291

g l D i s a b l e (GL TEXTURE 2D ) ;

292 293

}

294 295

p r o t e c t e d : // data

168 296 297

Part VI: Appendix int float

iWidth , rAngle ;

i H e i g h t ; // The d i m e n s i o n s o f our a r r a y // used f o r a n i m a t i o n

unsigned i n t

iTexture ;

// The t e x t u r e used a s a data a r r a y

GLhandleARB GLhandleARB

programObject ; fragmentShader ;

// t h e program used t o update

GLint

texUnit ;

// a parameter t o t h e fragment program

298 299 300 301 302 303 304 305

};

306 307 308 309 310 311

// GLUT i d l e f u n c t i o n void i d l e ( ) { glutPostRedisplay ( ) ; }

312 313 314 315 316 317 318 319

// GLUT d i s p l a y f u n c t i o n void display ( ) { g p H e l l o −>update ( ) ; // update t h e s c e n e and run t h e edge d e t e c t f i l t e r g p H e l l o −>d i s p l a y ( ) ; // d i s p l a y t h e r e s u l t s glutSwapBuffers ( ) ; }

320 321 322 323 324

// GLUT r e s h a p e f u n c t i o n v o i d r e s h a p e ( i n t w, i n t h ) { i f ( h == 0 ) h = 1 ;

325

g l V i e w p o r t ( 0 , 0 , w, h ) ;

326 327

// GPGPU CONCEPT 3b : One−to−one P i x e l t o Texel Mapping : An O r t h o g r a p h i c // Projection . // This code s e t s t h e p r o j e c t i o n matrix t o o r t h o g r a p h i c with a r a n g e o f // [ − 1 , 1 ] i n t h e X and Y d i m e n s i o n s . This a l l o w s a t r i v i a l mapping o f // p i x e l s t o t e x e l s . glMatrixMode (GL PROJECTION ) ; glLoadIdentity ( ) ; gluOrtho2D ( −1 , 1 , −1, 1 ) ; glMatrixMode (GL MODELVIEW) ; glLoadIdentity ( ) ;

328 329 330 331 332 333 334 335 336 337 338

}

339 340 341 342 343 344

// C a l l e d a t s t a r t u p void i n i t i a l i z e ( ) { // I n i t i a l i z e t h e ”OpenGL E x t e n s i o n Wrangler ” l i b r a r y glewInit ( ) ;

345 346 347

// Ensure we have t h e n e c e s s a r y OpenGL Shading Language e x t e n s i o n s . i f ( g le w G e t Ext e ns io n ( ” GL ARB fragment shader ” ) != GL TRUE | |

Part VI: Appendix g le w G e t Ext e ns io n ( ” GL ARB vertex shader ” ) != GL TRUE | | != GL TRUE | | g le w G e t Ext e ns io n ( ” GL ARB shader objects ” ) g le w G e t Ext e ns io n ( ” GL ARB shading language 100 ” ) != GL TRUE)

348 349 350

{

351

f p r i n t f ( s t d e r r , ” D r i v e r d o e s not s u p p o r t OpenGL Shading Language \n ” ) ; exit (1);

352 353

}

354 355

// C r e a t e t h e example o b j e c t g p H e l l o = new HelloGPGPU ( 5 1 2 , 5 1 2 ) ;

356 357 358

}

359 360 361 362 363 364 365

// The main f u n c t i o n i n t main ( ) { g l u t I n i t D i s p l a y M o d e (GLUT DOUBLE | GLUT RGBA) ; glutInitWindowSize (512 , 512); glutCreateWindow ( ” H e l l o , GPGPU! (GLSL v e r s i o n ) ” ) ;

366

glutIdleFunc ( i d l e ) ; glutDisplayFunc ( display ) ; glutReshapeFunc ( r e s h a p e ) ;

367 368 369 370

initialize ();

371 372

glutMainLoop ( ) ; return 0;

373 374 375

169

}

Appendix B Hello World Brook Listing B.1: BR file 1 2 3 4 5 6 7 8 9

/∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− GPGPU − Computer V i s i o n Adaptive Background S u b t r a c t i o n on Gr aphi cs Hardware −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Copyright ( c ) 2005 − 2006 Hicham Ghorayeb −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− This s o f t w a r e i s p r o v i d e d ’ as−i s ’ , w i t h o u t any e x p r e s s o r i m p l i e d warranty . In no e v e n t w i l l t h e a u t h o r s be h e l d l i a b l e f o r any damages a r i s i n g from t h e u s e o f t h i s s o f t w a r e .

10 11 12 13

P e r m i s s i o n i s g r a n t e d t o anyone t o u s e t h i s s o f t w a r e f o r any purpose , i n c l u d i n g commercial a p p l i c a t i o n s , and t o a l t e r i t and r e d i s t r i b u t e i t f r e e l y , s u b j e c t to the f o l l o w i n g r e s t r i c t i o n s :

14 15 16 17 18

1 . The o r i g i n o f t h i s s o f t w a r e must not be m i s r e p r e s e n t e d ; you must not c l a i m t h a t you wrote t h e o r i g i n a l s o f t w a r e . I f you u s e t h i s s o f t w a r e i n a product , an acknowledgment i n t h e p r o d u c t documentation would be a p p r e c i a t e d but i s not r e q u i r e d .

19 20 21

2 . A l t e r e d s o u r c e v e r s i o n s must be p l a i n l y marked a s such , and must not be m i s r e p r e s e n t e d a s b e i n g t h e o r i g i n a l s o f t w a r e .

22 23 24

3 . This n o t i c e may not be removed o r a l t e r e d from any s o u r c e distribution .

25 26 27 28 29

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Author : Hicham Ghorayeb ( hicham . ghorayeb@ensmp . f r ) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Notes r e g a r d i n g t h i s ”GPU based Adaptive Background S u b t r a c t i o n ” s o u r c e code :

30 31 32

Note t h a t t h e example r e q u i r e s OpenCV and Brook

33 34 35

−− Hicham Ghorayeb , A p r i l 2006

Part VI: Appendix 36 37 38

171

References : h t t p : / /www. gpgpu . o r g / −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗/

39 40 41 42 43 44 45 46 47 48 49 50 51 52

#i n c l u d e #i n c l u d e #i n c l u d e < s t d l i b . h> #i n c l u d e #i n c l u d e ” . . / bbs . h” #i n c l u d e #i n c l u d e ” T i m e U t i l s / t i m e t o o l s . h” #d e f i n e BBS ALGO STR ”BASIC BACKGROUND SUBTRACTION” #d e f i n e BBS ALGO STR D ”BASIC BACKGROUND SUBTRACTION: Download” #d e f i n e BBS ALGO STR K CLASSIFY ”BASIC BACKGROUND SUBTRACTION: K e r n e l : C l a s s i f y ” ”BASIC BACKGROUND SUBTRACTION: K e r n e l : Update ” #d e f i n e BBS ALGO STR K UPDATE #d e f i n e BBS ALGO STR K INIT ”BASIC BACKGROUND SUBTRACTION: K e r n e l : I n i t ” #d e f i n e BBS ALGO STR U ”BASIC BACKGROUND SUBTRACTION: Upload ”

53 54 55 56 57 58 59 60

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗ STREAMS DECLARATION ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ i n t i n s i z e x = BBS FRAME WIDTH; i n t i n s i z e y = BBS FRAME HEIGHT ; i n t o u t s i z e x = BBS FRAME WIDTH; i n t o u t s i z e y = BBS FRAME HEIGHT ;

61 62 63

i t e r f l o a t 2 i t = i t e r ( float2 (0.0 f , 0.0 f ) , float2 ( insizey , insizex ) ) ;

64 65 66 67

f l o a t 3 backgroundModel; f l o a t 3 inputFrame; f l o a t foregroundMask;

68 69 70

float a l p h a = BBS ALPHA ; f l o a t t h r e s h o l d = BBS THRESHOLD;

71 72 73 74 75 76 77 78 79 80 81 82

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗ KERNELS DECLARATION ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ kernel void b r k e r n e l b b s i n i t ( f l o a t 3 iImg [ ] [ ] , i t e r f l o a t 2 i t , out f l o a t 3 oModel) { oModel = iImg [ i t ] ; }

83 84 85 86 87

kernel void b r k e r n e l b b s c l a s s i f y ( f l o a t 3 iImg [ ] [ ] , f l o a t 3 iModel [ ] [ ] ,

172 i t e r f l o a t 2 i t , f l o a t threshold , out f l o a t oMask)

88 89 90 91

Part VI: Appendix

{ f l o a t 3 regAbsDiff ; float3 inPixels = iImg [ i t ] ; f l o a t 3 inModel = iModel [ i t ] ; r e g A b s D i f f = abs ( i n P i x e l s − inModel ) ;

92 93 94 95 96

i f ( a l l ( step ( threshold , regAbsDiff ) ) ) oMask = 2 5 5 . 0 f ; else oMask = 0.0 f ;

97 98 99 100 101

}

102 103 104 105 106 107 108 109 110 111 112 113

kernel void br kernel bbs update ( f l o a t 3 iImg [ ] [ ] , f l o a t 3 iModel [ ] [ ] , i t e r f l o a t 2 i t , f l o a t alpha , out f l o a t 3 oModel) { float3 inPixels = iImg [ i t ] ; f l o a t 3 inModel = iModel [ i t ] ; f l o a t 3 a = f l o a t 3 ( alpha , alpha , a l p h a ) ;

114

oModel = a ∗ i n P i x e l s + (1−a ) ∗ inModel ;

115 116

}

117 118 119 120 121 122 123 124 125 126 127

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗ BRIDGES TO KERNELS ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ v o i d b b s r e a d i n p u t s t r e a m s ( f l o a t ∗ image ) { INIT BRMM PERFORMANCE MEASURE(BBS ALGO STR ) ; START BRMM PERFORMANCE MEASURE(BBS ALGO STR D ) ; streamRead ( inputFrame , image ) ; STOP BRMM PERFORMANCE MEASURE(BBS ALGO STR D ) ; }

128 129 130 131 132 133 134 135

v o i d b b s w r i t e o u t p u t s t r e a m s ( f l o a t ∗ image ) { INIT BRMM PERFORMANCE MEASURE(BBS ALGO STR ) ; START BRMM PERFORMANCE MEASURE(BBS ALGO STR U ) ; streamWrite ( foregroundMask , image ) ; STOP BRMM PERFORMANCE MEASURE(BBS ALGO STR U ) ; }

136 137 138 139

void run bbsInit ( ) { INIT BRMM PERFORMANCE MEASURE(BBS ALGO STR ) ;

Part VI: Appendix START BRMM PERFORMANCE MEASURE( BBS ALGO STR K INIT ) ; b r k e r n e l b b s i n i t ( inputFrame , i t , backgroundModel ) ; STOP BRMM PERFORMANCE MEASURE( BBS ALGO STR K INIT ) ;

140 141 142 143

173

}

144 145 146 147 148 149 150 151 152 153 154 155 156

void run bbs C las s ify ( ) { INIT BRMM PERFORMANCE MEASURE(BBS ALGO STR ) ; START BRMM PERFORMANCE MEASURE(BBS ALGO STR K CLASSIFY ) ; br kernel bbs classify ( inputFrame , backgroundModel , it , threshold , foregroundMask ) ; STOP BRMM PERFORMANCE MEASURE(BBS ALGO STR K CLASSIFY ) ; }

157 158 159 160 161 162 163 164 165 166 167 168 169

v o i d run bbsUpdate ( ) { INIT BRMM PERFORMANCE MEASURE(BBS ALGO STR ) ; START BRMM PERFORMANCE MEASURE(BBS ALGO STR K UPDATE ) ; br kernel bbs update ( inputFrame , backgroundModel , it , alpha , backgroundModel ) ; STOP BRMM PERFORMANCE MEASURE(BBS ALGO STR K UPDATE ) ; }

Listing B.2: GPU subsystem 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

//−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // GPGPU − Computer V i s i o n // Adaptive Background S u b t r a c t i o n on Gr aphi c s Hardware //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Copyright ( c ) 2005 − 2006 Hicham Ghorayeb //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // This s o f t w a r e i s p r o v i d e d ’ as−i s ’ , w i t h o u t any e x p r e s s o r i m p l i e d // warranty . In no e v e n t w i l l t h e a u t h o r s be h e l d l i a b l e f o r any // damages a r i s i n g from t h e u s e o f t h i s s o f t w a r e . // // P e r m i s s i o n i s g r a n t e d t o anyone t o u s e t h i s s o f t w a r e f o r any // purpose , i n c l u d i n g commercial a p p l i c a t i o n s , and t o a l t e r i t and // r e d i s t r i b u t e i t f r e e l y , s u b j e c t t o t h e f o l l o w i n g r e s t r i c t i o n s : // // 1 . The o r i g i n o f t h i s s o f t w a r e must not be m i s r e p r e s e n t e d ; you // must not c l a i m t h a t you wrote t h e o r i g i n a l s o f t w a r e . I f you u s e // t h i s s o f t w a r e i n a product , an acknowledgment i n t h e p r o d u c t // documentation would be a p p r e c i a t e d but i s not r e q u i r e d . // // 2 . A l t e r e d s o u r c e v e r s i o n s must be p l a i n l y marked a s such , and

174 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Part VI: Appendix

// must not be m i s r e p r e s e n t e d a s b e i n g t h e o r i g i n a l s o f t w a r e . // // 3 . This n o t i c e may not be removed o r a l t e r e d from any s o u r c e // distribution . // //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Author : Hicham Ghorayeb ( hicham . ghorayeb@ensmp . f r ) //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Notes r e g a r d i n g t h i s ”GPU based Adaptive Background S u b t r a c t i o n ” s o u r c e code : // // // Note t h a t t h e example r e q u i r e s OpenCV and Brook // // −− Hicham Ghorayeb , A p r i l 2006 // // R e f e r e n c e s : // h t t p : / /www. gpgpu . o r g / //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

39 40

#i n c l u d e ” gpu subsystem . h”

41 42

#i n c l u d e ” bbs . h”

43 44 45 46 47 48 49

// I n i t Subsystem int gpu init subsystem () { // A l l o c a t e s F l o a t B u f f e r return 0; }

50 51 52 53 54 55

// R e l e a s e Subsystem void gpu release subsystem ( ) { // R e l e a s e F l o a t B u f f e r }

56 57 58 59 60 61 62

// I n i t i a l i z a t i o n o f t h e Background Image v o i d g p u I n i t B a c k g r o u n d ( s t r u c t I p l I m a g e ∗ image ) { a s s e r t ( ( image−>width == BBS FRAME WIDTH) && ( image−>h e i g h t == BBS FRAME HEIGHT) && ( image−>nChannels == 3 ) ) ;

63 64

float ∗ buffer

= NULL;

65 66 67

b u f f e r = ( f l o a t ∗) malloc ( image−>width ∗ image−>h e i g h t ∗ image−>nChannels ∗ s i z e o f ( f l o a t ) ) ;

68 69 70 71 72

// I n i t B u f f e r from I p l I m a g e f o r ( i n t i =0; i h e i g h t ; i ++){ f o r ( i n t j =0; j width ; j ++){ b u f f e r [ i ∗ image−>width ∗ 3 + j ∗ 3 + 0 ] =

Part VI: Appendix

175 ( f l o a t ) image−>imageData [ i ∗ image−>widthStep+j ∗ 3+ 0 ] ; b u f f e r [ i ∗ image−>width ∗ 3 + j ∗ 3 + 1 ] = ( f l o a t ) image−>imageData [ i ∗ image−>widthStep+j ∗ 3+ 1 ] ; b u f f e r [ i ∗ image−>width ∗ 3 + j ∗ 3 + 2 ] = ( f l o a t ) image−>imageData [ i ∗ image−>widthStep+j ∗ 3+ 2 ] ;

73 74 75 76 77

}

78

}

79 80

bbs read input streams ( buffer ) ; run bbsInit ( ) ;

81 82 83

free ( buffer );

84 85

}

86 87 88 89

void {

// C a l l The Algorithm t o c l a s s i f y p i x e l s i n t o Foreground and background g p u C l a s s i f y I m a g e ( s t r u c t I p l I m a g e ∗ image ) a s s e r t ( ( image−>width == BBS FRAME WIDTH) ( image−>h e i g h t == BBS FRAME HEIGHT) && ( image−>nChannels == 3 ) ) ;

90 91 92

&&

93

float ∗ buffer

94

= NULL;

95

b u f f e r = ( f l o a t ∗) malloc ( image−>width ∗ image−>h e i g h t ∗ image−>nChannels ∗ s i z e o f ( f l o a t ) ) ;

96 97 98

// I n i t B u f f e r from I p l I m a g e f o r ( i n t i =0; i h e i g h t ; i ++){ f o r ( i n t j =0; j width ; j ++){ b u f f e r [ i ∗ image−>width ∗ ( f l o a t ) image−>imageData [ i b u f f e r [ i ∗ image−>width ∗ ( f l o a t ) image−>imageData [ i b u f f e r [ i ∗ image−>width ∗ ( f l o a t ) image−>imageData [ i } }

99 100 101 102 103 104 105 106 107 108 109

3 ∗ 3 ∗ 3 ∗

+ j ∗ 3 + 0] = image−>widthStep+j ∗ 3+ 0 ] ; + j ∗ 3 + 1] = image−>widthStep+j ∗ 3+ 1 ] ; + j ∗ 3 + 2] = image−>widthStep+j ∗ 3+ 2 ] ;

110

bbs read input streams ( buffer ) ; run bbsClassify ( ) ;

111 112 113

free ( buffer );

114 115

}

116 117 118 119 120 121 122

// R e t r i e v e t h e r e s u l t o f background / f o r e g r o u n d s e g m e n t a t i o n v o i d g p u R e t r i e v e R e s u l t ( s t r u c t I p l I m a g e ∗ image ) { a s s e r t ( ( image−>width == BBS FRAME WIDTH) && ( image−>h e i g h t == BBS FRAME HEIGHT) && ( image−>nChannels == 1 ) ) ;

123 124

float ∗ buffer

= NULL;

176

Part VI: Appendix

125

b u f f e r = ( f l o a t ∗) malloc ( image−>width ∗ image−>h e i g h t ∗ image−>nChannels ∗ s i z e o f ( f l o a t ) ) ;

126 127 128

bbs write output streams ( buffer ) ;

129 130

// Update I p l I m a g e from t h e B u f f e r f o r ( i n t i =0; i h e i g h t ; i ++){ f o r ( i n t j =0; j width ; j ++){ image−>imageData [ i ∗ image−>widthStep+j +0] = ( ( b u f f e r [ i ∗ image−>width + j + 0 ] > 0 . 0 f ) ? 2 5 5 : 0 ) ; } }

131 132 133 134 135 136 137 138

free ( buffer );

139 140

}

141 142 143 144 145 146

// Update t h e Background Model v o i d gpu UpdateBackgroundModel ( ) { run bbsUpdate ( ) ; }

Listing B.3: CPU subsystem 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

//−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // GPGPU − Computer V i s i o n // Adaptive Background S u b t r a c t i o n on Gr aphi c s Hardware //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Copyright ( c ) 2005 − 2006 Hicham Ghorayeb //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // This s o f t w a r e i s p r o v i d e d ’ as−i s ’ , w i t h o u t any e x p r e s s o r i m p l i e d // warranty . In no e v e n t w i l l t h e a u t h o r s be h e l d l i a b l e f o r any // damages a r i s i n g from t h e u s e o f t h i s s o f t w a r e . // // P e r m i s s i o n i s g r a n t e d t o anyone t o u s e t h i s s o f t w a r e f o r any // purpose , i n c l u d i n g commercial a p p l i c a t i o n s , and t o a l t e r i t and // r e d i s t r i b u t e i t f r e e l y , s u b j e c t t o t h e f o l l o w i n g r e s t r i c t i o n s : // // 1 . The o r i g i n o f t h i s s o f t w a r e must not be m i s r e p r e s e n t e d ; you // must not c l a i m t h a t you wrote t h e o r i g i n a l s o f t w a r e . I f you u s e // t h i s s o f t w a r e i n a product , an acknowledgment i n t h e p r o d u c t // documentation would be a p p r e c i a t e d but i s not r e q u i r e d . // // 2 . A l t e r e d s o u r c e v e r s i o n s must be p l a i n l y marked a s such , and // must not be m i s r e p r e s e n t e d a s b e i n g t h e o r i g i n a l s o f t w a r e . // // 3 . This n o t i c e may not be removed o r a l t e r e d from any s o u r c e // distribution . // //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Author : Hicham Ghorayeb ( hicham . ghorayeb@ensmp . f r ) //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Part VI: Appendix 29 30 31 32 33 34 35 36 37 38 39

177

// Notes r e g a r d i n g t h i s ”GPU based Adaptive Background S u b t r a c t i o n ” s o u r c e code : // // // Note t h a t t h e example r e q u i r e s OpenCV and Brook // // −− Hicham Ghorayeb , A p r i l 2006 // // R e f e r e n c e s : // h t t p : / /www. gpgpu . o r g / //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− #i n c l u d e ” cpu subsystem . h”

40 41 42 43 44

int cpu init subsystem () { return 0; }

45 46 47 48

void cpu release subsystem ( ) { }

Listing B.4: Main function 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

//−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // GPGPU − Computer V i s i o n // Adaptive Background S u b t r a c t i o n on Gr aphi c s Hardware //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Copyright ( c ) 2005 − 2006 Hicham Ghorayeb //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // This s o f t w a r e i s p r o v i d e d ’ as−i s ’ , w i t h o u t any e x p r e s s o r i m p l i e d // warranty . In no e v e n t w i l l t h e a u t h o r s be h e l d l i a b l e f o r any // damages a r i s i n g from t h e u s e o f t h i s s o f t w a r e . // // P e r m i s s i o n i s g r a n t e d t o anyone t o u s e t h i s s o f t w a r e f o r any // purpose , i n c l u d i n g commercial a p p l i c a t i o n s , and t o a l t e r i t and // r e d i s t r i b u t e i t f r e e l y , s u b j e c t t o t h e f o l l o w i n g r e s t r i c t i o n s : // // 1 . The o r i g i n o f t h i s s o f t w a r e must not be m i s r e p r e s e n t e d ; you // must not c l a i m t h a t you wrote t h e o r i g i n a l s o f t w a r e . I f you u s e // t h i s s o f t w a r e i n a product , an acknowledgment i n t h e p r o d u c t // documentation would be a p p r e c i a t e d but i s not r e q u i r e d . // // 2 . A l t e r e d s o u r c e v e r s i o n s must be p l a i n l y marked a s such , and // must not be m i s r e p r e s e n t e d a s b e i n g t h e o r i g i n a l s o f t w a r e . // // 3 . This n o t i c e may not be removed o r a l t e r e d from any s o u r c e // distribution . // //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Author : Hicham Ghorayeb ( hicham . ghorayeb@ensmp . f r ) //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− // Notes r e g a r d i n g t h i s ”GPU based Adaptive Background S u b t r a c t i o n ” s o u r c e code : //

178 31 32 33 34 35 36 37 38

Part VI: Appendix

// // Note t h a t t h e example r e q u i r e s OpenCV and Brook // // −− Hicham Ghorayeb , A p r i l 2006 // // R e f e r e n c e s : // h t t p : / /www. gpgpu . o r g / //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

39 40 41

#i n c l u d e #i n c l u d e

42 43 44 45 46 47 48 49 50 51

#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e

< s t d l i b . h>
< f l o a t . h> < l i m i t s . h>